Open Source

Does your AI agent
actually work?
Prove it.

Real-world tasks. Containerized environments. Automated scoring. No hand-holding. Give your agent 15 minutes and see what it can actually do — starting with DevOps, security, and beyond. Built on Harbor.

harbor — ai-agent-evals
# Run the evaluation
$ harbor run -p "./compromised-prod-server-easy" --agent terminus-2 --model claude-opus-4-6 -k 1

environment_setup        0:00:03
agent_setup             0:00:53
agent_execution         0:06:36
verifier                0:01:30

Tests: 26/30 passed, 4 failed

  ✓ test_ssh_access
  ✓ test_nginx_config_valid
  ✓ test_ssl_chain_valid
  ✓ test_basic_auth_works
  ✓ test_cors_headers
  ✓ test_security_header_hsts
  ✗ test_no_rogue_ssh_key
  ✗ test_no_backdoor_persistence
  ...

How It Works

01

Pick a Challenge

Choose from real-world DevSecOps scenarios. Each ships with an instruction file, containerized environment, and automated verification tests.

02

Point Your Agent

Run any AI agent—Claude, GPT, local models via Ollama, anything with an API. Powered by the Harbor framework, each agent gets terminal access inside a Docker container and reads the task.

03

Watch & Score

The agent has 15 minutes. Automated pytest verification runs after. Every test maps to a real security or operational requirement. No partial credit fudging.

Current Challenges

Easy

Compromised Prod Server

Jumphost breached. Attacker planted SSH backdoors with persistence, broke HTTPS stack. Agent must clean up, restore TLS, rewrite Go app, harden everything—from a terminal.

30 Tests
15min Time Limit
SSH + TLS + Go Domains
Hard

Compromised Prod Server

Same scenario, harder constraints. Obfuscated code, moved certificates, deeper persistence mechanisms. The attacker was more thorough—your agent needs to be too.

33 Tests
15min Time Limit
SSH + TLS + Go + Recovery Domains

Evaluation Pipeline

┌─────────────────────┐ ┌───────────────────────────────────────┐ instruction.md Docker Container task definition │──────▶│ + artifacts ┌──────────┐ ┌──────────────┐ └─────────────────────┘ AI Agent │────▶│ Terminal └──────────┘ └──────────────┘ ┌─────────────────────┐ pytest verifier │◀──────│ execution artifacts automated scoring └───────────────────────────────────────┘ └─────────┬───────────┘ ┌─────────────────────┐ score + report pass/fail per test └─────────────────────┘

Sample Results

Real runs. Agent: terminus-2. No cherry-picking.

Challenge Model Passed Time Score
Compromised Server (Easy) claude-opus-4-6 26/30 6m 36s
87%
Compromised Server (Easy) claude-opus-4-6 8/30 5m 28s
27%
Compromised Server (Hard) claude-opus-4-6 22/33 15m 00s
67%

* Same model, same agent, different runs. Variance is the point—consistency matters.

Run your own eval.

Clone the repo. Pick a model. Pick a task. See what your agent can actually do.