Open Source

Does your AI agent
actually work?
Prove it.

Real-world tasks. Containerized environments. Automated scoring. No hand-holding. Give your agent 15 minutes and see what it can actually do — starting with DevOps, security, and beyond. Built on Harbor.

Get Started See How It Works

harbor — ai-agent-evals

# Run the evaluation
$ harbor run -p "./compromised-prod-server-easy" --agent terminus-2 --model claude-opus-4-6 -k 1

environment_setup        0:00:03
agent_setup             0:00:53
agent_execution         0:06:36
verifier                0:01:30

Tests: 26/30 passed, 4 failed

✓ test_ssh_access
✓ test_nginx_config_valid
✓ test_ssl_chain_valid
✓ test_basic_auth_works
✓ test_cors_headers
✓ test_security_header_hsts
✗ test_no_rogue_ssh_key
✗ test_no_backdoor_persistence
...

// process

How It Works

Pick a Challenge

Choose from real-world DevSecOps scenarios. Each ships with an instruction file, containerized environment, and automated verification tests.

Point Your Agent

Run any AI agent—Claude, GPT, local models via Ollama, anything with an API. Powered by the Harbor framework, each agent gets terminal access inside a Docker container and reads the task.

Watch & Score

The agent has 15 minutes. Automated pytest verification runs after. Every test maps to a real security or operational requirement. No partial credit fudging.

// scenarios

Current Challenges

Easy

Compromised Prod Server

Jumphost breached. Attacker planted SSH backdoors with persistence, broke HTTPS stack. Agent must clean up, restore TLS, rewrite Go app, harden everything—from a terminal.

30 Tests

15min Time Limit

SSH + TLS + Go Domains

Hard

Compromised Prod Server

Same scenario, harder constraints. Obfuscated code, moved certificates, deeper persistence mechanisms. The attacker was more thorough—your agent needs to be too.

33 Tests

15min Time Limit

SSH + TLS + Go + Recovery Domains

// architecture

Evaluation Pipeline

┌─────────────────────┐ ┌───────────────────────────────────────┐ │ instruction.md │ │ Docker Container │ │ task definition │──────▶│ │ │ + artifacts │ │ ┌──────────┐ ┌──────────────┐ │ └─────────────────────┘ │ │ AI Agent │────▶│ Terminal │ │ │ └──────────┘ └──────────────┘ │ ┌─────────────────────┐ │ │ │ pytest verifier │◀──────│ execution artifacts │ │ automated scoring │ └───────────────────────────────────────┘ └─────────┬───────────┘ │ ▼ ┌─────────────────────┐ │ score + report │ │ pass/fail per test │ └─────────────────────┘

// benchmarks

Sample Results

Real runs. Agent: terminus-2. No cherry-picking.

Challenge	Model	Passed	Time	Score
Compromised Server (Easy)	claude-opus-4-6	26/30	6m 36s	87%
Compromised Server (Easy)	claude-opus-4-6	8/30	5m 28s	27%
Compromised Server (Hard)	claude-opus-4-6	22/33	15m 00s	67%

* Same model, same agent, different runs. Variance is the point—consistency matters.

Does your AI agent actually work? Prove it.

How It Works

Pick a Challenge

Point Your Agent

Watch & Score

Current Challenges

Compromised Prod Server

Compromised Prod Server

Evaluation Pipeline

Sample Results

Run your own eval.

Does your AI agent
actually work?
Prove it.