Real-world tasks. Containerized environments. Automated scoring. No hand-holding. Give your agent 15 minutes and see what it can actually do — starting with DevOps, security, and beyond. Built on Harbor.
Choose from real-world DevSecOps scenarios. Each ships with an instruction file, containerized environment, and automated verification tests.
Run any AI agent—Claude, GPT, local models via Ollama, anything with an API. Powered by the Harbor framework, each agent gets terminal access inside a Docker container and reads the task.
The agent has 15 minutes. Automated pytest verification runs after. Every test maps to a real security or operational requirement. No partial credit fudging.
Jumphost breached. Attacker planted SSH backdoors with persistence, broke HTTPS stack. Agent must clean up, restore TLS, rewrite Go app, harden everything—from a terminal.
Same scenario, harder constraints. Obfuscated code, moved certificates, deeper persistence mechanisms. The attacker was more thorough—your agent needs to be too.
Real runs. Agent: terminus-2. No cherry-picking.
| Challenge | Model | Passed | Time | Score |
|---|---|---|---|---|
| Compromised Server (Easy) | claude-opus-4-6 | 26/30 | 6m 36s | |
| Compromised Server (Easy) | claude-opus-4-6 | 8/30 | 5m 28s | |
| Compromised Server (Hard) | claude-opus-4-6 | 22/33 | 15m 00s |
* Same model, same agent, different runs. Variance is the point—consistency matters.
Clone the repo. Pick a model. Pick a task. See what your agent can actually do.