Skip to content

ADR-0019: Swarm End-to-End Evaluations (Sandbox Benchmarking)

  • Status: Accepted
  • Date: 2026-03-26
  • Deciders: DeAcero Agentic Team
  • Supersedes: None

Context

Individual skill evaluations (e.g., standard evals.json unit prompting) test an AI's capability in isolation. However, Cornerstone is an intertwined ecosystem of autonomous agents (Archaeologist, Architect, Technical Writer, GitOps Expert) collaborating to modernize legacy systems. We need a systemic method to evaluate the emergent behavior and stability of the entire Swarm without modifying production code.

Decision

We will construct the Swarm E2E Evaluator (tools/run_swarm_eval.py). 1. The Sandbox Format: A dummy polyglot legacy system containing SQL datastores, COBOL business rules, and C# monoliths will reside in tests/e2e_swarm/legacy_app/. 2. The Execution Engine: The evaluator script will generate a sterile project using cookiecutter, inject the legacy code and an objective.md prompt, and execute the Agentic Swarm with full autonomy (--permission-mode auto). 3. The Grading Rubric: The swarm will be graded passively by the CI tools. If the swarm successfully modernized the logic, the spawned project must pass import-linter (Architecture compliance), check_adr_gate.py (Governance compliance), and pytest (Business logic correctness).

Consequences

Positive

  • Ensures that updates to .agents/skills or template architecture do not inadvertently break the Swarm's capacity to collaborate.
  • Measures token/time efficiency of modernization tasks across different LLM models (Claude vs. Gemini).
  • Explicitly tests the multi-domain routing capability of the Swarm.

Negative

  • Running an E2E Evaluation consumes real API tokens and time (typically >10 minutes per run). It is reserved for manual benchmarking rather than per-commit CI checks.