ADR-0001: Adopt a Central Telemetry Microservice for Cross-Team AI Agent Observability
Status: accepted Deciders: DeAcero Platform Team, AI Agent Guild Date: 2026-03-17
Context and Problem Statement
Multiple teams generate Python projects from the Cornerstone template. Each project runs AI agents (Claude, Gemini, or any LLM) inside CI pipelines, invoking skills and discovery tools. Today there is no shared visibility into how agents behave across teams — which skills are used most, which models are called, what it costs in USD, whether ADR gates pass or fail, or when new knowledge artifacts are created.
The objective is to unify AI agent usage horizontally across all DeAcero development teams. Without a telemetry system, this mandate cannot be measured.
Decision Drivers
- Teams must never be required to send telemetry — activation is controlled via env var; absence means silent no-op
- Must support Claude (Anthropic), Gemini (Vertex AI), and any LLM agnostically via OpenTelemetry/OpenLLMetry
- Minimal instrumentation burden per team — SDK auto-instruments with decorators and hooks
- Central service must be self-hosted (no third-party SaaS; data residency stays within DeAcero infrastructure)
- Generated projects must function fully without a running observability server
Considered Options
- Central FastAPI + PostgreSQL microservice with thin SDK in generated template
- Managed SaaS (Langfuse, Honeycomb, Datadog)
- Pure OpenTelemetry Collector without a custom service
- File-based telemetry (append JSONL locally, no central aggregation)
Decision Outcome
Chosen option: Option 1 — Central FastAPI + PostgreSQL microservice with thin SDK.
Rationale: SaaS solutions (Option 2) create vendor lock-in and raise data residency concerns within DeAcero. A pure OTEL Collector (Option 3) cannot natively represent the custom event schema required (e.g., skill.invoked with model+cost, knowledge.created, project.generated). File-based telemetry (Option 4) cannot support cross-team aggregation or a web dashboard. Option 1 satisfies every driver: env-var activation, stdlib-only SDK, self-hosted service.
Positive Consequences
- Single pane of glass for all AI agent activity across teams
- Cost attribution per team, project, and model becomes measurable
knowledge.created/knowledge.usedevents enable measuring ROI of the ADR-first mandate- Teams with no observability URL set are unaffected (zero code change required)
Negative Consequences
- DeAcero must operate the PostgreSQL + FastAPI service (uptime, migrations, backups)
- SDK must maintain backward compatibility as event schemas evolve (versioned via
schema_versionfield) - Air-gapped CI environments cannot send telemetry even if desired
Pros and Cons of the Options
Option 1: Central FastAPI + PostgreSQL
- Good, because fully self-hosted and controlled by DeAcero
- Good, because event schema matches exactly what is needed
- Good, because consent model is a single env var — no code change in project
- Bad, because DeAcero must maintain the service
Option 2: Managed SaaS
- Good, because zero infrastructure to maintain
- Bad, because data residency and vendor lock-in are unacceptable for DeAcero
Option 3: OpenTelemetry Collector only
- Good, because standards-based and multi-vendor
- Bad, because OTEL spans do not map cleanly to
project.generatedorknowledge.created
Option 4: File-based telemetry
- Good, because zero infrastructure, works offline
- Bad, because cross-team aggregation and dashboarding are impossible
Implementation
- SDK placement in template:
{{cookiecutter.project_slug}}/.telemetry/ - Central service:
services/observability/(FastAPI + PostgreSQL + Docker Compose) - Dashboard: HTML static (Tier 1) + optional Grafana (Tier 2)
- CLI:
cornerstone reportsubcommand