ADR-0001: Adopt a Central Telemetry Microservice for Cross-Team AI Agent Observability

Status: accepted Deciders: DeAcero Platform Team, AI Agent Guild Date: 2026-03-17

Context and Problem Statement

Multiple teams generate Python projects from the Cornerstone template. Each project runs AI agents (Claude, Gemini, or any LLM) inside CI pipelines, invoking skills and discovery tools. Today there is no shared visibility into how agents behave across teams — which skills are used most, which models are called, what it costs in USD, whether ADR gates pass or fail, or when new knowledge artifacts are created.

The objective is to unify AI agent usage horizontally across all DeAcero development teams. Without a telemetry system, this mandate cannot be measured.

Decision Drivers

Teams must never be required to send telemetry — activation is controlled via env var; absence means silent no-op
Must support Claude (Anthropic), Gemini (Vertex AI), and any LLM agnostically via OpenTelemetry/OpenLLMetry
Minimal instrumentation burden per team — SDK auto-instruments with decorators and hooks
Central service must be self-hosted (no third-party SaaS; data residency stays within DeAcero infrastructure)
Generated projects must function fully without a running observability server

Considered Options

Central FastAPI + PostgreSQL microservice with thin SDK in generated template
Managed SaaS (Langfuse, Honeycomb, Datadog)
Pure OpenTelemetry Collector without a custom service
File-based telemetry (append JSONL locally, no central aggregation)

Decision Outcome

Chosen option: Option 1 — Central FastAPI + PostgreSQL microservice with thin SDK.

Rationale: SaaS solutions (Option 2) create vendor lock-in and raise data residency concerns within DeAcero. A pure OTEL Collector (Option 3) cannot natively represent the custom event schema required (e.g., skill.invoked with model+cost, knowledge.created, project.generated). File-based telemetry (Option 4) cannot support cross-team aggregation or a web dashboard. Option 1 satisfies every driver: env-var activation, stdlib-only SDK, self-hosted service.

Positive Consequences

Single pane of glass for all AI agent activity across teams
Cost attribution per team, project, and model becomes measurable
knowledge.created / knowledge.used events enable measuring ROI of the ADR-first mandate
Teams with no observability URL set are unaffected (zero code change required)

Negative Consequences

DeAcero must operate the PostgreSQL + FastAPI service (uptime, migrations, backups)
SDK must maintain backward compatibility as event schemas evolve (versioned via schema_version field)
Air-gapped CI environments cannot send telemetry even if desired

Pros and Cons of the Options

Option 1: Central FastAPI + PostgreSQL

Good, because fully self-hosted and controlled by DeAcero
Good, because event schema matches exactly what is needed
Good, because consent model is a single env var — no code change in project
Bad, because DeAcero must maintain the service

Option 2: Managed SaaS

Good, because zero infrastructure to maintain
Bad, because data residency and vendor lock-in are unacceptable for DeAcero

Option 3: OpenTelemetry Collector only

Good, because standards-based and multi-vendor
Bad, because OTEL spans do not map cleanly to project.generated or knowledge.created

Option 4: File-based telemetry

Good, because zero infrastructure, works offline
Bad, because cross-team aggregation and dashboarding are impossible

Implementation

SDK placement in template: {{cookiecutter.project_slug}}/.telemetry/
Central service: services/observability/ (FastAPI + PostgreSQL + Docker Compose)
Dashboard: HTML static (Tier 1) + optional Grafana (Tier 2)
CLI: cornerstone report subcommand

Links

Child decision on schema: ADR-0002
Child decision on consent: ADR-0003
Review planned: after first production team onboards