Metrics and Success Tracking — Handbook

Overview

Traditional engineering metrics — lines of code, story points completed, velocity charts — measure human productivity in a human-paced workflow. Agentic development introduces a fundamentally different execution model where AI agents generate the bulk of code and humans focus on specification, orchestration, and governance. This page defines four metric pillars purpose-built for measuring agentic team performance, plus a framework for addressing the talent development challenges that arise when agents absorb junior-level work.

Why Traditional Metrics Fall Short

Story points measure the perceived difficulty of work as estimated by human developers. Lines of code measure volume of output. Neither captures what matters in an agentic team:

Story points assume that the person estimating will also execute the work. When an agent executes, the estimation model breaks — a task that takes a human three days might take an agent fifteen minutes, but the context engineering to enable that execution might take two hours.
Lines of code reward verbosity. An agent can generate thousands of lines in minutes. The question is not how much code was produced, but whether that code delivered business value without introducing architectural debt.
Velocity tracks throughput of human-estimated work. In an agentic team, throughput depends on context quality, agent reliability, and pipeline flow — none of which velocity captures.

The four pillars below replace these legacy metrics with measurements aligned to how agentic teams actually operate.

Pillar I: Leverage and Autonomy Metrics

Leverage metrics answer the fundamental question: how much productive work does each human enable through agent orchestration?

Operator Leverage Ratio

The Operator Leverage Ratio measures the number of concurrent agent sequences a single human supervisor can effectively manage.

Calculation: Active agent sequences / Human supervisors

Target progression:

Starting out: 1:1 — One agent per operator while the team builds context engineering skills and establishes workflows.
Intermediate: 1:3 to 1:5 — Operators manage multiple agents as specs improve and agent recoveries become less frequent.
At scale: 1:5 to 1:10 — High-quality context, mature Llmops practices, and reliable evaluation harnesses allow operators to supervise larger agent fleets.

A ratio that plateaus below target indicates bottlenecks in context quality, agent reliability, or review throughput. Investigate which factor is constraining scale.

Correction Ratio

The Correction Ratio tracks how frequently Human In The Loop intervention is required during agent execution.

Calculation: Number of human interventions / Total agent task completions

Interpretation:

Low ratio (below 0.1) — The agent is completing tasks with minimal human input. Context quality and architectural constraints are effective.
Moderate ratio (0.1 to 0.3) — Normal for complex feature work. Some agent recoveries are expected.
High ratio (above 0.3) — The agent requires correction on more than 30% of tasks. This signals a systemic issue: refine the Live Specs, enrich the Context Index, or tighten architectural constraints. The problem is almost always in the inputs, not the agent.

Track Correction Ratio per task type. A high ratio for one category (e.g., database migration tasks) with low ratios elsewhere points to a specific context gap rather than a general quality problem.

Spec-to-Code Ratio (SCR)

The Spec-to-Code Ratio measures the percentage of initial requirements that result in a functional pull request without requiring human rewrite of the generated code.

Calculation: PRs merged without human code changes / Total PRs submitted by agents

Target: Above 0.7 for mature teams. Below 0.5 indicates that specifications are not detailed enough for agents to execute faithfully.

SCR is the single most actionable metric for the Context Architect. When it drops, the root cause is almost always in the spec, not the agent. Common culprits: ambiguous acceptance criteria, missing edge cases, and stale Golden Samples that no longer reflect current code patterns.

Pillar II: Efficiency and Velocity Metrics

Efficiency metrics measure how effectively the pipeline converts compute resources into delivered value.

Mean Time to Unblock (MTTU)

Mean Time to Unblock measures the latency between an agent raising a Blocker Flag (hitting a retry limit, encountering an unresolvable error, or requesting human input) and a human providing the context or decision needed to resume execution.

Calculation: Average time from Blocker Flag to human intervention across all blocked agent runs

Target: Under 30 minutes during working hours. Every minute of MTTU is a minute of wasted compute (the agent is consuming resources while blocked) and blocked pipeline throughput.

Improvement levers:

Reduce frequency — Better specs and context reduce the number of blockers agents encounter.
Reduce detection latency — Real-time alerting on the AgentOps Dashboard ensures operators see blockers immediately, not at the next Daily Flow Sync.
Reduce resolution latency — Documenting common blocker patterns and their resolutions allows operators to respond faster.

Flow Efficiency

Flow Efficiency measures the ratio of active compute time (the agent is executing, generating code, running tests) to total wall-clock time (including wait states, blocked time, and queue time).

Calculation: Active agent compute time / Total wall-clock time from task assignment to PR submission

Target: Above 0.6. A Flow Efficiency below 0.4 means the agent spends more time waiting than working — the bottleneck is in human processes (review queues, context preparation, approval gates), not agent execution speed.

Common drag factors:

Review queue buildup — Agent Operators cannot review PRs as fast as agents produce them. Solution: increase review capacity or expand autonomous merge authority for low-risk tasks.
Context preparation delays — Context Packets are not ready when agents are available. Solution: maintain a buffer of prepared Context Packets.
Infrastructure wait times — Agent sandbox provisioning or test environment setup takes too long. Solution: pre-warm execution environments.

Pillar III: Economic and FinOps Metrics

Economic metrics connect agent execution to business outcomes. They answer the question every executive will ask: is this investment paying off?

Metric	Business Impact	Tactical Goal
Cost per Feature	Predicts R&D margins accurately.	Reduce via long-term memory caching.
Token Budget Management	Prevents "runaway" recursive loops.	Implement hard-stops at the Orchestration Layer.
Blended Efficiency	Justifies the ROI of the AI infrastructure.	Maintain lower cost than manual outsourced dev.

Cost per Feature

Track the fully loaded cost of delivering each feature through agent execution: API token consumption, compute infrastructure, human supervision time, and review effort.

Why it matters: Cost per Feature is the metric that determines whether agentic development is economically viable for your organization. If agent-delivered features cost more than human-delivered features (including the human's salary, benefits, and overhead), the investment is not paying off.

Reduction strategies:

Long-term memory caching — Store and reuse context that agents need repeatedly (architectural rules, domain glossaries, API schemas) rather than re-injecting it on every run.
Task routing optimization — Route only tasks with positive ROI to agents. Some tasks are cheaper to do manually — typically one-off tasks with high ambiguity and low repetition.
Model selection — Use smaller, cheaper models for routine tasks (formatting, simple refactors) and reserve large models for complex feature work.

Token Budget Management

The Token Budget is the maximum compute spend authorized per time period. It functions as a circuit breaker that prevents the most expensive failure mode in agentic systems: recursive agent loops that consume thousands of dollars in tokens while producing no value.

Implementation:

Set hard-stop limits at the Orchestration Layer. When an agent exceeds its token allocation for a single task, execution halts and a Blocker Flag is raised.
Track consumption in real time on the AgentOps Dashboard.
Review budget utilization weekly during Context and Allocation Planning.

Blended Efficiency

Blended Efficiency compares the total cost of your hybrid human-agent team against the cost of an equivalent fully-human team delivering the same output.

Calculation: Total agentic team cost (salaries + compute + infrastructure) / Estimated cost of equivalent human-only team

Target: Below 1.0 — meaning the agentic team delivers the same output for less total cost. Early-stage teams may run above 1.0 while building context engineering maturity. If Blended Efficiency remains above 1.0 after six months, re-evaluate whether the organization's work profile is suited for agentic execution.

Pillar IV: Quality and Governance Metrics

Quality metrics ensure that speed and volume do not come at the expense of structural integrity. These are the Guardrails that prevent agentic teams from accumulating architectural debt faster than they deliver value.

Architectural Violation Rate

The Architectural Violation Rate tracks how often agent-generated code attempts to violate domain boundaries, dependency rules, or structural constraints defined by the Principal Systems Architect.

Calculation: Architecture test failures / Total agent PRs submitted

Interpretation:

Below 0.05 — Agents are operating within architectural boundaries. Context and constraints are well-defined.
0.05 to 0.15 — Moderate violation rate. Review the architectural rules agents receive and update Golden Samples.
Above 0.15 — Agents are frequently violating boundaries. This indicates a systemic problem: either the architectural rules are not included in Context Packets, or they are too ambiguous for agents to follow.

Pattern Consistency Score

The Pattern Consistency Score measures how closely agent-generated code adheres to the Golden Samples and established patterns defined by the Principal Systems Architect.

Assessment methods:

Automated — Static analysis tools compare generated code structure, naming conventions, and dependency patterns against Golden Sample templates.
LLM-as-a-Judge — A secondary LLM evaluates generated code against a rubric derived from Golden Samples, scoring adherence on dimensions like naming, structure, error handling, and documentation.
Human review sampling — Agent Operators periodically deep-review a random sample of merged PRs specifically for pattern adherence.

Target: Above 0.8. Scores below 0.7 indicate that Golden Samples need updating or that Context Packets are not including them consistently.

The False Signal: Agent Pass Rate Without Violation Tracking

A common implementation pitfall is tracking "Agent Pass Rate" (the percentage of agent-generated PRs that pass automated tests) as the primary quality metric while ignoring Architectural Violation Rate.

This creates a dangerous false signal. An agent can produce code that passes every functional test while silently violating architectural principles — creating cross-domain coupling, bypassing the event bus for direct database writes, or introducing circular dependencies. These violations do not break tests today, but they compound into structural debt that becomes exponentially expensive to fix.

Always pair Agent Pass Rate with Architectural Violation Rate and Pattern Consistency Score. High pass rates with high violation rates indicate that your test suite is incomplete, not that your agents are performing well.

What Comes Next

These metrics and talent frameworks complete the operational layer of the Agentic Development Framework. With team structure, governance routines, and success metrics defined, you have the full picture of how to build and run an agentic development team — from strategic planning through daily execution to performance measurement.