WorkflowAdvanced

Spec Engineering

How to write effective Live Specs that produce high Spec-to-Code Ratios and minimize agent intervention.

Overview

Spec Engineering is the discipline of writing Live Spec documents that agents can execute with minimal intervention. A well-engineered spec produces a high Spec To Code Ratio — the agent's output closely follows the specification rather than improvising. A poorly engineered spec produces hallucinations, missed edge cases, and looping, regardless of how capable the model is.

This pattern provides a structured methodology for authoring specs. It moves teams from "write some requirements and hope the agent figures it out" to a repeatable process: start from executable acceptance criteria, decompose into testable behaviors, attach Golden Samples and architectural constraints, and iterate based on measured outcomes. The Context Architect owns this process, but every engineer who writes specs benefits from the methodology.

Spec engineering is the highest-leverage activity in an agentic development workflow. A 30-minute investment in a precise spec saves hours of agent retries, Rescue Mission escalations, and human correction. Teams that measure Spec To Code Ratio consistently find that spec quality — not model selection, not prompt engineering tricks — is the dominant factor in agent output quality.

Problem

Most teams write specs that are readable by humans but ambiguous to agents:

Prose-heavy requirements. Specs describe what the feature should do in narrative form. A human reader infers the unstated details from experience. An agent has no such inference — it either invents the details (hallucination) or asks clarifying questions (delays).
Untestable acceptance criteria. Criteria like "the component should handle errors gracefully" or "the API should be performant" cannot be validated by an Eval Harness. The agent has no way to verify its own output against these criteria, and neither does the automated gate.
Missing boundary conditions. Specs describe the happy path but omit error handling, empty states, concurrent access, edge cases with special characters, and boundary values. Agents either ignore these cases or invent handling that does not match team expectations.
Implicit architectural knowledge. The spec says "create a new service" but does not specify which service layer pattern to follow, which directory to place files in, which naming conventions to use, or which dependencies are allowed. The agent makes choices that may violate architectural rules not captured in the spec.
No feedback loop. Teams do not measure spec quality systematically. A spec that caused three rescue missions looks identical in the backlog to a spec that the agent nailed on the first attempt. Without Spec To Code Ratio and Correction Ratio data tied back to individual specs, there is no signal for improvement.

Solution

Apply a six-step spec engineering methodology that produces machine-readable, testable, self-contained specifications.

The Six Steps

Start with acceptance criteria — Write executable, testable conditions before anything else.
Write the behavioral contract — Define inputs, outputs, edge cases, and error handling.
Define the system constitution reference — Link to coding standards, architectural constraints, and security rules.
Decompose into an actionable task map — Break the spec into ordered subtasks the agent can execute sequentially.
Attach Golden Samples and context references — Provide concrete examples of expected output quality.
Review and validate before agent execution — Treat specs like code: review them, version them, and test them.

Implementation

Code Examples

Spec Quality Scoring

// scripts/score-spec.ts
interface SpecQualityScore {
  acceptanceCriteriaScore: number;   // 0-25
  behavioralContractScore: number;   // 0-25
  contextCompletenessScore: number;  // 0-25
  taskDecompositionScore: number;    // 0-25
  total: number;                     // 0-100
  grade: "A" | "B" | "C" | "D" | "F";
  suggestions: string[];
}

function scoreSpec(specContent: string): SpecQualityScore {
  const suggestions: string[] = [];

  // Score acceptance criteria
  let acScore = 0;
  const acTable = specContent.match(
    /\| AC-\d+.*\|/g
  );
  if (acTable && acTable.length > 0) {
    acScore += 10; // Has structured ACs
    if (acTable.every((row) => row.includes("-test") || row.includes("-check"))) {
      acScore += 10; // All ACs have validation methods
    } else {
      suggestions.push(
        "Some acceptance criteria lack validation methods"
      );
    }
    if (acTable.length >= 5) {
      acScore += 5; // Sufficient coverage
    } else {
      suggestions.push(
        "Consider adding more acceptance criteria for edge cases"
      );
    }
  } else {
    suggestions.push(
      "Add structured acceptance criteria with IDs and validation methods"
    );
  }

  // Score behavioral contract
  let bcScore = 0;
  if (specContent.includes("**Inputs:**")) bcScore += 5;
  else suggestions.push("Add an Inputs table to the behavioral contract");
  if (specContent.includes("**Outputs (success):**")) bcScore += 5;
  else suggestions.push("Add a success Outputs table");
  if (specContent.includes("**Outputs (error):**")) bcScore += 5;
  else suggestions.push("Add an error Outputs table");
  if (specContent.includes("**Behavior — Normal Flow:**")) bcScore += 5;
  else suggestions.push("Add normal flow behavior description");
  if (specContent.includes("**Behavior — Error Flow:**")) bcScore += 5;
  else suggestions.push("Add error flow behavior description");

  // Score context completeness
  let ctxScore = 0;
  if (specContent.includes("## System Constitution")) ctxScore += 5;
  else suggestions.push("Add System Constitution reference");
  if (specContent.includes("Golden Sample")) ctxScore += 10;
  else suggestions.push("Add golden sample references");
  if (specContent.includes("Context Packet")) ctxScore += 5;
  else suggestions.push("Add context packet file listing");
  if (specContent.includes("token") || specContent.includes("budget"))
    ctxScore += 5;
  else suggestions.push("Add token budget estimate");

  // Score task decomposition
  let tdScore = 0;
  const taskMatches = specContent.match(/### Task \d+/g);
  if (taskMatches && taskMatches.length > 0) {
    tdScore += 10;
    if (specContent.includes("**Dependencies:**")) tdScore += 5;
    if (specContent.includes("**Validates:**")) tdScore += 5;
    if (specContent.includes("**Golden Sample:**")) tdScore += 5;
  } else {
    suggestions.push("Decompose spec into ordered subtasks with dependencies");
  }

  const total = acScore + bcScore + ctxScore + tdScore;
  const grade =
    total >= 85
      ? "A"
      : total >= 70
        ? "B"
        : total >= 55
          ? "C"
          : total >= 40
            ? "D"
            : "F";

  return {
    acceptanceCriteriaScore: acScore,
    behavioralContractScore: bcScore,
    contextCompletenessScore: ctxScore,
    taskDecompositionScore: tdScore,
    total,
    grade,
    suggestions,
  };
}

Before and After Spec Comparison

# BEFORE: Vague spec that causes agent failures
# ------------------------------------------------
feature: User Authentication
description: Add login functionality
requirements:
  - Users can log in with email and password
  - Show error on invalid credentials
  - Use JWT for sessions
notes: Follow existing patterns

# AFTER: Engineered spec that agents execute reliably
# ------------------------------------------------
feature: User Authentication — Login Endpoint
spec_version: 1
status: ready
author: "@context-architect"

behavioral_contract:
  purpose: "Authenticate user by email/password, return signed JWT"
  inputs:
    - name: email
      type: string
      constraints: "RFC 5322 email format"
      required: true
    - name: password
      type: string
      constraints: "8-128 characters"
      required: true
  outputs_success:
    - name: token
      type: string
      description: "RS256-signed JWT, 24h expiry, contains userId/email/role"
    - name: user
      type: object
      description: "{ id: string, email: string, role: string }"
  outputs_error:
    - status: 401
      code: AUTH_INVALID_CREDENTIALS
      when: "Email exists but password does not match"
    - status: 401
      code: AUTH_USER_NOT_FOUND
      when: "No user with provided email"
    - status: 429
      code: AUTH_RATE_LIMITED
      when: "5+ failures from same IP in 15 minutes"
    - status: 400
      code: AUTH_INVALID_INPUT
      when: "Email or password fails format validation"

acceptance_criteria:
  - id: AC-1
    criterion: "Returns 200 with JWT on valid credentials"
    validation: integration-test
    priority: must-have
  - id: AC-2
    criterion: "Returns 401 AUTH_INVALID_CREDENTIALS on wrong password"
    validation: integration-test
    priority: must-have
  - id: AC-3
    criterion: "Returns 429 after 5 failed attempts from same IP in 15min"
    validation: integration-test
    priority: must-have
  - id: AC-4
    criterion: "JWT uses RS256, contains userId/email/role, 24h expiry"
    validation: unit-test
    priority: must-have
  - id: AC-5
    criterion: "bcrypt timing-safe comparison for passwords"
    validation: unit-test
    priority: must-have

system_constitution_ref: "context/constitution/coding-standards.md"
golden_samples:
  - path: "src/users/services/user-service.ts"
    demonstrates: "Service pattern, error handling"
  - path: "src/users/__tests__/user-service.test.ts"
    demonstrates: "Test structure, mocking"

context_packet:
  - "prisma/schema.prisma"
  - "src/shared/errors/app-error.ts"
  - "docs/api/auth-spec.yaml"

task_map:
  - id: 1
    name: "Input validation schema"
    deliverable: "src/auth/schemas/login-schema.ts"
    validates: [AC-4]
    dependencies: []
  - id: 2
    name: "Auth service login method"
    deliverable: "src/auth/services/auth-service.ts"
    validates: [AC-1, AC-2, AC-3, AC-4, AC-5]
    dependencies: [1]
  - id: 3
    name: "Auth controller login endpoint"
    deliverable: "src/auth/controllers/auth-controller.ts"
    validates: [AC-1, AC-2, AC-3]
    dependencies: [2]
  - id: 4
    name: "Rate limiting middleware"
    deliverable: "src/auth/middleware/rate-limiter.ts"
    validates: [AC-3]
    dependencies: []
  - id: 5
    name: "Tests for all acceptance criteria"
    deliverable: "src/auth/__tests__/"
    validates: [AC-1, AC-2, AC-3, AC-4, AC-5]
    dependencies: [1, 2, 3, 4]

token_budget: 12000

Considerations

Benefits

• **Higher Spec-to-Code Ratio.** Engineered specs leave less room for agent improvisation. When the spec defines inputs, outputs, error cases, and task decomposition, the agent follows the specification rather than guessing. Teams that adopt this methodology typically see [[spec-to-code-ratio]] increase from 50-60% to 80-90%.
• **Fewer rescue missions.** The most common [[rescue-mission]] root cause is "ambiguous spec" (~40% of rescues). Engineered specs with testable acceptance criteria and explicit boundary conditions eliminate most of these. The [[correction-ratio]] drops because the agent gets it right on the first attempt.
• **Specs become institutional knowledge.** A well-engineered spec is not a throwaway document — it is a reusable reference. When a similar feature needs to be built in the future, the team can adapt an existing spec rather than starting from scratch. The spec library grows into a valuable asset.
• **Measurable quality improvement.** By tying [[spec-to-code-ratio]] and [[correction-ratio]] back to individual specs, teams can identify which spec patterns work and which do not. This creates a data-driven feedback loop for continuous improvement.
• **Reduced onboarding time for new engineers.** A library of well-engineered specs teaches new team members how the codebase works, what patterns to follow, and what quality looks like. The specs are living documentation of the team's engineering standards.

Challenges

• **Initial slowdown perception.** Engineers accustomed to jumping straight into code perceive spec writing as overhead. The response is data: show that a 30-minute spec investment saves 2-3 hours of agent retries and human correction. Track the numbers to make the case.
• **Spec maintenance as code evolves.** Specs that reference specific file paths, database schemas, or API contracts become stale as the codebase changes. Include spec staleness checks in the monthly Context Hygiene ceremony. Flag specs where referenced files have changed since the spec was last updated.
• **Translation gap between business and technical language.** Product managers write requirements in business language. [[context-architect]] roles translate these into technical specs. The translation step introduces potential information loss. Pair the [[context-architect]] with the product manager during spec authoring to minimize gaps.
• **Over-engineering specs for simple tasks.** Not every task needs a full six-step spec. A simple CRUD endpoint following an existing pattern may need only acceptance criteria and a golden sample reference. Scale the spec effort to the task complexity — use the routing matrix from [[agent-task-routing]] to determine how thorough the spec needs to be.
• **Resistance to peer review of specs.** Engineers may view spec review as bureaucratic overhead. Frame it as the highest-leverage review activity: 15 minutes of spec review prevents hours of wasted agent execution. Start with lightweight reviews (a quick read-through) and formalize only for high-complexity tasks.