Design Patterns
WorkflowIntermediate

Rescue Mission Workflow

The structured diagnosis-correction-resume cycle for recovering stuck agents and turning failures into reusable context.

Overview

A Rescue Mission is the structured process for recovering an agent that has become stuck — looping on the same error, misinterpreting a spec, violating architectural constraints, or exhausting its Token Budget without producing usable output. Rather than treating stuck agents as ad-hoc emergencies, this pattern defines a repeatable three-step cycle: Diagnose the root cause, Inject corrective context, and Resume execution. After every rescue, a feedback step enriches the Context Index so the same blocker does not recur.

Rescue missions sit at Phase 4 of the Four-Phase Escalation Ladder — they activate after automated retries (Phase 1), parameter adjustments (Phase 2), and the Blocker Flag pause (Phase 3) have all failed to resolve the issue. The Agent Operator who monitors the agent flags the problem. The Daily Flow Sync triages it. And a designated engineer — often the Agent Operator themselves or a senior engineer — executes the rescue.

The value of this pattern compounds over time. Each rescue mission produces a documented diagnosis and a context update. After several months, the Context Index covers the failure modes the team has encountered, and rescue frequency drops. Teams that track Mean Time To Unblock (MTTU) typically see a 40-60% reduction within the first quarter of adopting this workflow.

Problem

Agents get stuck for predictable reasons, but teams without a structured rescue workflow handle these situations poorly:

  • Ad-hoc fixes that do not persist. An engineer unblocks the agent with a quick prompt adjustment or manual code edit. The fix works once, but the underlying cause — a missing context file, an ambiguous spec section, a constraint not captured in the System Constitution — remains. The next agent that encounters the same situation gets stuck again.

  • No root cause analysis. Without a diagnosis step, teams treat symptoms. They see "the agent produced incorrect SQL" and conclude "the agent is bad at SQL." The actual root cause might be that the database schema context was not included in the Context Packet, or that the spec did not specify which ORM to use.

  • Inconsistent escalation. Some engineers let agents spin for hours before intervening. Others kill the agent at the first sign of trouble and do the work themselves. Without a defined escalation process, the team oscillates between wasting tokens and wasting human time.

  • Knowledge stays in people's heads. The engineer who rescued the agent knows what went wrong and how to fix it. That knowledge lives in their memory, not in the Context Index. When they are out sick or leave the team, the knowledge leaves with them.

  • No visibility into failure patterns. Without tracking rescue missions, the team cannot identify systemic issues. If 60% of rescues are caused by incomplete specs, the team should invest in better spec engineering — but they do not know that without data.

Solution

Implement the Rescue Mission Workflow as a five-step process that every team member follows when an agent reaches Phase 4 of the escalation ladder.

The Five Steps

  1. Detect — Identify that the agent is stuck via the Agentops Dashboard or manual observation. Raise a Blocker Flag.
  2. Diagnose — Determine the root cause category. The diagnosis drives the correction.
  3. Inject — Provide the specific corrective context the agent needs.
  4. Resume — Restart agent execution with the corrected context.
  5. Enrich — Update the Context Index to prevent recurrence.

The entire cycle should take 15-45 minutes for a trained team. If diagnosis consistently takes longer, the monitoring and diagnostic tooling needs improvement.

Implementation

1

2

3

Code Examples

Rescue Mission Tracking
// scripts/rescue-tracker.ts
interface RescueMission {
  id: string;
  taskId: string;
  detectedAt: Date;
  resolvedAt: Date | null;
  diagnosis: {
    category: string;
    evidence: string[];
    rootCause: string;
  };
  injection: {
    description: string;
    filesAdded: string[];
    specChanges: string[];
  };
  outcome: "resolved" | "re-escalated" | "abandoned";
  contextUpdates: string[];
  mttu: number | null; // minutes from detection to resolution
}

function calculateMTTU(missions: RescueMission[]): number {
  const resolved = missions.filter(
    (m) => m.outcome === "resolved" && m.mttu !== null
  );
  if (resolved.length === 0) return 0;
  return resolved.reduce((sum, m) => sum + (m.mttu ?? 0), 0) / resolved.length;
}

function categoryBreakdown(
  missions: RescueMission[]
): Record<string, number> {
  return missions.reduce(
    (acc, m) => {
      const cat = m.diagnosis.category;
      acc[cat] = (acc[cat] || 0) + 1;
      return acc;
    },
    {} as Record<string, number>
  );
}

function generateRescueReport(missions: RescueMission[]): string {
  const mttu = calculateMTTU(missions);
  const breakdown = categoryBreakdown(missions);
  const totalContextUpdates = missions.reduce(
    (sum, m) => sum + m.contextUpdates.length,
    0
  );

  return [
    `## Rescue Mission Report`,
    ``,
    `**Period:** Last 30 days`,
    `**Total Rescues:** ${missions.length}`,
    `**Average MTTU:** ${mttu.toFixed(0)} minutes`,
    `**Context Updates Generated:** ${totalContextUpdates}`,
    ``,
    `### Root Cause Breakdown`,
    ...Object.entries(breakdown)
      .sort(([, a], [, b]) => b - a)
      .map(
        ([cat, count]) =>
          `- ${cat}: ${count} (${((count / missions.length) * 100).toFixed(0)}%)`
      ),
  ].join("\n");
}
Blocker Detection Script
// scripts/detect-blockers.ts
interface AgentSession {
  id: string;
  taskId: string;
  startedAt: Date;
  tokenBudget: number;
  tokensUsed: number;
  retryCount: number;
  lastGateResult: "pass" | "fail";
  lastError: string | null;
  expectedDurationMinutes: number;
}

interface BlockerAlert {
  sessionId: string;
  taskId: string;
  reason: string;
  severity: "warning" | "critical";
  suggestedAction: string;
}

function detectBlockers(sessions: AgentSession[]): BlockerAlert[] {
  const alerts: BlockerAlert[] = [];
  const now = new Date();

  for (const session of sessions) {
    const elapsedMinutes =
      (now.getTime() - session.startedAt.getTime()) / 60000;
    const budgetPercent = (session.tokensUsed / session.tokenBudget) * 100;

    // Check: token budget approaching limit
    if (budgetPercent > 80 && session.lastGateResult === "fail") {
      alerts.push({
        sessionId: session.id,
        taskId: session.taskId,
        reason: `Token budget at ${budgetPercent.toFixed(0)}% with failing gates`,
        severity: "critical",
        suggestedAction: "Initiate rescue mission — budget exhaustion likely",
      });
    }

    // Check: excessive retries
    if (session.retryCount >= 3) {
      alerts.push({
        sessionId: session.id,
        taskId: session.taskId,
        reason: `${session.retryCount} retries with no improvement`,
        severity: "critical",
        suggestedAction:
          "Raise blocker flag — automated retries are not resolving the issue",
      });
    }

    // Check: running longer than expected
    if (elapsedMinutes > session.expectedDurationMinutes * 2) {
      alerts.push({
        sessionId: session.id,
        taskId: session.taskId,
        reason: `Running ${elapsedMinutes.toFixed(0)}min (expected ${session.expectedDurationMinutes}min)`,
        severity: "warning",
        suggestedAction:
          "Check agent progress — may be stuck or working on out-of-scope items",
      });
    }
  }

  return alerts;
}

Considerations

Benefits
  • **Faster unblocking.** A structured diagnosis checklist eliminates the guesswork that makes ad-hoc rescues slow. Teams that follow this workflow consistently report [[mean-time-to-unblock]] improvements of 40-60% within three months.
  • **Compound improvement.** Every rescue enriches the [[context-index]]. The first rescue for a given blocker type costs 45 minutes. The fifth rescue for a similar blocker costs 15 minutes because the context already exists. After enough rescues, the blocker category stops occurring entirely.
  • **Systematic knowledge capture.** Rescue injection records and context updates convert individual troubleshooting knowledge into team-level assets. The [[context-index]] grows from rescue missions, not from planned documentation sprints.
  • **Data-driven process improvement.** Tracking rescue frequency by root cause category reveals where the team should invest. If 40% of rescues are caused by ambiguous specs, improving spec quality yields the highest return. Without tracking, the investment decision is a guess.
  • **Reduced recurring failures.** The enrichment step (Step 5) is the key differentiator from ad-hoc fixes. Each rescue addresses the root cause in the [[context-index]], not just the symptom in the current task. Over time, the team encounters fewer novel blockers.
Challenges
  • **Requires real-time monitoring.** The detection step depends on visibility into agent execution state. Teams without an [[agentops-dashboard]] or equivalent monitoring must rely on manual observation, which delays detection and increases MTTU.
  • **Diagnosis skill takes practice.** The five-category diagnosis framework is straightforward in theory but requires pattern recognition that develops over weeks. Expect diagnosis to take 20-30 minutes initially; it should drop to 5-10 minutes after a dozen rescues.
  • **Balancing speed and thoroughness.** Under deadline pressure, teams skip the enrichment step (Step 5) to save time. This creates technical debt in the [[context-index]] and guarantees the same blocker will recur. The Flow Manager should track enrichment completion rates and flag gaps.
  • **Post-rescue context review burden.** Each rescue generates context updates that should be reviewed like code — incorrect context injections can cause new problems. Include rescue-generated context updates in the weekly Architecture Governance Review.
  • **Requires team discipline.** The workflow only works if every team member follows it consistently. One engineer who bypasses the process and applies quick fixes without enrichment undermines the compound improvement benefit. The [[daily-flow-sync]] should include a brief review of rescue completion status.