Rain Lag

The Analog Incident Relay Race: Passing Paper Batons Through a Multi‑Team Outage

How to run outages like a well-coordinated relay race—using structured handoffs, predictive alerting, escalation rules, and reusable playbooks to keep multi-team incidents under control.

The Analog Incident Relay Race: Passing Paper Batons Through a Multi‑Team Outage

If you’ve ever watched an engineering org handle a major outage, you’ve probably seen a strange hybrid of high-tech dashboards and low-tech chaos: people scribbling on notepads, pasting timelines into chats, and “who owns this now?” echoing across channels.

It looks a lot like a relay race where nobody is quite sure who’s holding the baton.

In this post, we’ll treat incident response as an analog relay race and show how to turn those paper batons into a structured, reliable handoff system. We’ll cover:

  • How to design explicit ownership transitions for each phase of an outage
  • Using predictive alerting and time-based escalation rules to reduce MTTR
  • Preventing notification storms across multiple tools and teams
  • Running focused retrospectives that actually lead to change
  • Building and sharing reusable incident response playbooks

Why Incidents Feel Like a Bad Relay Race

In a real relay race, teams lose not because their runners are slow, but because they fumble the handoffs.

Most complex outages fail the same way:

  • The database team fixes the immediate problem, but nobody owns the data repair backlog that will surface days later.
  • A support manager promises customer follow-ups, but there’s no clear owner once the incident channel goes quiet.
  • SREs close their laptops at shift change assuming “someone else” is on it.

The technical fix might be fast; the organizational response is not.

The solution is to treat each phase of the incident lifecycle as a relay leg with clear ownership and formal handoffs.


Map Your Incident Lifecycle as Relay Legs

Start by defining the major “legs” of your incident relay. A common pattern:

  1. Detection & Triage – Identify the issue, confirm impact, and declare an incident.
  2. Containment & Stabilization – Stop the bleeding and restore service.
  3. Data Repair & Remediation – Fix corrupted data, reconcile queues, clean up side effects.
  4. Customer Communication & Follow-up – Notify, update, and close the loop with customers.
  5. Retrospective & Improvement – Learn, prioritize actions, and update playbooks.

Each leg must have:

  • A named owner role (not just a team): e.g., Incident Commander, Data Repair Lead, Customer Comms Lead.
  • A clear entry condition: what triggers this leg to begin.
  • A clear exit condition: what it means to be “done.”
  • A documented handoff: who receives the baton and how.

Think of the baton as a single source of truth: a doc, ticket, or incident record that moves from owner to owner.

Example: Ownership in Each Leg

  • Detection & Triage

    • Owner: On-call SRE (initial) → Incident Commander (after declaration)
    • Exit: Incident is declared; severity set; initial scope understood.
  • Containment & Stabilization

    • Owner: Incident Commander
    • Exit: Impact reduced or resolved; service stable for agreed time window.
  • Data Repair & Remediation

    • Owner: Data Repair Lead (often from the owning service team)
    • Exit: Data fixed or queued in a tracked backlog; risks documented.
  • Customer Communication & Follow-up

    • Owner: Customer Comms Lead (Support, PM, or Customer Success)
    • Exit: All required notices and follow-ups completed; status page closed.
  • Retrospective & Improvement

    • Owner: Post-incident Facilitator (often SRE or Engineering Manager)
    • Exit: Facts documented; contributing factors listed; actions created and owned.

Make Handoffs Explicit, Not Informal

Most outages rely on ad-hoc transitions:

“Looks good now. I think Data will take it from here?”

That’s how batons get dropped.

Replace that with explicit, structured handoffs.

A Simple Handoff Template

For every leg transition, capture:

  • From: Role/Person handing off
  • To: Role/Person accepting ownership
  • Scope: What exactly you now own
  • State: Current status, risks, and open questions
  • Artifacts: Links to dashboards, logs, docs, tickets
  • Next checkpoints: Deadlines or review times

In practice, this can be as simple as a standard block pasted into your incident channel and stored in your incident record:

[HANDOFF] From: Incident Commander (Alice) To: Data Repair Lead (Ravi) Scope: Reconcile orders created 12:10–12:27 UTC from Kafka topic `orders` to DB `orders_v2`. State: 1,842 orders affected; 600 verified; 0 customer-facing discrepancies so far. Artifacts: Runbook #42, Repair dashboard, Jira EPIC-321. Next: First status update at 15:00 UTC; completion ETA 18:00 UTC.

You can keep this “analog” (a text template everyone uses) while wiring it into your modern tooling.


Reduce MTTR with Predictive Alerting

The earlier you detect an incident, the more time you have for each relay leg. This is where predictive alerting and ML-based confidence scores matter.

Instead of only triggering alerts on hard thresholds (e.g., 5xx > X%), use models that:

  • Learn normal behavior over time (per service, per time of day).
  • Raise alerts when patterns deviate with a confidence score (e.g., 0.92 likelihood of outage).
  • Combine multiple signals (latency, error rates, saturation, complaint volume) into a single early-warning alert.

Operationally, this looks like:

  • Low-to-medium confidence signals create warning-level incidents or dashboards for human review.
  • High-confidence signals can automatically page on-call and open an incident ticket.

The goal is not to drown teams in false positives, but to give the first runner in the relay a head start.


Use Time-Based Escalation to Avoid Silent Failures

Once an incident starts, time-based escalation rules prevent it from stalling.

Examples:

  • If no one acknowledges a P1 alert in 5 minutes → escalate to a higher escalation policy (PagerDuty, SMS, or direct manager).
  • If an incident remains at high severity for 30 minutes with no status update → auto-ping the Incident Commander and leadership channel.
  • If a backlog item from the incident (e.g., data repair) is still open after 48 hours → auto-escalate to the owning team’s manager.

In tools like PagerDuty, Opsgenie, or home-grown systems, treat these as SLOs for incident response itself.

Your relay race shouldn’t depend on someone remembering to pass the baton. Time-based rules make sure it moves.


Prevent Notification Storms in Long Incidents

Multi-team outages often degenerate into notification storms:

  • Each team’s tool sends the same alert to different channels.
  • Status updates are copied to email, Slack, SMS, ticketing tools—often with slight differences.
  • People mute channels to stay sane… and miss the one message they actually need.

Solve this by coordinating and throttling alerts:

  1. Designate a primary broadcast channel for the incident (one Slack channel, one MS Teams room, etc.).
  2. Route all tools through an aggregator (PagerDuty, incident management platform, or a custom service) that:
    • Deduplicates alerts from multiple sources.
    • Applies rate limits (e.g., at most one customer-facing update every 30 minutes unless impact changes).
    • Supports role-based subscriptions (IC vs on-call vs leadership vs support).
  3. Promote a single, canonical status artifact:
    • Public: status page, customer update doc.
    • Internal: incident timeline + owner fields.

Your goal: more signal, less noise, especially when multiple teams and tools are involved.


Run Focused, Fact-Driven Retrospectives

After the race, good teams analyze the tape. The same applies to outages.

A focused retrospective should:

  1. Collect facts, not opinions

    • Timeline of events (who did what, when, with links).
    • Metrics (MTTD, MTTA, MTTR; impact duration; customer impact).
  2. Define contributing factors

    • Technical: architecture flaws, capacity limits, missing safeguards.
    • Process: missing runbooks, unclear handoffs, slow approvals.
    • Human: fatigue, inadequate training, role confusion.
  3. Prioritize follow-up actions

    • Each action must have:
      • A clear owner
      • A due date
      • A business impact rationale
    • Leadership tasks should be explicit: funding, hiring, tool investments, policy changes.

Keep retrospectives blameless but accountable: the point is to fix systems, not shame people, while ensuring actions don’t vanish into a black hole.


Standardize With Reusable Incident Response Playbooks

You don’t want to reinvent the race plan for every outage. That’s where incident response playbooks come in.

Resources like the open-source AWS Incident Response Playbooks (in the aws-samples/aws-incident-response-playbooks repo) are a great starting point.

Use or adapt playbooks to:

  • Standardize roles and responsibilities (IC, Ops Lead, Comms Lead, etc.).
  • Document checklists for common scenarios (database saturation, region failure, authentication outage).
  • Define handoff patterns between teams and time zones.
  • Encode escalation logic and notification rules.

Then, contribute back:

  • Your improved runbooks for multi-team outages.
  • New playbooks that capture how your organization passes batons between SRE, platform, product teams, and support.

Over time, your “analog” processes become reliable enough to script; your “paper batons” become templates everyone knows and trusts.


Bringing It All Together

Running a good incident isn’t about heroics; it’s about choreography.

To turn your multi-team outages into a well-run relay race:

  • Define relay legs in your incident lifecycle and attach clear owners.
  • Use structured, explicit handoffs so batons never get dropped.
  • Leverage predictive alerting and time-based escalation to detect and respond faster.
  • Coordinate and throttle notifications to avoid storm fatigue.
  • Run focused retrospectives that turn pain into prioritized, owned actions.
  • Build and share reusable playbooks, including those from aws-samples/aws-incident-response-playbooks, to standardize how teams work together.

You may still scribble notes on paper during the chaos—that’s fine. The key is that every scribble rolls up into a clear baton, passed cleanly from one owner to the next, until the incident is not just resolved, but truly finished.

The Analog Incident Relay Race: Passing Paper Batons Through a Multi‑Team Outage | Rain Lag