The Cardboard Incident Relay: Running Production Fire Drills Without Pager Chaos

Modern production systems fail in creative, surprising ways. Your incident response should not.

Yet in many teams, “incident response” still means: a flurry of pagers, a pile-up of Slack pings, confused managers asking for updates, and engineers scrambling without a shared plan.

You can do better—with a cardboard baton.

Think of a “Cardboard Incident Relay” as a low-cost, low-stakes way to rehearse high-stakes situations: a paper baton passed between teammates as you simulate real outages. No on-call stress, no 2 a.m. adrenaline—just structured practice.

In this post, we’ll walk through how to:

Build clear, standardized incident playbooks
Run tabletop exercises (“cardboard relays”) to practice responses
Expose weak communication and escalation paths before real crises
Calm the chaos with better monitoring, alerting, and incident channels
Archive and analyze incidents for continuous improvement

Why You Need Fire Drills for Production

Fire departments don’t wait for a real building to catch fire to figure out how to respond. They drill. Repeatedly.

Engineering teams need the same discipline.

Without structured practice:

People improvise roles in the moment
Conflicting instructions slow down response
Nobody is sure who can make which decision
Post-incident reviews devolve into blame or hand-waving

The “Cardboard Incident Relay” approach treats incident management as a trainable skill, not a panic-driven event. Instead of chaos, you get:

Predictable roles and responsibilities
Faster, more confident decisions
Fewer miscommunications and duplicated work
A calmer culture around failure

And you can implement it using nothing more than: a scenario, a virtual or physical baton, and time on the calendar.

Step 1: Establish Clear, Standardized Playbooks

You can’t practice what you haven’t defined.

Start with a small set of incident response playbooks for your most common or most damaging production scenarios. For example:

Database latency / timeouts
Authentication service degradation
Payment processing failures
Major feature outage for a key customer segment

Each playbook should answer:

Trigger conditions
- What metrics, alerts, or user reports indicate this incident type?
- What thresholds turn “annoying” into “incident”?
Roles and responsibilities
- Incident Commander (IC): owns decisions and coordination
- Technical Lead: digs into root cause, drives mitigation
- Communications Lead: posts updates to stakeholders and status pages
- Scribe: records timeline, actions, and key decisions
Immediate actions (first 5–15 minutes)
- Who is paged?
- Where do you coordinate (channel, bridge, tools)?
- What safe mitigations can be applied quickly (e.g., rollbacks, feature flags, traffic shedding)?
Escalation and decision points
- At what point do you roll back, fail over, or declare a larger incident?
- When do you pull in leadership, legal, or customer support?
Communication templates
- Internal status updates: frequency, level of detail
- External comms: status page, customer success briefs, etc.

These are living documents. Don’t wait for perfect. Aim for clear and good enough, then improve them through drills.

Step 2: Run Tabletop “Cardboard Relay” Exercises

Once you have playbooks, you need to practice using them—without waking anyone up in the middle of the night.

A tabletop exercise is a structured, low-stress simulation. The “cardboard relay” metaphor is simple:

You gather participants (in a room or video call)
You walk through a scripted incident scenario
A cardboard baton (or virtual token) represents who currently holds the Incident Commander role
As the scenario evolves, the baton can be passed—intentionally and visibly

How to run a basic tabletop

Choose a scenario
Example: “Checkout error rates jump to 30% starting at 10:03 a.m.”
Assign initial roles
- IC holds the baton
- Technical Lead, Communications Lead, Scribe are named
Simulate time and events
A facilitator reveals information in stages:
- 10:05 – Alert triggers on payment error rate
- 10:08 – Support tickets spike
- 10:12 – Monitoring shows DB CPU at 90%
- 10:18 – A key enterprise customer emails their account manager
Enforce realism
- No magical “I check everything at once” moves
- Actions cost time
- If someone says “I’d run query X,” the facilitator returns plausible results
Practice baton passing
- The IC may pass the baton if their context is limited, if their shift would end, or if a more appropriate IC appears (e.g., a regional on-call with better context)
- The baton pass must be explicit: “I am handing IC to Alex as of 10:15; Alex, please confirm.”
Debrief immediately
After the run:
- What slowed the team down?
- Where were responsibilities unclear?
- What tools or data were missing?
- What would you change in the playbook?

This is where you discover design flaws in your process—while stakes are low.

Step 3: Expose Communication and Escalation Weaknesses Early

Tabletop drills are incredibly effective at revealing soft spots in your incident response:

Unclear ownership: “Wait, who talks to the CEO?”
Missing escalation paths: “How do we get the database team on the line?”
Decision paralysis: “Who approves traffic shedding or a rollback?”
Information overload: “We had five dashboards and no shared understanding.”

Design your scenarios with these goals in mind:

Force cross-team coordination (e.g., infra + app + data)
Introduce conflicting pressures (e.g., speed vs. data integrity)
Include non-technical stakeholders (e.g., support, product, marketing)

Every weakness you expose on cardboard is a failure you’re likely to avoid in production.

Step 4: Replace Pager Chaos With Structured Decision-Making

Most pager chaos comes from ad-hoc decisions made under pressure:

Multiple people trying to be in charge
Repeated questions: “What’s the status? Who’s doing what?”
Engineers pulled into incidents with no clear need

Drills help you refine your decision-making model:

Who can declare an incident of a given severity?
Who can end it?
What’s the chain for escalation and delegation?
When do you stop debugging and choose a mitigation (rollback, feature flag, failover)?

Write these rules down. Practice them. Adjust as you learn. Over time, your team internalizes a calm, consistent response pattern.

Step 5: Implement Monitoring and Intelligent Alerting

All the process in the world won’t help if your alerts are noise.

Your goal: fewer, smarter alerts that focus on real user impact.

Key practices:

Coverage before cleverness
- Monitor critical user journeys (signup, login, checkout, search)
- Monitor infrastructure fundamentals (CPU, memory, disk, latency, error rates)
Meaningful thresholds
- Tune alerts to impact: an error rate going from 0.01% to 0.02% might be irrelevant, but 2% to 10% is not
- Combine signals where possible (e.g., error rate + latency + traffic anomaly)
Noise reduction
- Rate-limit flapping alerts
- Group related alerts into a single incident
- Run periodic reviews of “never actionable” alerts and remove or adjust them

Then incorporate these alerts into your tabletop exercises. Confirm that:

The right people are paged at the right time
Alerts are understandable in context
On-call isn’t drowning in meaningless warnings

Step 6: Use Dedicated, Structured Incident Channels

When a real incident hits, you don’t want decisions buried in random threads.

Set up a pattern such as:

A dedicated incident channel: #inc-<date>-<short-description>
A standard way to create that channel (bot command, runbook link, etc.)
Automatic posting of:
- Incident start time
- On-call / IC assignment
- Links to relevant dashboards, logs, tickets

Within that channel, enforce a few simple rules:

Directed messages: Use @name to assign actions explicitly
Time-stamped updates: The IC or Comms Lead posts periodic summaries: status, hypotheses, actions, next steps
Ack required: When given a task, the assignee acknowledges: “On it, ETA 5 min”

This creates a transparent, auditable trail of what happened when—and by whom.

Practice this structure during your tabletop exercises so it becomes second nature.

Step 7: Archive and Analyze Communication for Learning

Your incident channel history is one of your most valuable training assets—if you keep it and use it.

A robust incident practice includes:

Archiving communication
- Preserve channels, logs, and dashboards associated with each incident
- Link them from your incident ticket or documentation
Transcription (when needed)
- For voice/video bridges, record and transcribe calls
- Particularly useful for complex, multi-hour incidents
Post-incident reviews that actually teach
- Reconstruct the timeline from the archive: what was known, decided, and done at each step
- Highlight communication patterns: where did confusion or delays come from?
- Update playbooks, on-call rotations, tooling, or training based on real findings

Then, reuse these real incidents as future tabletop scenarios. You now have:

Authentic, organization-specific examples
Real decision dilemmas your team already faced
A feedback loop from reality back into practice

Putting It All Together

The Cardboard Incident Relay is less about props and more about mindset:

Incidents are inevitable; chaos is optional
Calm, effective response is a muscle you build, not a talent you hire
Practice should be low-cost, low-stress, but highly realistic

By defining playbooks, running regular tabletop drills, tightening your monitoring and alerting, structuring incident channels, and archiving your communication, you turn messy firefights into repeatable, improvable operations.

Start small:

Write a playbook for a single, common incident type.
Schedule a one-hour tabletop drill next week.
Use a cardboard baton—or its digital equivalent—to clearly mark the Incident Commander.

Then iterate. In a few months, you’ll see the difference: fewer surprises, less panic, and a team that treats production incidents like what they are—hard problems, not emergencies.

Your future self (and your sleep schedule) will thank you.