The Paper-Driven Chaos Deck: Designing Low‑Tech Incident Cards That Tame High‑Tech Outages
How to design and use simple, paper-based chaos decks to run powerful SRE tabletop exercises that improve incident response, collaboration, and production resilience—without expensive tools or complex setups.
The Paper-Driven Chaos Deck: Designing Low‑Tech Incident Cards That Tame High‑Tech Outages
Modern systems fail in complex, surprising ways—and most of that complexity lives in people and processes, not just in code. Yet many organizations still treat incident response as something you only “really learn” during a crisis.
You don’t need a full-blown chaos engineering platform or elaborate staging environments to change that. A stack of paper, a pen, and some thoughtfully designed incident cards are enough to transform how your teams learn to handle outages.
This is the idea behind a paper-driven chaos deck: a low-tech toolkit for running realistic, repeatable tabletop incident simulations that build real-world resilience.
Why Paper-Based Chaos Decks Work So Well
At first glance, paper feels almost primitive compared to production systems, observability stacks, and sophisticated chaos tools. But that’s exactly why chaos decks are powerful.
1. Frictionless and accessible
You don’t need:
- A dedicated environment
- Special accounts or cloud credits
- Simulation tools or complex runbooks
You just need a room (physical or virtual), people, and a deck of incident cards. That low barrier means you can:
- Run exercises during team meetings, on-boarding, and game days
- Include non-technical stakeholders (support, product, leadership)
- Avoid the excuse of “we’ll do this when we have the right tool”
2. Structured chaos reveals real gaps
Chaos decks give you structured unpredictability. Each card defines a scenario, constraint, or twist, helping you:
- Expose gaps in incident response plans
- Surface missing documentation or unclear ownership
- Stress-test your on-call rotations, escalation paths, and runbooks
Because the format is consistent (draw cards, respond as a team, reflect), you can repeat exercises and track how your capabilities improve over time.
3. Low-stakes practice builds high-stakes muscle memory
Teams under real incident pressure:
- Default to familiar patterns (even if they’re inefficient)
- Struggle to communicate clearly
- Forget “textbook” processes they never actually practiced
Regular chaos deck sessions build muscle memory for:
- How to declare incidents
- How to communicate internally and externally
- How to make decisions with limited or conflicting information
When a real high‑tech outage hits, the behaviors and collaboration patterns you practiced with paper spill over into the incident bridge.
Anatomy of a Chaos Deck: What’s on the Cards?
A good chaos deck is intentional. It’s not a random pile of disasters; it’s a curated set of prompts aligned with your actual production risks and SRE concerns.
Core card types
You can start with four broad categories:
-
Incident Scenario Cards
A short description of what’s going wrong.Examples:
- “API latency is spiking across EU users; dashboards show no obvious cause.”
- “Background jobs are stuck; queue depth is climbing, but CPU is low.”
- “New deployment just rolled out; customer support reports timeouts.”
-
Signal & Detection Cards
How the issue surfaces.Examples:
- “Pager alert: SLO burn rate for checkout latency breaches threshold.”
- “No alert fired. Incident discovered via a tweet from a major customer.”
- “Synthetic monitoring shows failures, but real-user metrics look normal.”
-
Constraint Cards
Limitations that force realistic trade-offs.Examples:
- “Primary SRE is on a plane; you’re down one experienced responder.”
- “You can’t roll back; the database schema already migrated.”
- “Regulatory: you must avoid data loss at all costs.”
-
Twist / Escalation Cards
Changes introduced mid-incident.Examples:
- “The mitigation worked, but error rates return after 20 minutes.”
- “A parallel outage appears in an unrelated service.”
- “Legal asks for an impact update in the next 10 minutes.”
You can mix and match: start with an Incident card, flip a Signal card for how it’s detected, then add Constraints or Twists as the exercise unfolds.
Aligning Cards With SRE Concerns
To be more than storytelling, your chaos deck should map directly to your SRE priorities: reliability, scalability, and efficiency.
Reliability-focused cards
- “Error budget for search API is at 90% burn for the month. Another spike will freeze all changes. An incident starts now.”
- “Persistent storage is unusually slow; read latency is within SLO, write latency is not.”
- “A dependency’s 3rd-party API is intermittently failing; you have no direct control.”
These scenarios test:
- SLO literacy (Can the team reason about error budgets?)
- Dependency management (Do you have fallbacks or graceful degradation?)
- Prioritization under pressure (Which users or regions do you protect first?)
Scalability-focused cards
- “Traffic doubles during an unplanned marketing campaign. Autoscaling lags by 10–15 minutes.”
- “Hot partition forms in your database; one shard is maxed-out, others are idle.”
- “Your cache hit rate drops suddenly; origin is melting.”
These explore:
- Capacity planning and scaling strategies
- Observability for load patterns and hot spots
- Playbooks for load shedding or rate limiting
Efficiency-focused cards
Efficiency is not just about cost; it’s also about time, focus, and process.
Examples:
- “On-call is flooded with low-priority alerts during a real incident.”
- “A manual deployment step is forgotten, causing partial rollout failures.”
- “Two teams both assume the other owns the failing service.”
These probe:
- Alert hygiene and prioritization
- Runbook quality and automation gaps
- Ownership clarity and incident role definition
How to Run a Chaos Deck Tabletop Exercise
You don’t need a heavy process. A simple, repeatable format is enough.
1. Set the stage (5–10 minutes)
- Define the goal: e.g., practice incident command, test SLO literacy, onboard new on-call.
- Assign roles: Incident Commander, Communications Lead, Operations, Observers.
- Explain that this is low-stakes learning: the goal is insight, not blame.
2. Draw and read the initial cards (5 minutes)
- Draw an Incident Scenario card and a Signal card.
- Read them aloud, confirm everyone understands the setup.
Optional: let the team ask “clarifying questions” and answer only with what’s written on the cards or what they’d actually know from dashboards/runbooks.
3. Respond as if it’s real (20–30 minutes)
Ask the team to talk through, step by step:
- How do you declare the incident? What severity?
- Who gets paged or invited? Who leads?
- What’s the first place you look? Which dashboards/logs?
- What initial hypotheses do you form?
- What experiments or mitigations do you try first?
At a suitable moment, introduce Constraint or Twist cards to simulate:
- Missing people
- Tool failures
- New surprises or compounding impact
Keep the pace realistic but focused; you’re practicing decision-making, not solving every technical detail.
4. Run a quick debrief (15–20 minutes)
This is where the real value lives. Discuss:
- What went well in communication and decision-making?
- Where did we get stuck? Why?
- Did roles and ownership feel clear?
- What documentation or automation would have helped?
- Which SLOs, dashboards, or alerts did we wish we had?
Capture concrete follow-ups:
- Create/update runbooks
- Refine on-call rotations or escalation paths
- Tune alerts and SLOs
- Add or adjust cards in the deck to reflect new learning
Integrating Chaos Decks into SRE & IR Learning Paths
Chaos decks are not just a one-off workshop trick; they can be part of an ongoing learning program.
Onboarding new SREs and responders
For new team members, chaos deck sessions:
- Introduce real-world incidents they’re likely to face
- Teach how incidents are declared and run at your org
- Reinforce cultural expectations (blamelessness, collaboration)
Run short sessions as part of onboarding, gradually increasing complexity.
Regular practice for experienced teams
For established SRE and incident response teams:
- Schedule monthly or quarterly tabletop sessions
- Rotate facilitators so more people learn to lead incidents
- Track improvements over time (fewer communication gaps, clearer triage)
Over time, you build shared tacit knowledge: the subtle, practiced skills that don’t appear in docs but make real incidents go smoother.
Feeding improvements back into the system
Use chaos deck outcomes to:
- Refine your incident response process and role definitions
- Drive updates to your runbooks, SLOs, and alerts
- Identify systemic risks that deserve engineering investment
The deck itself becomes a living artifact of your learning: you add cards based on real incidents and retire ones that are no longer relevant.
Measuring the Impact of a Paper-Driven Chaos Deck
Even with a simple paper deck, you can track progress. Look for qualitative and quantitative signals:
- Time to clarity: How long until the team can state a shared hypothesis and plan?
- Role fluency: Do people naturally assume and respect incident roles?
- Communication quality: Are status updates clear, concise, and audience-appropriate?
- Follow-through: Are action items from debriefs actually implemented?
Compare across sessions and you’ll see trends: more confident responders, better triage, fewer surprises in real incidents.
Conclusion: Low-Tech Cards, High-Impact Learning
High-tech outages don’t require equally high-tech training tools. A paper-driven chaos deck offers:
- A low-cost, low-friction way to practice incidents regularly
- Structured tabletop exercises that reveal real gaps in your processes
- A repeatable framework that aligns with SRE priorities—reliability, scalability, and efficiency
- A powerful method to build collaboration, decision-making, and incident muscle memory across the organization
Start small: write 10–20 cards based on your actual production risks and recent incidents. Run a one-hour tabletop with your team. Iterate on both your deck and your process.
The next time a real outage hits, you’ll be glad your first time working together under pressure was with paper, not production.