The Paper-First Chaos Carousel: Spinning Up Low-Tech Drills for High-Stakes Incidents

Modern systems fail in complex, surprising ways—but you don’t need complex tools to prepare for those failures.

Before you unleash chaos engineering tools on your production environment, there’s a safer, cheaper, and often more effective starting point: paper-first chaos drills.

Think of them as a “Chaos Carousel” you can spin up anytime: no infrastructure, no special permissions, just a small group of people, some prompts, and a willingness to ask, “Okay, what if this broke… then what?”

This post walks through how to design, run, and grow a paper-first chaos practice using tabletop exercises, and how to connect what you learn to real-world tools like AWS Fault Injection Simulator and Chaos Monkey—without burning out your team.

Why Start with Paper-First Chaos Drills?

Paper-first chaos drills are low-tech, low-risk practice sessions for incident response:

No changes to production
No access to cloud consoles required
No risk of real outages

Instead of injecting real failures, you simulate incidents on paper (or whiteboards, shared docs, or slides). The goal isn’t realism at the infrastructure level; it’s realism in team behavior, decision-making, and communication.

This approach is powerful because it:

Builds muscle memory for how you respond under pressure
Exposes gaps in runbooks, alerting, ownership, and escalation paths
Aligns expectations between engineering, SRE, support, and leadership
Costs almost nothing to run and repeat

You’re rehearsing failure—and your response—where it’s safest to learn: in a room, not in prod.

Tabletop Exercises: The Core of Paper-First Chaos

Tabletop exercises are a key practice in both security and chaos engineering. They’re structured, discussion-based scenarios where a group walks through a hypothetical incident step by step.

Think of a tabletop exercise as a story:

"It’s 2:13 a.m. PagerDuty goes off. Latency has tripled in your payments API. Customers can’t check out. What happens in the first five minutes?"

From there, the facilitator feeds new information:

A new alert fires
Metrics become available
A dependency fails
A communication from a stakeholder arrives

Participants respond as they would in real life:

Who gets paged next?
What dashboard do you open first?
Do you declare a full incident yet? At what severity?
What do you say to customer support or leadership?

The focus is not on "passing" the scenario. It’s on uncovering:

Missing runbooks
Unclear ownership
Overreliance on one person’s knowledge
Broken or noisy alerts
Communication gaps between teams

Done well, tabletop exercises become a core chaos engineering practice: structured, repeatable, and increasingly realistic over time.

Designing Your First Paper-First Chaos Scenario

You don’t need to start from scratch. Many organizations use structured guides and templates to keep exercises consistent and repeatable.

A basic scenario template can include:

Scenario Name
Example: "Regional database outage in checkout path"
Business Impact
- What’s at risk? (revenue, reputation, safety, SLAs)
- What metrics matter most? (error rate, latency, failed payments)
Starting Conditions
- Time of day
- Who is on call
- What alerts fire first
Timeline Prompts (released by the facilitator)
- T+5 minutes: A second alert from another service
- T+10 minutes: Slack explodes with internal messages
- T+20 minutes: Executive asks for an ETA on resolution
Artifacts to Reference
- Runbooks (if they exist)
- Dashboards
- On-call procedures
- Incident communication templates
Learning Goals
Example goals:
- Validate that everyone knows how to declare an incident
- Practice updating a status page
- Identify which dependencies you don’t monitor well

To keep scenarios relevant, tailor them to your own high-stakes incidents:

Past outages that hurt the business
Scenarios tied to SLAs or regulatory risk
Single points of failure you already suspect but haven’t fully explored

The key is repeatability: use the same basic structure across exercises so you can compare improvements over time.

Running the Chaos Carousel: How to Facilitate

A good facilitator keeps the exercise focused, realistic, and psychologically safe.

Roles

Facilitator: Guides the scenario, introduces new information, keeps time
Participants: On-call engineers, SREs, developers, support, sometimes product or leadership
Observer / Note-taker: Captures decisions, questions, and follow-up items

Ground Rules

Blame-free: This is practice, not performance review
Assume good intent: People are doing their best with what they know
Stay in character: Respond as you truly would if this were live
Timebox: Typically 45–90 minutes

Flow

Set Context (5–10 minutes)
- State the goals (e.g., "test our incident declaration process")
- Introduce the scenario and constraints
Simulate the Incident (30–60 minutes)
- Present the initial alert
- Ask: "What do you do next? Who do you involve?"
- Reveal new prompts as time goes on
- Encourage participants to reference real tools: "What dashboard? Which runbook?"
Debrief (15–30 minutes)
- What went well?
- Where were we confused or blocked?
- What surprised you?
- What concrete actions should we take next?

The debrief is where chaos drills pay off. It turns a fictional story into real improvements for your systems and processes.

From Paper to Production: Operationalizing with Chaos Tools

Once you’ve run a few paper-first exercises, patterns will emerge:

Certain dependencies are consistently fragile
Alerts are missing or too noisy
Runbooks are outdated or nonexistent
Some teams are always in the critical path

That’s when it’s time to connect your learnings to real chaos engineering tools.

Examples:

AWS Fault Injection Simulator (FIS)
Turn a paper scenario like "database latency spike in us-east-1" into an automated experiment that:
- Introduces network latency on a specific resource
- Fails over to another region
- Observes impact on error rate and user experience
Open-Source Tools (e.g., Netflix’s Chaos Monkey)
If your paper drill centered on "what if we lose a node in this service?", formalize that by:
- Randomly terminating instances in a controlled environment
- Ensuring autoscaling and redundancy work as expected

Don’t skip straight to tools. Use paper-first practice to decide what’s worth automating. This reduces risk and ensures your chaos experiments are aligned with real business concerns—rather than chaos for chaos’s sake.

Spreading the Load: Rotating Participation and On-Call Skills

One of the biggest risks in incident response is knowledge concentration: only a handful of experts know what to do when things break.

Paper-first chaos drills are a perfect way to distribute this knowledge.

Rotate who participates: Involve different engineers, teams, and time zones
Rotate roles: Sometimes a senior engineer is primary responder; other times, a newer team member leads
Invite non-engineers: Support, product, and ops learn how incidents unfold and how they can help

Benefits:

Fewer single points of failure in human expertise
More people feel confident taking action during incidents
On-call duty feels shared and fair, not like a punishment for a few experts

You’re not just rehearsing technical response—you’re building a culture where incidents are a team sport.

Leadership’s Role: Making Practice Sustainable

Incident response and on-call duty are emotionally demanding. Without support, teams burn out.

Strong leadership involvement turns chaos practice from a stress multiplier into a reliability engine that’s sustainable.

Leaders can:

Sponsor time for regular chaos drills and debriefs
Reward learning, not just uptime—celebrate good incident practices
Ensure follow-through on action items from tabletop exercises
Invest in tools and staffing to reduce alert noise and toil
Model calm behavior by participating occasionally and respecting incident process

When people see that leadership takes incident practice seriously—and uses it to fix systemic issues rather than blame individuals—psychological safety rises, burnout falls, and reliability improves.

Getting Started: A Simple Playbook

You don’t need a big program to begin. Start small and iterate.

Pick One High-Stakes Scenario
Something realistic and impactful (e.g., "checkout failures on Black Friday").
Find a Template
Use or create a simple guide with: scenario, impact, timeline prompts, roles, and goals.
Schedule a 60-Minute Session
Include one facilitator, a few engineers, maybe a support or product partner.
Run the Exercise and Debrief
Capture 3–5 concrete improvements (e.g., "Create a runbook for declaring SEV-1 incidents").
Repeat and Rotate
Next month, change the scenario and participants. Over time, build a backlog of tabletop exercises and track the improvements they drive.
Bridge to Automation
Once patterns emerge, decide which scenarios to turn into experiments in AWS FIS, Chaos Monkey, or similar tools.

Conclusion: Spin the Carousel Before You Spin Up Chaos

High-stakes incidents are unavoidable. Being unprepared is optional.

Paper-first chaos drills and tabletop exercises give you a safe, inexpensive way to rehearse failure, expose weak spots, and build shared confidence—long before you inject a single packet drop or terminate a single instance in production.

Start with paper. Learn where your systems, processes, and communication are fragile. Then, and only then, move toward automation with tools like AWS Fault Injection Simulator or Chaos Monkey.

If you make the Paper-First Chaos Carousel a regular practice—with rotating participants and visible leadership support—you’ll not only improve your technical reliability. You’ll also create a healthier, more resilient incident culture where everyone knows what to do when things go wrong—and no one has to carry that burden alone.