Rain Lag

The Paper-First Chaos Carousel: Spinning Up Low-Tech Drills for High-Stakes Incidents

How to use simple, paper-first chaos drills and tabletop exercises to build real incident response skills—before you touch production or spin up fancy chaos engineering tools.

The Paper-First Chaos Carousel: Spinning Up Low-Tech Drills for High-Stakes Incidents

Modern systems fail in complex, surprising ways—but you don’t need complex tools to prepare for those failures.

Before you unleash chaos engineering tools on your production environment, there’s a safer, cheaper, and often more effective starting point: paper-first chaos drills.

Think of them as a “Chaos Carousel” you can spin up anytime: no infrastructure, no special permissions, just a small group of people, some prompts, and a willingness to ask, “Okay, what if this broke… then what?”

This post walks through how to design, run, and grow a paper-first chaos practice using tabletop exercises, and how to connect what you learn to real-world tools like AWS Fault Injection Simulator and Chaos Monkey—without burning out your team.


Why Start with Paper-First Chaos Drills?

Paper-first chaos drills are low-tech, low-risk practice sessions for incident response:

  • No changes to production
  • No access to cloud consoles required
  • No risk of real outages

Instead of injecting real failures, you simulate incidents on paper (or whiteboards, shared docs, or slides). The goal isn’t realism at the infrastructure level; it’s realism in team behavior, decision-making, and communication.

This approach is powerful because it:

  • Builds muscle memory for how you respond under pressure
  • Exposes gaps in runbooks, alerting, ownership, and escalation paths
  • Aligns expectations between engineering, SRE, support, and leadership
  • Costs almost nothing to run and repeat

You’re rehearsing failure—and your response—where it’s safest to learn: in a room, not in prod.


Tabletop Exercises: The Core of Paper-First Chaos

Tabletop exercises are a key practice in both security and chaos engineering. They’re structured, discussion-based scenarios where a group walks through a hypothetical incident step by step.

Think of a tabletop exercise as a story:

"It’s 2:13 a.m. PagerDuty goes off. Latency has tripled in your payments API. Customers can’t check out. What happens in the first five minutes?"

From there, the facilitator feeds new information:

  • A new alert fires
  • Metrics become available
  • A dependency fails
  • A communication from a stakeholder arrives

Participants respond as they would in real life:

  • Who gets paged next?
  • What dashboard do you open first?
  • Do you declare a full incident yet? At what severity?
  • What do you say to customer support or leadership?

The focus is not on "passing" the scenario. It’s on uncovering:

  • Missing runbooks
  • Unclear ownership
  • Overreliance on one person’s knowledge
  • Broken or noisy alerts
  • Communication gaps between teams

Done well, tabletop exercises become a core chaos engineering practice: structured, repeatable, and increasingly realistic over time.


Designing Your First Paper-First Chaos Scenario

You don’t need to start from scratch. Many organizations use structured guides and templates to keep exercises consistent and repeatable.

A basic scenario template can include:

  1. Scenario Name
    Example: "Regional database outage in checkout path"

  2. Business Impact

    • What’s at risk? (revenue, reputation, safety, SLAs)
    • What metrics matter most? (error rate, latency, failed payments)
  3. Starting Conditions

    • Time of day
    • Who is on call
    • What alerts fire first
  4. Timeline Prompts (released by the facilitator)

    • T+5 minutes: A second alert from another service
    • T+10 minutes: Slack explodes with internal messages
    • T+20 minutes: Executive asks for an ETA on resolution
  5. Artifacts to Reference

    • Runbooks (if they exist)
    • Dashboards
    • On-call procedures
    • Incident communication templates
  6. Learning Goals
    Example goals:

    • Validate that everyone knows how to declare an incident
    • Practice updating a status page
    • Identify which dependencies you don’t monitor well

To keep scenarios relevant, tailor them to your own high-stakes incidents:

  • Past outages that hurt the business
  • Scenarios tied to SLAs or regulatory risk
  • Single points of failure you already suspect but haven’t fully explored

The key is repeatability: use the same basic structure across exercises so you can compare improvements over time.


Running the Chaos Carousel: How to Facilitate

A good facilitator keeps the exercise focused, realistic, and psychologically safe.

Roles

  • Facilitator: Guides the scenario, introduces new information, keeps time
  • Participants: On-call engineers, SREs, developers, support, sometimes product or leadership
  • Observer / Note-taker: Captures decisions, questions, and follow-up items

Ground Rules

  • Blame-free: This is practice, not performance review
  • Assume good intent: People are doing their best with what they know
  • Stay in character: Respond as you truly would if this were live
  • Timebox: Typically 45–90 minutes

Flow

  1. Set Context (5–10 minutes)

    • State the goals (e.g., "test our incident declaration process")
    • Introduce the scenario and constraints
  2. Simulate the Incident (30–60 minutes)

    • Present the initial alert
    • Ask: "What do you do next? Who do you involve?"
    • Reveal new prompts as time goes on
    • Encourage participants to reference real tools: "What dashboard? Which runbook?"
  3. Debrief (15–30 minutes)

    • What went well?
    • Where were we confused or blocked?
    • What surprised you?
    • What concrete actions should we take next?

The debrief is where chaos drills pay off. It turns a fictional story into real improvements for your systems and processes.


From Paper to Production: Operationalizing with Chaos Tools

Once you’ve run a few paper-first exercises, patterns will emerge:

  • Certain dependencies are consistently fragile
  • Alerts are missing or too noisy
  • Runbooks are outdated or nonexistent
  • Some teams are always in the critical path

That’s when it’s time to connect your learnings to real chaos engineering tools.

Examples:

  • AWS Fault Injection Simulator (FIS)
    Turn a paper scenario like "database latency spike in us-east-1" into an automated experiment that:

    • Introduces network latency on a specific resource
    • Fails over to another region
    • Observes impact on error rate and user experience
  • Open-Source Tools (e.g., Netflix’s Chaos Monkey)
    If your paper drill centered on "what if we lose a node in this service?", formalize that by:

    • Randomly terminating instances in a controlled environment
    • Ensuring autoscaling and redundancy work as expected

Don’t skip straight to tools. Use paper-first practice to decide what’s worth automating. This reduces risk and ensures your chaos experiments are aligned with real business concerns—rather than chaos for chaos’s sake.


Spreading the Load: Rotating Participation and On-Call Skills

One of the biggest risks in incident response is knowledge concentration: only a handful of experts know what to do when things break.

Paper-first chaos drills are a perfect way to distribute this knowledge.

  • Rotate who participates: Involve different engineers, teams, and time zones
  • Rotate roles: Sometimes a senior engineer is primary responder; other times, a newer team member leads
  • Invite non-engineers: Support, product, and ops learn how incidents unfold and how they can help

Benefits:

  • Fewer single points of failure in human expertise
  • More people feel confident taking action during incidents
  • On-call duty feels shared and fair, not like a punishment for a few experts

You’re not just rehearsing technical response—you’re building a culture where incidents are a team sport.


Leadership’s Role: Making Practice Sustainable

Incident response and on-call duty are emotionally demanding. Without support, teams burn out.

Strong leadership involvement turns chaos practice from a stress multiplier into a reliability engine that’s sustainable.

Leaders can:

  • Sponsor time for regular chaos drills and debriefs
  • Reward learning, not just uptime—celebrate good incident practices
  • Ensure follow-through on action items from tabletop exercises
  • Invest in tools and staffing to reduce alert noise and toil
  • Model calm behavior by participating occasionally and respecting incident process

When people see that leadership takes incident practice seriously—and uses it to fix systemic issues rather than blame individuals—psychological safety rises, burnout falls, and reliability improves.


Getting Started: A Simple Playbook

You don’t need a big program to begin. Start small and iterate.

  1. Pick One High-Stakes Scenario
    Something realistic and impactful (e.g., "checkout failures on Black Friday").

  2. Find a Template
    Use or create a simple guide with: scenario, impact, timeline prompts, roles, and goals.

  3. Schedule a 60-Minute Session
    Include one facilitator, a few engineers, maybe a support or product partner.

  4. Run the Exercise and Debrief
    Capture 3–5 concrete improvements (e.g., "Create a runbook for declaring SEV-1 incidents").

  5. Repeat and Rotate
    Next month, change the scenario and participants. Over time, build a backlog of tabletop exercises and track the improvements they drive.

  6. Bridge to Automation
    Once patterns emerge, decide which scenarios to turn into experiments in AWS FIS, Chaos Monkey, or similar tools.


Conclusion: Spin the Carousel Before You Spin Up Chaos

High-stakes incidents are unavoidable. Being unprepared is optional.

Paper-first chaos drills and tabletop exercises give you a safe, inexpensive way to rehearse failure, expose weak spots, and build shared confidence—long before you inject a single packet drop or terminate a single instance in production.

Start with paper. Learn where your systems, processes, and communication are fragile. Then, and only then, move toward automation with tools like AWS Fault Injection Simulator or Chaos Monkey.

If you make the Paper-First Chaos Carousel a regular practice—with rotating participants and visible leadership support—you’ll not only improve your technical reliability. You’ll also create a healthier, more resilient incident culture where everyone knows what to do when things go wrong—and no one has to carry that burden alone.

The Paper-First Chaos Carousel: Spinning Up Low-Tech Drills for High-Stakes Incidents | Rain Lag