Rain Lag

The Analog Incident Compass Trainset: Practice Outages Without Touching Prod

How to build a tabletop, paper-based “incident trainset” that lets your team rehearse outages, practice on-call skills, and uncover resilience gaps—without ever risking production.

Introduction

Most teams only really meet their incident response plan when something is already on fire.

The document exists. The runbooks exist. The dashboards exist. But the first time many people experience them together is at 3 a.m., with customer impact mounting and nerves fraying.

Tabletop outage drills are how you fix that.

In this post, we’ll explore the idea of an “Analog Incident Compass Trainset”: a fully paper (or whiteboard) simulation of your production world that lets you practice real incidents without ever touching prod. Think of it as a model railway for outages: a safe little world that behaves like your real one, where you can drive trains off cliffs, derail services, and learn from the wreckage.

We’ll cover:

  • Why tabletop drills matter more than yet another incident doc
  • How “serious games” like the Analog Incident Compass Trainset build on-call skill and confidence
  • How to design scenarios with realistic artifacts and branching narratives
  • How to instrument your simulations with simple analytics to track team progress

Why Tabletop Outage Drills Beat Static Playbooks

Written incident plans are necessary, but they’re not sufficient. Knowing about a process isn’t the same as being able to run it under pressure.

Tabletop outage drills turn static plans into real-world readiness by:

  • Exercising decision-making under pressure: Who declares the incident? When do you page another team? Do you roll back or roll forward?
  • Practicing coordination: How do the on-call, incident commander, comms, and stakeholders move in sync instead of in conflict?
  • Improving communication: What do we say to customers, execs, and internal teams, and when?
  • Exposing gaps: The moment someone asks, “Where’s the dashboard for this?” or “Who even owns this service?” you’ve found a valuable hole—before a customer does.

All of this can be done in a low-risk, no-impact environment. That’s the core of the Analog Incident Compass Trainset: realistic stress, zero production risk.


From Wheel of Misfortune to the Analog Incident Compass Trainset

The Trainset belongs to a growing family of serious games for reliability and incident response.

You may have heard of:

  • Wheel of Misfortune – An SRE practice where you “spin” into a random outage scenario and work through it as a group.
  • Game days / chaos drills – Intentional fault-injection in staging or production to test resilience and response.

The Analog Incident Compass Trainset takes inspiration from these, but emphasizes:

  • No code, no infrastructure needed – Everything runs on paper, index cards, and laptops opened to fake dashboards.
  • Guided but flexible narrative – A facilitator steers the story while letting the team’s choices meaningfully shape what happens.
  • Repeatability and instrumentation – Run the same scenario with different cohorts, measure performance, compare outcomes.

It’s a model of your system and your org, shrunk down to tabletop scale so you can safely experiment with how they behave in crisis.


Building Your Paper Railway: Core Components

You don’t need much to get started. Think in terms of:

  1. Map of the world
    A one-page diagram of your system:

    • Key services and dependencies
    • External providers (payment processor, auth provider, CDN, etc.)
    • Data stores and critical queues / topics

    This is the “track layout” for your trainset.

  2. Incident scenario cards
    Each card is a possible derailment:

    • Past real incidents (sanitized if necessary)
    • Fictional but plausible failures
    • External incidents (cloud provider outages, certificate expiry, config pushes)
  3. Artifacts packet
    To make it feel real, give teams artifacts they’d actually use:

    • Monitoring links or screenshots
    • Time-series graphs (CPU, latency, error rate)
    • Logs (sanitized snippets)
    • Runbooks / playbooks
    • Ticket snapshots or fake Slack threads
  4. Roles
    Assign people to:

    • Incident Commander (IC)
    • Primary on-call / tech lead
    • Comms lead (customer & stakeholder updates)
    • Scribe / recorder
    • Supporting engineers or representatives from other teams
  5. Facilitator & script
    One person runs the simulation:

    • Releases artifacts over time ("At T+5 minutes, you see this graph…")
    • Plays the role of external systems, customers, and management
    • Tracks decisions and timings

That’s it. You now have a basic trainset: world, trains, track, and a way to crash them.


Designing Realistic, Stress-Testing Scenarios

The art is in scenario design. Strong scenarios:

  • Are grounded in real risk: model common failure modes—bad deploys, dependency timeouts, database saturation, expired certificates.
  • Require cross-team interaction: the best incidents touch multiple services or owners.
  • Have layers of discoverable information: early clues, misleading signals, and eventually the smoking gun.

Use Past and Fictional Incidents

Mix both:

  • Past incidents let you rehearse known weak spots and validate that fixes (technical or process) actually change behavior.
  • Fictional incidents keep people from “cheating” with prior knowledge and let you simulate future or edge-case risks.

In both, the goal is to:

  • Practice debugging under stress
  • Reinforce incident management protocols
  • Refine cross-team communication patterns

Include Production-like Artifacts

To avoid abstract puzzles, mirror actual investigations:

  • Monitoring & dashboards: Provide printed screenshots or static HTML exports. Example: “API latency by region,” “Error rate by endpoint,” “DB connections.”
  • Time-series data: Show snapshots at T+0, T+5, T+15. Let the team request additional views.
  • Runbooks: Include playbooks with some missing or outdated steps to reveal gaps.
  • Slack or ticket excerpts: Incoming customer complaints, sales asking “Is this impacting EU customers?” etc.

The more the exercise feels like a real shift, the more transferable the learning.


Adding Branching Narratives with Interactive Tools

You can take your trainset further by scripting branching stories using interactive narrative tools like Ink (from Inkle) or similar engines.

Why?

  • Different choices should have different consequences: delaying incident declaration might cause more customer impact; rolling back might fix one symptom but expose another.
  • Branching paths let you explore best case, worst case, and weird case outcomes from the same initial fault.

How It Works in Practice

  1. Author the story in Ink, defining:

    • The core fault (e.g., misconfigured feature flag + cache stampede)
    • Decision points (declare incident, roll back, fail over, page another team)
    • Consequences (graphs change, customers get angrier, dependencies fail)
  2. Run it analog-style:

    • The facilitator uses the Ink script as a behind-the-scenes guide.
    • When the team chooses an action, the facilitator jumps to the matching branch and presents the next artifact.
  3. Capture paths and outcomes:

    • Which branch did this team follow?
    • Where did they get stuck?
    • Did they find the fastest or safest resolution path?

You now have a guided but flexible exercise, replayable across cohorts while still keeping it dynamic.


Treat Simulations Like Chaos Experiments

Even in pure tabletop form, these drills function as chaos-style experiments on your processes and architecture.

They help you uncover:

  • Organizational single points of failure: “Only Alice knows how to restart that job.”
  • Monitoring blind spots: “We have no graph for queue depth here.”
  • Runbook decay: Steps that reference non-existent tools or owners.
  • Escalation bottlenecks: Confusion about who can authorize a rollback or customer announcement.

The key is to treat the exercise as an experiment, not a performance review:

  • Hypothesize: “We believe teams can diagnose a partial region outage within 15 minutes.”
  • Test via simulation.
  • Observe and measure.
  • Improve documentation, ownership, or architecture.

You’re effectively hardening your system by hardening the human and process layer.


Instrumenting Your Trainset: Logs, Heatmaps, and Cohorts

To move beyond “that was interesting” into measurable improvement, instrument your simulations.

You don’t need heavy tooling. Start with:

  1. Action log
    During the drill, record:

    • Timestamps for key events (incident declared, first mitigation, full recovery)
    • Decisions made (rollback vs roll-forward, when to page, what to communicate)
    • Requests for artifacts (what data did they ask for, and when?)
  2. Skill heatmaps
    After each exercise, rate (lightweight, 1–5 scale):

    • Technical diagnosis
    • Use of monitoring & logs
    • IC leadership
    • Communication (internal & external)
    • Cross-team collaboration

    Visualizing this over multiple sessions gives you a skill heatmap. You’ll see areas where:

    • Individuals excel (future ICs or mentors)
    • Teams struggle (e.g., everyone underuses logging)
  3. Cohort views
    Run the same scenario with:

    • Different teams (backend vs SRE vs support)
    • Different experience levels (junior vs senior on-call)

    Compare:

    • Time to declare an incident
    • Time to root cause
    • Time to first customer communication

These simple analytics turn your Analog Incident Compass Trainset into a feedback loop, not a one-off exercise.


Making It Stick: Cadence and Culture

To get real value, treat these drills as part of your operational culture, not a novelty.

  • Run them regularly: Monthly or quarterly, with rotating scenarios and participants.
  • Normalize learning, not blame: The goal is to reveal weak spots so you can fix them, not to embarrass anyone.
  • Feed outcomes back into the system:
    • Update runbooks and ownership docs.
    • Add missing dashboards or alerts.
    • Adjust your incident response policy as needed.

Over time, you’ll see:

  • More confident on-call engineers
  • Fewer surprises in real incidents
  • Faster, calmer, more coordinated responses when things do break

Conclusion

The Analog Incident Compass Trainset is a simple idea: build a safe, small-scale model of your production world and drive failures through it on purpose.

By combining tabletop outage drills, realistic artifacts, branching narratives, and lightweight analytics, you can:

  • Turn static incident plans into lived muscle memory
  • Build on-call confidence without touching production
  • Discover organizational and technical gaps before customers do

You don’t need fancy tooling to start—just paper, people, and a willingness to treat incidents as a skill to be practiced, not a crisis to be merely survived.

Build your trainset. Derail a few services. Learn. Then, when the real trains wobble on the real tracks, your team will already know which way the compass points.

The Analog Incident Compass Trainset: Practice Outages Without Touching Prod | Rain Lag