Rain Lag

The Analog Incident Train Station Sand Table: Rehearsing Outages With a Tactile, Moveable Paper Landscape

How a low‑tech, paper‑based “sand table” of your systems can transform incident response practice, reveal hidden dependencies, and build a stronger culture of preparedness.

Introduction

Digital systems fail in stubbornly analog ways.

Alerts fire. People scramble. Communication channels clog. Someone pings the wrong team. A critical dependency you’d forgotten about suddenly becomes the linchpin of the entire outage. None of this chaos is visible on a dashboard.

That’s where the analog incident train station sand table comes in.

Borrowing from military sand tables and model train layouts, this is a physical, tactile model of your infrastructure and workflows—often built from paper, sticky notes, strings, and movable pieces. You use it to rehearse outages in slow motion: walking through realistic failures step‑by‑step, with everyone around the same table.

It’s low‑tech, cheap, and surprisingly powerful.

In this post, we’ll explore what an incident sand table is, how it works, why it’s so effective, and how you can build and use one to improve your organization’s incident response.


What Is an Analog Incident Train Station Sand Table?

Think of a train station control room, with a big map of tracks, switches, and trains. Now swap the trains for:

  • Services and microservices
  • Databases and queues
  • External vendors and APIs
  • User segments and clients
  • Teams, roles, and communication channels

Then build that world out of paper, index cards, tape, string, and little movable tokens.

That’s your incident sand table: a physical, moveable landscape of your system.

Key characteristics:

  • Tactile and physical: People stand (or sit) around it, move pieces, draw connections, and literally point at things.
  • Low‑tech: No special software required; paper and markers are enough.
  • Scenario‑driven: You use it to play through outages, like an incident response tabletop exercise.
  • Collaborative: Engineers, SREs, support, product, and leadership share the same picture.

Instead of staring at dashboards and diagrams during an exercise, participants inhabit the system: they move parts around, simulate failures, and act out how they would respond.


Why Not Just Use Dashboards and Diagrams?

You already have architecture diagrams, runbooks, and dashboards. Why bother with scissors and tape?

Because those tools are:

  • Abstract: Diagrams are static and often outdated; dashboards show metrics, not relationships.
  • Individual: Each person sees their own screen; shared understanding is implicit, not explicit.
  • Time‑pressured: During real incidents there’s no time to slow down and examine how the system really behaves.

The analog sand table adds what digital tools often miss:

1. Embodied understanding

Moving a "database" card and seeing that it’s connected by strings to six different services makes dependency sprawl visceral, not theoretical.

2. Shared mental model

There is one model in the middle of the room. Everyone is literally “on the same page” and can challenge or clarify assumptions on the spot.

3. Space for reflection

You run scenarios in slow motion. You can stop, rewind, and ask, “What would really happen here?” That’s hard to do when pager alerts are blaring.


How the Sand Table Works in Practice

You can think of a sand table session as a live‑action tabletop exercise.

Step 1: Build the landscape

You start by mapping:

  • Core components: Services, data stores, queues, caches, external APIs
  • User entry points: Web, mobile, partners, internal tools
  • Key dependencies: Networks, DNS, identity providers, cloud regions
  • Teams and roles: On‑call SRE, incident commander, customer support, comms, product

Concrete materials work well:

  • Index cards or sticky notes for components and teams
  • Colored string or tape for connections and data flows
  • Tokens or small objects for customers or requests
  • Different colors to indicate criticality or ownership

The purpose is not perfect fidelity but a useful, manipulable approximation of your system.

Step 2: Choose an incident scenario

Craft a specific, realistic failure, for example:

  • Primary database in Region A becomes read‑only
  • Third‑party payment processor times out intermittently
  • DNS misconfiguration makes the API unreachable
  • Internal auth service is degraded and returns 500s for some users

Write the scenario on a card and define:

  • Starting conditions (time of day, load, active campaigns)
  • Initial symptoms (alerts, user reports, dashboards)
  • Known unknowns (what’s ambiguous at the start)

Step 3: Play it out step‑by‑step

Then, with the whole group:

  1. Trigger the failure: Move or flip a card to indicate the broken component.
  2. Simulate signals: Place “alert” tokens at the relevant services or dashboards.
  3. Assign roles: Incident commander, comms, primary responder, subject‑matter experts.
  4. Act the response in rounds of a few minutes each:
    • What does each role see right now?
    • Who talks to whom, via which channel?
    • What action do they take? (e.g., failover, feature flag, rollback, comms)

You physically move tokens and cards to reflect these decisions:

  • A “customer request” token fails to reach the database
  • A “message” token travels from support to the incident channel
  • A “runbook” card is pulled in when someone decides to consult docs

Step 4: Observe information flow and coordination

As you play, you watch for:

  • Where does information pool or stall?
  • Who gets overloaded? (Too many lines converging on one role or system)
  • Which dependencies surprise people?
  • What assumptions differ between teams?

This is where the sand table shines: the bottlenecks are visible in how crowded certain areas become, or how often a token has to travel back and forth.

Step 5: Debrief and capture improvements

After the scenario, debrief explicitly:

  • What worked well?
  • Where were the slowdowns or confusion?
  • Which runbooks or dashboards were missing or unclear?
  • Which communication patterns helped or hurt?

Turn these into concrete changes:

  • Edit or create runbooks
  • Adjust on‑call rotations or escalation paths
  • Add or refine alerts and dashboards
  • Clarify interfaces between teams

Over time, repeated sessions create iterative improvement cycles for both technical and human processes.


Designing an Effective Sand Table: Lessons from Systems Thinking

Good sand tables are informed by good systems design. A few principles help.

1. Model reliability, not just functionality

Don’t only draw “happy path” flows. Make reliability concerns first‑class:

  • Show replicas, failover paths, and backups
  • Represent SLOs (e.g., by marking particularly critical paths)
  • Include operational tools (observability stack, feature flags, CI/CD)

This keeps the conversation anchored in resilience.

2. Make interfaces explicit

Treat every boundary as an interface:

  • Between microservices
  • Between your system and external vendors
  • Between teams (SRE ↔ Product, Support ↔ Engineering)

Label what flows across each interface:

  • Data types
  • Contracts (SLAs/SLOs)
  • Communication channels (Slack, PagerDuty, email)

This reveals where unclear or brittle interfaces will hurt you during incidents.

3. Embrace multiple scales

Use visual cues for levels of abstraction:

  • High‑level: user journeys and critical business flows
  • Mid‑level: services and data stores
  • Low‑level: key components that frequently fail (e.g., caches, message brokers)

You don’t need every detail, but you do need enough fidelity to reason about failure modes and coordination patterns.

4. Invite cross‑functional participation

Incidents are socio‑technical. Pull in:

  • Engineers and SREs
  • Support and Customer Success
  • Product and Marketing (for user impact and comms)
  • Incident managers or leadership (if relevant)

Each group sees different parts of the system. The sand table makes that diversity of perspective a feature, not a source of misalignment.


Why This Low‑Tech Approach Works So Well

Despite (or because of) its simplicity, an analog sand table delivers real benefits.

1. Better visualization of complexity and dependencies

Seeing services, queues, and user flows laid out on a table—with strings criss‑crossing—makes complexity concrete. People quickly spot:

  • Hidden single points of failure
  • Overloaded shared components
  • Overly complex paths for critical user journeys

2. Safer practice for rare, high‑stakes events

Serious outages are rare but impactful. It’s hard to gain experience without real pain. A sand table gives you a safe sandbox to practice:

  • Declaring incidents
  • Handing off roles
  • Making decisions under uncertainty
  • Communicating with stakeholders

3. Stronger culture of preparedness and learning

Regular sessions turn incident readiness into a habit, not a one‑off initiative. Teams start to:

  • Talk more openly about failure
  • Normalize post‑incident learning
  • See reliability as a shared responsibility, not just “SRE’s job”

4. Accessible and inexpensive

You don’t need a big budget or sophisticated training platform.

Basic kit:

  • Paper, index cards, sticky notes
  • Markers, tape, string
  • Any flat surface

This makes it attainable for organizations of any size, from startups to large enterprises.


Getting Started: A Simple Recipe

You can run a first sand table session in half a day.

  1. Pick one critical user journey
    Example: “User signs in and completes a purchase.”

  2. Map just enough of the system
    Include the main services, data stores, and external dependencies for that journey.

  3. Invite 5–10 people
    Cross‑functional if possible, including at least one person who knows the architecture well.

  4. Define a focused scenario
    E.g., “Payment provider is degraded for 30% of transactions.”

  5. Run through 30–45 minutes of simulation
    Pause periodically to clarify what would really happen.

  6. Spend as long on debrief as on the exercise
    Capture changes to runbooks, alerts, and processes.

Then, schedule the next session. Each iteration will refine both your sand table and your response capability.


Conclusion

Modern incidents are never purely technical. They’re the intersection of infrastructure, software, people, and communication under pressure.

The analog incident train station sand table gives you a way to see and rehearse that whole system—not just the logs and metrics. By turning your architecture into a tactile, moveable landscape, it enables teams to:

  • Visualize complex dependencies
  • Practice realistic outage scenarios
  • Spot information bottlenecks and coordination gaps
  • Iteratively improve both runbooks and human processes

All with paper, tape, and a few hours of focused attention.

If you care about resilience, don’t wait for the next real outage to discover how your system and your organization behave under stress. Build a sand table, gather your team, and start rehearsing today.

The Analog Incident Train Station Sand Table: Rehearsing Outages With a Tactile, Moveable Paper Landscape | Rain Lag