The Analog Incident Train Station Sand Table: Rehearsing Outages With a Tactile, Moveable Paper Landscape

Introduction

Digital systems fail in stubbornly analog ways.

Alerts fire. People scramble. Communication channels clog. Someone pings the wrong team. A critical dependency you’d forgotten about suddenly becomes the linchpin of the entire outage. None of this chaos is visible on a dashboard.

That’s where the analog incident train station sand table comes in.

Borrowing from military sand tables and model train layouts, this is a physical, tactile model of your infrastructure and workflows—often built from paper, sticky notes, strings, and movable pieces. You use it to rehearse outages in slow motion: walking through realistic failures step‑by‑step, with everyone around the same table.

It’s low‑tech, cheap, and surprisingly powerful.

In this post, we’ll explore what an incident sand table is, how it works, why it’s so effective, and how you can build and use one to improve your organization’s incident response.

What Is an Analog Incident Train Station Sand Table?

Think of a train station control room, with a big map of tracks, switches, and trains. Now swap the trains for:

Services and microservices
Databases and queues
External vendors and APIs
User segments and clients
Teams, roles, and communication channels

Then build that world out of paper, index cards, tape, string, and little movable tokens.

That’s your incident sand table: a physical, moveable landscape of your system.

Key characteristics:

Tactile and physical: People stand (or sit) around it, move pieces, draw connections, and literally point at things.
Low‑tech: No special software required; paper and markers are enough.
Scenario‑driven: You use it to play through outages, like an incident response tabletop exercise.
Collaborative: Engineers, SREs, support, product, and leadership share the same picture.

Instead of staring at dashboards and diagrams during an exercise, participants inhabit the system: they move parts around, simulate failures, and act out how they would respond.

Why Not Just Use Dashboards and Diagrams?

You already have architecture diagrams, runbooks, and dashboards. Why bother with scissors and tape?

Because those tools are:

Abstract: Diagrams are static and often outdated; dashboards show metrics, not relationships.
Individual: Each person sees their own screen; shared understanding is implicit, not explicit.
Time‑pressured: During real incidents there’s no time to slow down and examine how the system really behaves.

The analog sand table adds what digital tools often miss:

1. Embodied understanding

Moving a "database" card and seeing that it’s connected by strings to six different services makes dependency sprawl visceral, not theoretical.

2. Shared mental model

There is one model in the middle of the room. Everyone is literally “on the same page” and can challenge or clarify assumptions on the spot.

3. Space for reflection

You run scenarios in slow motion. You can stop, rewind, and ask, “What would really happen here?” That’s hard to do when pager alerts are blaring.

How the Sand Table Works in Practice

You can think of a sand table session as a live‑action tabletop exercise.

Step 1: Build the landscape

You start by mapping:

Core components: Services, data stores, queues, caches, external APIs
User entry points: Web, mobile, partners, internal tools
Key dependencies: Networks, DNS, identity providers, cloud regions
Teams and roles: On‑call SRE, incident commander, customer support, comms, product

Concrete materials work well:

Index cards or sticky notes for components and teams
Colored string or tape for connections and data flows
Tokens or small objects for customers or requests
Different colors to indicate criticality or ownership

The purpose is not perfect fidelity but a useful, manipulable approximation of your system.

Step 2: Choose an incident scenario

Craft a specific, realistic failure, for example:

Primary database in Region A becomes read‑only
Third‑party payment processor times out intermittently
DNS misconfiguration makes the API unreachable
Internal auth service is degraded and returns 500s for some users

Write the scenario on a card and define:

Starting conditions (time of day, load, active campaigns)
Initial symptoms (alerts, user reports, dashboards)
Known unknowns (what’s ambiguous at the start)

Step 3: Play it out step‑by‑step

Then, with the whole group:

Trigger the failure: Move or flip a card to indicate the broken component.
Simulate signals: Place “alert” tokens at the relevant services or dashboards.
Assign roles: Incident commander, comms, primary responder, subject‑matter experts.
Act the response in rounds of a few minutes each:
- What does each role see right now?
- Who talks to whom, via which channel?
- What action do they take? (e.g., failover, feature flag, rollback, comms)

You physically move tokens and cards to reflect these decisions:

A “customer request” token fails to reach the database
A “message” token travels from support to the incident channel
A “runbook” card is pulled in when someone decides to consult docs

Step 4: Observe information flow and coordination

As you play, you watch for:

Where does information pool or stall?
Who gets overloaded? (Too many lines converging on one role or system)
Which dependencies surprise people?
What assumptions differ between teams?

This is where the sand table shines: the bottlenecks are visible in how crowded certain areas become, or how often a token has to travel back and forth.

Step 5: Debrief and capture improvements

After the scenario, debrief explicitly:

What worked well?
Where were the slowdowns or confusion?
Which runbooks or dashboards were missing or unclear?
Which communication patterns helped or hurt?

Turn these into concrete changes:

Edit or create runbooks
Adjust on‑call rotations or escalation paths
Add or refine alerts and dashboards
Clarify interfaces between teams

Over time, repeated sessions create iterative improvement cycles for both technical and human processes.

Designing an Effective Sand Table: Lessons from Systems Thinking

Good sand tables are informed by good systems design. A few principles help.

1. Model reliability, not just functionality

Don’t only draw “happy path” flows. Make reliability concerns first‑class:

Show replicas, failover paths, and backups
Represent SLOs (e.g., by marking particularly critical paths)
Include operational tools (observability stack, feature flags, CI/CD)

This keeps the conversation anchored in resilience.

2. Make interfaces explicit

Treat every boundary as an interface:

Between microservices
Between your system and external vendors
Between teams (SRE ↔ Product, Support ↔ Engineering)

Label what flows across each interface:

Data types
Contracts (SLAs/SLOs)
Communication channels (Slack, PagerDuty, email)

This reveals where unclear or brittle interfaces will hurt you during incidents.

3. Embrace multiple scales

Use visual cues for levels of abstraction:

High‑level: user journeys and critical business flows
Mid‑level: services and data stores
Low‑level: key components that frequently fail (e.g., caches, message brokers)

You don’t need every detail, but you do need enough fidelity to reason about failure modes and coordination patterns.

4. Invite cross‑functional participation

Incidents are socio‑technical. Pull in:

Engineers and SREs
Support and Customer Success
Product and Marketing (for user impact and comms)
Incident managers or leadership (if relevant)

Each group sees different parts of the system. The sand table makes that diversity of perspective a feature, not a source of misalignment.

Why This Low‑Tech Approach Works So Well

Despite (or because of) its simplicity, an analog sand table delivers real benefits.

1. Better visualization of complexity and dependencies

Seeing services, queues, and user flows laid out on a table—with strings criss‑crossing—makes complexity concrete. People quickly spot:

Hidden single points of failure
Overloaded shared components
Overly complex paths for critical user journeys

2. Safer practice for rare, high‑stakes events

Serious outages are rare but impactful. It’s hard to gain experience without real pain. A sand table gives you a safe sandbox to practice:

Declaring incidents
Handing off roles
Making decisions under uncertainty
Communicating with stakeholders

3. Stronger culture of preparedness and learning

Regular sessions turn incident readiness into a habit, not a one‑off initiative. Teams start to:

Talk more openly about failure
Normalize post‑incident learning
See reliability as a shared responsibility, not just “SRE’s job”

4. Accessible and inexpensive

You don’t need a big budget or sophisticated training platform.

Basic kit:

Paper, index cards, sticky notes
Markers, tape, string
Any flat surface

This makes it attainable for organizations of any size, from startups to large enterprises.

Getting Started: A Simple Recipe

You can run a first sand table session in half a day.

Pick one critical user journey
Example: “User signs in and completes a purchase.”
Map just enough of the system
Include the main services, data stores, and external dependencies for that journey.
Invite 5–10 people
Cross‑functional if possible, including at least one person who knows the architecture well.
Define a focused scenario
E.g., “Payment provider is degraded for 30% of transactions.”
Run through 30–45 minutes of simulation
Pause periodically to clarify what would really happen.
Spend as long on debrief as on the exercise
Capture changes to runbooks, alerts, and processes.

Then, schedule the next session. Each iteration will refine both your sand table and your response capability.

Conclusion

Modern incidents are never purely technical. They’re the intersection of infrastructure, software, people, and communication under pressure.

The analog incident train station sand table gives you a way to see and rehearse that whole system—not just the logs and metrics. By turning your architecture into a tactile, moveable landscape, it enables teams to:

Visualize complex dependencies
Practice realistic outage scenarios
Spot information bottlenecks and coordination gaps
Iteratively improve both runbooks and human processes

All with paper, tape, and a few hours of focused attention.

If you care about resilience, don’t wait for the next real outage to discover how your system and your organization behave under stress. Build a sand table, gather your team, and start rehearsing today.