The Analog Incident Sandtable: Replaying Production Disasters With Paper and Pens

Digital systems, white‑hot production incidents, complex microservices… and you’re solving them with paper buildings and hand‑drawn arrows on a table.

That’s the core idea behind the Analog Incident Sandtable: a deliberately low‑tech, physical way to replay real outages, explore dependencies, and practice coordinated incident response in a safe, repeatable environment.

No Kubernetes cluster. No observability dashboards. Just markers, paper, tape, and people.

In this post, we’ll walk through what an Analog Incident Sandtable is, how it works, and why this deceptively simple technique can have a big impact on downtime prevention, resilience, and your incident‑readiness culture.

What Is an Analog Incident Sandtable?

The Analog Incident Sandtable is a tabletop simulator for your production environment.

Instead of a combat map with units and terrain, you build:

Paper buildings to represent key systems and services (e.g., “Payments API,” “User Service,” “DB Cluster,” “Third‑Party Auth”).
Hand‑drawn traffic flows using markers or string to show how data moves between components.
Physical artifacts like sticky notes to represent alerts, user reports, logs, or status page updates.

You then replay a real incident (or a realistic scenario) step‑by‑step:

What failed first?
What did we notice?
Who got paged?
What actions were taken, and in what order?
How did those actions change the system state and the blast radius?

The goal isn’t to recreate every technical detail. Instead, you’re building a shared mental model of cause‑and‑effect and coordination dynamics during an outage.

Why Go Low‑Tech in a High‑Tech World?

On the surface, a paper‑and‑marker exercise seems primitive compared to sophisticated chaos engineering tools or full staging environments. But the analog format has some powerful advantages:

1. Radical Accessibility

Everyone can participate: engineers, support, product managers, SREs, on‑call leads, even executives.

You don’t need access to prod or specialist tooling. You only need:

A table or whiteboard
Paper/cardstock
Markers and sticky notes

This lowers the barrier to entry and makes the discussion about the system and the people, not the tools.

2. Slowed‑Down, Shared Understanding

Reality is messy and fast. In a live incident, you don’t have time to pause and ask:

“Wait, which service talks to this backend, and what happens if this queue fills up?”

The sandtable gives you permission to slow down. You can:

Pause at any moment
Ask naive questions
Move components around
Redraw flows to represent changing conditions

This turns elusive mental models into something visible and negotiable.

3. Safe Exploration of Failure

In real production, experimentation during an incident is risky.

On the sandtable, you can safely explore:

“What if this service failed first instead?”
“What if this alert fired 10 minutes earlier?”
“What if we had a rate limit here?”

You’re effectively running counterfactual simulations — without risking user impact.

Setting Up an Analog Incident Sandtable

You don’t need a big budget or a detailed playbook. Start simple.

Step 1: Choose a Real Incident (or a High‑Value Scenario)

Pick an incident that:

Had real user impact, or
Revealed confusing dependencies, or
Exposed coordination challenges between teams.

Alternatively, design a realistic scenario around a known risk, like:

Losing a core database
A failed rollout in a key service
A third‑party provider outage

Step 2: Map the System as Paper Buildings

On index cards or folded paper, write the names of:

Core services (e.g., API Gateway, Auth Service, Payments, Search)
Data stores (User DB, Orders DB, Redis Cache)
External dependencies (Payment Processor, Email Provider)

Arrange them on the table to roughly reflect your architecture.

Step 3: Draw Traffic Flows

Use markers, string, or arrows to represent:

Request flows (web → API → services → DB)
Async flows (queues, workers, event buses)
Critical dependency paths (e.g., what must be healthy for “checkout” to work)

Don’t over‑optimize for accuracy on the first pass. The important part is visibility and shared agreement.

Step 4: Introduce Time and Signals

Recreate the incident as a timeline:

Minute 0: Something fails (mark it physically on the table)
Minute X: First alert fires (place a sticky note by on‑call)
Minute Y: Customers start reporting errors (another note)
Minute Z: A mitigation or change is applied (move or mark components)

As you move through time, adjust the table:

Darken or cross out failed components
Draw new arrows for degraded or rerouted traffic
Add notes for mitigations, rollbacks, config changes

This is where the tabletop simulator truly comes to life.

What You Learn: Dependencies, Decisions, Dynamics

Running sandtable sessions isn’t just about replaying the past; it’s about probing how your system and your organization respond to failure.

Refining Service and System Dependencies

As you walk through the incident, you’ll uncover:

Hidden dependencies: “We didn’t realize the notifications service depends on this database.”
Implicit assumptions: “We thought this queue drained to a different worker group.”
Fragile coupling: “This minor feature actually blocks checkout if it fails.”

You can then refine how you simulate dependencies during future incidents:

More accurate runbooks and diagrams
Better internal documentation
Clearer ownership boundaries between teams

Understanding Coordination Dynamics

The sandtable reveals how people coordinate under stress:

Who spoke to whom, and when?
How were decisions made and communicated?
Where did confusion, handoffs, or delays occur?

You’ll often see patterns like:

Two teams debugging the same subsystem in parallel
Nobody clearly owning a critical dependency
Misaligned mental models of “What’s actually broken?”

Those observations feed directly into improvements for incident command, communication channels, and escalation paths.

From Insights to Concrete Mitigations

A good Analog Incident Sandtable session doesn’t end with “That was interesting.” It ends with actionable changes.

Teams use the outcomes to propose and prioritize mitigations such as:

Requirements changes
Adjusting SLAs, timeouts, or availability targets based on realistic failure modes.
Design and architecture updates
Introducing circuit breakers, retries with backoff, bulkheads, graceful degradation, or alternative flows when dependencies fail.
Better detection and observability logic
- New or tuned alerts
- Improved dashboards focused on user impact, not just resource metrics
- Health checks that reflect true functionality, not just process uptime
Maintenance and operational improvements
- Better change management for high‑risk components
- Regular failover tests or DR exercises
- Improved runbooks and standard operating procedures

Each mitigation is grounded in the specific failures and delays you saw on the table.

Building Resilience and an Incident‑Readiness Culture

The Analog Incident Sandtable is more than a one‑off exercise. Over time, repeated sessions directly support broader organizational goals:

1. Downtime Prevention and Faster Recovery

By rehearsing real incidents, teams:

Spot weak links before they cause another outage
Become familiar with alternative mitigation paths
Learn which signals to trust first when everything is on fire

This leads to faster, more coordinated responses when the next real incident hits.

2. Normalizing Failure as a Learning Opportunity

Instead of treating incidents as shameful anomalies, sandtable sessions frame them as valuable practice material.

Teams learn to say:

“We had an outage. Let’s put it on the sandtable, learn from it, and get better.”

This reduces blame and encourages honest reflection.

3. Revealing Training and Knowledge Gaps

The exercise naturally surfaces where training is missing:

People who don’t know how a core system works
Confusion about alert meanings or log interpretation
Unclear handoffs between product, support, and engineering

You can then design targeted training or onboarding materials to close those gaps before the next real incident.

Practical Tips for Running Effective Sessions

To get the most out of Analog Incident Sandtables:

Keep the group small but cross‑functional: 5–10 people, including at least one person who lived the incident and a few who did not.
Timebox it: 60–90 minutes is enough to explore one incident deeply.
Assign a facilitator: Someone to keep time, navigate the timeline, and encourage quieter voices.
Capture learnings as you go: Use a separate board or document for “Findings” and “Potential Mitigations.”
End with prioritization: Ask, “What 2–3 changes would most reduce the impact of a similar incident?” and log owners and next steps.

Conclusion: Paper as an Engine of Resilience

The Analog Incident Sandtable looks simple: just paper buildings and hand‑drawn arrows on a table. But behind that simplicity is a powerful mechanism for:

Making complex systems understandable
Practicing incident response without risk
Refining your model of dependencies and failure modes
Building faster, more coordinated reactions to real outages
Strengthening your culture of resilience and readiness

You don’t need new tooling to start. Pick a past incident, gather your team, clear a table, and sketch your system.

Then press play on the outage — this time, in a space where you can pause, rewind, and redesign how your systems and your teams respond when it matters most.