Rain Lag

The Analog Outage Story Sandtable: Sculpting Paper Terrain to Rehearse How Failures Flow Through Your System

How to use low‑tech “sandtable” outage rehearsals and paper terrain models to understand cascading failures, map dependencies, and strengthen your incident response plan before a real crisis hits.

Introduction

Military planners have used sandtables for centuries: physical models of terrain that commanders use to walk through missions, explore what‑ifs, and anticipate how reality might go off the rails. In the digital age, we’ve mostly traded sand and string for dashboards and diagrams.

But when it comes to complex outages and cascading failures, a bit of analog thinking is exactly what most organizations are missing.

In this post, we’ll explore the “Analog Outage Story Sandtable”: a low‑tech, paper‑and‑markers way to rehearse how incidents unfold across your systems. You’ll learn how to sculpt “paper terrain” that represents your infrastructure and supply chain, then use it to run tabletop exercises that reveal weak spots, dependency risks, and missing parts of your incident response plan—before a real failure does it for you.


Why Analog Still Matters in a Digital Failure

When a real incident hits, you’re juggling:

  • Partial and lagging telemetry
  • Conflicting dashboards
  • Frantic messages in multiple tools
  • Stakeholders begging for updates

Simulation tools, chaos experiments, and digital twins are powerful, but they’re not always easy to set up or safe to run for every system. An analog sandtable exercise strips things down:

  • No code deployments
  • No risk to production
  • No tooling requirements beyond pens, sticky notes, and a table

You’re rehearsing the story of how failures move through your ecosystem, not the precise CPU metrics at millisecond resolution. That story is what aligns engineering, operations, security, and business stakeholders around what really matters when things go wrong.

Think of it as:

A safe, collaborative rehearsal of your worst day—told with paper, markers, and brutal honesty.


Step 1: Sculpt the Paper Terrain – Map What Actually Exists

Your sandtable starts with a physical map of your system landscape. This is not a neat architecture diagram built for a slide deck; it’s a working model of reality.

What to put on the table

Use index cards, sticky notes, or printed components. Each card should represent a discrete element, such as:

  • Core services: authentication, billing, search, API gateway, etc.
  • Data stores: databases, caches, object storage, analytics warehouses
  • Infrastructure: load balancers, message queues, DNS, CDNs, VPNs
  • Applications: customer‑facing apps, internal tools, mobile clients
  • Security/compliance layers: WAF, IAM, SIEM, monitoring and alerting
  • Upstream dependencies: payment processors, email providers, cloud regions, identity providers
  • Downstream consumers: key integrations, partners, major internal teams relying on the system

Lay them out in rough “zones” (e.g., Edge, Core Services, Data, Third Parties, Business Functions). Draw arrows for dependencies.

This is your paper terrain—an imperfect, but tangible model of your real environment.

Aim for interaction, not perfection

The goal is not a flawless CMDB in paper form. It’s a shared mental model:

  • Engineers see where their service sits in the bigger picture.
  • Business stakeholders see how technical failures impact customer‑facing capabilities.
  • Security sees where critical controls actually live.

If people argue over where a service belongs or what it depends on—good. That argument is the work.


Step 2: Map Dependencies Like Your Outage Depends on It (Because It Does)

Failures rarely stay put. A “small” issue becomes a major incident because no one fully understood how components interact under stress.

Your sandtable exercise should force explicit dependency mapping:

  1. Draw directional arrows between components that rely on each other.
  2. Label arrows with the type of dependency:
    • Data (e.g., “writes orders to DB1”)
    • Identity (e.g., “needs SSO from IdP-X”)
    • Network (e.g., “tunnels over VPN-Y”)
    • Business (e.g., “relied on for monthly close process”)
  3. Note criticality directly on the card: High / Medium / Low, or use colored dots.

Encourage teams to be honest about:

  • Hidden dependencies: “Oh, that cron job actually hits the old API too.”
  • Couplings via shared resources: “These services are ‘independent’ but share the same DB cluster.”
  • Operational dependencies: “If this dashboard fails, we’re flying blind.”

You’re not just documenting relationships; you’re revealing where a failure in one card makes three other cards silently unusable.


Step 3: Prioritize Critical Systems and Key Organizational Activities

Not all systems deserve equal attention during outage rehearsals. Start with those that:

  • Directly affect revenue or core mission delivery
  • Handle sensitive data or regulated functions
  • Are operational linchpins (e.g., authentication, messaging, logging)

Ask:

  • “If this card disappears, what is the first business activity that breaks?”
  • “What’s the time to serious damage (financial, legal, reputational)?”

Mark systems that support:

  • Customer sign‑up, login, and checkout
  • Regulatory reporting
  • Production operations (e.g., manufacturing control systems)
  • Critical internal workflows (e.g., incident management platform itself)

Your initial sandtable sessions should zoom in on these high‑impact components. Once your team gets good at rehearsing failures here, you can expand to broader scenarios.


Step 4: Pull the Supply Chain Onto the Table

Most modern outages aren’t purely “internal.” They flow through:

  • Cloud providers and specific regions
  • Managed databases and SaaS tools
  • Third‑party APIs (payments, comms, identity, analytics)
  • Open source libraries and pre‑built components

To rehearse realistically, integrate supply chain risk analysis into your terrain:

  1. Create cards for major third‑party systems you rely on.
  2. For each, add:
    • What internal services depend on it
    • What business processes are affected if it fails
    • Existing SLAs / SLOs (how long you can expect it to be down)
  3. Consider upstream of upstream where relevant: cloud region, network provider, DNS authority.

Then, during your exercise, have scenarios where:

  • A major SaaS provider suffers an outage
  • A cloud region goes down
  • A key security vendor is compromised

Walk through the cascade, step by step. You’ll quickly see whether you’ve:

  • Concentrated risk on a single provider
  • Lacked failover strategies
  • Underestimated the blast radius of external failures

Step 5: Run the Outage Story as a Tabletop Exercise

Now you have terrain. Time for the story.

Set the scenario

Pick a realistic spark:

  • “Primary database cluster in Region A becomes unreachable.”
  • “Our identity provider is compromised; we must revoke trust immediately.”
  • “Third‑party payment processor is partially down and intermittently failing.”

Appoint a facilitator to reveal information gradually and keep the group focused.

Play through, minute by minute

For each step in the scenario, ask:

  • Which cards are directly affected right now?
  • Which dependencies fail as a result?
  • What do users experience?
  • What do internal teams see or not see?
  • How would we even know this is happening?

Move or flip cards to indicate:

  • ❌ Fully down
  • ⚠️ Degraded
  • ❓ Unknown/uncertain state

Document:

  • Detection: How we notice the problem
  • Diagnosis: How we narrow down the cause
  • Communication: Who we inform, how, and when
  • Coordination: Which teams lead and which support
  • Decision points: When to fail over, degrade features, or declare an incident

You’re rehearsing both the technical flow of failure and the human response.


Step 6: Use the Exercise to Validate and Improve Your Incident Response Plan

Every sandtable session is a test of your incident response plan in disguise. As you walk through the scenario, keep a copy of your plan visible.

Ask repeatedly:

  • Does the plan match what we’re actually doing?
  • Are roles and responsibilities clear at each stage?
  • Are there documented runbooks for the actions we’re taking?
  • Do we have communication templates for customers, executives, regulators?

Capture gaps as you go:

  • Missing contact details for a key third‑party
  • Ambiguous ownership of a high‑impact service
  • No clear criteria for when to declare a “major” incident
  • Lack of playbooks for partial third‑party failure

Treat these as actionable backlog items. The value of the sandtable isn’t the story you tell today—it’s the improvements it prompts before tomorrow’s real incident.


Step 7: Consider Complementary Immersive and Digital Tools

Analog sandtables are powerful for alignment and shared understanding. They’re not a replacement for technical simulations—they’re a complement.

Once you’ve used paper to:

  • Identify fragile dependencies
  • Clarify roles and responsibilities
  • Define plausible outage scenarios

…you can graduate some of those stories into richer simulations:

  • VR or 3D visualizations of infrastructure for training incident commanders
  • Simulated control rooms where teams practice working under time pressure
  • Chaos engineering experiments that safely test assumptions you surfaced
  • Automated tabletop tools that walk teams through digital scenario scripts

The sandtable is the low‑cost, low‑risk rehearsal stage. More immersive tools are where you stress‑test muscle memory and tooling at scale, once you know what really matters.


Conclusion: Practice the Story Before the Crisis Writes It for You

Complex systems fail in complex ways—and they take your organization’s blind spots with them. Waiting for a real outage to reveal those blind spots is an expensive way to learn.

The Analog Outage Story Sandtable offers a pragmatic alternative:

  • Build a tangible map of your systems and supply chain.
  • Make dependencies—and their risks—explicit.
  • Focus on critical systems that support core business activities.
  • Rehearse realistic failure stories with the actual people who will respond.
  • Use what you learn to refine your incident response plan and inform more advanced simulations.

You don’t need a lab or a massive budget. You need paper, a table, time, and the willingness to walk through your worst day before it happens.

If you haven’t yet, pick one critical system, gather the right people, and build your first paper terrain. The story you tell around that table might be the reason your next real outage is survivable instead of catastrophic.

The Analog Outage Story Sandtable: Sculpting Paper Terrain to Rehearse How Failures Flow Through Your System | Rain Lag