Rain Lag

The Analog Incident Story Trainyard Shadowbox: Safely Reenacting Real Outages with Paper Models

How to build a miniature, paper-based replica of your production system—an "incident story trainyard"—to safely replay real outages, explore resilience gaps, and involve the whole organization without touching live infrastructure.

The Analog Incident Story Trainyard Shadowbox: Building a Miniature Paper Replica of Your System to Safely Reenact Real Outages

Digital systems fail in very physical ways: cascading outages, blocked paths, dead ends, and strange side routes that only appear when you least expect them. Yet most of our incident analysis tools are abstract dashboards, dense logs, and intimidating diagrams.

What if, instead, you could spread your system out on a table as a miniature paper trainyard—complete with tracks (dependencies), trains (requests), and switches (routing rules)—and literally move incidents around with your hands?

That’s the idea behind an analog incident story trainyard shadowbox: a paper-based miniature replica of your production system designed to safely reenact real outages without touching live infrastructure.

In this post, you’ll learn:

  • What a shadowbox is and why it works
  • How to model your system using a simple connectivity-only topology
  • How to build your own paper incident trainyard
  • How to replay real incidents and use them to guide resilience improvements

What Is an “Incident Story Trainyard Shadowbox”?

A shadowbox is a physical, simplified representation of a complex system. Think of it as a 3D storyboard for how your architecture behaves under stress. In our context, it’s a tabletop model of your production environment built from paper cards, string, magnets, and other simple materials.

The incident story trainyard metaphor comes from rail networks:

  • Tracks represent service dependencies
  • Switches represent routing logic or feature flags
  • Trains represent user requests, messages, or jobs moving through the system
  • Stations and yards represent services, databases, caches, queues

By laying these out physically, you can:

  • Walk through real incidents step by step
  • Change tracks (dependencies) and see how failures propagate
  • Visualize bottlenecks, single points of failure, and resilience gaps

All of this happens offline—no experiments on production, no risk to customers—while still drawing from real incident data and real system topology.


Why Do This Physically Instead of Digitally?

On paper this can sound whimsical, but the design principles are serious. The approach blends:

  • Experimental design – you’re creating a controlled environment to explore “what if” questions.
  • Chaos engineering – you’re probing how your system behaves under failure modes, but in a low-risk, simulated form.

A physical model adds advantages that digital-only tools rarely provide:

1. Lower barrier to participation

A dense architecture diagram in a wiki is intimidating. A tabletop model made of colored cards and string is approachable.

  • Non-technical stakeholders can literally see where things break.
  • New engineers can learn the system’s shape without reading thousands of lines of YAML.
  • Product, support, and operations can participate together using a shared, tangible artifact.

2. Shared mental model, not just shared dashboard

When people move requests (trains) along paths (tracks) together, they’re forced into common language and reasoning:

  • “If this cache fails, what path does the request take now?”
  • “Who notices first? Which alarms trigger?”
  • “Where does the backlog accumulate?”

This collaborative exploration builds alignment much faster than individuals staring at dashboards alone.

3. Lightweight and easy to update

Full-scale simulation environments are expensive to build and harder to maintain. A simple connectivity-only model—which captures:

  • what depends on what (service dependency graph), and
  • how many replicas or nodes each component has

is often good enough for answering many outage questions:

  • Does this failover path exist?
  • Is this component a single point of failure?
  • What gets overloaded next when this fails?

Because the model is intentionally coarse-grained, you can keep it current as architecture changes without heroic effort.


The Minimal Model: Connectivity and Replica Counts

You don’t need to reproduce every detail of your system in miniature. In fact, you shouldn’t.

A minimal topological model focuses on:

  1. Service nodes – each major service, data store, queue, 3rd-party dependency, etc.
  2. Edges – arrows showing who calls whom or who depends on whom.
  3. Replica counts – whether a service runs as 1 instance, N instances, or across multiple regions.

That’s it. No packet-level simulation. No CPU graphs.

Why this is often sufficient:

  • Most impactful outages trace back to dependency failures or capacity limits, not obscure low-level details.
  • The shape of the graph and the multiplicity of nodes already give powerful insights into resilience and blast radius.
  • For tabletop reenactments, you care about causal structure (“this breaks, then that breaks”), not exact timings.

You’re building a tool for thinking and storytelling, not a substitute for your load test harness.


How to Build Your Shadowbox Trainyard

You can start simple and improve over time. Here’s a basic setup.

Materials

  • Index cards or sticky notes (for services)
  • Colored string or thin tape (for dependencies)
  • Small tokens (for requests/messages): poker chips, paper circles, or Lego pieces
  • Stickers or colored dots (to indicate replica counts, regions, or roles)
  • A large board, whiteboard, or just a table

Step 1: Define your scope

Pick a bounded part of the system:

  • A critical user journey (e.g., checkout flow)
  • A specific bounded context (e.g., payments platform)
  • The set of services involved in a recent major incident

Trying to model your entire company’s systems in one shot will stall the effort.

Step 2: Create service cards

For each service in scope, create a card with:

  • Name (clear and human-readable)
  • Type (API, worker, cache, DB, queue, third-party)
  • Replica info – e.g., 1x, 3x, multi-region, active/passive

You can color-code by type (blue for services, yellow for data stores, green for queues, etc.).

Step 3: Lay out the dependency graph

On the table or board:

  • Place services roughly in the order of the request flow (left to right or top to bottom)
  • Connect callers to callees with string or arrows
  • Use different colors or patterns for synchronous vs. async calls if helpful

Aim for accuracy over aesthetics: this is a working tool, not a design poster.

Step 4: Add simple failure and capacity markers

To support outage replays, add a few conventions:

  • A “down” token (e.g., a red X) you can place on a service to indicate failure
  • A way to indicate degraded behavior (e.g., yellow exclamation marks for elevated latency)
  • Optional: simple capacity markers (e.g., each token represents 100 requests per second that a service can handle)

Step 5: Represent requests as movable tokens

Pick a token type to represent:

  • A user request
  • A job/message
  • A scheduled task

You’ll physically move these along the graph as you reenact incidents.


Reenacting Real Incidents in the Shadowbox

Once the trainyard is ready, you can replay past outages step by step.

1. Choose an incident

Start with a recent, well-documented one:

  • You know the timeline (alerts, detection, mitigation)
  • You’ve identified contributing factors
  • You have some sense of the dependencies involved

This makes it easier to validate your model against reality.

2. Set the initial state

  • Place a handful of request tokens at the system’s “entry point” (e.g., web frontend, API gateway)
  • Set all services to “healthy” (no failure markers)
  • State your assumptions aloud (typical traffic, all regions up, etc.)

3. Introduce the failure

At the time the incident began:

  • Mark the failing component(s) with the “down” token
  • Explain the real trigger (deploy, config change, regional outage, network issue, etc.)

Now, walk the requests through the system and answer together:

  • Where do they go now that this component is down?
  • Do they time out, retries kick in, or alternate paths get used?
  • Which services see increased load as a result?
  • Who would alert first according to your current observability setup?

4. Explore propagation and mitigation

As the team moves tokens and markers:

  • Trace where queues back up or where traffic concentrates
  • Simulate the actual mitigations taken (e.g., feature flag turned off, traffic shifted)
  • Ask: What other options could we have had?

This is where the story emerges—people fill in operational details, confusion points, communication gaps, and unspoken assumptions.

5. Capture insights and potential changes

Document:

  • Surprising dependencies no one had in mind
  • Single points of failure that should be made redundant
  • Missing alerts or misleading dashboards
  • Candidate guardrails, feature flags, or capacity buffers

Because everyone can see the system, it’s easier to have grounded conversations about trade-offs: cost vs. resilience, complexity vs. safety.


From Play to Practice: What Teams Gain

A shadowbox is not a toy; it becomes a shared learning environment.

Better incident response practice

You can run tabletop drills using the trainyard:

  • Assign roles (on-call engineer, incident commander, comms lead, product owner)
  • Simulate detection, triage, communication, and mitigation
  • Practice runbooks and refine them based on what actually happens in the model

More inclusive, cross-functional learning

Because the model is physical and visual:

  • Support teams can understand why certain errors occur and what’s realistic to promise customers
  • Product managers can see the architectural cost of new features or reliability commitments
  • Leadership can understand resilience investments beyond abstract “nines” and SLO charts

Safer exploration than full chaos experiments

Chaos engineering against production is powerful but expensive and risky. The shadowbox approach:

  • Costs almost nothing to run
  • Has zero risk to users
  • Still exposes important resilience gaps and design flaws

You can even prototype chaos experiments on the table first, to decide whether they’re worth attempting in live environments.


Keeping the Shadowbox Useful Over Time

The value of the model depends on its freshness and relevance.

  • Update it with each major architectural change: new services, retired dependencies, region migrations.
  • Use it in regular rituals: post-incident reviews, quarterly resilience reviews, onboarding for new team members.
  • Keep it minimal: if it gets too detailed, you’ll stop updating it. Focus on dependencies and replica counts first; only add more detail when multiple incidents demand it.

Think of the shadowbox as a living companion to your architecture docs and runbooks—a place where those abstractions are physically tested against messy reality.


Conclusion: Bringing Outages Into the Light

Outages are stories of how complex systems actually behave, not how we wish they behaved. A physical incident story trainyard shadowbox lets you:

  • Make those stories visible and tangible
  • Involve more people in understanding and improving resilience
  • Rehearse responses and design trade-offs without risking production

By building a miniature paper replica of your system, grounded in a simple connectivity-only model, you gain a flexible, low-cost lab for exploring failure. Over time, reenacting real incidents in this space transforms them from isolated crises into shared learning experiences—and that’s where real resilience starts to grow.

The Analog Incident Story Trainyard Shadowbox: Safely Reenacting Real Outages with Paper Models | Rain Lag