Rain Lag

The Analog Incident Story Trainyard Terrarium: A Desk-Sized Paper Ecosystem for Watching Outages Evolve

How a low-tech, desk-sized paper ‘trainyard terrarium’ can help teams visualize complex outages, quantify impact, and turn postmortems into real reliability improvements.

The Analog Incident Story Trainyard Terrarium: A Desk-Sized Paper Ecosystem for Watching Outages Evolve

Incidents rarely happen in a straight line. They creep in from multiple directions: a fragile API here, a delayed courier there, a warehouse stockout made worse by a storm two states away. By the time it becomes “an outage,” you’re staring at a tangled story that doesn’t fit neatly into a single dashboard.

Enter the analog incident story trainyard terrarium: a desk-sized, paper-based ecosystem that lets you watch outages evolve over time—visually, slowly, and holistically.

This isn’t a fancy digital twin or a new observability tool. It’s a deliberately low-tech model of your system and its surroundings, laid out like a miniature trainyard and terrarium combined. It’s where you map users, services, suppliers, couriers, warehouses, weather, and time—with paper trains, tracks, cards, and tokens—to understand how real-world disruptions ripple through your ecosystem.

In this post, we’ll explore how this kind of analog model can transform your incident analysis, make impact concrete, and create better postmortems and follow-up actions grounded in Design-for-Reliability (DfR) principles.


Why Go Analog for Incidents?

Digital tools excel at precision and speed, but they often hide the shape of an incident. A desk-sized paper model forces you to:

  • Slow down. When you’re placing cards by hand, you’re thinking about cause, effect, and sequence—not just copying timestamps.
  • Make complexity visible. You see how software, logistics, suppliers, and environment intersect in one physical view.
  • Tell the story. Stakeholders can literally walk around the model and ask, “Where did this start? Who felt it next?”

The result is a more truthful narrative of how outages unfold—not just in your code, but across the entire system of people, processes, and physical world constraints.


Building Your Trainyard Terrarium

Think of your desk as a layout board for an evolving story. You’re building a tiny ecosystem where incidents live, move, and resolve.

Core Elements

Here’s a simple structure you can start with:

  1. Tracks (Flows)
    Use tape, string, or drawn lines to represent flows:

    • Data/API calls
    • Order fulfillment flows
    • Shipment routes
    • Supplier deliveries
  2. Stations (Domains)
    Place labeled cards or sticky notes as stations:

    • App / Website (user-facing surface)
    • Backend Services (payments, auth, tracking)
    • Warehouse / Fulfillment Centers
    • Couriers / Carriers
    • Suppliers / Manufacturers
    • Environment Nodes (regions vulnerable to flooding, heat, storms)
  3. Trains (Events & Entities)
    Use small cards, colored tokens, or cut-out “trains” to represent:

    • Customer orders
    • API requests
    • Delivery trucks
    • Inventory shipments Each train moves along the tracks as time passes.
  4. Overlays (Conditions & Failures)
    Use colored markers or transparent sticky notes to indicate:

    • System failure (red)
    • Degradation / delays (orange)
    • Risk / saturation thresholds (yellow)
    • Normal operation (green)
  5. Timeline Rail
    Along the bottom or side, maintain a time rail:

    • Incident start, detection, escalation, mitigation, resolution
    • Key external events: storms, road closures, supplier production delays

This gives you a desk-sized “ecosystem” where software, logistics, and environment meet.


Making Impact Tangible: Quantify on Paper

Incidents feel abstract until you can answer:

  • How many users were affected?
  • For how long?
  • What did it cost us? (revenue, refunds, penalties, reputational damage)

In the trainyard terrarium, you make this explicit.

Quantifying Incident Impact

Add visual counters and annotations:

  • User tokens: Each token = 100 users. Stack them at the affected station (e.g., “Checkout API”). As time passes, add more tokens for newly impacted users.
  • Downtime strips: Use strips of paper on your time rail to mark uptime vs downtime/degradation for each critical service.
  • Financial markers: Place small cards at impact points (e.g., “Lost $4,200 in abandoned carts,” “Late-delivery refunds: $1,750”).

By the end, your terrarium acts as a physical heatmap of impact. This makes prioritization less emotional and more evidence-driven.


Storyboarding the Incident: A Holistic Timeline

Outages today often span multiple domains:

  • Software systems
  • Transportation logistics
  • Suppliers and upstream inventory
  • Environmental factors like flooding or temperature

The terrarium lets you storyboard those interactions.

Example Story: A Multidomain Disruption

Imagine this sequence mapped onto your model:

  1. Supplier Delay
    A supplier’s factory experiences a heatwave. Cooling systems fail; production slows. You place a red marker on the supplier station and note: “Reduced output, ETA +3 days.”

  2. Warehouse Stockout Risk
    Trains representing shipments to your warehouse stop arriving. Inventory tokens start dropping on the warehouse card.

  3. Courier Disruption
    Simultaneously, flooding hits a key transport hub. You place a flood icon on the affected region: “Courier delays 24–48 hours.” Trains on courier tracks stack up.

  4. User-Facing Symptoms
    The website still accepts orders, but tracking updates stall and deliveries slip. You move user tokens to “Affected by Late Delivery” and mark the tracking API as “Degraded.”

  5. Real-World Impacts
    Complaints rise, refunds increase, and a social media spike appears. You add cards for “Support volume +40%” and “Refunds +$X.”

On the desk, you see how delayed tracking updates, courier delays, late restocking, and environmental conditions create a single, evolving outage.


Postmortems: Balance Praise, Honesty, and Action

A trainyard terrarium isn’t just for real-time understanding; it’s a perfect scaffold for postmortems that are:

  • Balanced
  • Honest
  • Actionable

Acknowledging What Went Well

Start by annotating successful responses directly on the map:

  • "Incident detected within 10 minutes via monitoring at Station X."
  • "On-call re-routed 40% of orders to an unaffected warehouse."
  • "Support team quickly published accurate delay expectations."

Place green check marks or small “win” stickers where these occurred. This highlights strengths in detection, mitigation, and communication.

Being Honest About Shortcomings

Next, mark where the team fell short—without blame, but with precision:

  • Slow or missing detection: No alert when courier delays started affecting delivery times.
  • Poor visibility: Environmental risks (flood-prone routes, temperature-sensitive goods) weren’t modeled in planning.
  • Communication gaps: Customers saw generic error messages instead of clear, contextual explanations.

Use red or orange markers with short notes: “No alert configured,” “Assumed supplier reliability,” “Support playbook missing scenario.”

This honest assessment of shortcomings is core to learning. On the desk, those red markers act as visual prompts: we can do better here.


From Insight to Improvement: Actionable Follow-Ups

A postmortem is only as good as its follow-up tasks. The terrarium should end up surrounded by cards that say: “Here’s what we’ll actually do next.”

Designing Actionable, Prioritized Tasks

Organize follow-up tasks into clear categories and priorities:

  • Monitoring & Detection

    • Add alerting for spikes in courier delays beyond X hours.
    • Track supplier lead time variance as a first-class metric.
  • Resilience & Redundancy

    • Introduce secondary supplier for critical SKUs.
    • Add alternative courier paths for flood-prone regions.
  • Communication & UX

    • Improve customer messaging for delays: accurate ETAs and explanations.
    • Create internal playbooks for multi-domain disruptions.
  • Data & Modeling

    • Integrate environmental risk data (weather, temperature) into planning.
    • Log and correlate tracking delays with customer impact.

For each task, assign:

  • Owner
  • Deadline
  • Expected risk reduction (e.g., “Reduces single-supplier risk by 40% for item group A.”)

Pin these around the terrarium so stakeholders can see the direct line from outage story → insight → concrete change.


Infusing Design-for-Reliability (DfR) into the Terrarium

Design-for-Reliability (DfR) is about designing systems to withstand real-world complexity: manufacturing variance, messy usage patterns, and unpredictable environments.

Your trainyard terrarium becomes a DfR thinking tool when you:

  1. Model variability explicitly

    • Show best-case and worst-case lead times for suppliers.
    • Represent seasonal courier performance changes.
    • Add “hot zones” for high-temperature risk or flood-prone regions.
  2. Stress test the system on paper

    • Run what-if scenarios: “What if this supplier fails?” “What if this courier hub is down for 72 hours?”
    • Move trains and tokens to see where congestion or failure appears first.
  3. Design mitigations visibly

    • Draw alternative tracks for rerouting orders or traffic.
    • Place backup stations (secondary suppliers, cloud regions, warehouses) and simulate cutovers.

DfR is easier to explain and reason about when you can point at a paper ecosystem and say, “Here is where we add resilience, not just more alerts.”


Making It a Living Practice

To get lasting value from your analog incident story trainyard terrarium:

  • Keep it out, not in a drawer. Let it live on a table or wall so people can walk up, ask questions, and learn from past incidents.
  • Use it during incident reviews. Build the story together as you review logs, timelines, and metrics.
  • Update it for new realities. When you add a new supplier, region, or system, update the model.
  • Cross-train with it. Onboard new engineers, ops, and support staff by walking them through real incidents on the terrarium.

Conclusion: Seeing the Whole Ecosystem

Modern outages aren’t just about servers and code. They’re about the messy intersection of:

  • Software reliability
  • Logistics and transportation
  • Supplier performance
  • Environmental conditions
  • Human response and communication

A desk-sized paper ecosystem—your incident story trainyard terrarium—gives you a way to see all of this at once. It helps you:

  • Quantify impact in human and financial terms
  • Tell balanced incident stories that celebrate successes and confront shortcomings
  • Generate clear, prioritized follow-up tasks
  • Apply Design-for-Reliability thinking across digital and physical domains

In a world full of dashboards and alerts, a carefully crafted analog model might be the clearest window you have into how outages really evolve—and how your organization can learn to navigate them better over time.

The Analog Incident Story Trainyard Terrarium: A Desk-Sized Paper Ecosystem for Watching Outages Evolve | Rain Lag