Rain Lag

The Analog Incident Story Subway Map: A Fold-Out Blueprint for Tracing Outage Cause and Effect

How to design a fold-out, subway-style paper map that turns messy incident postmortems into a clear, visual story of system outages, failure detection, and circuit breaker behavior.

The Analog Incident Story Subway Map: Designing a Fold-Out Paper Network for Tracing Outage Cause and Effect

Modern distributed systems fail in tangled, multi-step ways. Dashboards scroll by, alerts fire, people swarm video calls—and when the dust settles, you’re left with a familiar question:

What actually happened, in what order, and why?

Most teams answer that question with a written postmortem doc, maybe a diagram or two, and a list of action items. That’s a good start, but written reports alone rarely capture the complexity of how an outage truly unfolded across systems, people, and time.

Enter the Analog Incident Story Subway Map: a fold-out paper map that turns an outage into a visual, navigable network—like a subway map for your system’s failure journey.

In this post, we’ll walk through how to design that map so it:

  • Embeds insights from structured postmortems
  • Shows cause-and-effect across multiple systems
  • Incorporates real-time failure detection metrics
  • Visually explains how circuit breakers contain damage
  • Highlights the parallel tracks of detection and isolation
  • Uses quantitative risk data (e.g., via Laplace inversion–based analysis)
  • Becomes a living artifact for reviews and training—not just a pretty picture

Why an Analog Subway Map for Incidents?

A subway map is a powerful metaphor for incidents:

  • Stations represent events: symptoms, alarms, system transitions, human actions.
  • Lines represent flows: requests through services, error propagation, detection signals, circuit breaker states.
  • Transfers show how a single failure jumps to another component or domain.

Unlike a pure timeline, a subway-style map emphasizes relationships and paths, not just sequence. In a complex outage, that’s often what matters most:

  • How did this symptom connect to that root cause?
  • Where could we have detected the problem earlier?
  • Which circuit breakers successfully prevented a cascade—and which didn’t?

The fold-out paper format makes this even more powerful:

  • People can physically point, annotate, and argue with the map in real time.
  • It can be laid in the middle of the table during reviews, making it a shared reference point.
  • Different scales (zoomed-in panels, zoomed-out overview) can coexist on the same sheet.

This isn’t nostalgia for paper; it’s an intentional choice to create a tactile, collaborative artifact that outlives a single Google Doc link.


Step 1: Start from a Structured, Thorough Postmortem

The subway map is only as good as the story underneath it. Before you ever draw a line, you need a rigorous, structured postmortem process.

At a minimum, that process should produce:

  1. A detailed timeline of:
    • System signals (metrics, logs, traces)
    • User-visible symptoms
    • Operator interventions
  2. Explicit cause-and-effect chains, not just a root cause label
  3. Contributing factors (technical, process, organizational)
  4. Detection and response analysis:
    • When the system could have known vs. when it actually did
    • When humans could have intervened vs. when they actually did

Your map will transform this linear narrative into a network of events and flows, but the logical consistency and completeness come from the postmortem itself.

Tip: Treat the postmortem as the “script” and the subway map as the “storyboard.” If the script is vague, the storyboard will be confusing.


Step 2: Define the Visual Language of the Map

Before drawing, define a consistent visual vocabulary so that everyone reads the map the same way.

Core Elements

  • Stations (nodes)

    • Symptom stations (e.g., user-facing errors): squares
    • Internal events (service failures, retries, DB contention): circles
    • Human actions (deploy, rollback, feature flag change): diamonds
    • Circuit breaker state changes: hexagons
  • Lines (paths)

    • Request flow line: solid colored lines for each major service path
    • Error propagation line: dotted red overlay
    • Detection signals line: dashed blue lines representing metrics/alerts
    • Circuit breaker line: thick black or orange line showing isolation boundaries
  • Annotations

    • Time stamps
    • Metric values (e.g., latency, error rate)
    • Ticket/incident IDs
    • Short text notes (2–5 words) for key insights

With a legend on the fold-out itself, an engineer seeing the map for the first time should be able to decode:

“This is a user symptom, flowing from service A to B to C, while circuit breakers at B flicked on here, and the detection system only noticed over there.”


Step 3: Show How Failure Detection Sees the Outage

Most incident diagrams focus solely on the system under failure. To understand and improve response, you must also express the system that detects failure.

Design a dedicated detection line:

  • Each station on this line is a detection event:
    • Metric threshold crossed
    • Alert fired
    • Dashboard opened
    • On-call acknowledgment
  • The line runs in parallel to the request/error lines, with arrows up or down connecting detection events to the underlying system events they refer to.

This gives you:

  • A clear picture of where observability lagged reality
  • Visual gaps where no detection event exists despite obvious issues
  • Places where alerts fired but didn’t influence operator action

By using a subway metaphor, you can literally “ride” the detection line and see how far behind or ahead it is relative to the failure progression.


Step 4: Visualize Circuit Breakers as Isolation Lines

Circuit breakers are your system’s equivalent of track switches and barriers. The map should make this isolation behavior highly visible.

Design a circuit breaker line that:

  • Runs alongside the primary service lines
  • Has state-change stations (closed → open, open → half-open, etc.)
  • Connects to the services it protects via short branch lines

When a circuit breaker trips:

  • Draw a bold icon or color change at that station
  • Fade or cross-hatch the downstream lines to indicate traffic no longer flows
  • Annotate with:
    • Trigger condition (e.g., 50% errors over 10 seconds)
    • Time to open/close
    • % of traffic shed

This makes it visually obvious where cascades were stopped and where missing or misconfigured breakers allowed failure to spread further than intended.

Overlaying the detection and circuit breaker lines lets you ask:

  • Did the detection system or the circuit breaker act first?
  • Did they agree on what was failing?
  • Where did one system compensate for the other’s weaknesses?

Step 5: Ground the Map in Quantitative Risk Data

To keep the subway map from becoming a purely qualitative story, anchor it in quantitative risk analysis.

One approach is to use Laplace inversion–based outage probability analysis (or similar reliability methods) to estimate:

  • Probability of each component failing within a given time window
  • Distribution of outage durations
  • Probability of certain cascades occurring

You don’t need to show the math on the map, but you should encode the results visually:

  • Line thickness proportional to the probability of that failure path
  • Station size or halo indicating components with high outage likelihood
  • Shaded zones showing high-risk clusters where small perturbations are likely to propagate

Include a small panel on the fold-out with:

  • A summary table of key probabilities
  • A short explanation of what the shading and thickness represent
  • Pointers to the underlying analysis for those who want deeper detail

This way, the map is not just “what happened” but also “how likely was this” and “where are we inherently fragile.”


Step 6: Make the Fold-Out Map a Living Artifact

The subway map should not be a one-off diagram that gets archived and forgotten.

Treat it as a living artifact by:

  1. Bringing it to every incident review

    • Lay it on the table (or show a scannable PDF if remote)
    • Let participants trace paths with pens or cursors
    • Add sticky notes or digital callouts for new insights
  2. Iterating on the template

    • After a few incidents, refine the legend and visual language
    • Standardize colors, shapes, and annotations across incidents
  3. Using it for training and drills

    • New engineers “ride the lines” of past incidents as onboarding
    • Game days simulate similar patterns and refer back to the map
  4. Connecting it to process improvements

    • Highlight stations that correspond to successful mitigations and remediations
    • Mark which stations led to long-term improvements (e.g., new SLOs, alert changes, architectural changes)

The more you use the map actively, the more it becomes a shared mental model for how your system fails—and how your organization responds.


Putting It All Together

An Analog Incident Story Subway Map is more than a stylish poster. It’s a design pattern for thinking about outages:

  • Structured postmortems provide the narrative backbone.
  • A subway-style visual language turns linear timelines into navigable networks.
  • Parallel lines for request flow, error propagation, detection, and circuit breakers expose interactions that are hard to see in text.
  • Quantitative risk analysis (including Laplace inversion–based methods) anchors the story in real probabilities.
  • The fold-out, analog format encourages collaborative exploration and continuous refinement.

As systems grow more distributed and failure modes more intertwined, teams need tools that help them see the whole story at once. A carefully designed incident subway map doesn’t replace logs, dashboards, or traces—but it weaves them into a coherent, human-readable journey from first symptom to root cause and beyond.

If your incident reviews feel like re-reading the same dry report, try drawing your next outage as a subway map on a fold-out sheet. You might discover that the path to better reliability is easier to follow when it’s printed right in front of you.

The Analog Incident Story Subway Map: A Fold-Out Blueprint for Tracing Outage Cause and Effect | Rain Lag