The Paper Incident Story Subway Map: Hand‑Designing Underground Routes for Hidden Failure Signals
How hand-drawn “subway maps” of alerts, failures, and escalations can reveal hidden reliability risks that dashboards and code never show—and how to build your own using RBDs, FTA, and gemba-style observation.
The Paper Incident Story Subway Map: Hand‑Designing Underground Routes for Hidden Failure Signals
Complex systems rarely fail in straight lines. Things break sideways, alerts hop between tools, and tiny glitches tunnel through strange routes before they finally surface as a “major incident.” If you’ve ever stared at three different dashboards while your pager screams and still felt blind, you already know: our mental model of the system almost never matches the real thing.
One powerful way to close that gap is delightfully low‑tech: draw it.
This post explores a technique I call the Incident Story Subway Map—a hand‑designed, diagram‑heavy way to map how failures flow through your systems, tools, and people. Think of it as a visual subway map for your incidents: which “lines” (signals) run where, which “stations” (tools/teams) they pass through, and where they silently dead‑end.
We’ll connect this idea to established reliability tools—Reliability Block Diagrams (RBDs), Fault Tree Analysis (FTA), and gemba‑style observation—and show how combining them surfaces hidden failure paths you won’t see in logs or code.
Why You Need a “Subway Map” for Incidents
Most teams have some version of:
- Monitoring configs in one repo
- Alert routing rules in another
- Runbooks in a wiki
- Escalation policies in an ops tool
Each artifact describes a slice of reality, but none tells the whole story. When something breaks, we scramble through these layers, trying to reconstruct the causal chain in our heads.
A subway map captures this as a single, visual narrative:
- Where failures start (alert sources)
- How they travel (alerting routes and transformations)
- Where they branch or stall (routing logic, filters, silencing rules)
- How and when humans enter the picture (escalations, on-calls, handoffs)
Once you see the full network of signals and paths, you begin to notice:
- Important components with no alert coverage
- Alerts that fire but never reach a responsible human
- Escalation paths that are theoretically defined but practically broken
- Overlapping alerts that create noise instead of clarity
That’s where reliability diagramming tools come in.
Layer 1: Reliability Block Diagrams – Who Keeps the Lights On?
A Reliability Block Diagram (RBD) visually represents how component reliability affects the overall system. Blocks in series represent components where any failure breaks the whole; blocks in parallel represent redundancy.
For an incident subway map, RBDs give you the infrastructure backbone:
- Draw major services, dependencies, and critical components as blocks.
- Connect them to show how user-visible functionality depends on them.
- Highlight single points of failure and critical redundancy.
Example (textual):
User Request → API Gateway → Auth Service → App Service → DB Cluster
└→ Cache Cluster
In your subway map, this RBD layer is the base transit network: which tracks even exist for failures to travel.
Use this layer to answer:
- What must remain healthy to deliver our key user journeys?
- Where would a local failure propagate into a system-wide outage?
Layer 2: Alert Route Diagramming – How Do Signals Actually Travel?
Once you understand which blocks matter, the next step is to diagram how alerts move when something fails.
Most teams have a surprising amount of complexity here:
- Metrics → Alert manager → Notification router → Chat/Email/Pager
- Logs → SIEM → Correlation rules → Incident tool
- Synthetic checks → External provider → Webhook → On‑call system
When you draw these as explicit paths, hidden relationships pop out.
What to Map
For each critical component in your RBD:
- Alert Sources
- Metrics, logs, traces, synthetic checks, health endpoints.
- Transformation Steps
- Alert rules, correlation engines, deduplication, noise filters.
- Routing and Escalation
- On‑call rotations, escalation policies, fallback channels.
- Human Interaction Points
- Who sees what first? Who can acknowledge, escalate, or close?
This is where the “subway” metaphor shines: draw each alert source as a line, each tool/team as a station, and escalations as transfers.
Why Diagramming Alert Routes Matters
- You see alerts that end in nowhere (no configured recipient, disabled channel).
- You catch route splits where one event sends mixed, confusing signals.
- You identify latent delays—e.g., an alert that goes to email first, then pager after 30 minutes.
Diagramming this makes it much easier to safely customize and improve your escalation paths, because you can:
- Simulate proposed changes on paper.
- Validate that high‑severity alerts always have a clear, timely path.
- Ensure that lower‑priority noise doesn’t share the same line as critical signals.
Layer 3: Making Hidden Connections Explicit
So much of alerting and escalation logic lives in code or opaque configuration:
- Terraform modules for alert managers
- YAML files for rules and routes
- JSON policies in incident tools
Individually, each file “makes sense.” Collectively, they can create emergent behavior that no engineer has fully visualized.
By showing explicit connections between:
- Monitors / rules
- Services / components
- Teams / on‑call roles
- Channels / tools
…you reveal:
- Orphan alerts that don’t correspond to any currently owned component.
- Orphan components with no direct alert coverage.
- Implicit dependencies (e.g., one team’s alerts always wake another team first).
Your subway map becomes a living document of reality, not just configuration intent.
Layer 4: Fault Tree Analysis – Where Do Incidents Really Come From?
If RBDs show how systems stay up, Fault Tree Analysis (FTA) shows how they go down.
An FTA diagram starts from a top event (e.g., “checkout unavailable”) and works backward:
- Decompose into intermediate events (API fails, DB rejects writes, etc.).
- Combine them with logic gates (AND/OR) to show required combinations.
- Identify basic events at the leaves (specific component or process failures).
When you merge FTA with your subway map, you can:
- Trace each failure path to see which alerts currently fire (if any).
- Identify critical failure paths that lack early warning.
- Prioritize work on the failure modes most likely to cause major incidents.
For example, you might discover that a top‑level outage can be caused by:
- A single silent cache failure combined with a background job slowdown.
- Neither of which currently has a page‑worthy alert.
This is where you shift from “we drew a map” to “we know which tracks most need maintenance.”
The Power of a Consistent Visual Legend
All of this only works if people can read the map.
Define a simple, consistent visual language:
- Shapes
- Rectangles: components/services
- Circles: alerts or events
- Diamonds: decision points (routing/escalation logic)
- Parallelograms: human actions or handoffs
- Colors
- Red: failure states
- Green: healthy states
- Orange: degraded or at‑risk
- Blue: tools/platforms (monitoring, paging, chat)
- Lines
- Solid: synchronous dependency or immediate notification
- Dashed: asynchronous or delayed path
- Double‑line: high‑criticality route
Using the same legend across RBDs, alert routes, and FTA diagrams means:
- Engineers, SREs, and incident managers share a common visual vocabulary.
- New team members can ramp up by “reading the map” instead of reading every config.
- Post‑incident reviews can point to specific symbols and paths, avoiding confusion.
Print the legend at the edge of every diagram. Make it boringly consistent.
Gemba: Go Where the Incidents Actually Happen
There’s a lean manufacturing concept called gemba—"the real place" where work actually happens. In software reliability, the gemba is:
- The incident channel during a live outage
- The on‑call engineer’s laptop at 3 a.m.
- The handoffs between teams in the first 30 minutes of an incident
Watching real work as it happens surfaces:
- Steps people take that are missing from any runbook
- Workarounds that never appear in dashboards
- Delays caused by confusion over who owns what
- Tools people actually rely on, versus what the architecture diagram claims
Take your early subway map and sit beside an on‑call engineer (or watch a recording) during an incident:
- Trace their actions as new lines and stations on the map.
- Mark where they jump tools, ask questions, or get stuck.
- Note where reality diverges from your diagrams.
You’ll quickly find:
- “Shadow routes” not represented in any official system (e.g., DMing a specific expert).
- Places where no one notices an alert because the chat channel is noisy.
- Repeated manual checks that should be automated or monitored.
Gemba turns your map from a design artifact into a field‑tested reliability model.
Combining Maps, Data, and Observation
The strongest incident models come from three intersecting views:
- Visual failure maps
- RBDs for system structure
- Alert route diagrams for signal flow
- FTA for failure paths
- Real‑world data
- Historical incident timelines
- Alert fire/ack/resolve patterns
- MTTA/MTTR metrics per path or team
- Gemba‑style observation
- Watching on‑call behavior
- Shadowing incident commanders
- Debriefing with responders post‑incident
When you overlay these, you can:
- Validate whether your critical paths see real traffic in incidents.
- Adjust your maps to reflect actual human behavior, not just intent.
- Prioritize work based on where incidents really come from, not just theoretical risks.
Your Incident Story Subway Map becomes:
- A teaching tool for new engineers
- A planning tool for reliability investments
- A diagnostic tool during post‑incident reviews
How to Start Your Own Paper Incident Story Subway Map
You don’t need specialized software. Start with:
- One key user journey (e.g., sign‑in, checkout, message send).
- A whiteboard or large sheet of paper.
- A small cross‑functional group (dev, SRE, incident manager).
Then:
- Draw an RBD for that journey’s critical path.
- Overlay all known alert sources and routes.
- Add failure paths using light‑weight FTA thinking.
- Define and stick to a simple legend.
- Validate the map by walking through a recent real incident.
You’ll almost certainly discover at least one of:
- A critical component with no direct, actionable alert
- An alert that pings the wrong team first
- A failure path that is only detectable by users complaining
Those discoveries are your roadmap for the next set of reliability improvements.
Conclusion: Draw the Underground
Dashboards tell you what is happening. Logs tell you where it happened. Configs tell you what you meant to build.
A hand‑designed Incident Story Subway Map shows you how failure signals actually travel—through systems, tools, and people—on their way from first anomaly to resolved incident.
By combining:
- RBDs to map critical dependencies
- Alert route diagrams to expose and improve signal paths
- FTA to prioritize critical failure modes
- Consistent visual legends to foster shared understanding
- Gemba‑style observation to ground your model in reality
…you get a far more accurate, actionable picture of your reliability posture than any single tool can provide.
Sometimes the fastest way to fix your most modern systems is to pick up a pen and draw the underground lines your incidents ride on every day.