Rain Lag

The Analog Incident Subway Blueprint: How to Navigate Outages When Observability Goes Dark

When dashboards die and telemetry disappears, a room‑sized, paper “subway map” of your systems can become your last reliable observability layer. Here’s how to design, use, and maintain one so your team can still navigate incidents in the dark.

Introduction: When the Lights Go Out on Observability

Modern incident response assumes one thing: your observability stack is alive.

But what happens when:

  • Your monitoring cluster is down
  • Your cloud provider is having a “partial disruption” (read: chaos)
  • Your SSO or VPN fails and no one can reach dashboards
  • A network partition isolates your primary tools

In those moments, you don’t just lose metrics and traces—you lose orientation. Teams no longer share a common picture of what the system is, what’s broken, and what might break next.

That’s where an Analog Incident Subway Blueprint comes in: a room‑sized, paper network map that acts as a backup observability layer when everything digital goes dark.

Think of it as a physical, glanceable, subway‑style diagram of your infrastructure, applications, data stores, and critical external services—annotated with known failure modes and incident heuristics. It’s not a nostalgic gimmick; it’s a practical tool for high‑stress, low‑information situations.


Why Paper Still Matters in a Digital Incident Room

A wall‑sized paper map offers three advantages you don’t get from dashboards:

  1. It never goes down. No login, no network, no power required.
  2. It’s inherently shared. Everyone in the room is literally looking at the same picture.
  3. It anchors discussion under stress. In emergencies, people think and communicate better when they can point at something.

When your screens are blank or inaccessible, a map taped across the wall becomes your backup observability interface: not live telemetry, but live cognition—the place where people reconstruct what’s happening using whatever signals they still have.


Cognitive Maps: Learning from Firefighters

Firefighters are trained to build cognitive maps of burning buildings:

  • Where exits and stairwells are
  • Which walls are load‑bearing
  • Where fire is likely to spread next

They rarely see the whole building clearly, but they maintain an internal model that lets them act quickly with partial information.

Your engineers need something similar.

The goal of your analog subway blueprint is not perfect fidelity. It’s to help responders form and update a shared cognitive map of:

  • What components exist
  • How they relate
  • Where problems are most likely to propagate

The more often teams rehearse with the map, the more those cognitive maps sharpen. During a real outage, they’re not starting from zero; they’re filling in a known structure with new incident clues.


Step 1: Break the System Down by Dependency Type

A useful analog map starts with clear layers of dependencies. At minimum, separate:

  1. Infrastructure
    • Compute (nodes, clusters, autoscaling groups)
    • Network (VPCs, subnets, gateways, load balancers)
    • Storage (block, object, shared file systems)
  2. Applications / Services
    • User‑facing services (APIs, web apps, mobile backends)
    • Internal services (auth, billing, search, recommendation)
    • Batch / background workers
  3. Data Systems
    • Databases (SQL, NoSQL)
    • Caches, queues, event streams
    • Analytics / pipelines
  4. External Services
    • Third‑party APIs (payments, messaging, auth)
    • Managed SaaS dependencies (logging, email, CDNs)

On the wall, you might arrange these as horizontal bands from bottom (infrastructure) to top (user‑facing applications). This makes the map actionable during an outage because responders can say things like:

  • “If this database is degraded, what services in the band above it die?”
  • “If this region is unhealthy, which user flows at the top are affected?”

The goal is rapid blast‑radius estimation, not detailed architecture documentation.


Step 2: Apply Application Dependency Mapping—On Paper

Most organizations already have some kind of application dependency mapping in a tool: service to service, service to database, service to external API.

The analog blueprint borrows those concepts and flattens them into a subway‑style map:

  • Lines = critical business flows (e.g., “checkout workflow”, “user login”, “data ingestion”)
  • Stations = key components along the way (services, databases, queues, third‑party APIs)
  • Interchanges = shared dependencies used by multiple lines (auth, user profile, payments)

For example, the “checkout line” might visually pass through:

Web Frontend → API Gateway → Orders Service → Payments Service → Payment Provider API → Orders DB → Notifications Service

By laying multiple lines on the same map, you can instantly see:

  • Where the map is dense with intersections (high‑risk shared dependencies)
  • Which paths have no redundancy
  • Which external services sit directly on critical lines

Under stress, responders should be able to glance at the map and think: “If this station is on fire, which lines shut down?”


Step 3: Prioritize the Most Consequential Dependencies

A room‑sized map is not the place to represent every cron job and debug service.

You want consequence‑driven design:

  • Include the components whose failure would:
    • Take down key user journeys
    • Corrupt or delay critical data
    • Break regulatory or security controls
  • Omit or collapse low‑impact details into broader nodes (e.g., “auxiliary workers cluster”).

Use visual emphasis to signal importance:

  • Thicker lines for core business flows
  • Larger stations for high‑impact components
  • Color coding for severity tiers (e.g., red = single point of failure)

The objective is speed: in the first five minutes of an incident, responders should be able to trace probable failure paths and estimate blast radius without paging through documentation.


Step 4: Encode Observability Knowledge Directly Onto the Map

Even when your observability system is down, you still know something about your system’s behavior. That knowledge can be pre‑encoded onto the paper map.

Here are useful layers to add:

1. Common Failure Modes

Next to each major station, add a small legend:

  • “Typical failures: connection exhaustion, throttling, stale cache, DB lock contention”
  • “Frequent issues: TLS errors from provider, regional latency spikes”

These notes act as cognitive shortcuts when live error dashboards are unavailable.

2. Anomaly Patterns

Use subtle icons or labels for known patterns, such as:

  • “Often fails after deploys”
  • “Sensitive to traffic spikes”
  • “Impacted when Region A is unhealthy”

This helps responders think in hypotheses: “Could this be another post‑deploy cache desync?”

3. Alert Flows (When They Work)

Even if your alerting system is currently dark, document what normally alerts on what:

  • Which SLOs or alerts are associated with each station
  • Which teams own and respond to them

If someone has partial access (e.g., via a laptop still on VPN), they can cross‑reference what’s failing with where it would show up—using the paper map as their navigation guide.


Step 5: Make It Operable in the Heat of an Incident

An analog blueprint is only useful if it’s easy to update in real time.

Equip the incident room with:

  • Colored sticky notes (for current issues, hypotheses, and mitigations)
  • Dry‑erase markers (if laminated) to highlight suspected paths
  • A simple color language, e.g.:
    • Red sticky = confirmed impacted component
    • Yellow sticky = suspected issue / under investigation
    • Blue sticky = mitigation applied / temporary workaround

As the incident unfolds, the map becomes a living storyboard:

  • Teams mark affected stations and lines
  • Draw arrows for suspected propagation paths
  • Annotate decisions (“traffic shifted to Region B at 14:32”)

After the incident, you can photograph the annotated map and use it to reconstruct the timeline for post‑incident review.


Step 6: Maintain and Rehearse—Build Shared Mental Models

A map that’s outdated or unused is dangerous. Treat the analog blueprint as a first‑class operational asset:

  1. Regular Maintenance

    • Review quarterly (or after major architecture changes)
    • Add new critical services and retire old ones
    • Validate that dependency lines still reflect reality
  2. Rehearsal and Training

    • Run tabletop exercises with only the map, no dashboards
    • Practice “you lost observability—what do you do?”
    • Rotate facilitators from different teams to broaden ownership

Over time, these rehearsals:

  • Build shared mental models across infra, app, and data teams
  • Reduce time spent arguing about what exists vs. what’s happening
  • Make people more comfortable working from incomplete information

When the real “observability dark” event arrives, the room isn’t staring helplessly at a wall. They’re using a familiar navigation tool they’ve practiced with.


Conclusion: Design for the Day Your Dashboards Disappear

Most organizations over‑optimize for the world where all their tools are online. The Analog Incident Subway Blueprint is a bet on the opposite scenario: assume your observability stack will fail when you need it most.

By investing in a room‑sized, paper map that:

  • Breaks your system into clear dependency types
  • Visualizes application and data flows in a subway‑style layout
  • Prioritizes the most consequential dependencies
  • Encodes observability knowledge and failure heuristics
  • Is maintained and rehearsed with regularly

…you create a robust, low‑tech observability layer that survives outages, auth failures, and vendor incidents.

When observability goes dark, your team shouldn’t be guessing in the void. They should be standing in front of a wall‑sized blueprint, tracing lines, marking stations, and navigating their way back to stability—together.

The Analog Incident Subway Blueprint: How to Navigate Outages When Observability Goes Dark | Rain Lag