Rain Lag

The Analog Incident Cartographer’s Desk: Hand‑Drawing Reliability Maps Before Your Next Pager Storm

How hand‑drawn dependency maps and tabletop exercises can transform chaotic incident response into a disciplined, reliable practice—before your next pager storm hits.

The Analog Incident Cartographer’s Desk: Hand‑Drawing Reliability Maps Before Your Next Pager Storm

There’s a special moment in every bad incident: the quiet one right after every pager lights up.

It’s the moment when everyone’s talking, metrics are screaming, dashboards are red, and someone asks the hardest question in operations:

“What actually depends on what?”

If you can’t answer that in minutes, not hours, you don’t have an incident response problem—you have a dependency mapping problem.

This is where analog incident cartography—literally hand‑drawing reliability maps—and disciplined tabletop exercises come in. Together, they turn incident response from last‑minute firefighting into a deliberate, continuously improving capability.


Why Incident Tabletop Exercises Matter Before the Pager Storm

Incident response doesn’t start when production goes sideways. It starts in the conference room (or Zoom call) long before that, with tabletop exercises.

A tabletop is a simulated incident: you walk through a plausible failure scenario in real time, with the actual people who would respond, and see how your organization behaves.

Well‑run tableaus help you:

  • Test your playbooks: Do runbooks exist? Are they findable? Are they any good?
  • Reveal process gaps: Who’s on point for external comms? Who decides on customer‑impacting mitigations?
  • Expose tooling shortcomings: Do you have the right observability, logging, and access? Is everything behind five people’s personal SSH configs?
  • Stress communication paths: How do you coordinate across engineering, support, security, and leadership?

All of this happens without burning customer trust or revenue.

The key: simulate real‑world incidents, not textbook ones. Bring in realistic failure modes: partial region loss, degraded third‑party APIs, security signals that might be false positives, slow data corruption, a misconfigured feature flag in only one shard.

These exercises reliably uncover problems that only show up under pressure:

  • “We thought on‑call had database access—they don’t.”
  • “We have no idea what depends on this internal API.”
  • “Legal needs a heads‑up but isn’t in the incident channel.”

When the actual pager storm arrives, your organization shouldn’t be discovering these for the first time.


Dependency Mapping: More Than an Asset Inventory

Most organizations have lists:

  • Services
  • Databases
  • Queues
  • Third‑party providers
  • Teams

But during an incident, a list is almost useless. What responders need is structure.

A dependency map answers questions your asset list can’t:

  • What calls what?
  • Which services depend on this database?
  • If this third‑party goes down, which customer flows break?
  • If this team is offline, who can make which changes?

Where an asset inventory says “We have 40 microservices,” a dependency map says:

“Service A depends on B and C; B uses Database X; C relies on Third‑Party Y; both A and C serve the checkout experience for enterprise customers.”

This is the difference between:

  • Technical blast radius: which systems will fail.
  • Business blast radius: which customers, features, and revenue streams will feel it.

And critically, maps reveal hidden single points of failure:

  • That “small” internal auth service every product uses.
  • The one engineer who knows how to restart the legacy batch job.
  • The shared CI/CD pipeline that every team needs to ship an emergency fix.

Without a map, these remain invisible—until they fail.


The Case for Analog: Why Hand‑Drawing Maps Works Better Than You Think

You probably already have diagrams: auto‑generated architecture graphs, fancy service maps in your APM tool, maybe a CMDB. They’re useful—but they’re not enough.

When you’re deep in an incident, those maps often fail you because they’re:

  • Too dense: hundreds of edges, unreadable at a glance.
  • Too abstract: show technical links but not business impact or team ownership.
  • Too distant: created once, never updated, nobody trusts them.

Enter the analog incident cartographer’s desk.

Imagine a whiteboard. Sticky notes. Sharpies. A stack of index cards. The goal isn’t aesthetic perfection; it’s shared understanding.

Manually sketching reliability maps has distinct advantages:

  1. Forces intentional thinking
    When you draw a line between two systems, you have to ask: “Does this really depend on that? In which direction? Under what conditions?”

  2. Makes complexity tangible
    As the board fills up, people feel the weight of that tangle of arrows. It’s often the first time the team visually grasps how fragile certain paths are.

  3. Improves memory under pressure
    The physical act of drawing—arguing over arrows, rearranging sticky notes—locks the model into people’s heads. When production burns later, they can recall “that fragile path we circled in red.”

  4. Encourages cross‑functional collaboration
    Infrastructure, app teams, data, SRE, security, and support can all point at the same board. Disagreements surface and get resolved in real time.

  5. Surfaces non‑technical dependencies
    You can map people and process too: who can approve a failover, who has production access, which third parties gate critical flows.

Think of it less as drawing an architecture diagram and more as charting an incident map: roads, bridges, choke points, and critical infrastructure that keep your digital city running.


How to Run an Analog Mapping + Tabletop Session

You get the biggest payoff when you combine analog maps with tabletop exercises.

Here’s a lightweight approach.

1. Pick a Scenario That Actually Worries You

Start with something realistic and painful:

  • Primary database region degradation.
  • Major third‑party outage (payments, auth, DNS, email).
  • Widespread latency due to a network partition.
  • Security incident: suspicious exfiltration from a key system.

Write a one‑paragraph scenario, with enough detail to be plausible but open‑ended enough to explore.

2. Build the First Draft Map Together

In the room (or virtual whiteboard), ask:

  • What’s directly involved in this scenario? (services, data stores, third parties)
  • What do those components depend on?
  • What customer flows rely on them?

Draw:

  • Boxes for services and data stores.
  • Different shapes or colors for third parties.
  • Lines with arrows for direction of dependency.
  • Annotations for:
    • Business criticality (e.g., revenue, SLAs)
    • Team ownership
    • Known risk areas (e.g., “single AZ,” “no failover,” “one maintainer”).

Don’t aim for perfection; aim for shared visibility. Expect lots of “Wait, does that actually depend on…?” and “We should check that later.” Capture those follow‑ups.

3. Run the Tabletop Against the Map

Now walk through the incident as if it’s live:

  • Simulate monitoring alerts firing.
  • Ask: “What do you look at first?”
  • Use the map to trace potential blast radius:
    • “If this database is degraded, which services go noisy?”
    • “Who needs to be pulled into the incident channel?”
    • “What customer journeys are at risk?”

As you go, mark the map:

  • Red lines for fragile dependencies.
  • Question marks where you’re unsure of behavior.
  • Stars for likely mitigation points (feature flags, failovers, degrading gracefully).

You’ll quickly see:

  • Process gaps: nobody knows who can declare a customer‑visible incident.
  • Tooling gaps: missing dashboards, no synthetics, unknown logs.
  • Org gaps: key teams or third parties with no clear contact path.

4. Turn Discoveries into Concrete Improvements

End the session by extracting actionable work:

  • Add or fix monitoring and alerts.
  • Create or update runbooks.
  • Clarify incident roles and escalation paths.
  • Reduce single points of failure (technical or human).

Then, capture a cleaned‑up digital version of the map (photo, wiki diagram, or light documentation), but keep the analog spirit: it’s a living tool, not a static artifact.


Maintaining Your Maps Without Drowning in Updates

You don’t need a perfect, real‑time map of everything. You need a good‑enough, up‑to‑date map of what really matters.

Some practical rules:

  • Focus on critical paths: customer onboarding, checkout, authentication, data pipelines tied to SLAs.
  • Review maps on a cadence: quarterly or after major architecture shifts.
  • Update after incidents: if reality surprised you, the map was wrong—fix it.
  • Link maps to ownership: each map should clearly show which teams own which components.

Over time, you build a modest library of reliable, human‑understandable maps for key flows and platforms. Not exhaustive, but trustworthy.


From Firefighting to a Disciplined Capability

When organizations skip mapping and tabletop work, incidents feel like:

  • Chaotic Slack channels.
  • Endless guess‑and‑check changes.
  • Surprises about what’s impacted.
  • Postmortems filled with phrases like “we didn’t realize” and “we assumed.”

With practiced tableaus and well‑maintained dependency maps, incidents look more like:

  • Faster triage: you know where to look and who to call.
  • Better prioritization: you focus on the highest business impact, not the noisiest alert.
  • More coordinated response: roles are known, handoffs are smoother, comms are clearer.
  • Continuous improvement: every exercise and incident feeds back into maps, playbooks, and architectures.

That’s the real payoff: incident response stops being a barely controlled bonfire and becomes a discipline—one your organization practices on purpose, not just survives.


Conclusion: Set Up the Cartographer’s Desk Before the Storm

You don’t need a new tool to get better at incidents. You need a marker, a whiteboard, and an hour with the right people.

Before your next pager storm:

  1. Pick a realistic, scary scenario.
  2. Gather the teams who would respond.
  3. Hand‑draw the dependency map—systems, services, people, and third parties.
  4. Walk through the incident as if it’s real.
  5. Turn everything you learn into updates: maps, playbooks, monitoring, ownership.

Make analog incident cartography a habit, not a one‑off exercise. When the real alerts start blaring, your responders won’t just have tools and dashboards—they’ll have a mental map of the terrain.

And that’s the difference between being lost in the storm and leading the way through it.

The Analog Incident Cartographer’s Desk: Hand‑Drawing Reliability Maps Before Your Next Pager Storm | Rain Lag