Rain Lag

The Analog Incident Story Cabinet of Cartographers: Hand‑Drawing How On‑Call Engineers Really Navigate Failure

Hand‑drawn incident story‑maps reveal how on‑call engineers actually think and move through outages—often very differently from formal runbooks. This post explores why analog mapping matters, how to do it, and how it fits into a human‑centered operating model alongside SRE and Team Topologies.

The Analog Incident Story Cabinet of Cartographers: Hand‑Drawing How On‑Call Engineers Really Navigate Failure

When a real incident hits, it rarely follows the neat boxes and arrows of our official diagrams.

Instead, on‑call engineers improvise. They follow hunches, jump between tools, DM the one person who “just knows” a weird edge case, and slowly converge on what’s actually happening. The official architecture diagram might be pinned to Confluence—but the real map of how people navigate the failure lives in their heads.

This is where The Analog Incident Story Cabinet of Cartographers comes in.

It’s a practice of asking engineers to hand‑draw how they moved through an incident: what they saw, where they went, who they talked to, what they tried, and how they updated their understanding. These analog story‑maps reveal the real terrain of operations—messy, human, and incredibly valuable.


Shared understanding is the real incident superpower

We often talk about incident response in terms of:

  • Better tooling
  • Faster alerts
  • Tighter runbooks
  • Clearer roles

All of that matters, but during an incident, one thing matters more than anything else:

Can the team build and maintain a shared understanding of what’s going on?

This is what cognitive systems researchers call a common operational picture. In practice, it looks like:

  • Everyone roughly agrees on what is broken and for whom
  • People share a plausible hypothesis about what’s causing it
  • There’s alignment on what we’re trying next and why

When this shared understanding is weak, you see:

  • Multiple people debugging the same thing in parallel
  • Conflicting explanations shouted into Slack
  • Individuals going off on solo debugging adventures
  • Confusion about whether we’re gaining or losing ground

Technical skill and procedures help, but only insofar as they support this evolving mental picture. The real work of incident management is continually updating and aligning mental models under pressure.


Mental models are invisible—until you draw them

Every on‑call engineer carries an internal map of:

  • How the system is wired together
  • Which components tend to fail and how
  • What “normal” looks like in each dashboard
  • Who to call for which kind of weirdness
  • What worked or failed in similar incidents before

These mental models drive:

  • Which logs they open first
  • Which metrics they trust
  • Which Slack channels they join
  • Which hypotheses feel “worth trying”

The problem: these models are mostly invisible.

You can’t see that one engineer thinks the issue is always “database first,” while another thinks “network first.” You don’t notice that half the team believes a component is stateless while the other half thinks it holds critical session data—until that mismatch blows up coordination in the middle of an outage.

Making these models visible—through talk, drawings, and shared artifacts—is one of the fastest ways to improve coordination and reduce stress during incidents.

And that’s exactly what analog incident story‑maps are designed to do.


What is the Analog Incident Story Cabinet of Cartographers?

Think of it as a recurring practice and a shared library.

  • Analog – Done by hand: pen, paper, whiteboard, sticky notes.
  • Incident Story – Not just the system; the narrative of how people moved through the incident.
  • Cabinet – A place (physical or digital) to keep these artifacts and revisit them.
  • Cartographers – The on‑call engineers and incident responders who draw the maps.

At its core:

After significant incidents, each key responder hand‑draws a map of how they actually navigated the failure.

Each map shows:

  • Entry point: How they first noticed something was wrong
  • Path: Where they went (dashboards, services, people, tools)
  • Forks: Decision points and discarded hypotheses
  • Landmarks: The clues that shaped their thinking
  • Outcome: What finally clicked and what they did about it

The result is a collection of raw, human‑scale maps that often look nothing like your formal architecture diagram or runbook.


Why hand‑drawn maps (instead of just better diagrams)?

A natural question: why not just improve the official diagrams? Why insist on analog, hand‑drawn maps?

Because analog maps capture what digital diagrams usually miss:

  1. The narrative flow
    Digital diagrams show structure. Hand‑drawn maps show sequence: "I saw this, so I checked that, then I pinged them." This is how people actually move through incidents.

  2. The uncertainty and dead ends
    Engineers draw crossed‑out paths, big question marks, “weird spike??” notes. These are the edges of knowledge—where tools, docs, or understanding failed them.

  3. The informal system
    Lines to “#infra‑secrets” or “ask Maya” or “old Jira ticket from 2022” are not in your architecture diagram. They are in your incident map.

  4. The lived topology
    Over time, these maps show what the team feels is connected. Two services that “should be independent” but always appear side‑by‑side on incident maps are not independent in practice.

  5. Low friction, high honesty
    Hand‑drawing is fast and forgiving. People are less self‑conscious and more candid than when editing a polished diagram in Lucidchart or Figma.


How to run an analog cartography session

You can start small. After a notable incident (even a minor one), do this within a day or two:

  1. Gather the responders
    3–8 people who were actively involved is ideal.

  2. Give them simple tools
    Paper, sticky notes, pens, markers. If you’re remote: blank slides or a virtual whiteboard, but still encourage hand‑drawn style.

  3. Ask each person to draw their own map
    Prompt them with:

    • "Start with how you first realized something was wrong."
    • "Draw what you looked at, who you talked to, and what you tried."
    • "Mark dead ends, confusing things, or moments where your understanding changed."
  4. Show, tell, and compare
    Each person walks through their map while others listen and annotate their own:

    • "I didn’t know that dashboard existed."
    • "Wait, you thought it was the cache? I was sure it was the queue."
    • "I didn’t realize that alert is always noisy for you."
  5. Identify patterns and gaps
    As a group, look for:

    • Repeated detours or blind alleys
    • Diverging assumptions about how components behave
    • Missing or hard‑to‑find tools and dashboards
    • Social dependencies ("We can’t fix this without Alex")
  6. Capture the meta‑insights
    Don’t just store the drawings. Write down a few bullets:

    • What surprised us?
    • What do these maps show that our runbooks/diagrams don’t?
    • What simple changes would have shortened these paths?
  7. Put the maps in the cabinet
    The "cabinet" might be:

    • A literal folder in your team space
    • A wiki page with photos of each map
    • A Notion/Confluence space titled "Incident Story Maps"

Repeat this practice periodically, not just for major outages. Over time, your cabinet becomes a visual history of how the system and the team evolved together.


How these maps change your operating model

The Analog Incident Story Cabinet of Cartographers doesn’t replace SRE or Team Topologies. It fits alongside them, reinforcing their core ideas.

With SRE practices

  • Error budgets & SLOs tell you when you’re in trouble.
  • Incident cartography helps you understand how you actually find your way out.

Patterns that emerge from story‑maps can feed back into:

  • Better alert design (surfacing the right landmarks earlier)
  • More realistic runbooks (reflecting the actual paths people take)
  • Tooling investments (automating the painful parts everyone keeps drawing)

With Team Topologies

Team Topologies emphasizes team interactions and flow of change. Story‑maps make these flows visible in incidents:

  • Where do we repeatedly pivot between teams?
  • Which teams are surprise dependencies in critical paths?
  • Where is cognitive load highest during an outage?

These maps often reveal that your "intended" team boundaries and your "operational" team boundaries are out of sync—an important signal for rethinking ownership or interfaces.

Toward a human‑centered operating model

Most operating models are tool‑ and process‑centric:

  • Here is the diagram.
  • Here is the runbook.
  • Here is the escalation chain.

Analog incident cartography pulls the focus back to the human system:

  • How do people actually sense, interpret, and act?
  • Where does understanding break down?
  • Where is coordination fragile?

When you treat those questions as first‑class, you get an operating model that’s not just robust on paper, but resilient in practice.


Getting started: a minimal playbook

To introduce this in your organization, you don’t need a big program. Try this:

  1. Pick one recent incident that people still remember.
  2. Invite the responders to a 45–60 minute "mapping retrospective."
    Make it clear: this is not about blame; it’s about understanding.
  3. Run the hand‑draw exercise and comparison discussion.
  4. Publish the maps and a brief summary of what you learned.
  5. Iterate—do it again after the next significant incident.

If people find it useful (they usually do), you can formalize it as a lightweight part of your incident review process.


Conclusion: Draw the map you actually walk

Every team has two maps of their system:

  1. The official, digital map: architecture diagrams, service catalogs, runbooks.
  2. The unofficial, mental map: the one engineers use at 3 a.m. when everything is on fire.

Most organizations invest heavily in the first and almost ignore the second.

The Analog Incident Story Cabinet of Cartographers is a way to bring those mental maps into the light—to honor the real routes engineers take through failure, and to learn from them.

When you do, you get:

  • Faster, calmer incident response
  • Clearer shared understanding under stress
  • More grounded improvements to tooling and process
  • A living, human‑centered operating model that evolves with your system

If you want to know how your system really works in production, don’t just open your architecture diagram.

Ask your engineers to pick up a pen and draw the story of how they got out of the last outage.

That’s the map you’re actually walking—and the one worth improving.

The Analog Incident Story Cabinet of Cartographers: Hand‑Drawing How On‑Call Engineers Really Navigate Failure | Rain Lag