The Paper Runbook Observatory: Building a Wall-Sized Analog Map of How Your Incidents Actually End
How to turn messy, real-world incident response into a wall-sized, analog “paper runbook” that reveals what really happens, exposes gaps in your processes, and feeds back into better digital runbooks and faster recovery.
The Paper Runbook Observatory: Building a Wall-Sized Analog Map of How Your Incidents Actually End
Digital runbooks are optimistic fiction.
They describe how incidents should unfold, step by step. But when things are on fire, humans improvise, skip steps, invent new ones, and bring in teams that were never on the original diagram.
The result: your official runbooks and your real incident behavior slowly drift apart.
This is where the Paper Runbook Observatory comes in—a wall-sized, analog map that makes the actual flow of your incidents visible. Think of it as a flight recorder turned into an enormous, shared visual workspace. It shows how incidents really start, who actually gets involved, what paths are common, and how they really end.
In this post, we’ll explore how to design and use a paper runbook observatory, why physical visual management still matters in an age of dashboards, and how this approach can feed directly into better digital runbooks, improved coordination, and faster incident recovery.
Why Build a Wall-Sized Paper Runbook?
It might sound counterintuitive in a digital-first world. You already have:
- Ticketing systems
- Pager alerts
- Chat transcripts
- Postmortems
- Digital runbook tools
So why drag everything back onto paper and tape it to a wall?
Because your brain sees patterns on a wall that it will never notice in a spreadsheet or JSON blob.
A wall-sized map:
- Compresses context: Dozens of incidents and paths are simultaneously visible.
- Forces prioritization: You can’t render every detail, so you capture the essential flow.
- Invites collaboration: People can walk up, point, argue, annotate, and learn together.
- Makes friction visible: Bottlenecks, loops, and awkward handoffs are obvious when they’re physically drawn.
This isn’t anti-digital. It’s an observatory: a way to step back and see how your incident response system actually behaves, then improve everything else (including your digital tooling) based on those observations.
Step 1: Collect Real Incident Paths, Not Ideal Ones
The observatory starts with data—but not the data you think.
You’re not diagramming the intended process from your official runbooks. You’re reconstructing the real paths taken during actual, recent incidents.
Sources to mine:
- Postmortems: Timelines, chat logs, decision points, escalations
- Pager/alert histories: Who got paged, in what order, with what effect
- Ticket systems: Ownership transfers, priority changes, resolution steps
- Incident review meetings: Unofficial “oh, we actually called X directly” moments
For each incident, identify:
- Trigger/entry point (what first indicated trouble?)
- Initial responder and team
- Major decisions and branch points
- Escalations or handoffs between teams
- Workarounds or “shadow” processes
- How the incident actually ended (rollback, patch, feature flag, hotfix, etc.)
You’re not trying to capture every mouse click. Focus on the semantic steps that changed the state of the incident or who was responsible.
Step 2: Use a Standard Visual Language (e.g., BPMN)
If you just start drawing boxes and arrows, you’ll quickly end up with a beautiful mess.
To keep the map readable and consistent across teams, use a standard notation like BPMN (Business Process Model and Notation) or a simplified variant. You don’t have to use every symbol—just enough to distinguish important categories.
A lightweight set might include:
- Rounded rectangles: Activities ("Run database diagnostics", "Notify on-call SRE")
- Diamonds: Decisions ("Error rate decreasing?", "Is rollback safe?")
- Circles: Start and end points ("User complaint received", "Incident closed")
- Swimlanes: Teams or roles (SRE, App Team, Security, Support, Management)
- Arrow styles:
- Solid arrows: Normal flow
- Dashed arrows: Out-of-band contact (e.g., "DM’d someone on Slack")
- Red arrows: Escalations or emergency interventions
Post a legend beside the map. The goal is that anyone walking up can quickly understand:
- What type of step they’re looking at
- Which team owns it
- How the incident moved from one party to another
Standard notation turns the wall from an art piece into a shared language about operations.
Step 3: Turn Your Wall into an Incident Observatory
Now comes the fun part: build the map.
Use:
- Wide wall space (or multiple boards)
- Sticky notes or index cards for activities and decisions
- Colored tape or string for flows between steps
- Different colors for different teams or severities
Two common layouts work well:
1. By Lifecycle Stages
Organize horizontally by stage, such as:
- Detection
- Triage
- Diagnosis
- Mitigation
- Recovery
- Follow-up
Within each stage, place the actual steps you saw in your incidents, grouped under each team’s swimlane. Draw arrows to indicate the path each incident took through these stages.
2. By Incident Type
Alternatively, use sections of the wall for common incident categories:
- Performance degradation
- Outage / downtime
- Security event
- Data integrity issue
- Dependency failure
For each category, map multiple real incidents and their flows side by side. This makes it easier to spot how, say, security incidents follow very different paths than performance incidents.
Over time, as you add more incidents, the wall becomes an observatory:
- Thick clusters of arrows indicate common paths.
- Rare but painful paths stand out as odd branches.
- Places where incidents get "stuck" are visually cluttered or overconnected.
You now have a living model of how your system responds under pressure.
Step 4: Use Visual Management to Improve Communication
This giant paper map isn’t just a documentation artifact—it’s a visual management tool.
Visual management (from lean manufacturing and operations) is about making the state of work instantly visible. In incident management, that means:
- New hires can walk up and grasp how incidents typically move through the organization.
- Cross-team confusion becomes visible when you see unclear handoffs or looping escalations.
- Leadership can see where most of the time is spent by glancing at dense clusters of steps.
Enhance the wall with visual cues:
- Colored dots on steps that are frequent sources of confusion
- Clock icons next to long-duration stages
- Lightning bolts where temporary workarounds are repeatedly used
This visual vocabulary turns the wall into a conversation starter:
- “Why do so many incidents stall between SRE and Database?”
- “Why are we always improvising during initial triage for performance issues?”
- “Why does Security only show up at the very end of the flow?”
You’re not just documenting; you’re making coordination problems visible and discussable.
Step 5: Compare Reality vs. Runbooks
Now that you’ve mapped actual flows, it’s time for a reality check.
Pick a few critical runbooks and ask:
- Where does the official runbook diverge from the wall?
- Which steps are consistently skipped? Why?
- Which new steps appear on the wall but not in the runbook?
- Where do responders routinely “go off script” to succeed?
You’ll likely discover:
- Steps that are theoretical, not practical.
- Escalation paths that no one uses.
- Teams that are quietly doing key work but aren’t named in runbooks.
- Informal channels (DMs, side calls) that carry critical decisions.
This gap analysis is gold.
Instead of blaming responders for “not following process,” treat the wall as evidence that the process doesn’t match reality. Either fix the process, or update the runbook to reflect the real, effective behavior.
Step 6: Feed Insights Back into Digital Runbooks & Metrics
The observatory isn’t an end state. It’s a feedback engine.
From your wall, you can:
- Redesign runbooks to match the common, successful paths.
- Clarify ownership at ambiguous handoff points.
- Automate obvious steps (paging, routing, logging, dashboards) that show up in every incident path.
Then, connect these changes to hard outcomes:
- MTTA (Mean Time to Acknowledge): Can you shorten the early detection/triage stages by clarifying who owns the first response?
- MTTR (Mean Time to Resolve): Are there repeat bottlenecks at diagnosis or escalation that can be preempted or streamlined?
- Resilience: Do your redesigned paths reduce reliance on heroics or one specific expert?
As you run new incidents with updated digital runbooks, periodically:
- Add their paths to the wall.
- See if the map is converging toward a smaller set of clear, efficient flows.
- Adjust again.
The cycle is:
Observe → Map → Discuss → Redesign → Run → Observe.
Step 7: Layer in Postmortem Findings to Spot Systemic Issues
Finally, enrich your observatory with postmortem insights.
For each incident on the wall, attach:
- Key contributing factors (e.g., “alert fatigue,” “missing dashboard,” “ambiguous ownership”)
- Notable mistakes and recoveries
- Learnings and action items
Use simple tags or icons stuck on the relevant steps. When you zoom out, you’ll start seeing patterns:
- The same contributing factor appears on many different incidents.
- The same workaround is used by different teams.
- Certain teams are frequent endpoints for “we don’t know what else to do.”
This turns the wall into more than a flow map—it becomes a map of systemic weakness and resilience. Instead of treating incidents as isolated fires, you see where your organization is structurally fragile.
Bringing It All Together
A paper runbook observatory might feel surprisingly low-tech, but that’s exactly its strength.
By taking incident response out of the screen and onto a shared wall, you:
- Reveal the gap between theory and practice.
- Use visual management to make coordination problems obvious.
- Apply standard notation so complex, multi-team workflows become readable.
- Encourage cross-team learning and alignment around what actually works.
- Feed observations back into digital runbooks, reducing MTTA/MTTR and improving resilience.
- Combine postmortem insights with process flows to expose systemic, recurring issues.
If your incidents feel chaotic, if your runbooks feel ignored, or if every major outage turns into a brand new improvisation, try building your own paper observatory.
Sometimes the fastest way to understand a complex, socio-technical system is to step away from the dashboard, pick up a marker, and cover a wall with how your incidents actually end.