The Analog Incident Observatory Stairwell: Climbing Paper Steps Through a Live Outage Without Losing the Plot
How operational runbooks, cognitive load management, and analog-friendly procedures help incident teams navigate live outages without losing the plot—even when tools fail and stress peaks.
The Analog Incident Observatory Stairwell: Climbing Paper Steps Through a Live Outage Without Losing the Plot
When everything is on fire, no one wants to read.
And yet, the teams that handle major incidents best are rarely the ones with the “smartest” individuals or the fanciest tools. They’re the ones that can still follow a story—the story of the incident—while everything around them is noisy, confusing, and emotionally charged.
A powerful way to keep that story intact is what I like to call the Analog Incident Observatory Stairwell: a metaphor for climbing a simple, visible, step-by-step path through chaos, guided by runbooks that work even when your tools don’t.
This post explores why operational runbooks and analog-friendly procedures matter so much in live outages, and how they help teams avoid losing the plot.
Why Runbooks Are the Foundation of Incident Response
An operational runbook is not just a checklist. It’s a narrative scaffold:
- It tells you where to start.
- It suggests what to observe.
- It defines clear decision points.
- It keeps everyone on the same page about what happens next.
Runbooks serve as the foundation of effective incident preparation and response because they:
- Codify experience: They turn hard-won tribal knowledge from past incidents into reusable guidance.
- Reduce ramp-up time: Newer engineers can contribute meaningfully without years of exposure to every failure mode.
- Standardize the basics: Common diagnostic and remediation steps don’t need to be reinvented at 2 a.m.
- Anchor communication: The incident commander, responders, and stakeholders can all refer to the same logical flow.
High-performing teams don’t assume everyone will “just know what to do.” They bake that knowledge into runbooks and keep them current.
The Real Gap: Preparation, Not Talent
It’s tempting to explain differences between strong and struggling incident teams in terms of raw skill: “We just need more senior engineers,” or “We need better tools.”
In reality, the performance gap is mostly about preparation:
- High-performing teams rehearse incidents (game days, drills, simulations).
- They design, test, and refine their runbooks.
- They create clear roles and expectations for who does what when things go wrong.
Struggling teams often have:
- Incomplete or outdated documentation.
- Runbooks that exist but are unused or unknown.
- No clear incident command structure.
- Heavy dependence on a few “heroes.”
During a live outage, everyone feels the pressure. The question isn’t, “Who is the smartest?” It’s, “Which team prepared a map before they got lost?”
Runbooks are that map—and they become even more critical once stress kicks in.
Stress Makes Smart People Do Dumb Things
A major incident is not a normal working environment. You’re dealing with:
- Customer pressure and potential financial impact.
- Leadership watching closely.
- Systems behaving unpredictably.
- Time compression: minutes feel like seconds.
Under these conditions, very predictable psychological responses kick in:
- Tunnel vision: People lock onto one hypothesis and ignore contradictory evidence.
- Shortened attention span: Multitasking becomes chaos; important details are dropped.
- Decision fatigue: The more micro-decisions required, the lower the quality of each subsequent decision.
- Fight-or-flight mode: Emotional reactivity increases; calm analysis decreases.
These responses can impair even highly skilled engineers.
Runbooks act as a cognitive prosthetic under stress:
- They reduce the number of decisions you have to make from first principles.
- They provide a structured path when your brain is tempted to jump around.
- They offer pre-validated sequences of actions that you can trust when your intuition is overloaded.
In other words, runbooks help you keep your head when your nervous system desperately wants to lose it.
Cognitive Load Is a First-Class Incident Metric
Most incident teams pay close attention to technical indicators:
- Error rates
- Latencies
- Saturation metrics
- Logs and traces
But during an incident, there’s another dimension that matters just as much: the team’s collective cognitive load.
When cognitive load is too high:
- Key steps get skipped.
- The same questions get asked repeatedly.
- People talk past each other.
- Incident timelines become impossible to reconstruct.
Managing cognitive load means:
- Limiting the number of parallel workstreams.
- Assigning clear roles (incident commander, scribe, comms lead, domain experts).
- Using runbooks to offload routine decision-making.
- Writing things down so the brain doesn’t have to retain every detail.
A well-structured runbook is not just a “technical checklist”; it’s an instrument for load balancing human attention. It tells the team:
- “Here is the minimal path through this type of incident.”
- “Here is what must be tracked and recorded.”
- “Here is when to stop, reassess, or escalate.”
When you treat cognitive load as a first-class concern, your runbooks naturally evolve from static documents into living guides for how people think together under pressure.
The Power of Analog-Friendly Procedures
Modern incident tooling is impressive: dashboards, chatops integrations, auto-remediation, AI-assisted triage. But in a truly gnarly outage, some or all of that may be compromised or unavailable.
That’s where the Analog Incident Observatory Stairwell comes in.
Imagine you’re in a high-rise building. The power fails. Elevators stop. Emergency lights flicker on. How do you get out? You take the stairwell:
- It’s low-tech.
- It’s predictable.
- It’s visible and physical.
Analog-friendly procedures are your incident stairwell—simple, robust paths that still work when your digital “elevators” are down.
Examples of analog-friendly practices:
- Printed runbooks for the highest-risk, highest-impact scenarios (e.g., widespread database failure, identity provider outage, total network partition).
- Whiteboard-based timelines and status boards to maintain shared awareness when tools lag or are unavailable.
- Paper-based checklists for incident command, ensuring no critical communication step is missed.
- Pre-agreed analog fallbacks: If chat is down, use phone bridges or SMS trees with clear activation rules.
Why this matters:
- It removes dependence on the very systems that may be failing.
- It provides a tactile, visual anchor when digital noise is overwhelming.
- It keeps the team aligned and oriented around a shared, physical representation of the incident path.
You don’t need to go fully Luddite. The point is not to replace tools, but to ensure the core of your response plan doesn’t collapse when tools do.
Designing Runbooks as “Paper Steps” You Can Actually Climb
Not all runbooks are created equal. Some are glorified wikis: dense, outdated, and ignored. To function as a usable stairwell in a live outage, they must be designed for action under stress.
Some principles:
1. Make the First Steps Frictionless
Runbooks should answer: “What do we do in the first 5 minutes?”
- Start with a simple trigger description (what symptoms suggest using this runbook?).
- Provide immediate actions: stabilize, stop further harm, gather key facts.
- Avoid long preambles or theory.
2. Keep Steps Atomic and Observable
Each step should describe a concrete, observable action, for example:
- “Run
Xquery and paste results into the incident channel.” - “Check dashboard
Yand note error rate trend.” - “If CPU > 80% for more than 5 minutes, go to Step 7; otherwise, go to Step 9.”
This supports clear handoffs and makes it easy to tell if a step is truly “done.”
3. Build in Decision Points and Branches
Runbooks shouldn’t assume a single path. Good ones include branching logic:
- “If database connections are saturated, follow the ‘Connection Storm’ branch.”
- “If external dependency
Zis degraded, follow the ‘Third-Party Impact’ branch.”
These branches should be visually obvious and easy to follow—especially in printed or whiteboarded form.
4. Make Them Role-Aware
Highlight who is expected to act:
- Steps for incident commander (coordination, communication, decision-making).
- Steps for responders (diagnostics, mitigations).
- Steps for comms lead (stakeholder updates, status pages).
This allows team members to focus only on the steps relevant to their role and lowers cognitive load.
5. Test, Debrief, Iterate
Runbooks are hypotheses until tested:
- Use them in game days or simulations.
- After real incidents, conduct post-incident reviews specifically focused on how well the runbook supported the team.
- Capture pain points: missing branches, unclear wording, overly long sections.
The goal: each iteration makes the “paper steps” more trustworthy and more climbable next time.
Bringing It All Together
Live outages are not exams where you prove how smart you are. They’re environments designed to erode your attention, fragment your communication, and test your preparation.
Teams that consistently navigate incidents without losing the plot do a few things differently:
- They treat runbooks as core infrastructure, not optional documentation.
- They understand that the biggest performance lever is preparation, not heroics.
- They engineer for human cognition, not just for system resilience.
- They invest in analog-friendly stairwells—procedures that still work when the lights go out and the tools are on fire.
The next time you review your incident program, ask:
- If our main tools were unavailable, could we still coordinate a response?
- Are our runbooks usable under stress, or just readable at leisure?
- Do we actively manage cognitive load, or only system load?
If the answers make you uneasy, that’s not a failure—it’s an invitation. Start small: pick one high-impact failure mode, design a sharp, analog-friendly runbook, print it, drill it.
Then, when the next outage comes—and it will—you’ll have a stairwell to climb instead of a dark, spinning elevator shaft.
You won’t just keep the system alive. You’ll keep the story straight.