Rain Lag

The Analog Incident Story Compass Tower: Stacking Paper Floors to Walk Through an Outage From Lobby to Rooftop

How to design incident post‑mortems like an architectural walkthrough—using an “Analog Story Compass Tower” to structure learning, accountability, and better tools from every outage.

The Analog Incident Story Compass Tower: Stacking Paper Floors to Walk Through an Outage From Lobby to Rooftop

When an outage hits, most teams focus on one goal: get it fixed, fast.

Then, right after the adrenaline fades, someone opens a blank document and types two cursed words: “Post-mortem template”.

Too often what follows is rushed, shallow, and treated as a bureaucratic checkbox. Yet this document is the only durable artifact that explains what happened, why it happened, and how you will avoid repeating it.

You wouldn’t design a skyscraper without drawings, or rebuild one after a collapse based on fuzzy memories. Incidents are no different.

This post introduces a mental model and a practical format: the Analog Incident Story Compass Tower—a way of visualizing and structuring post‑mortems as a stack of paper “floors” you can walk, from lobby to rooftop. Along the way we’ll connect incident work to design disciplines like architecture, software, and visual design.


Why Post‑Mortems Must Be Treated as Essential Documents

A post‑mortem is not:

  • A blame artifact
  • A CYA report for management
  • A mandatory attachment for compliance

A good post‑mortem is:

  • The canonical record of what happened
  • A shared learning object for the whole organization
  • A map that future responders can walk when something similar breaks

If you treat post-mortems as paperwork to “get over with,” you lose:

  • Accurate memory – Details decay fast after an incident.
  • Root causes – Superficial summaries hide systemic issues.
  • Reusability – Sloppy narratives can’t be searched, reused, or taught.

High‑reliability organizations (e.g., aviation, medicine, space) are obsessive about these documents. They know that the cost of a careful narrative now is tiny compared to the cost of repeating the same failure later.

Your incident report should be something you’re proud to hand a new hire and say, “Walk this like a building. You’ll understand our systems, our tools, and how we think.”


The Compass Tower Metaphor: Walking the Outage Like a Building

Imagine every incident post‑mortem as a tower of paper floors laid out on a desk.

  • The Lobby is where you enter: a clear, human summary.
  • Each Floor is a distinct layer of understanding.
  • The Rooftop is where you step back: systemic fixes and design lessons.

You should be able to

  • Start at the lobby and quickly understand what happened.
  • Walk floor by floor to see how and why it happened.
  • Reach the rooftop to learn what you changed and what you’ll design differently.

This is your Analog Story Compass: a repeatable structure that orients anyone who steps into the story—SREs, support engineers, executives, new hires, even auditors.

Let’s design this tower floor by floor.


Floor Plan: A Reusable Post‑Mortem Template

Below is a practical, reusable template you can adapt. Think of each section as a floor in your tower.

Lobby: Executive Story

Purpose: Provide a crisp, non-technical narrative anyone can understand.

Include:

  • Incident name & ID
  • Dates & times (with time zones)
  • Systems affected
  • Impact in plain language (e.g., “25% of users could not log in for 42 minutes.”)
  • Status (resolved, mitigating, under investigation)

This is the foyer: if someone reads only this, they should know what building they’re in.


Floor 1: Timeline Walkthrough

Purpose: Anchor the narrative in time.

  • Use timestamped events (e.g., “10:14 UTC – alert fired…”)
  • Note who did what, and what tools or dashboards they used
  • Include automations as actors (e.g., autoscaling, scripts)

This floor is like walking a corridor lined with clocks: you see the flow and pacing of the response.


Floor 2: Technical Anatomy

Purpose: Explain the system and failure mode in enough detail to teach.

  • Visual: one simple, labeled diagram of the system as it was during the incident
  • Key components involved (services, queues, databases, external APIs)
  • What degraded or failed (e.g., “Write latency on primary DB spiked >2 seconds.”)

Borrow from software and visual design here:

  • Use consistent shapes and colors across incidents.
  • Keep diagrams high-level but accurate.
  • Avoid clutter—this is not your entire architecture, just the relevant slice.

Floor 3: Causal Story (Not Just “Root Cause”)

Purpose: Show how contributing factors layered to create the outage.

Instead of one “root cause,” build a causal stack:

  • Triggering event – what set things off
  • Contributing conditions – e.g., missing alarms, risky deploy patterns
  • System characteristics – e.g., lack of backpressure, single point of failure
  • Organizational context – e.g., on-call fatigue, tooling gaps, unclear ownership

Use techniques from architecture and systems design:

  • Think in layers (foundation, structure, facade)
  • Ask: If we changed this layer, would the failure still have happened?

Your goal is a coherent story, not a culprit.


Floor 4: Response Design & Tooling

Purpose: Analyze how your tools and processes shaped the response.

Questions to answer:

  • How quickly were alerts noticed and understood?
  • Were dashboards and logs easy for the whole team to use, or only experts?
  • Did runbooks help, confuse, or get ignored?
  • What workarounds did responders invent on the fly?

Here you treat incident response as a UX design problem:

  • If the on‑call interface were a product, would you ship it?
  • How many clicks to see “Is it us or the provider?”
  • Can a new engineer reasonably orient themselves within minutes?

This is where you flag needs for better incident management tools and support resources.


Floor 5: Decisions & Tradeoffs

Purpose: Document the key choices made under pressure.

For each major decision:

  • What options were considered?
  • Why was this option chosen?
  • What constraints existed (risk tolerance, regulations, SLAs)?

This is architectural thinking: no building is perfectly safe, cheap, and beautiful at once. You document tradeoffs to make future decisions faster and clearer.


Rooftop: Changes, Experiments, and Long‑Term Design

Purpose: Step back above the building and look at your city.

Break this into:

  1. Immediate fixes – already done (config changes, rollbacks, new alerts)
  2. Short‑term actions (days–weeks) – small design improvements: better docs, refined dashboards, playbook updates
  3. Long‑term design work (months) – systemic projects: re-architecting a dependency, restructuring on-call, training programs

Explicitly link each action to a layer in your causal stack so you can see which floors are being reinforced.


Tools as Part of the Architecture: Make Them Learnable

A beautifully structured post-mortem is useless if only two people can operate your incident tools.

When choosing and configuring incident management tooling, treat it like public infrastructure:

  • Learnability: Can a new team member be dangerous-but-useful within a day?
  • Consistency: Do alerts, runbooks, and dashboards follow familiar patterns?
  • Visibility: Can everyone see what’s happening in real time—chat, timeline, changes?

And critically, prioritize support resources from your tooling vendors and internal teams:

  • Clear documentation and quick-start guides
  • Short, focused tutorials and videos
  • Shared forums or channels where people ask and answer questions
  • Regular training sessions and incident drills

You are designing not just software, but a learning environment.


Incident Work as a Complex Design Problem

Outages are not just bugs; they’re complex design problems:

  • Technical: code paths, failovers, performance dynamics
  • Social: communication, roles, expectations
  • Cognitive: how people perceive, reason, and act under stress

Other disciplines have wrestled with similar complexity for decades:

  • Architecture: load paths, failure modes, safety margins
  • Software design: modularity, interfaces, observability
  • Visual design: hierarchy, contrast, readability under constraints

Borrow their practices:

  • Sketch: rough diagrams before precise metrics.
  • Iterate: refine your post‑mortem template after every few incidents.
  • Standardize: reuse visual patterns and language.
  • Critique: review post‑mortems like design reviews, not status reports.

If you treat incidents as design opportunities, your system—and your team—get structurally better over time.


Putting It Into Practice

To adopt the Analog Incident Story Compass Tower:

  1. Define your floors: Customize the sections above into a template.
  2. Print it once: Literally lay out the “floors” on a table after your next major incident. Walk through them with the team.
  3. Refine the blueprint: Remove sections no one uses, add what’s missing.
  4. Integrate into tools: Bake the template into your incident platform and knowledge base.
  5. Train on the tower: Teach new hires to read old incidents from lobby to rooftop.

Over time, your incident library becomes a skyline: a visible, navigable landscape of what you’ve survived and how you’ve improved.


Conclusion: Build Towers Worth Walking

Every incident tells a story, but most organizations only capture a sketch.

By treating post‑mortems as essential architectural documents, using a clear, reusable template, and choosing tools and support resources that your entire team can master, you turn outages into designed learning experiences instead of random scars.

The Analog Incident Story Compass Tower is just a metaphor—but it’s a powerful one. When you can walk an outage from lobby to rooftop, floor by floor, you gain something rare: the ability not just to fix what broke, but to design a future where it’s harder for the same failure to happen again.

The Analog Incident Story Compass Tower: Stacking Paper Floors to Walk Through an Outage From Lobby to Rooftop | Rain Lag