Rain Lag

The Analog Incident Story Blueprint Scroll: Unrolling a Floor‑Length Map for Pre‑Mortem Outage Rehearsals

How a giant paper “incident scroll” turns abstract system risk into a tangible, collaborative rehearsal—before an outage ever happens.

Introduction

Most teams know how to do a post‑mortem after an outage: gather logs, reconstruct the timeline, debate the root cause, promise to “do better next time.” But by then, the damage is already done—customers are impacted, reputations are dented, sleep has been lost.

A pre‑mortem flips that script. Instead of asking, “What went wrong?” you ask, “Imagine it’s three months in the future: this launch failed spectacularly. What happened?” You rehearse the outage before it ever exists.

Now add one more twist: do it analog.

In an age of dashboards and shared docs, there’s surprising power in rolling out a floor‑length paper map—a literal incident story blueprint scroll the team can walk along. It becomes a physical stage for your imagined outage: systems, dependencies, human actors, and failure modes all laid out in ink.

This post explains how to design and run pre‑mortem outage rehearsals using an analog, floor‑length "incident scroll" that makes your systems and risks impossible to ignore.


Why Pre‑Mortem Outage Rehearsals Matter

A pre‑mortem is a structured exercise you conduct before a major release, migration, or architecture change. The core idea:

"Assume the project has failed badly. Work backward to figure out why."

Instead of generic risk checklists, people are free to imagine vivid, concrete failure stories. This helps you:

  • Surface hidden failure modes that don’t show up in traditional risk logs.
  • Challenge optimism bias ("that won’t happen to us") with realistic disaster narratives.
  • Align the team on where the real landmines lie.
  • Convert imagined disasters into plans: mitigations, playbooks, and monitoring improvements.

Pre‑mortems are especially powerful in complex IT environments where no single person sees the full system. They allow everyone—engineering, SRE, security, product, support—to contribute pieces of the puzzle.


The Incident Story Blueprint Scroll: Why Go Analog?

Using a floor‑length paper scroll or map for the pre‑mortem might sound old‑fashioned, but that’s the point. Analog tools have advantages that software alone rarely delivers:

  • Physical presence: A giant scroll on the floor or wall is hard to ignore. It anchors everyone’s attention in one shared space.
  • Shared mental model: People literally stand around the same map, pointing at services, drawing arrows, and arguing over connections.
  • Embodied walkthrough: Participants can walk the timeline, following a hypothetical outage minute by minute, step by step.
  • Low friction: Markers and sticky notes are faster than fiddling with diagram tools in a meeting.

Think of the scroll as your Analog Incident Story Blueprint—a narrative space where you map systems, actors, and time, then play out the outage as a story from first symptom to final resolution.


Step 1: Define Clear Goals for the Pre‑Mortem

Before unrolling paper, decide what you’re rehearsing and why.

Clarify:

  1. Scope

    • Are you focusing on a specific change (e.g., database migration, new feature launch)?
    • Or a category of failure (e.g., identity provider outage, regional cloud failure)?
  2. Goals

    • What do you want to walk away with?
    • Examples:
      • A prioritized risk register
      • A list of new alerts and dashboards to build
      • Runbook improvements and on‑call training topics
  3. Participants

    • Include people from:
      • Application engineering
      • SRE / platform / infrastructure
      • Security
      • Customer support / operations
      • Product or business stakeholders, where relevant

Announce the goal up front: “By the end of this session, we will have a list of top outage scenarios, associated risks, and concrete action items to prevent or handle them better.”


Step 2: Unroll the Scroll and Draw the System Map

Unroll your paper across the floor, a long table, or a hallway wall. This is your canvas for the system and the story.

Divide it into two main dimensions:

  1. Horizontal axis: Time / Incident Timeline

    • Left side: T‑0 (the change goes live) or T‑0 (first symptom appears)
    • Right side: T+X hours (incident resolved, RCA started)
    • Mark key time buckets: T+5m, T+15m, T+1h, etc.
  2. Vertical axis: System Layers & Actors

    • Top: external actors (users, third‑party services)
    • Then: edge / API / frontend
    • Mid: services, microservices, and business logic
    • Lower: data stores, queues, caches, infrastructure, networking
    • Bottom: teams and roles (on‑call, support, incident commander, etc.)

Use colored markers and sticky notes to:

  • Draw boxes for services, tools, and external dependencies.
  • Sketch data flows and critical paths.
  • Annotate known single points of failure.
  • Mark where observability exists (and where it doesn’t).

This rough, high‑level map doesn’t need to be perfect. The goal is to externalize shared understanding and reveal missing knowledge.


Step 3: Imagine “What Went Wrong” in Realistic Detail

With the scroll laid out, shift into pre‑mortem mode.

Pose the core question:

“It’s six weeks after this change went live. We suffered a major, customer‑visible outage. Walk me through what went wrong.”

Facilitate collaborative brainstorming:

  1. Silent idea generation (5–10 minutes)

    • Everyone writes hypothetical failures on sticky notes:
      • “New config flag disabled rate limiting → downstream DB overwhelmed.”
      • “Third‑party auth provider changed API → login failures.”
      • “Migration script stalled halfway → inconsistent data in region B.”
  2. Cluster by theme and location on the scroll

    • Place each sticky where it would show up in the system map and timeline.
    • Group similar failure types (e.g., data corruption, latency spikes, auth failures).
  3. Tell the incident story

    • Pick one scenario and narrate it as if it already happened:
      • When did the first symptom occur?
      • What did customers experience?
      • What did on‑call see (or not see)?
      • How did the issue escalate and propagate?

Write this story along the timeline axis. Draw arrows to show cascading effects: a minor issue in a supporting service that becomes a full‑scale outage because of retries, thundering herds, or misconfigured fallbacks.


Step 4: Thoroughly Map Dependencies and Cascading Risks

The scroll shines when you start tracing dependencies. Many serious incidents begin with something small:

  • A “non‑critical” internal API goes down, but three revenue‑critical services quietly depend on it.
  • A background job misbehaves, clogging queues and delaying user‑facing operations.
  • A shared cache cluster is tuned for one use case but thrashed by another.

Use the session to:

  • Trace service → data store → third‑party chains.
  • Mark shared resources (databases, caches, message buses, feature flag systems, CI/CD tools).
  • Identify cross‑team dependencies that will matter in a crisis.

Every time someone says, “Wait, I didn’t know that depended on X,” circle that spot. These are prime candidates for better monitoring, isolation, or reliability improvements.


Step 5: Turn Imagined Failures into Concrete Plans

A pre‑mortem is only valuable if it produces tangible outcomes.

For each major scenario explored on the scroll, capture:

  1. Risk Register Entries

    • Description of the risk
    • Likelihood and impact (even if roughly scored)
    • System components involved
  2. Mitigation Plans

    • Design or architecture changes (e.g., circuit breakers, bulkheads, better backpressure)
    • Process changes (e.g., staged rollouts, feature flags, change freeze windows)
    • Test coverage (e.g., chaos tests, load tests, integration tests)
  3. Action Items

    • New or improved alerts and SLOs
    • Runbook updates and incident roles
    • Training or simulations for on‑call engineers

Assign clear owners and deadlines. Translate the sticky notes into your usual tools: ticketing system, risk tracker, or reliability roadmap.


Step 6: Integrate Insights with Incident Tooling

Don’t let the analog scroll live in a silo. Tie it into your digital incident ecosystem:

  • Alerting & Monitoring

    • For each major failure path, ask: “How would we know this is happening quickly?”
    • Add or refine alerts, dashboards, SLOs, and traces.
  • Escalation Paths

    • If a scenario crosses teams or vendors, ensure your escalation policies reflect that reality.
    • Update on‑call rotations, contact lists, and incident commander playbooks.
  • RCA & Incident Management Tools

    • Use your pre‑mortem stories as ready‑made templates for future RCAs.
    • When a real incident occurs, compare it to the scenarios you rehearsed. Did you predict it? Did mitigations help?

This closes the loop between rehearsal, detection, and response: what you imagine on paper should directly improve what you see and do in production.


Step 7: Repeat and Refine—Don’t Let the Scroll Get Stale

Systems evolve; architectures drift; dependencies creep in quietly. A one‑off pre‑mortem quickly becomes outdated.

Make the incident scroll a recurring ritual:

  • Run a pre‑mortem before any major release or migration.
  • Schedule quarterly or semi‑annual outage rehearsals focused on new high‑risk areas.
  • Update the system map as services are added, retired, or re‑architected.

Over time, you’ll build a library of incident stories and risk patterns. You’ll also reduce complacency—no one assumes, “We’re safe now,” because they’re regularly confronted with new, plausible failure modes.

When a real outage does happen, use the scroll:

  • Mark the actual incident path.
  • Compare it to your previous imagined scenarios.
  • Feed new insights back into your next pre‑mortem.

Conclusion

Pre‑mortems help teams fail on paper instead of in production. By unrolling an analog, floor‑length incident story blueprint scroll, you turn invisible complexity into something the whole team can see, touch, and walk through together.

The process is straightforward:

  1. Define clear goals and scope.
  2. Unroll the scroll and sketch your system timeline and layers.
  3. Imagine realistic “what went wrong” stories.
  4. Map dependencies and reveal cascading failure paths.
  5. Convert scenarios into risk registers, mitigations, and action items.
  6. Integrate insights with your incident tooling.
  7. Repeat and refine as your architecture evolves.

The result isn’t just a better diagram. It’s a shared, embodied understanding of how your systems really behave under stress—and a concrete plan to prevent the next headline‑worthy outage before it ever happens.

The Analog Incident Story Blueprint Scroll: Unrolling a Floor‑Length Map for Pre‑Mortem Outage Rehearsals | Rain Lag