Rain Lag

The Analog Incident Story Map Cabinet: How Failures Really Spread in Complex Systems

Exploring a new, sociotechnical way to visualize and manage cascading incidents through the "Analog Incident Story Map Cabinet"—a design science approach that reveals how failures truly propagate across people, processes, and technology.

The Analog Incident Story Map Cabinet: How Failures Really Spread in Complex Systems

When something goes wrong in a complex system—like a power grid, a large-scale IT infrastructure, or a critical industrial network—we often talk as if there was a single “root cause” and a clean, linear chain of events. But that story is almost always a polite fiction.

In reality, failures unfold as messy, multi-stage narratives: misunderstandings, minor deviations, technical glitches, automation quirks, and organizational blind spots all interlock. Traditional models of cascading failures, like percolation or epidemic models, don’t really capture this rich sociotechnical reality.

Enter the Analog Incident Story Map Cabinet—a concept and method that treats incident response like cartography. Instead of relying on simplistic failure trees or abstract contagion models, it maps how failures truly spread over time and across people, processes, and technology.

This approach is grounded in Design Science Research (DSR) and brings together structured analysis, physical visualization, and standards-based playbooks, all while embracing the messy human side of incidents.


From Epidemic Metaphors to Incident Cartography

For years, researchers have borrowed ideas from percolation and epidemic models to describe cascading failures, particularly in networks like power transmission systems. These models imagine failures spreading like a virus: one node affects its neighbors, and so on.

Those models are useful—but only up to a point.

They struggle with:

  • Human decision-making (dispatchers, operators, engineers under stress)
  • Organizational dynamics (policies, incentives, communication patterns)
  • Tools and automation quirks (control systems, alarms, dashboards)

In real incidents, what people see, understand, and decide at each moment heavily shapes how a failure spreads. The same technical fault can become a minor blip or a full-blown crisis depending on social and organizational context.

The Analog Incident Story Map Cabinet reframes cascading incidents not as single failures with knock-on effects, but as multi-stage, systemic events that emerge from interacting sociotechnical elements.


What Is the Analog Incident Story Map Cabinet?

Imagine a large, physical cabinet filled with sliding drawers.

Each drawer is a story map of an incident: a timeline of events, decisions, system states, communications, and interventions. It’s an “analog” representation, but it’s based on highly structured data and analysis.

The cabinet becomes a cartographic archive of failures:

  • Each drawer = one incident
  • Each incident = a mapped narrative of how failures propagated
  • Across drawers = patterns and archetypes of how incidents typically unfold

This is more than a metaphor. It’s a design artifact emerging from a Design Science Research (DSR) process, where:

  1. A real-world problem is identified (poor mental models of cascading failures).
  2. An artifact is created (the story map cabinet and its method).
  3. The artifact is evaluated in practice (with real incidents and operators).

The result is a tangible way to see and compare incident narratives—not just data points.


Six Recurrent Incident Archetypes

By systematically mapping multiple incidents, the research identifies six recurrent incident archetypes—common patterns in how failures propagate.

While the exact labels and nuances depend on the specific domain (e.g., power transmission, IT operations), archetypes typically capture patterns like:

  1. Slow-Burn Drift
    Small deviations accumulate unnoticed until a threshold is crossed.

  2. Alarm Storm Overload
    Too many alerts desensitize operators, causing critical signals to be missed.

  3. Hidden Dependency Cascade
    A seemingly isolated fault reveals deep, uncharted interdependencies.

  4. Control Room Coordination Breakdown
    Miscommunication and misaligned mental models amplify a manageable event.

  5. Automation Surprise
    Automated systems behave as designed—but not as expected.

  6. Recovery-Induced Failure
    Well-intentioned recovery actions trigger new problems elsewhere in the system.

These archetypes are not just after-the-fact labels. They are actionable templates that guide:

  • How to recognize an unfolding pattern early
  • Which kinds of interventions are likely to help (or harm)
  • How to structure training and rehearsals

Instead of reinventing the wheel with each incident, teams can ask: Which archetype are we in right now? And, What does the playbook suggest for this pattern?


A Structured Incident-Response Playbook Aligned with NIST

The Analog Incident Story Map Cabinet is not just about storytelling—it’s tightly connected to a structured incident-response playbook.

This playbook is aligned with NIST guidelines (such as NIST’s Computer Security Incident Handling Guide, SP 800-61, and related frameworks), so it:

  • Uses recognizable phases (e.g., Preparation, Detection & Analysis, Containment, Eradication, Recovery, Post-Incident Activity)
  • Defines roles and responsibilities
  • Encourages evidence-based decision-making

Where this work goes further is in tailoring the playbook to the six incident archetypes. For each archetype, the playbook defines:

  • Early warning signs and indicators
  • Expected sociotechnical interactions (who needs to talk to whom, over what channel)
  • Recommended interventions (technical actions, communication steps, escalation paths)
  • Known pitfalls (common missteps observed in past incidents)

This alignment with NIST brings standardization and legitimacy, while the archetype-based structure makes it practically usable in real operations.


A Deeply Sociotechnical View of Failure

At the heart of the Analog Incident Story Map Cabinet is a sociotechnical perspective.

That means:

  • Failures are not purely technical events.
  • They arise from interactions between people, processes, and technology.

The story maps highlight:

  • What information was available to which person, when
  • How procedures shaped or constrained decisions
  • How tools and interfaces amplified or dampened signals
  • Where informal workarounds diverged from formal processes

Instead of blaming “human error” or “system fault,” these maps show how human decisions make sense in context—and how that context is shaped by design choices, organizational culture, and automation.

This perspective is crucial to understanding cascading incidents in systems like power grids, where:

  • Operators manage high-stakes, time-critical decisions
  • System state is only partially observable
  • Tools and alarms can mislead as much as they guide

Logging Everything: From Crisis to Learning

One of the key insights of this work is that all actions during incidents can be logged and revisited:

  • Operator commands
  • System responses
  • Communications (where policy and privacy allow)
  • Timings and order of events

These logs feed directly into the incident story maps.

The benefits are substantial:

  1. Post-Incident Insight
    Teams can reconstruct the incident as a narrative: what happened, when, why it made sense, and how context evolved.

  2. Training and Simulation
    Real incidents become training scenarios. New operators can be walked through past story maps, seeing how familiar patterns re-emerge.

  3. Stakeholder Communication
    Managers, regulators, and external stakeholders get a clear, visual, and structured explanation of the incident—without oversimplified blame.

  4. Design Feedback Loop
    Insights from story maps can inform redesign of tools, processes, and organizational structures.

This turns incident response from a one-off firefight into a continuous learning cycle.


Why Traditional Cascade Models Fall Short

Percolation and epidemic models treat failures like infections that randomly jump between connected nodes.

In complex sociotechnical systems, this misses:

  • Conditional behavior: Failures propagate only if certain procedures are followed or skipped.
  • Operator adaptation: Humans improvise, compensate, and sometimes introduce new failure paths.
  • Policy and regulation: Rules shape which actions are even considered.
  • Tool-mediated perception: Dashboards, alarms, and interfaces filter what is visible.

Cascading incidents in, say, power transmission are better understood as multi-stage systemic events:

  • Early technical deviations
  • Local compensations and workarounds
  • Shifting operating margins
  • Misaligned mental models across teams
  • Late-stage, system-wide constraints suddenly binding

The Analog Incident Story Map Cabinet captures this multi-stage reality in a structured, analyzable way, rather than flattening it into an abstract contagion process.


From Artifacts to Practice: Why This Matters

The value of the Analog Incident Story Map Cabinet is not just conceptual. It offers:

  • A concrete artifact (the cabinet and its maps) for shared understanding
  • A framework to recognize recurring incident archetypes
  • A standards-aligned playbook tailored to these archetypes
  • A method to turn raw logs into structured narratives and enduring organizational memory

For organizations that operate critical infrastructure or complex digital systems, this approach can:

  • Improve situational awareness during incidents
  • Reduce the likelihood of repeated mistakes
  • Enhance training and onboarding
  • Support transparent communication with regulators and stakeholders

Ultimately, it helps teams see failures as patterns in a landscape, not as isolated anomalies.


Conclusion: Drawing Better Maps of Failure

Incidents will never be fully eliminated in complex systems. But we can become much better at understanding how they unfold.

The Analog Incident Story Map Cabinet offers a new way to:

  • Visualize how failures really propagate
  • Recognize recurrent incident archetypes
  • Align response actions with robust standards like NIST
  • Embrace a sociotechnical view that honors the real conditions of work

Instead of searching for a single root cause, we can build better maps—maps that help us navigate crises in real time and learn from them afterward.

In a world of growing complexity, the organizations that invest in this kind of incident cartography will be the ones that respond faster, recover smarter, and learn deeper from every failure.

The Analog Incident Story Map Cabinet: How Failures Really Spread in Complex Systems | Rain Lag