The Analog Incident Story Lighthouse Map: Hand‑Drawing Safe Paths Through Chaotic Production Nights
How an analog, visual “Lighthouse Map” can turn chaotic production incidents into a disciplined, repeatable learning practice that protects both reliability and people.
Introduction: When Production Feels Like a Storm at Sea
If you’ve ever taken a 3 a.m. incident call, you know the feeling: dashboards glowing, Slack pinging nonstop, people talking past each other, and a creeping sense that you’re guessing more than you’re reasoning. By the time the system is stable again, everyone is exhausted, and the follow‑up review—if it happens at all—feels rushed, ad‑hoc, and forgettable.
The Analog Incident Story Lighthouse Map is an attempt to change that story.
It’s a deliberately analog, visual framework for incident analysis: a hand‑drawn “map” that helps teams plot safe paths through chaotic production nights. It uses incident archetypes and a threat/vulnerability traceability matrix to systematically anticipate and understand failure modes, while connecting technical signals to human decisions. Over time, it becomes a shared mental model for how your organization learns from trouble.
This post walks through what the Lighthouse Map is, how it works, and how it can support sustainable on‑call without burning people out.
From One‑Off Fire Drills to a Disciplined Practice
Most teams treat incident reviews like short‑term penance: something you do after a “big” outage to appease stakeholders. The result is predictable:
- Every review looks different.
- Lessons don’t accumulate; they evaporate.
- Tacit knowledge lives in a few people’s heads.
- On‑call feels like heroics, not a designed practice.
The Lighthouse Map reframes incident reviews as a disciplined, repeatable practice. Instead of improvising each time, you work from a visual template that:
- Defines review principles (blamelessness, curiosity, systemic focus).
- Lays out evaluation criteria (signal quality, decision context, vulnerability exposure, response coordination).
- Embeds learning checkpoints so every review traverses the same core questions.
By making the process visible and predictable, the Lighthouse Map reduces cognitive overhead during stressful times and ensures your reviews keep getting better instead of resetting to zero.
What Is the “Lighthouse Map”?
Think of the Lighthouse Map as a large, analog canvas you fill during or after an incident. It’s not a new tool; it’s a new way of structuring conversation and attention.
A typical map includes:
-
Incident Storyline
A timeline of what happened: alerts, observations, decisions, actions, and outcomes. This is the narrative spine. -
Incident Archetype Panel
A small library of recurring patterns, such as:- Configuration drift
- Capacity exhaustion
- Dependency failure
- Bad rollout / change gone wrong
- Latent bug activated by a rare condition
You tag the story with one or more archetypes. Over time, this helps you recognize patterns early.
-
Threat/Vulnerability Traceability Matrix
A structured grid that links:- Threats (what could go wrong or did go wrong)
- Vulnerabilities (where the system or process was exposed)
- Controls / Mitigations (what exists now, what’s missing, what’s planned)
For example:
- Threat: Cache cluster node loss
- Vulnerability: No automatic failover test; manual recovery steps undocumented
- Control: Add quarterly failover game days; create runbook; strengthen alerting
-
Human Factors & Coordination Area
A space to capture:- Who got paged, and when
- How information flowed (or didn’t)
- Hand‑offs between people and teams
- Decision points and the context those people had
-
Learning Checkpoints
Pivotal questions like:- What surprised us most?
- Where were we blind? What did we assume?
- When did we feel stuck, and why?
- What helped reduce confusion or pressure?
By the end of a session, the map is a dense, visual story of one incident that connects technical conditions to human experience.
Making Tacit Knowledge Visible and Shareable
High‑performing on‑call engineers accumulate an enormous amount of tacit knowledge:
- “This alert is noisy but rarely critical.”
- “When this service slows down, check that dependency first.”
- “If Alice is on call, she knows the hidden debug flag to flip.”
This knowledge is gold—and also fragile. It often disappears when people rotate teams, leave the company, or just burn out.
The Lighthouse Map’s analog, collaborative nature is designed to pull that knowledge out of heads and put it on paper:
- During mapping sessions, facilitators actively ask, “How did you know to do that?” or “What made that alert feel important?”
- Those answers get written directly on the map alongside the timeline and matrix.
- Over time, common heuristics, shortcuts, and mental models emerge as explicit artifacts.
The result is a shared navigation chart: new team members can see how experienced responders think, not just what buttons they pushed. Incident response becomes teachable rather than mystical.
Connecting Technical Signals to Human Factors
Traditional post‑mortems often fixate on root cause in the narrowest sense: a bug, a bad deploy, a missing index. The Lighthouse Map keeps you honest by requiring a holistic view that includes both:
- Technical conditions: alerts fired (or didn’t), system states, infrastructure health, vulnerabilities present.
- Human factors: decisions under pressure, communication patterns, role clarity, fatigue, ambiguity.
By laying these dimensions side by side, the map prompts questions such as:
- Why did this alert get ignored? Was it noise fatigue, ambiguous severity, or a trust issue with the monitoring system?
- Why did the on‑call engineer choose rollback instead of failover? What information did they have at the time?
- How did hand‑offs between time zones or teams help or hinder progress?
This moves you away from blame and toward situational understanding. The goal isn’t to identify who “messed up,” but to understand why reasonable people made reasonable decisions that still led to trouble.
Using Archetypes and Traceability to Anticipate Failure Modes
One of the most powerful aspects of the Lighthouse Map is how it uses incident archetypes and a threat/vulnerability traceability matrix to make emerging risks visible.
Incident Archetypes
By tagging each incident with one or more archetypes, you can:
- See which categories dominate your landscape (e.g., “change management” vs. “capacity” vs. “dependencies”).
- Spot early warnings: “This new alert sequence looks just like our typical dependency failure pattern.”
- Design preventive experiments targeted at your most common archetypes.
Threat/Vulnerability Traceability Matrix
The matrix ensures every incident is examined with the same disciplined lens:
- For each threat that materialized, ask: What vulnerabilities made it possible?
- For each vulnerability, ask: What control exists, and how effective is it?
- Track these over time so you can see whether mitigations are actually reducing exposure.
This is how you move from reactive patching to systematic risk reduction. The map becomes a living inventory of your organization’s “known dragons” and how you’re taming them.
Designing Sustainable On‑Call with the Lighthouse Map
You can’t have reliable systems if your people are chronically overloaded. The Lighthouse Map explicitly links incident learning to sustainable on‑call design.
Each review session asks:
-
Alerts
- Were alerts timely, actionable, and clear?
- Which alerts created noise or confusion?
- What changes would reduce fatigue (aggregation, suppression, better routing)?
-
Rotations
- Was the rotation depth and coverage appropriate?
- Did time zone or hand‑off patterns help or harm response?
- Are we concentrating too much cognitive load on a small group?
-
Runbooks & Playbooks
- Where did responders improvise because documentation didn’t exist or wasn’t trusted?
- Which runbooks actually helped, and why?
- What small updates would have made this night far less stressful?
Because these questions are built into the template, you cannot forget to ask them. The outcome is not just a more resilient system, but a more humane on‑call environment where reliability doesn’t depend on heroic sacrifice.
Building a Culture of Continuous Learning
Using the Lighthouse Map once is helpful; using it consistently is transformative.
As incident after incident is plotted:
- You accumulate a library of maps that tell the history of your system’s evolution—and your team’s.
- Patterns become obvious: recurring vulnerabilities, common communication failure points, archetypes that drive most of your pain.
- You get better at recognizing early signals during live incidents because you’ve seen similar stories unfold on the wall.
Over time, the Lighthouse Map becomes more than a tool. It is a ritual:
- People expect that every significant incident will be mapped.
- Engineers see that their experiences become organizational knowledge, not just war stories.
- Leaders gain a clear way to track how incidents are shaping both technical roadmaps and team practices.
This is what a continuous‑learning culture looks like in practice: each incident strengthens not only your systems but also the way your people think, decide, and collaborate.
Getting Started with Your Own Lighthouse Map
You don’t need a fancy tool to begin. Start analog:
-
Grab a large sheet of paper or a whiteboard.
Divide it into the sections described above: Storyline, Archetypes, Threat/Vulnerability Matrix, Human Factors, Learning Checkpoints. -
Choose a recent, meaningful incident.
Invite everyone who participated—including people from support, product, and operations. -
Walk through the story together.
Draw the timeline, mark decisions, tag archetypes, and fill in the matrix. -
Pause at each learning checkpoint.
Capture surprises, uncertainties, and emotional load explicitly. -
Turn insights into follow‑through.
Create a short list of improvements across three dimensions: systems, on‑call design, and communication practices.
With repetition, this analog practice will feel less like a meeting and more like collective navigation—your team gathering around a lighthouse chart, updating it with new knowledge so the next journey through rough seas is a bit safer.
Conclusion: Drawing Safe Paths Through Chaos
Chaotic production nights are not going away. Systems will grow more complex, dependencies more tangled, expectations higher. What you can change is how your organization responds and learns.
The Analog Incident Story Lighthouse Map offers a way to:
- Systematically anticipate and understand failure modes.
- Make tacit incident knowledge explicit and shareable.
- Connect technical conditions to human factors.
- Design sustainable on‑call that protects both systems and people.
- Build a culture where each incident is a chance to learn, not just a crisis to survive.
In a world of dashboards and automation, picking up a marker and drawing your incident story by hand can feel surprisingly grounding. The map doesn’t remove the storms—but it gives your team a common chart and a shared lighthouse to steer by, one chaotic night at a time.