The Analog Incident Story Trainboard Planetarium: A Wall of Paper Stars That Predict Your Next Outage Orbit
How a low-tech wall of paper stars can transform your incident retrospectives into a powerful, data-driven system for predicting and preventing your next major outage.
The Analog Incident Story Trainboard Planetarium: A Wall of Paper Stars That Predict Your Next Outage Orbit
What if your incident data looked less like a spreadsheet and more like a night sky?
Imagine a wall in your office: covered in paper stars, orbit lines, and hand-drawn constellations. Each star is an incident, each orbit a system, each constellation a pattern of recurring failure. It’s part timeline, part map, part story. This is your Analog Incident Story Trainboard Planetarium—a physical, highly visual way to understand how your systems fail, learn from them, and predict the next outage before it happens.
In an era obsessed with dashboards and automation, this might sound quaint. But that’s the point. By going analog, you force the organization to slow down, think deeply, and see the patterns that digital tools often hide behind filters and charts.
This post walks through how to build and use an analog "planetarium" as a data-driven, structured approach to incident retrospectives and continuous improvement.
Why an Analog Planetarium for Incidents?
Outages rarely happen out of nowhere. They emerge from:
- small near misses
- weak signals in logs and alerts
- unresolved tech debt
- organizational blind spots
Most incident reviews focus on the last few hours before the outage. The planetarium forces you to zoom out and see:
- long-term patterns across months and systems
- systemic causes instead of one-off mistakes
- organizational dynamics, not just technical faults
You’re not replacing digital tools. You’re complementing them with a tactile, narrative, pattern-seeking surface that everyone can gather around.
Step 1: Make It Data-Driven (Not Just Story-Driven)
The planetarium is not a feelings board. It’s a data-driven artifact that makes your incident history visible at a glance.
What goes on the wall?
Each incident gets a paper star with:
- Date/time and duration
- Systems / services impacted
- Severity level
- Customer impact (e.g., % of traffic affected)
- Primary contributing factors (e.g., config error, capacity, dependency, human factors)
- Detection source (alert, customer report, internal report)
Place the stars on the wall by time on the horizontal axis and by system or domain on the vertical axis, or by severity as distance from the “core” (your most critical systems). Over time, you’ll see galaxies of incidents emerge.
The key is that every star is backed by real data, pulled from your incident tracking systems. The wall is just the visualization layer.
Step 2: Prepare for Each Incident Review Like a Mission Briefing
The planetarium is most powerful when each incident review is deliberate, structured, and time-boxed.
Before the review, prepare:
-
Clear goals
- Are you trying to reduce repeat incidents in one system?
- Are you examining detection gaps?
- Are you trying to understand human/organizational factors?
-
Relevant data
- Incident timeline and metrics (latency, error rates, etc.)
- Historical incidents for the same system or failure mode
- Near misses and minor alerts related to this incident
-
Defined roles
- Facilitator: keeps time, maintains psychological safety, focuses discussion
- Scribe: captures insights and decisions on paper and in tools
- Domain experts: provide context for systems involved
- Observer(s): from other teams to broaden perspectives
-
Planetarium updates
- Add the new star(s) for the incident
- Mark related historical stars with a light outline or connection lines
When people walk into the room, they should see:
"This is not a blame session. This is a mission briefing to understand the orbit of this outage in the context of all the others."
Step 3: Facilitate Retrospectives So Every Voice Becomes a Star
The wall is a backdrop. The real work is in how you talk about the incident.
A solid facilitation structure:
-
Set the tone
- No blame, no shaming.
- Focus on systems, processes, and conditions.
-
Reconstruct the shared story
- Walk through the timeline.
- Use the wall to connect this incident to past ones.
-
Invite all voices
- Actively ask: "What did we miss?" "What surprised you?" "What felt confusing?"
- Make space for people outside the primary owning team.
-
Identify lessons and convert them to actions
- For each insight, ask: "So what?" and "Now what?"
- Turn insights into concrete process improvements, e.g.:
- runbooks updated
- alerts tuned
- ownership clarified
- training created
Post-review, update the wall:
- Tag stars with symbols for lessons learned, actions completed, and open risks.
Step 4: Use the Accident Triangle to Watch the Skies for Early Warnings
The accident triangle (also known as the safety triangle) suggests that for every major incident, there are many more:
- near misses
- minor incidents
- unreported anomalies
On your planetarium, don’t only map the big outages. Also map:
- Minor alerts that self-resolved
- Partial degradations
- Customer reports that didn’t become full incidents
Use different shapes or colors:
- Large stars for major incidents
- Small stars for minor incidents
- Dots for near misses
Over time, you’ll see clusters where near misses orbit the same system. That’s where your next major outage is likely to appear.
Make a habit of asking in each review:
- "What near misses preceded this?"
- "Where else are we seeing similar weak signals?"
The accident triangle turns your planetarium into a predictive map, not just a memorial.
Step 5: Practice Framing Analysis – How You Tell the Story Changes the Future
Incidents aren’t just technical events; they’re stories we tell ourselves about what happened and why.
Framing analysis means intentionally examining how an incident is described:
- Is the narrative blame-focused ("Alice misconfigured…") or system-focused ("The process allowed a single unchecked change…")?
- Does the framing highlight heroics ("Bob saved the day at 3 AM") instead of resilience ("We improved automation so Bob never has to do that again")?
- Are we over-emphasizing rare edge cases and ignoring common structural problems?
On the wall, you can:
- Annotate stars with brief narrative labels (e.g., "The One Where CI Failed Us," "The Hidden Dependency in Payments").
- Periodically review these labels and ask: "What kind of organization do these stories say we are?"
Consciously reframing incidents helps you move from:
"Who broke it?" → "What allowed this to break in this way?"
Step 6: Apply SMART-FOCUS to Analyze Incidents Systematically
To go beyond gut feeling, use a structured lens like SMART-FOCUS:
Sociotechnical Model Analysis of Responses, Threats, Failures, Opportunities, Control, Utility, and Sustainability
For each major incident, step through:
- S – Sociotechnical Model: How did humans, tools, and org structure interact?
- R – Responses: How did detection, escalation, and mitigation actually happen?
- T – Threats: What external or internal threats were involved (traffic spikes, third-party failures, misalignment)?
- F – Failures: What specific technical and process failures occurred?
- O – Opportunities: What chances did we have to catch this earlier or reduce impact?
- C – Control: What controls existed? Were they bypassed, ignored, or insufficient?
- U – Utility: Did the systems and processes work as designed? Were they usable under stress?
- S – Sustainability: Are our fixes and processes sustainable over time, or are we adding fragile heroics?
Map SMART-FOCUS findings as icons or small sticky notes around each star. Over time, you’ll see recurring themes:
- repeated gaps in detection
- brittle manual controls
- unsustainable runbooks
This transforms the wall into a sociotechnical diagnostic tool, not just a technical record.
Step 7: Close the Loop – From Constellations to Continuous Improvement
None of this matters if it doesn’t change how you operate.
Establish a continuous improvement loop tied to your planetarium:
-
From incident to insight
- Each major incident yields verified insights captured on the wall and in your tooling.
-
From insight to prevention strategy
- Translate insights into:
- updated monitoring and alerting
- improved deployment practices
- clearer ownership and escalation paths
- targeted training for on-call and engineering teams
- Translate insights into:
-
From strategy to practice
- Track which improvements have shipped.
- Mark stars where related improvements are live (e.g., a green ring around stars linked to completed actions).
-
From practice back to signals
- Watch the wall over the next quarter.
- Are similar incidents still appearing in that constellation, or did the pattern change?
Your analog planetarium now supports a living, evolving learning system: every outage or near miss reshapes the sky.
Bringing It All Together
The Analog Incident Story Trainboard Planetarium is more than a quirky wall decoration. It’s a:
- Data-driven map of your incident history
- Story surface for narrative and framing analysis
- Early warning system using the accident triangle
- Sociotechnical lens via SMART-FOCUS
- Continuous improvement engine that keeps teams aligned on learning, not blame
You don’t need expensive tools to build one:
- paper, markers, tape, sticky notes
- a blank wall
- a commitment to honest, structured reflection
In a world of complex, distributed systems, outages will happen. Your job isn’t to pretend they won’t—it’s to learn from every orbit, every star, every faint signal in the night sky.
Stand in front of that wall with your team. Look up at your own galaxy of incidents. Then ask together:
"What universe of failure are we living in—and how do we design a better one?"