The Analog Incident Story Trainboard Planetarium: A Wall of Paper Stars That Predict Your Next Outage Orbit

What if your incident data looked less like a spreadsheet and more like a night sky?

Imagine a wall in your office: covered in paper stars, orbit lines, and hand-drawn constellations. Each star is an incident, each orbit a system, each constellation a pattern of recurring failure. It’s part timeline, part map, part story. This is your Analog Incident Story Trainboard Planetarium—a physical, highly visual way to understand how your systems fail, learn from them, and predict the next outage before it happens.

In an era obsessed with dashboards and automation, this might sound quaint. But that’s the point. By going analog, you force the organization to slow down, think deeply, and see the patterns that digital tools often hide behind filters and charts.

This post walks through how to build and use an analog "planetarium" as a data-driven, structured approach to incident retrospectives and continuous improvement.

Why an Analog Planetarium for Incidents?

Outages rarely happen out of nowhere. They emerge from:

small near misses
weak signals in logs and alerts
unresolved tech debt
organizational blind spots

Most incident reviews focus on the last few hours before the outage. The planetarium forces you to zoom out and see:

long-term patterns across months and systems
systemic causes instead of one-off mistakes
organizational dynamics, not just technical faults

You’re not replacing digital tools. You’re complementing them with a tactile, narrative, pattern-seeking surface that everyone can gather around.

Step 1: Make It Data-Driven (Not Just Story-Driven)

The planetarium is not a feelings board. It’s a data-driven artifact that makes your incident history visible at a glance.

What goes on the wall?

Each incident gets a paper star with:

Date/time and duration
Systems / services impacted
Severity level
Customer impact (e.g., % of traffic affected)
Primary contributing factors (e.g., config error, capacity, dependency, human factors)
Detection source (alert, customer report, internal report)

Place the stars on the wall by time on the horizontal axis and by system or domain on the vertical axis, or by severity as distance from the “core” (your most critical systems). Over time, you’ll see galaxies of incidents emerge.

The key is that every star is backed by real data, pulled from your incident tracking systems. The wall is just the visualization layer.

Step 2: Prepare for Each Incident Review Like a Mission Briefing

The planetarium is most powerful when each incident review is deliberate, structured, and time-boxed.

Before the review, prepare:

Clear goals
- Are you trying to reduce repeat incidents in one system?
- Are you examining detection gaps?
- Are you trying to understand human/organizational factors?
Relevant data
- Incident timeline and metrics (latency, error rates, etc.)
- Historical incidents for the same system or failure mode
- Near misses and minor alerts related to this incident
Defined roles
- Facilitator: keeps time, maintains psychological safety, focuses discussion
- Scribe: captures insights and decisions on paper and in tools
- Domain experts: provide context for systems involved
- Observer(s): from other teams to broaden perspectives
Planetarium updates
- Add the new star(s) for the incident
- Mark related historical stars with a light outline or connection lines

When people walk into the room, they should see:

"This is not a blame session. This is a mission briefing to understand the orbit of this outage in the context of all the others."

Step 3: Facilitate Retrospectives So Every Voice Becomes a Star

The wall is a backdrop. The real work is in how you talk about the incident.

A solid facilitation structure:

Set the tone
- No blame, no shaming.
- Focus on systems, processes, and conditions.
Reconstruct the shared story
- Walk through the timeline.
- Use the wall to connect this incident to past ones.
Invite all voices
- Actively ask: "What did we miss?" "What surprised you?" "What felt confusing?"
- Make space for people outside the primary owning team.
Identify lessons and convert them to actions
- For each insight, ask: "So what?" and "Now what?"
- Turn insights into concrete process improvements, e.g.:
  - runbooks updated
  - alerts tuned
  - ownership clarified
  - training created

Post-review, update the wall:

Tag stars with symbols for lessons learned, actions completed, and open risks.

Step 4: Use the Accident Triangle to Watch the Skies for Early Warnings

The accident triangle (also known as the safety triangle) suggests that for every major incident, there are many more:

near misses
minor incidents
unreported anomalies

On your planetarium, don’t only map the big outages. Also map:

Minor alerts that self-resolved
Partial degradations
Customer reports that didn’t become full incidents

Use different shapes or colors:

Large stars for major incidents
Small stars for minor incidents
Dots for near misses

Over time, you’ll see clusters where near misses orbit the same system. That’s where your next major outage is likely to appear.

Make a habit of asking in each review:

"What near misses preceded this?"
"Where else are we seeing similar weak signals?"

The accident triangle turns your planetarium into a predictive map, not just a memorial.

Step 5: Practice Framing Analysis – How You Tell the Story Changes the Future

Incidents aren’t just technical events; they’re stories we tell ourselves about what happened and why.

Framing analysis means intentionally examining how an incident is described:

Is the narrative blame-focused ("Alice misconfigured…") or system-focused ("The process allowed a single unchecked change…")?
Does the framing highlight heroics ("Bob saved the day at 3 AM") instead of resilience ("We improved automation so Bob never has to do that again")?
Are we over-emphasizing rare edge cases and ignoring common structural problems?

On the wall, you can:

Annotate stars with brief narrative labels (e.g., "The One Where CI Failed Us," "The Hidden Dependency in Payments").
Periodically review these labels and ask: "What kind of organization do these stories say we are?"

Consciously reframing incidents helps you move from:

"Who broke it?" → "What allowed this to break in this way?"

Step 6: Apply SMART-FOCUS to Analyze Incidents Systematically

To go beyond gut feeling, use a structured lens like SMART-FOCUS:

Sociotechnical Model Analysis of Responses, Threats, Failures, Opportunities, Control, Utility, and Sustainability

For each major incident, step through:

S – Sociotechnical Model: How did humans, tools, and org structure interact?
R – Responses: How did detection, escalation, and mitigation actually happen?
T – Threats: What external or internal threats were involved (traffic spikes, third-party failures, misalignment)?
F – Failures: What specific technical and process failures occurred?
O – Opportunities: What chances did we have to catch this earlier or reduce impact?
C – Control: What controls existed? Were they bypassed, ignored, or insufficient?
U – Utility: Did the systems and processes work as designed? Were they usable under stress?
S – Sustainability: Are our fixes and processes sustainable over time, or are we adding fragile heroics?

Map SMART-FOCUS findings as icons or small sticky notes around each star. Over time, you’ll see recurring themes:

repeated gaps in detection
brittle manual controls
unsustainable runbooks

This transforms the wall into a sociotechnical diagnostic tool, not just a technical record.

Step 7: Close the Loop – From Constellations to Continuous Improvement

None of this matters if it doesn’t change how you operate.

Establish a continuous improvement loop tied to your planetarium:

From incident to insight
- Each major incident yields verified insights captured on the wall and in your tooling.
From insight to prevention strategy
- Translate insights into:
  - updated monitoring and alerting
  - improved deployment practices
  - clearer ownership and escalation paths
  - targeted training for on-call and engineering teams
From strategy to practice
- Track which improvements have shipped.
- Mark stars where related improvements are live (e.g., a green ring around stars linked to completed actions).
From practice back to signals
- Watch the wall over the next quarter.
- Are similar incidents still appearing in that constellation, or did the pattern change?

Your analog planetarium now supports a living, evolving learning system: every outage or near miss reshapes the sky.

Bringing It All Together

The Analog Incident Story Trainboard Planetarium is more than a quirky wall decoration. It’s a:

Data-driven map of your incident history
Story surface for narrative and framing analysis
Early warning system using the accident triangle
Sociotechnical lens via SMART-FOCUS
Continuous improvement engine that keeps teams aligned on learning, not blame

You don’t need expensive tools to build one:

paper, markers, tape, sticky notes
a blank wall
a commitment to honest, structured reflection

In a world of complex, distributed systems, outages will happen. Your job isn’t to pretend they won’t—it’s to learn from every orbit, every star, every faint signal in the night sky.

Stand in front of that wall with your team. Look up at your own galaxy of incidents. Then ask together:

"What universe of failure are we living in—and how do we design a better one?"