Rain Lag

The Cardboard Reliability Observatory: Turning Your Weirdest Incidents into a Hands‑On Museum

How to build a low‑tech, high‑impact “Cardboard Reliability Observatory” that turns your team’s strangest incidents into a physical museum of learning, storytelling, and continuous improvement.

Introduction

Most teams treat incidents like something to quietly bury: write a postmortem, file a few tickets, move on. Six months later, the same kind of failure shows up again wearing a slightly different hat.

What if, instead, you turned your weirdest incidents into a museum exhibit?

The Cardboard Reliability Observatory is a simple, physical way to visualize, explore, and remember your team’s strangest and most instructive failures. It’s part workshop, part art project, part reliability lab — and it works precisely because it’s low‑tech, playful, and tangible.

In this post, we’ll walk through what a Cardboard Reliability Observatory is, why it works, and how to run one with your team to build a living library of edge cases and surprises.


What Is a Cardboard Reliability Observatory?

Think of a Reliability Observatory as a hands‑on museum of incidents:

  • Each exhibit is one real incident from your system’s history.
  • It’s built from cardboard, markers, tape, string, sticky notes, and whatever craft supplies you have lying around.
  • It visualizes how the incident unfolded: people, code, systems, dependencies, signals, timelines.
  • It invites interaction: you can walk around it, trace failure paths with your finger, add questions, point at weirdness.

Instead of a dry Confluence page that no one rereads, you get something like a science museum exhibit:

  • A model of the system slice involved in the incident
  • Annotated failure paths (“this request got stuck here for 17 minutes”)
  • Context cards about on‑call conditions, dashboards, and surprises
  • Tags for themes: observability gaps, dependency risk, operational load, UX confusion, etc.

The goal isn’t art. The goal is shared, embodied understanding of how your system – and your team – behaves under stress.


Why Turn Incidents into Museum Exhibits?

1. Storytelling beats static documents

Humans remember stories and physical experiences better than bulleted lists. When people gather around a cardboard model and hear a teammate say:

“Here’s the cache that started returning 500s, and here’s the undocumented failover path we didn’t know about.”

they build a shared narrative of what happened. That narrative becomes:

  • Easier to recall during future design discussions
  • Easier to explain to new team members
  • Easier to connect with other incidents (“wait, this looks a lot like the billing outage from last year…”)

2. Psychological safety around failure

Turning incidents into a museum exhibit sends a cultural signal:

“We put failures on display so we can learn from them, not to shame anyone.”

The playful, low‑stakes nature of cardboard and markers helps remove the sting. It’s hard to run a blameful witch‑hunt while someone is drawing a wobbly API gateway with a Sharpie.

This atmosphere encourages:

  • Engineers to admit uncertainty and knowledge gaps
  • Honest discussion of human factors: alerts, fatigue, distraction
  • Curiosity instead of defensiveness

3. Low‑tech beats paralysis

When you say “chaos engineering GameDay,” some teams hear:

  • Big investment
  • Fancy tooling
  • Risky experiments in production

When you say “we’re going to make cardboard models of our weirdest incidents,” the barrier to entry is much lower.

The Observatory is a gentle, low‑risk on‑ramp to structured reliability work:

  • No new tools to learn
  • No infrastructure changes
  • Just time, space, cardboard, and facilitation

From there, it’s easier to grow into more advanced practices.


How to Run a Cardboard Reliability Observatory Workshop

You can run this as a 2–3 hour workshop with 6–20 people. Here’s a concrete structure you can adapt.

1. Curate your “weirdest incidents”

In advance, pick 3–6 incidents that:

  • Were unusual or surprising (not just common capacity hiccups)
  • Had non‑obvious root causes or interactions
  • Involved both technical and human dynamics

Pull the existing retrospectives, timelines, graphs, Slack logs, and tickets for each. Your goal is to bring raw data, not polished stories.

2. Set the ground rules

Start the session by explicitly framing:

  • Blamelessness: We’re here to understand systems, not judge people.
  • Learning focus: The value is in the questions we surface and patterns we see.
  • Psychological safety: It’s okay to say “I don’t know” or “I don’t understand this part.”

Make this visible on a poster or whiteboard.

3. Split into incident teams

Divide participants into small groups (3–5 people) and assign each group one incident. Each group gets:

  • A printed incident summary and timeline
  • Graphs/log snippets if available
  • Cardboard, sticky notes, tape, markers, string

Their mission: build a museum exhibit for that incident.

4. Build the cardboard exhibit (45–60 minutes)

Give teams a structured prompt:

  1. Map the actors

    • Draw/label services, queues, databases, external providers
    • Add people/roles: on‑call, SRE, support, product
  2. Lay out the timeline

    • Mark key times: when it started, when it was detected, mitigation steps, resolution
    • Show detection sources: alerts, customer tickets, dashboards
  3. Trace the failure path

    • Use string or colored tape to show the path the failure took
    • Annotate with sticky notes: “unexpected retry storm,” “silent failure here,” “alert fired but was ignored”
  4. Highlight “weirdness”

    • Use a different color/sticker for surprising factors:
      • Hidden dependencies
      • Non‑intuitive config
      • Tooling or process gaps
      • Human factors (shift change, conflicting priorities, unclear ownership)
  5. Capture data‑driven insights

    • For each anomaly, note what data you have:
      • Metrics? Logs? Traces? Screenshots? Slack timestamps?
    • Note where data was missing or misleading.

The point isn’t accuracy down to the last microservice — it’s capturing the cognitive model the team had (or lacked) during the incident.

5. Museum walk and storytelling (45–60 minutes)

Once exhibits are ready, do a gallery walk.

For each incident:

  • The group has 8–10 minutes to tell the story using the exhibit:
    • What did we think was happening at first?
    • What actually happened?
    • What surprised us?
    • How did we eventually understand and fix it?
  • The rest of the group can ask clarifying and curiosity‑driven questions, not “gotchas.”

Encourage:

  • “If I’d been on call, I think I would have looked here first…”
  • “This looks similar to [other incident] — could the same thing happen again?”

This is where the museum format shines: people physically point at components, walk the failure path, and negotiate a shared mental model.

6. Extract cross‑cutting themes (30 minutes)

After the museum walk, regroup and ask:

  • What patterns showed up across incidents?
    • Repeated observability gaps?
    • Fragile dependencies?
    • Knowledge silos?
    • Alert design issues?
  • Where did data help us resolve faster?
  • Where was data missing and forced guesswork?

Capture themes on a whiteboard. This is your data‑driven retrospective of retrospectives.


Making It Data‑Driven, Not Just Decorative

A cardboard observatory is fun, but its real power comes from how you connect it back to concrete change.

Use structured incident questions

For each incident (and exhibit), systematically probe:

  • Detection
    • How was it detected?
    • What signals did we have? What was noisy or missing?
  • Diagnosis
    • What hypotheses did we test first? Why?
    • What data helped us rule things out?
  • Coordination
    • Who was involved? How did they communicate?
    • Were roles and ownership clear?
  • Resolution
    • What finally worked? Was it obvious or a shot in the dark?
  • Learning
    • What would have made this incident boring instead of weird?

These questions anchor the conversation in observable facts and behaviors, not opinions about competence.

Prioritize follow‑ups

From the cross‑cutting themes, identify a small number of high‑leverage improvements, for example:

  • Instrument a new metric or trace to close a repeated blind spot
  • Simplify or document a fragile dependency path
  • Adjust alert thresholds or routing
  • Codify an escalation pattern that worked well

Turn these into clear, owned, time‑bounded actions, and track them like any other work.


From Cardboard to Continuous Improvement

The Observatory works best when it’s not a one‑off event.

Keep a living library

Designate a physical space (or digital twin, via photos and diagrams) as your Reliability Observatory. Over time:

  • Add new exhibits for notable incidents
  • Update old exhibits when architectures change
  • Use them to onboard new team members: “Here are three incidents that shaped how we design this system.”

This transforms your incident history from “old tickets and PDFs” into a living knowledge base.

Bridge to GameDays and chaos experiments

Once your team is comfortable with cardboard explorations, you can:

  • Convert past incidents into GameDay scenarios:
    • “Recreate this failure mode safely and see how we perform today.”
  • Use the identified weak spots to design targeted chaos experiments.

The Observatory provides a low‑anxiety starting point for more formal reliability practices.

Invest in facilitation

The success of this approach depends heavily on good facilitation:

  • Keep blame out of the room
  • Ensure quieter voices contribute
  • Keep things moving from storytelling to learning to action

If you can, train a few reliability champions or SREs in facilitation skills so they can run these sessions regularly.


Conclusion

Incidents are expensive — not just in downtime, but in stress, lost sleep, and eroded trust. Failing to learn deeply from them is like burning money.

The Cardboard Reliability Observatory offers a different path:

  • Turn your weirdest incidents into tangible exhibits
  • Encourage storytelling, shared mental models, and psychological safety
  • Use data‑driven questions to turn anecdotes into actionable insight
  • Build a living library of edge cases that improves both system and human reliability over time

You don’t need a fancy chaos engineering platform to start. You need cardboard, markers, a couple of hours, and the willingness to put your failures on display — not as scar tissue to hide, but as artifacts to learn from.

Pick one incident. Grab some cardboard. Build your first exhibit. That’s your first step toward an Observatory that helps your team not just survive incidents, but grow from them.

The Cardboard Reliability Observatory: Turning Your Weirdest Incidents into a Hands‑On Museum | Rain Lag