The Analog Incident Cabinet of Curiosities: Building a Tangible Museum of Tiny Outage Clues
Explore how treating incident artifacts—logs, screenshots, scribbled notes, and pager alerts—as a “cabinet of curiosities” can transform painful outage postmortems into a powerful, tangible system for learning and improving reliability.
The Analog Incident Cabinet of Curiosities
What if every outage your team has ever survived lived in a museum?
Not as a bullet point in a slide deck, but as a physical exhibit: screenshots pinned to corkboards, pagers taped next to scribbled timelines, printouts of graphs frozen at the moment everything went sideways. A cabinet of curiosities for incidents—a tangible archive of how things broke, how you found out, and how you fought your way out.
This isn’t nostalgia. It’s about making the messy reality of incident response visible and learnable. When you treat incident artifacts as curiosities worth preserving and revisiting, you create a culture that:
- Values deep postmortems over blame
- Reduces the pain of reconstructing timelines
- Turns tiny clues into concrete reliability improvements
Let’s open the drawers of this cabinet, one curiosity at a time.
Drawer 1: The Pager That Wouldn’t Stop Buzzing
Every outage has a first sensory moment: the vibration of a phone, a Slack ping, a red light on a wallboard. That first alert is your entry ticket to the incident museum.
What this curiosity represents:
- How you detect problems
- How clearly your alerts describe what’s wrong
- How quickly the right people are notified
Questions to ask when you put the “pager” in the cabinet:
- Did the alert fire early enough to prevent user impact?
- Was it noisy, vague, or misleading?
- Did the alert route to the right on‑call person or team?
- Did context (runbooks, links, dashboards) accompany the alert?
Practical improvement:
Treat each alert as a design artifact. After an incident, adjust:
- Alert thresholds (too noisy? too late?)
- Routing rules (did it escalate correctly?)
- Attached context (dashboards, logs, runbooks)
In your cabinet, keep a printed copy or screenshot of the alert as it actually appeared. It’s a small object, but it captures the first moment of awareness—where reliability really begins.
Drawer 2: The Scribbled Timeline on a Whiteboard
During a live outage, someone usually becomes the unofficial historian: jotting times, decisions, and observations on a whiteboard, a notepad, or a shared doc. This messy timeline is often the most accurate story of what actually happened.
Why it matters:
Postmortems depend on timelines, yet reconstructing them after the fact is painful:
- Logs are incomplete or rotated away
- Chat history is fragmented across channels
- People’s memories are biased or missing small but vital events
Your “scribbled timeline” curiosity is a reminder that:
- Capture during the incident is gold
- Even rough notes beat perfect reconstructions done days later
Make this curiosity work for you:
- Designate an incident scribe in your response roles
- Provide a simple template:
Time – Action – Actor – Evidence - Snap a photo of the whiteboard or export the shared doc right after resolution
When you print and archive these timelines in your cabinet, you build a visible sequence of “how we actually work under pressure.” That sequence is a direct input into improving on‑call rotations, tooling, handoff practices, and training.
Drawer 3: The Graph Frozen at 03:17
Somewhere there’s a graph, log snippet, or tracing view that shows the exact moment things broke. Maybe it’s a traffic spike, a sudden rise in error rates, or a latency curve going vertical.
This curiosity captures diagnosis—how the team moved from “something’s wrong” to “this is probably it.”
Curator questions:
- How long did it take to find this view?
- Was the necessary dashboard already built, or made on the fly?
- Did you have the observability depth (metrics, logs, traces) to see the true cause?
Reliability improvements that stem from this drawer:
- Standardize a core dashboard set per service (golden signals: latency, traffic, errors, saturation)
- Add saved views for commonly suspected patterns: database saturation, cache misses, resource exhaustion
- Define an on‑call handbook with links to these key dashboards
Print the “outage graph” and stick it in your cabinet with notes like: “We found this 45 minutes in. Should have been 5.” That sentence alone drives better monitoring design and incident readiness.
Drawer 4: The Chat Log and the Half‑Finished Runbook
Open another drawer and you find printed Slack transcripts, snippets of CLI commands, and a half‑completed runbook that someone followed until it stopped making sense.
This is the coordination drawer—where you see how humans actually collaborate during incidents.
What you’ll notice here:
- Repeated questions: “Who owns this?” “Can someone restart X?”
- Conflicting commands: two people trying different mitigations at once
- Missing or outdated runbook steps
Turn these artifacts into improvements:
- Update runbooks based on what people actually did, not what you wish they’d do
- Add clear ownership tags to services and dashboards
- Define explicit roles during incidents: incident commander, scribe, subject‑matter experts, communications lead
This drawer is where you see the gap between process on paper and process in reality—and that gap is where your next iteration of incident management practices comes from.
Drawer 5: The Postmortem with Margin Notes
Mature teams treat the postmortem itself as an artifact worthy of preservation. Not a checkbox document, but a working analysis that grows richer over time.
Open this drawer and you see printed postmortems covered in:
- Stakeholder comments
- Cost estimates
- Compliance notes
- Diagrams of cross‑team dependencies
Here, the incident story expands beyond the pure technical failure.
Depth you can add to postmortems:
- How it was detected: Which signal, by whom, at what time?
- How it was handled in real time: Key decisions, dead ends, coordination patterns
- What remediation was taken: Short‑term mitigations vs. long‑term fixes
- Who was impacted: Customers, internal teams, SLAs, and contracts
- Cost estimation: Lost revenue, reputational risk, internal thrash
- Dependencies and ownership: Which teams, vendors, or services were involved?
Each postmortem becomes a multi‑layered exhibit. Over time, your cabinet tells the story of your organization’s growing reliability maturity.
Drawer 6: The Human Side of On‑Call
Tucked in one drawer are calendars, rotation schedules, and a couple of tired‑looking selfies someone stuck there as a joke. This is the on‑call drawer—the human cost of keeping systems up.
Your curiosities here might include:
- A screenshot of a 3 a.m. alert storm
- A calendar showing someone on‑call 3 weeks in a row
- A survey result where engineers rate on‑call stress as “very high”
Use these clues to guide better reliability practices:
- Design sustainable rotations: limit consecutive weeks, ensure follow‑the‑sun where possible
- Build robust escalation trees so no one person is a single point of failure
- Introduce quiet hours for non‑urgent work when people are on‑call
- Add training and shadowing so new on‑call engineers aren’t learning in the middle of a critical outage
Your cabinet reminds you that reliability is not just about MTTR; it’s about burnout, resilience, and team health.
Drawer 7: The Remediation Roadmap
Finally, a drawer filled with sticky notes, Jira tickets, and architecture sketches. This is where each curiosity spawns change.
For every incident artifact, ask:
What will we build, change, or stop doing because of this?
Examples:
- A misleading alert → Rewrite alert messaging, adjust threshold, add runbook
- A missing dashboard → Create a standardized service dashboard template
- A confusing handoff → Formalize an incident commander role and training
- An overloaded single team → Revisit ownership boundaries and dependencies
The goal is simple: no curiosity without a consequence. Every object in the cabinet should trace to at least one improvement in how you build, deploy, monitor, or coordinate.
This is where your analog museum powers your digital future.
Bringing the Cabinet to Life
You don’t need a fancy office wall to start a cabinet of curiosities. You just need to treat your incident artifacts as first‑class learning tools.
Practical ways to begin:
-
Create a physical or virtual “Incident Wall”
- Pin screenshots, timelines, and postmortems on a physical board, or use a shared digital space.
-
Standardize what you collect
- First alert screenshot
- Rough live timeline
- Key graphs/logs/traces
- Final postmortem
- Remediation ticket list
-
Review the cabinet regularly
- Use it for on‑call training
- Walk new hires through past incidents as guided tours
- Revisit old curiosities to see if promised remediations actually happened
-
Connect artifacts to practice changes
- Every object should link to at least one improvement in tooling, process, or team design.
Conclusion: Curiosity as a Reliability Superpower
Outages will happen. The question is not whether you can avoid them entirely, but whether you can learn from them deeply enough that each one makes you meaningfully more resilient.
An analog incident cabinet of curiosities is a deliberate commitment to that learning. By preserving tiny clues—pager alerts, scribbled notes, frozen graphs, chat logs—you:
- Make incident history concrete and memorable
- Reduce the pain of retroactive timeline reconstruction
- Anchor your postmortems in real evidence, not fuzzy recollection
- Turn each failure into better tools, better processes, and better on‑call practices
When your organization walks past that cabinet—literal or virtual—they’re not just seeing war stories. They’re seeing proof that you take reliability seriously and that every outage is not just a setback, but another carefully cataloged curiosity helping you build a more robust future.