The Analog Incident Story Observatory Shelf: Turning Your On-Call Week into a Desk-Sized Constellation of Risks
How a simple, analog “observatory shelf” can transform stressful on-call work into a visible, shared constellation of risks, patterns, and learning signals—without adding cognitive overload.
Introduction: When Your Brain Becomes the Dashboard
If you’ve ever finished an on-call week and felt like you’d been hit by a slow-moving train, you’re not alone.
On-call work is tightly linked to fatigue, sleep disruption, and cognitive overload. Aviation, healthcare, and transportation have all developed regulation-focused guidance to manage fatigue safely. Tech is catching up, but we still largely treat on-call as a necessary inconvenience instead of a complex, human-centered system.
What if we treated your on-call week as something observable, not just survivable?
This post introduces a concept I’ll call the Analog Incident Story Observatory Shelf—a physical, desk-sized “constellation” of incidents, signals, and risks that turns your week on-call into a tangible system you can actually see and reason about.
It’s not a product. It’s a pattern: a way to represent incidents physically so that teams can see patterns across many events, not just the latest outage.
1. The Human Reality of On-Call: Not Just Pagers and Playbooks
Modern incident response is often framed as a tooling problem: better alerts, better dashboards, better runbooks. But research and decades of safety science say the quiet part out loud:
Human factors are central.
On-call response lives at the intersection of:
- Stress and arousal: Fight-or-flight responses shift how we perceive risk and time.
- Cognitive load: Juggling multiple alerts, Slack threads, logs, and dashboards taxes working memory.
- Group dynamics: Who speaks, who hesitates, and how we coordinate under pressure can matter more than the specific tool we’re using.
Regulated domains (like aviation) limit duty hours and treat fatigue like a safety hazard, not an individual weakness. Yet many engineering teams:
- Stack consecutive on-call weeks.
- Mix high-stakes incidents with regular project work.
- Rely on heroics and informal compensating behaviors.
Meanwhile, grey literature—blog posts, internal Google Docs, conference talks—offers a lot of guidance ("rotate weekly", "protect no-meeting days", "always have a backup"). But when you look for empirical evidence, the gap is obvious. Much of what we call “best practice” is really just hypotheses about what might work.
That’s an opportunity: treat your on-call setup as an experimentable system.
2. Incidents as Event-Driven Systems (Not Just Pages)
Under the hood, good incident response systems are event-driven. They don’t start with people; they start with event producers:
- Monitoring checks
- Application logs
- E-commerce events (checkout failures, abandoned carts)
- Customer tickets and email
- Even physical sensors in data centers or offices
These heterogeneous signals get normalized into a unified data model: alerts, incidents, tickets, or “events” in some central system. That normalization is critical. It’s what lets you:
- Correlate a spike in error rate with a recent deploy.
- See that a customer email is describing the same failure your logs show.
- Notice that a “minor” hardware sensor alarm always precedes a certain class of outage.
Yet the people on-call mostly see this as a firehose of interruptions.
The point of an Observatory Shelf is to externalize all those events and their relationships so that:
- You stop treating each alert as an isolated annoyance.
- You start seeing patterns of signals, responses, and outcomes across an entire week.
3. Dashboards Under Stress: Why the UI You Have Isn’t the UI You Need
Designing good engineering dashboards for incident response is hard.
Most dashboards are optimized for monitoring, not decision-making under stress. They:
- Show dozens of graphs and metrics.
- Assume you’re calm, rested, and have time to explore.
- Are tuned for experts who already know where to look.
But during an incident, your brain is very different:
- Narrowed attention: You can’t parse 12 panels and a legend with 30 colors.
- Working memory limits: You can track a few moving parts, not a whole system.
- Pressure to act: Each second you spend looking feels like failure.
Effective incident dashboards:
- Show what matters right now, not every metric you have.
- Make it obvious when to escalate, roll back, or declare an incident.
- Provide simple, direct narratives ("this changed, then that changed") rather than forcing the responder to assemble the story.
Digital dashboards are crucial—but when they live entirely on-screen, they disappear the moment the incident closes.
The Analog Incident Story Observatory Shelf complements them by giving you:
- A persistent, physical representation of what happened and how.
- A way to re-experience the system without needing to replay logs or dashboards.
4. What Is an Analog Incident Story Observatory Shelf?
Imagine a shelf, whiteboard, or section of wall near the team’s workspace.
Over the course of an on-call week, the current responder adds small, tangible artifacts:
- Index cards for incidents or pages.
- Colored stickers for severity, affected systems, customer impact.
- Strings or arrows connecting related events.
- Sticky notes capturing key decisions ("rolled back deploy", "paged database team").
By Friday, your shelf looks like a constellation of risk stories:
- A cluster of minor alerts around a particular service.
- A line of related events eventually culminating in a major outage.
- A lonely card representing a weird, one-off event that felt scary but didn’t recur.
It’s low-tech on purpose. The constraints are features:
- You can’t capture everything, so you’re forced to prioritize what really mattered.
- The representation is shared and visible, not hidden in someone’s personal notes.
- The team can literally stand around it and see system behavior over time.
This is not about replacing your tooling. It’s about:
Turning every on-call week into a physical model of your incident system and its risks.
5. Signal Transformations: From Interruptions to Learning Channels
Every incident is not just a failure; it’s a signal transformation:
- Inputs: conditions, events, context (traffic spikes, deploys, dependencies).
- Transformations: decisions, interventions, communications.
- Outputs: outcomes, side effects, new risks.
During an on-call week, you experience many such transformations:
- That low-level disk alert that always resolves itself after a backup job.
- The 2 a.m. alarm you ignore because it’s “always noisy” (and one day, it isn’t).
- The quick Slack ping from support that reveals a blind spot in your monitoring.
The Observatory Shelf lets you:
-
Capture these transformations visually.
- Use arrows to show how one event led to another decision or incident.
- Stack cards to show repeated patterns.
-
Treat each channel of signal as a learning artifact, not just noise.
- Monitoring: What did alerts not see in time?
- Humans: Who noticed early? Who was confused?
- Processes: Which runbooks helped, which ones got ignored?
-
Review the constellation as a team.
- Walk the timeline: "What changed here? Why did we believe this was safe?"
- Ask: "Where were we lucky rather than good?"
Instead of a postmortem for one big outage, you get a multi-incident, multi-signal review that turns on-call into a structured learning channel.
6. On-Call Conventions as Hypotheses: Use the Shelf to Experiment
Remember that gap between guidance and empirical evidence?
- "One-week rotations are best."
- "Nobody should be on-call more than X hours."
- "Secondary on-call is enough backup."
These are claims, not laws of physics. Your context—team size, product risk, customer sensitivity—may make them wrong for you.
The Observatory Shelf is an experiment platform:
-
Make a change to on-call.
- Shorter rotations.
- Explicit fatigue rules ("no deploys after midnight local time").
- Required backup during known high-risk launches.
-
Instrument the change visually.
- Mark weeks where the new rule applied.
- Track how many incidents, escalations, and handoffs occurred.
- Note perceived fatigue and stress (simple 1–5 ratings on cards).
-
Review patterns.
- Did incident outcomes change?
- Did decision quality or handoff clarity improve?
- Did people feel more or less burnt out?
Treat the canvas of your shelf as evidence, not decoration.
Over time, you can graduate some conventions from “rule-of-thumb" to "supported by our own data and stories".
7. How to Start Your Own Observatory Shelf
You don’t need a redesign of your incident process to try this. Start small:
-
Pick a space
- A shelf, a whiteboard, or a large poster near the team (or visible on video if remote).
-
Define simple artifacts
- One index card per incident or page.
- A small set of tags: service, severity, time of day, who responded.
-
Add as you go
- Encourage the on-call engineer to add to the shelf immediately after each event.
- Keep it fast: 1–2 minutes of writing and tagging.
-
Run a weekly constellation review
- 30 minutes at the end of the on-call week.
- Walk through the shelf in chronological order.
- Ask: "What did we learn? What surprised us? What should we test next week?"
-
Adjust based on cognitive load
- If adding to the shelf feels like too much during a crisis, back off.
- Remember: This is meant to reduce overload, not add performative process.
Conclusion: From Pages to Patterns
On-call will never be easy—but it doesn’t have to be opaque.
By acknowledging that human factors drive incident outcomes, recognizing that our current practices are often untested hypotheses, and embracing the idea that every alert is part of a broader event-driven system, we can upgrade on-call from a blur of pages to a source of structured knowledge.
An Analog Incident Story Observatory Shelf is a deliberately simple tool:
- It turns an overwhelming week into a visible constellation of risks and signals.
- It shifts attention from isolated outages to systemic patterns.
- It creates space to experiment, observe, and refine your on-call conventions.
Most importantly, it helps ensure that the real dashboard during an incident—your team’s collective cognition—has the support it needs.
You don’t need to buy anything to start. Just give your incidents a place to live in the physical world, and see what new patterns emerge when you can finally stand back and look at the whole sky.