Cabinet of Shadows: Mapping the Outages Your Metrics Never See

Every reliability team has one.

Not a dashboard. Not a postmortem doc. Something quieter.

A shared memory of “almost” incidents.

The deploy that spiked error rates for 90 seconds, then mysteriously recovered.
The database failover that took just long enough to scare everyone, but didn’t actually cause an outage.
The feature flag mistake that should have broken production, but didn’t, thanks to a quirk in traffic patterns.

No customer tickets. No pager rotation disaster. No official post-incident review.

And almost always: no metrics that really tell the story.

These are your near misses. They live in what you might call a Cabinet of Shadows: a growing archive of almost-outages your graphs barely register—but your systems desperately want you to notice.

This post is about why those shadows matter, why traditional observability misses them, and how generative AI can help you surface and interpret the stories hiding between the lines of your metrics.

Near Misses: The Incidents Your System is Whispering About

In safety-critical fields like aviation and industrial safety, there’s a powerful idea: near misses (or close calls).

A near miss is an unplanned event that could have caused harm or disruption, but didn’t.

OSHA and other safety disciplines treat these as gold:

No one gets hurt.
Nothing explodes.
Operations resume.

Yet the event is still logged, analyzed, and often triggers corrective action.

Why? Because near misses tell you something simple and terrifying:

Your system is fragile in ways that haven’t hurt you—yet.

In software infrastructure, we have the same patterns:

A misconfigured security group that didn’t lock out a critical service only because traffic was low.
A cache node that almost hit max memory right before an automatic refresh.
A Kafka partition that was seconds away from falling behind irrecoverably.

No official outage. No red status page. But the system raised its hand.

Why Dashboards Don’t See Your Near Misses

Traditional reliability metrics are optimized for one thing: detecting and quantifying harm that already occurred.

We design SLOs and dashboards around:

Error rates
Latency percentiles
Saturation metrics
Availability and uptime

This is necessary—but not sufficient.

The result:

If the error rate spike self-recovers, it’s a “blip.”
If the latency stays under the SLO, it’s “fine.”
If customers don’t complain, it “didn’t happen.”

In other words, we treat “no damage, no outage” as a non-event.

In safety and reliability disciplines, that’s exactly backward.

What doesn’t break you today is often the clearest data about what will break you tomorrow.

Your dashboards underrepresent near misses because they’re tuned for thresholds and steady-state behavior, not for subtle deviations and fragile boundary conditions.

That’s how you end up with a shadow cabinet of incidents:

Stories that live in Slack threads, ad-hoc Zoom calls, and hallway conversations.
“Oh yeah, remember that time the queue almost exploded?”
“We got lucky there; we should really fix that someday.”

If it’s not in a metric, ticket, or postmortem, it disappears into the shadows.

From Dashboards to Oscilloscopes: Seeing the Fine-Grained Signals

In electronics, if your signal is weird but not obviously broken, you don’t just stare at a single voltage reading. You reach for tools like:

Oscilloscopes – to see time-varying signals in high resolution.
Vectorscopes – to visualize complex color signals, like SMPTE color bars.

Why? Because complex systems exhibit subtle, patterned anomalies that a single number can’t express.

Consider SMPTE color bars on a vectorscope:

On a plain monitor, it’s just a stack of colored rectangles.
On a vectorscope, each color maps to a specific point or region.
If one point is slightly off, you know exactly which component of the signal is drifting.

This is the kind of perspective we often lack in infrastructure:

We see “latency under 200 ms” and call it good.
We don’t see that certain paths, tenants, or regions are consistently skewed.
We miss patterns of near misses because we lack a “vectorscope for our systems.”

What we need isn’t just more metrics; we need better ways to inspect fine-grained behavior:

Correlated traces that show “almost” resource exhaustion.
High-resolution time windows during deploys, failovers, and load shifts.
Anomaly visualizations that highlight small-but-persistent deviations.

Near misses often look like tiny anomalies in complex signals. The job is not to scream every time a blip appears—but to map where and how the system repeatedly brushes against the edge of failure.

Mapping the Cabinet of Shadows: Treat Near Misses as First-Class Incidents

To bring your shadow incidents into the light, you need both process and tooling changes.

1. Declare Near Misses as Real Incidents

Create an explicit category:

Severity N (Near Miss) – an event that could have caused customer or business impact but didn’t, this time.

Trigger a lightweight investigation when you see:

Abnormal but self-recovering error/latency spikes.
Resource saturation that auto-scales “just in time.”
Manual interventions that “save” a system from cascading failure.

The goal is not to over-bureaucratize every blip. It’s to build a culture where people can say:

“We got lucky. Let’s understand why.”

2. Log the Story, Not Just the Numbers

Near misses are rarely convincing if you only capture metrics. You need narrative:

What was happening in the system at the time?
What did people notice first? What did they ignore?
Which actions seemed to change the outcome?
What could have gone differently with slightly more load, delay, or failure?

These stories:

Reveal hidden dependencies not represented in diagrams.
Surface tacit knowledge (“everyone knows you never restart that service at 9 AM”).
Highlight organizational fragility (single-person heroics, unclear runbooks).

In other words, the story is the data.

Generative AI as a Signal Interpreter for Incident Stories

As generative AI becomes more common in engineering workflows, it’s not just a chatbot you ask for code samples. It’s also a pattern recognizer for narratives.

Your organization already produces:

Slack and Teams threads during “spicy” deploys
Ad-hoc incident docs
PagerDuty/On-Call notes
PR comments and commit messages

Most of these contain near-miss stories—but they remain unstructured and unanalyzed.

Generative AI can help you:

1. Surface Hidden Near Misses

Using conversational and document data (with appropriate privacy and access controls), AI can:

Cluster similar events: “deploys involving Service X that required a manual rollback.”
Flag recurring phrases: “almost,” “luckily,” “good thing we…,” “could have been bad.”
Suggest candidate near misses that were never formally logged.

2. Interpret and Reframe Incident Narratives

AI can read through your incident write-ups and:

Extract common contributing factors (e.g., “unowned dependency,” “manual failover,” “missing alert”).
Map them onto emerging theoretical models of complex systems and safety (e.g., drift into failure, local rationality, sharp-end/blunt-end dynamics).
Propose alternative framings:
- From “we made a dumb mistake” to “the system made it easy to do the wrong thing.”
- From “lucky save” to “repeatable pattern of narrowly avoided failure.”

This isn’t about replacing human judgment; it’s about having a co-analyst that:

Doesn’t get tired of reading 200 Slack messages.
Can connect distant dots across months of operational history.

3. Design Better Views of Subtle Anomalies

Just as oscilloscopes and vectorscopes gave engineers new visual grammars for understanding signals, AI can help design better visualizations of near-miss patterns:

Overlaying timelines of deployments, config changes, and resource saturation.
Highlighting “danger zones” where multiple weak signals co-occur.
Prototyping dashboards focused on fragility indicators, not just KPI health.

Practical Steps: Bringing Light to the Shadows

You don’t need a full AI platform to start. You can begin mapping your Cabinet of Shadows with incremental steps:

Name and track near misses.
- Add a “Near Miss” label or severity.
- Encourage engineers to log brief summaries, even if there was “no impact.”
Collect the stories.
- Store Slack incident channels and meeting notes in a searchable place.
- Ask one question after every spicy event: “If this had gone 10% worse, what would have happened?”
Experiment with AI-assisted analysis.
- Use generative AI to summarize weeks of operational chatter into recurring themes.
- Ask it: “What near-miss patterns do you see across these incidents?”
Build a ‘vectorscope’ view.
- Identify a handful of signals that correlate with near misses: deployment frequency, partial failures, manual overrides.
- Visualize them together during known tricky windows (e.g., big launches, backfills, migrations).
Close the loop.
- For each significant near miss, trace back to design, process, or culture:
  - Can we change defaults so the “wrong” action is harder?
  - Can we make the danger more visible earlier?
  - Can we make it easier to report “we got lucky” without blame?

Conclusion: Listening to What Didn’t Happen

Outages are loud. Near misses are quiet.

Dashboards are great at telling you when you already lost. They’re much worse at telling you how close you came to losing—and how often.

By treating near misses as first-class data, adopting tools and views more like oscilloscopes and vectorscopes, and using generative AI to help surface and interpret the rich narratives around “almost incidents,” you can finally start mapping your Cabinet of Shadows.

You don’t reduce fragility by staring harder at your uptime chart. You reduce it by listening carefully to the incidents that never quite happened—and acting as if they did.

The shadows are already full of stories. The question is whether you’ll read them before the next one steps into the light.