The Analog Incident Story Cabinet of Shadows: Mapping the Outages Your Metrics Never See

Modern systems are full of ghosts.

On your dashboards, the graphs look clean: uptime is high, error rates are low, SLAs are green. But in the corners of incident channels, private chats, and hallway conversations, there’s a different story: near-misses, misconfigured protections, failed failovers that “magically” fixed themselves, and recurring weirdness nobody has time to fully investigate.

This is your Cabinet of Shadows—the analog, messy, human record of outages and failures that your metrics never see.

In this post, we’ll explore why traditional reliability metrics miss this shadow zone, how hidden protection failures distort your understanding of risk, and what it looks like to systematically map these invisible outages into a usable, visual model of system risk.

The Problem with Two-State Outage Models

Most reliability frameworks rest on a simple idea: systems are either up or down. This binary view underpins common measures such as uptime percentage, mean time to failure (MTTF), and SLAs.

But real systems aren’t binary; they’re layered and contingent. Between “fully healthy” and “completely down” lies a spectrum:

Components that appear healthy but have silent protection failures
Redundant paths that no longer fail over correctly
Miscalibrated alerts that don’t fire when they should
Automation that only works under certain load conditions

Traditional two-state models:

Treat the system as either functioning or failed
Assume protections either work or don’t, with no nuance
Rarely model partial failures or hidden protection degradation

The result: your reliability looks better on paper than it is in reality. The model assumes protections are in place and effective. The system you’re actually running doesn’t always match that assumption.

The Hidden Layer: Protection Failures You Don’t See

When you design a system, you add layers of protection:

Load balancers
Rate limiters
Circuit breakers
Retry logic
Backup routes or regions
Feature flags and kill switches

We often model these as “if X fails, Y takes over.” But what if Y is already broken and you don’t know it?

These are hidden protection failures:

A failover route that hasn’t been updated since the last migration
A backup job that appears “successful” but silently skips key data
An alerting rule that’s been disabled “temporarily” and never re-enabled
A circuit breaker that never actually trips because it’s misconfigured

To understand the real reliability of your system, you need to estimate the probability that protections are already failed at the moment you’re observing the system.

That’s non-trivial:

You rarely have direct metrics that say, “this safety net is currently broken.”
Failures may be latent for weeks or months before being revealed.
The probability of hidden failure changes over time—especially after deployments, config changes, or partial incidents.

Most organizations simply don’t model this. As a result, dashboards implicitly assume: “If we haven’t seen it fail recently, it’s fine.” That assumption is often wrong.

Time-Dependent State Probabilities: Reliability Is a Moving Target

Systems aren’t just in one of a few discrete states; they move through them over time. More importantly, the probability of being in each state evolves as:

Code and infra change
Traffic patterns shift
Operators learn, automate, and occasionally break things

Conceptually, your system has a set of states, for example:

Healthy and fully protected
Healthy but with one or more protections silently degraded
Partially failed (e.g., degraded performance, limited regions)
Fully failed (outage)

Each state has a time-dependent probability:

Just after a deployment, your probability of hidden misconfiguration might spike.
After a major incident, you may temporarily improve protections, lowering some risks.
As time passes, entropy creeps in—configs drift, dependencies update, assumptions age—pushing probabilities back up.

Yet most standard metrics:

Report point-in-time values (e.g., today’s uptime)
Ignore latent risk and hidden failures
Treat history as a series of independent snapshots, not a continuous evolution

Without modeling these time-dependent probabilities, you miss patterns like:

“Our risk of protection failure climbs significantly within two weeks of each large deployment.”
“Our backups are statistically more likely to be misconfigured within 90 days of a major infra change.”

You don’t need a PhD in stochastic processes to gain value here. The key is to stop treating reliability as static and acknowledge that risk is dynamic.

The Shadow Zone: Near-Misses and Misoperations That Never Hit Dashboards

In every resilient system, there’s an invisible layer of incidents that didn’t quite become incidents:

A deployment that was rolled back just in time
A database that hit 95% capacity before someone noticed by chance
A misconfigured firewall rule caught by a senior engineer in review
A failover test that exposed serious gaps—but never affected customers

These are near-misses and protection misoperations. They:

Don’t cause visible outages
Rarely show up in error-rate or latency metrics
Often get handled informally, without tickets or post-incident analysis

Yet they are direct evidence of:

Hidden vulnerabilities
Failing or degraded protections
Organizational blind spots in review, testing, or change management

This entire shadow zone is usually absent from reliability reports and dashboards. You get:

A clean uptime number
An incident count that looks manageable
No clear signal of how close you are to catastrophic failure

Your Cabinet of Shadows is real, but it lives in:

People’s memories
DMs and side-channels
Unshared runbooks and private notes

Mapping this space is the first step toward understanding your true reliability.

Psychological Safety: Turning Ghost Stories into Data

You can’t measure what people are afraid to say out loud.

To surface the shadow incidents, you need psychological safety—a culture where people feel safe to report:

Mistakes that almost caused outages
Near-misses that never made it to customers
Misconfigurations they caught “just in time”
Failed experiments and ugly surprises

Teams that actively cultivate this environment often see:

A sharp increase in near-miss reporting—for example, a 40% rise in reports in six months
More detailed, higher-quality narratives about how risk actually manifests
Earlier detection of patterns in protection failures

Key leadership practices that support this:

Blameless post-incident reviews focused on learning, not punishment
Regular near-miss review sessions treated as first-class citizens, not “less important” than outages
Leaders publicly sharing their own mistakes and near-misses
Explicit recognition and appreciation for people who report weak signals early

The goal is to pull stories out of the shadows and treat them as critical input to your risk model.

From Cabinet of Stories to Visual Risk Dashboard

Once you have more visible signals—outages, near-misses, and indicators of hidden failures—you can start building a visual risk analysis dashboard that:

Unifies different types of events
- Full outages (customer impact)
- Partial degradations
- Near-misses
- Protection misoperations and test failures
Visualizes time-dependent risk
- Show how risk changes before and after major releases
- Highlight clusters of near-misses that precede real outages
- Indicate aging protections (e.g., last tested 180+ days ago)
Surfaces hidden failure indicators
- Failed failover drills
- Backup restore tests that only partially succeeded
- Alert tests that didn’t fire as expected

This dashboard doesn’t replace traditional observability; it complements it by mapping the shadow zone:

Metrics tell you: what is happening right now.
Incident and near-miss stories tell you: what almost happened, and why.
Combined, they form a dynamic map of system risk.

From Firefighting to Proactive Risk Management

When you only track visible outages, you’re limited to reactive firefighting:

An incident happens
You scramble to respond
You write a retrospective
You patch the specific cause

When you systematically map the Cabinet of Shadows, your posture shifts:

You see patterns in protection failures before they trigger outages
You detect rising risk levels after certain types of changes
You prioritize work based on risk concentration, not just recent pain

Concrete changes you might see:

Running regular, structured game days that deliberately stress protections
Tracking time since last verified test of backups, failover, and alerting
Using near-miss density as a leading indicator to trigger preventative work
Adjusting change management policies when near-misses spike after certain releases

The transformation is simple to describe, hard to execute: move from asking, “How many outages did we have?” to asking, “How many times did the system show us it was nearly unsafe—and what did we learn?”

Conclusion: Open the Cabinet, Map the Shadows

Your system is more fragile—and more informative—than your dashboards admit.

By:

Questioning simplistic up/down outage models
Accounting for hidden protection failures and their time-varying probabilities
Treating near-misses and misoperations as first-class data
Building psychological safety so people actually report them
Consolidating signals into visual, time-aware risk dashboards

…you move from a world where reliability is an illusion of green graphs to one where you can see and shape the true landscape of risk.

The Cabinet of Shadows exists whether you acknowledge it or not. The choice is whether it remains a collection of whispered stories—or becomes a map you can steer by.