The Analog Incident Story Domino Garden: Standing Up Tiny Paper Failures to See What Really Starts the Crash

Complex systems rarely fail in simple ways.

From distributed services to overloaded power grids, big outages are often triggered by something small: a slow dependency, a misconfigured retry policy, a single hot path nobody realized was shared. By the time the incident page goes off, the system-wide spike has already revealed just how tightly coupled everything really was.

This is where the idea of an Analog Incident Story Domino Garden comes in: instead of waiting for production to show you how your system fails, you build an inexpensive, physical or analog “domino” simulation of your architecture. Then you stand up lots of tiny paper failures and see which one actually starts the crash.

In this post, we’ll explore how to:

Use analog or physical models to explore cascading failures
Combine mathematical models with simulator-style experiments
Design “domino gardens” and dummy loads that expose real single points of failure
Visualize blast radius clearly using L1/L2/L3 impact levels
Leverage code-aware tooling to map real dependency graphs

Single Points of Failure Hide in Overload

Systems often look fine at normal load. Diagrams are neat, components appear decoupled, and everyone nods along with the architecture doc. But overload changes everything:

Background jobs start competing for the same database
A supposedly “optional” dependency becomes critical due to synchronous calls
Retries and timeouts synchronize, causing thundering herds

When a single component crosses a critical threshold, you can see sudden, system-wide spikes:

Latency blows up across unrelated services
Error rates jump for endpoints that never call the failing service directly
Resource utilization (CPU, memory, I/O) spikes in unexpected places

These patterns are a signal that you have hidden coupling. The architecture on paper says “loosely coupled,” but the runtime behavior says otherwise. The challenge is to discover these hidden links before production does it for you.

Why Build a “Domino Garden” for Incidents?

A domino garden is an experimental setup where you:

Represent components as physical or analog “dominoes” (cards, bricks, simulated nodes).
Give each one rules about how and when it can knock others over.
Try different ways of placing and tipping them to see which arrangements cause a cascade.

In engineering terms, this means:

Modeling services and dependencies
Injecting small, cheap failures (the “paper failures”)
Observing which ones lead to cascading impact

The goal is not a perfect digital twin. Instead, you’re looking for:

Which single failures start the biggest chain reactions
Which adjacent components amplify an incident
Where backpressure, retries, or shared infrastructure make things worse

This kind of analog thinking is powerful because it forces you to externalize your mental model. Once the “dominoes” are on the table—literally or figuratively—disagreements and assumptions surface quickly.

Dummy Loads and Constrained Simulations: Safe Failure Probing

You rarely want to trigger a real cascading incident. You want to test just enough to learn something about the system’s weak points.

Two key tools here are:

1. Dummy Loads

A dummy load is a stand-in for real traffic or real work:

A fake client that hammers only one endpoint with controlled patterns
A synthetic workload that stresses a single database partition
A stub service that simulates slow responses or specific error codes

Dummy loads let you study behaviors like impedance (how one part of the system resists or amplifies load) without blowing up the entire ecosystem.

Examples:

Send a constant trickle of slow responses from a single dependency to see how callers handle partial degradation.
Simulate a modest CPU spike on one node and see how your autoscaler reacts.

2. Constrained Simulations

A constrained simulation intentionally limits how far a failure can propagate:

Run tests in a staging environment with only a subset of services
Use feature flags or routing rules to keep certain paths offline
Replace potentially dangerous dependencies with safe mocks or stubs

You’re not trying to recreate the full system. You’re trying to answer focused questions like:

"What happens to Service A when Dependency B adds 100ms latency?"
"At what point do retries cause more harm than good for this path?"

These are your paper failures: cheap, reversible experiments that illuminate where the real dominos stand.

Hybrid Approaches: Math + Analog + Code

A lot of engineering problems sit in a weird middle ground:

Too complex for a clean closed-form mathematical model
Too large or risky to fully simulate in production-like detail

That’s why hybrid approaches work best:

Mathematical modeling for the core dynamics:
- Queueing theory for request backlogs
- Simple differential equations for resource growth/decay
- Probability models for error rates and retries
Analog / physical models for structure and intuition:
- Domino layouts on paper or a whiteboard
- Card-based dependency mapping workshops
- Simple spreadsheet simulations of cascading thresholds
Code-level simulators for realism where it really matters:
- Partial test harnesses around critical hot paths
- Local or containerized environments that spin up just the key services
- Fault-injection tools that mimic network partitions, slow disks, or timeouts

The trick is to avoid the trap of "all or nothing" realism. You don’t need a full replica of production to answer:

"Which single component failure would take down the most important user journeys?"

You need targeted fidelity where it counts.

The Power of Simplified or Partial Simulators

A simulator that handles only part of the system can still be extremely valuable—if it focuses on the critical operations most likely to start or amplify an incident.

Good partial simulators often share these traits:

Scope-limited: Only the main payment flow, or only the core search path
Configurable: Easy to dial latency, error rates, and capacity up and down
Fast feedback: Runs in seconds or minutes, not hours

Examples of what you can learn from such simulators:

Where retry storms begin
Which retries are safe versus destructive
When backpressure kicks in—and whether it works as intended

In your “domino garden,” these simulators are like small sections of tightly packed tiles you can rearrange and stress-test without rebuilding the entire field.

Visualizing Blast Radius: L1, L2, L3

Even when you identify a dangerous domino, it can be hard to convey its importance to the team. That’s where blast radius visualization comes in.

A practical approach is to label impact levels:

L1 (Local): Impact confined to the failing component and its immediate callers
L2 (Adjacent): Impact spreads to related services or user journeys, but the whole system is not degraded
L3 (Systemic): Broad, system-wide impact—major outage, business-critical functionality down

You can map this visually:

Draw services as nodes and color them by impact level under a given failure scenario
Overlay user journeys and highlight where they intersect failing paths
Annotate high-risk edges: “L1 failure here escalates to L3 in under 30 seconds”

This turns an abstract sentence like:

"If caching layer X goes down, several internal APIs degrade."

into a concrete picture:

"If X fails, we get an L3 incident on checkout within 2 minutes unless backpressure Y engages."

Teams can then prioritize improvements based on blast radius, not just theoretical severity.

Tooling That Understands Real Code and Real Graphs

All of this is much easier when your tools see the system the way your code actually is, not the way your architecture diagram pretends it is.

Good dependency-mapping and simulation tooling should:

Parse real import paths across repositories
Understand multiple languages (your backend, your frontends, your scripts)
Handle barrel files and indirection layers instead of giving up at the first alias

When your tools know that:

Service A depends on Library B
Library B uses Client C behind a barrel file
Client C actually calls Service D and E

…your “domino garden” becomes grounded in reality, not guesses.

This enables experiments like:

"Show me all L2/L3 dominos touched by this shared library."
"Simulate a 50% latency increase on this internal API and visualize the impact."

Without this fidelity, you’re knocking over cardboard shapes that only vaguely resemble your real graph.

Bringing It All Together

The Analog Incident Story Domino Garden isn’t about building a perfect model of your system. It’s about:

Making hidden couplings visible
Using paper failures and dummy loads to probe for weak points safely
Combining math, analog modeling, and code-level simulations for practical insight
Focusing fidelity where it matters most: the first few dominos that start a cascade
Communicating risk using clear blast radius levels (L1/L2/L3)
Grounding everything in real dependency graphs understood by your tools

If you treat incidents as stories, your job isn’t just to fix the last chapter. It’s to discover the opening line—the tiny failure that quietly sets every other piece in motion.

Build your domino garden before production does it for you. Stand up the paper failures. See what actually falls.

Then, redesign the board so that when something inevitably tips over, the story ends at L1 instead of L3.