The Analog Incident Story Domino Garden: Standing Up Tiny Paper Failures to See What Really Starts the Crash
How analog “domino gardens,” dummy loads, and partial simulators help engineering teams discover the true single points of failure that trigger cascading incidents in complex systems.
The Analog Incident Story Domino Garden: Standing Up Tiny Paper Failures to See What Really Starts the Crash
Complex systems rarely fail in simple ways.
From distributed services to overloaded power grids, big outages are often triggered by something small: a slow dependency, a misconfigured retry policy, a single hot path nobody realized was shared. By the time the incident page goes off, the system-wide spike has already revealed just how tightly coupled everything really was.
This is where the idea of an Analog Incident Story Domino Garden comes in: instead of waiting for production to show you how your system fails, you build an inexpensive, physical or analog “domino” simulation of your architecture. Then you stand up lots of tiny paper failures and see which one actually starts the crash.
In this post, we’ll explore how to:
- Use analog or physical models to explore cascading failures
- Combine mathematical models with simulator-style experiments
- Design “domino gardens” and dummy loads that expose real single points of failure
- Visualize blast radius clearly using L1/L2/L3 impact levels
- Leverage code-aware tooling to map real dependency graphs
Single Points of Failure Hide in Overload
Systems often look fine at normal load. Diagrams are neat, components appear decoupled, and everyone nods along with the architecture doc. But overload changes everything:
- Background jobs start competing for the same database
- A supposedly “optional” dependency becomes critical due to synchronous calls
- Retries and timeouts synchronize, causing thundering herds
When a single component crosses a critical threshold, you can see sudden, system-wide spikes:
- Latency blows up across unrelated services
- Error rates jump for endpoints that never call the failing service directly
- Resource utilization (CPU, memory, I/O) spikes in unexpected places
These patterns are a signal that you have hidden coupling. The architecture on paper says “loosely coupled,” but the runtime behavior says otherwise. The challenge is to discover these hidden links before production does it for you.
Why Build a “Domino Garden” for Incidents?
A domino garden is an experimental setup where you:
- Represent components as physical or analog “dominoes” (cards, bricks, simulated nodes).
- Give each one rules about how and when it can knock others over.
- Try different ways of placing and tipping them to see which arrangements cause a cascade.
In engineering terms, this means:
- Modeling services and dependencies
- Injecting small, cheap failures (the “paper failures”)
- Observing which ones lead to cascading impact
The goal is not a perfect digital twin. Instead, you’re looking for:
- Which single failures start the biggest chain reactions
- Which adjacent components amplify an incident
- Where backpressure, retries, or shared infrastructure make things worse
This kind of analog thinking is powerful because it forces you to externalize your mental model. Once the “dominoes” are on the table—literally or figuratively—disagreements and assumptions surface quickly.
Dummy Loads and Constrained Simulations: Safe Failure Probing
You rarely want to trigger a real cascading incident. You want to test just enough to learn something about the system’s weak points.
Two key tools here are:
1. Dummy Loads
A dummy load is a stand-in for real traffic or real work:
- A fake client that hammers only one endpoint with controlled patterns
- A synthetic workload that stresses a single database partition
- A stub service that simulates slow responses or specific error codes
Dummy loads let you study behaviors like impedance (how one part of the system resists or amplifies load) without blowing up the entire ecosystem.
Examples:
- Send a constant trickle of slow responses from a single dependency to see how callers handle partial degradation.
- Simulate a modest CPU spike on one node and see how your autoscaler reacts.
2. Constrained Simulations
A constrained simulation intentionally limits how far a failure can propagate:
- Run tests in a staging environment with only a subset of services
- Use feature flags or routing rules to keep certain paths offline
- Replace potentially dangerous dependencies with safe mocks or stubs
You’re not trying to recreate the full system. You’re trying to answer focused questions like:
- "What happens to Service A when Dependency B adds 100ms latency?"
- "At what point do retries cause more harm than good for this path?"
These are your paper failures: cheap, reversible experiments that illuminate where the real dominos stand.
Hybrid Approaches: Math + Analog + Code
A lot of engineering problems sit in a weird middle ground:
- Too complex for a clean closed-form mathematical model
- Too large or risky to fully simulate in production-like detail
That’s why hybrid approaches work best:
-
Mathematical modeling for the core dynamics:
- Queueing theory for request backlogs
- Simple differential equations for resource growth/decay
- Probability models for error rates and retries
-
Analog / physical models for structure and intuition:
- Domino layouts on paper or a whiteboard
- Card-based dependency mapping workshops
- Simple spreadsheet simulations of cascading thresholds
-
Code-level simulators for realism where it really matters:
- Partial test harnesses around critical hot paths
- Local or containerized environments that spin up just the key services
- Fault-injection tools that mimic network partitions, slow disks, or timeouts
The trick is to avoid the trap of "all or nothing" realism. You don’t need a full replica of production to answer:
"Which single component failure would take down the most important user journeys?"
You need targeted fidelity where it counts.
The Power of Simplified or Partial Simulators
A simulator that handles only part of the system can still be extremely valuable—if it focuses on the critical operations most likely to start or amplify an incident.
Good partial simulators often share these traits:
- Scope-limited: Only the main payment flow, or only the core search path
- Configurable: Easy to dial latency, error rates, and capacity up and down
- Fast feedback: Runs in seconds or minutes, not hours
Examples of what you can learn from such simulators:
- Where retry storms begin
- Which retries are safe versus destructive
- When backpressure kicks in—and whether it works as intended
In your “domino garden,” these simulators are like small sections of tightly packed tiles you can rearrange and stress-test without rebuilding the entire field.
Visualizing Blast Radius: L1, L2, L3
Even when you identify a dangerous domino, it can be hard to convey its importance to the team. That’s where blast radius visualization comes in.
A practical approach is to label impact levels:
- L1 (Local): Impact confined to the failing component and its immediate callers
- L2 (Adjacent): Impact spreads to related services or user journeys, but the whole system is not degraded
- L3 (Systemic): Broad, system-wide impact—major outage, business-critical functionality down
You can map this visually:
- Draw services as nodes and color them by impact level under a given failure scenario
- Overlay user journeys and highlight where they intersect failing paths
- Annotate high-risk edges: “L1 failure here escalates to L3 in under 30 seconds”
This turns an abstract sentence like:
"If caching layer X goes down, several internal APIs degrade."
into a concrete picture:
"If X fails, we get an L3 incident on checkout within 2 minutes unless backpressure Y engages."
Teams can then prioritize improvements based on blast radius, not just theoretical severity.
Tooling That Understands Real Code and Real Graphs
All of this is much easier when your tools see the system the way your code actually is, not the way your architecture diagram pretends it is.
Good dependency-mapping and simulation tooling should:
- Parse real import paths across repositories
- Understand multiple languages (your backend, your frontends, your scripts)
- Handle barrel files and indirection layers instead of giving up at the first alias
When your tools know that:
- Service A depends on Library B
- Library B uses Client C behind a barrel file
- Client C actually calls Service D and E
…your “domino garden” becomes grounded in reality, not guesses.
This enables experiments like:
- "Show me all L2/L3 dominos touched by this shared library."
- "Simulate a 50% latency increase on this internal API and visualize the impact."
Without this fidelity, you’re knocking over cardboard shapes that only vaguely resemble your real graph.
Bringing It All Together
The Analog Incident Story Domino Garden isn’t about building a perfect model of your system. It’s about:
- Making hidden couplings visible
- Using paper failures and dummy loads to probe for weak points safely
- Combining math, analog modeling, and code-level simulations for practical insight
- Focusing fidelity where it matters most: the first few dominos that start a cascade
- Communicating risk using clear blast radius levels (L1/L2/L3)
- Grounding everything in real dependency graphs understood by your tools
If you treat incidents as stories, your job isn’t just to fix the last chapter. It’s to discover the opening line—the tiny failure that quietly sets every other piece in motion.
Build your domino garden before production does it for you. Stand up the paper failures. See what actually falls.
Then, redesign the board so that when something inevitably tips over, the story ends at L1 instead of L3.