The Cardboard Reliability Puppet Stage: Acting Out Incidents Before They Hit Production
How a cardboard “reliability puppet stage” turns incident practice into playful, low‑stakes simulations that surface failures early and strengthen SRE teams.
The Cardboard Reliability Puppet Stage: Acting Out Incidents Before They Hit Production
Modern Site Reliability Engineering (SRE) spends a lot of time dealing with incidents after they’ve impacted users. But what if you could rehearse incidents like a theater company before opening night? What if your team could explore cascading failures, miscommunications, and weird edge cases using nothing more than cardboard, string, and paper marionettes?
Enter the Cardboard Reliability Puppet Stage—a deliberately low‑tech, playful way to simulate incidents, practice responses, and surface problems before they ever appear in production.
In this post, we’ll explore why this works, how it fits into SRE practice, and how you can build your own puppet stage for reliability learning.
Why SRE Needs More Play (and More Practice)
SRE is fundamentally about reliability under uncertainty. Teams:
- Monitor and improve availability and performance of large‑scale systems.
- Design for fault tolerance, graceful degradation, and rapid recovery.
- Try to detect and address issues early, before users ever feel pain.
We already have sophisticated tools: observability stacks, chaos experiments, load tests, runbooks, and post‑incident reviews. But many of these are:
- High stakes – Running chaos in production, even carefully, can be stressful.
- Time‑pressured – Real incidents don’t wait for perfect explanations.
- Abstract – Diagrams and dashboards don’t always reveal the human interactions and misunderstandings that make incidents worse.
That’s where playful, low‑cost tools come in. By acting out incidents with cardboard stages and paper puppets, you:
- Lower the emotional stakes.
- Make invisible dependencies physically visible.
- Rehearse communication and coordination, not just technical fixes.
This is not a replacement for real tools. It’s a complement that focuses on learning faster than your systems can fail.
What Is a “Reliability Puppet Stage”?
A reliability puppet stage is exactly what it sounds like:
- A simple cardboard “stage” that represents your system.
- Paper or cardboard puppets representing services, users, teams, tools, and even failure modes.
- Strings, sticks, or labels to show dependencies and data flows.
- A facilitator who narrates an incident scenario while team members act out their responses using the puppets.
Think of it as a tabletop incident simulation with props. The focus is on interaction and storytelling, not artistic quality.
Why Puppets Work Surprisingly Well
Educational puppetry research and classroom experience show that:
- Storytelling and role‑play increase engagement and memory.
- Students (and adults) feel safer exposing confusion “through the puppet” than in their own voice.
- Physical props encourage independent problem‑solving, experimentation, and “what if?” thinking.
SRE teams can leverage the same dynamics. When a paper marionette labeled “Cache” collapses dramatically, it’s funny—but it also makes the failure tangible and discussable.
Designing Your Cardboard Incident Stage
You don’t need much:
- A piece of cardboard or whiteboard for the “stage”
- Index cards or paper for the puppets
- Tape, markers, string, and sticks (or just hands)
Step 1: Map Your Cast of Characters
Create puppets for:
- Core services: API, frontend, auth, payments, database, cache, message queue.
- External dependencies: third‑party APIs, DNS, cloud provider.
- Users: end‑user, internal customer, on‑call engineer.
- Tools & processes: monitoring, alerting, runbooks, incident commander.
- Failure modes: “Network Partition,” “Thundering Herd,” “Disk Full,” “Deployment Gone Wrong.”
Each puppet should have a clear label. Don’t overcomplicate the design—stick figures are fine.
Step 2: Show Relationships
Use string, arrows, or drawn lines to represent:
- Data flows (user → frontend → API → DB)
- Dependencies (service A depends on service B)
- Control paths (alerts → on‑call → incident channel)
The point is to externalize mental models. People often discover mismatched understandings during this step alone.
Step 3: Define Simple Scenarios
Start with small, realistic incident prompts, such as:
- “Cache eviction bug causes a sudden spike on the database.”
- “Third‑party payment provider latency triples.”
- “New deployment silently disables retries on a critical client.”
Write each scenario on a card. The facilitator introduces it, then the puppet show begins.
How to Run an Incident Puppet Show
You can run these sessions as short, 45–60 minute exercises.
1. Set the Scene
- The facilitator lays out the puppets on the stage in the “normal” state.
- Briefly explain the system flow: who calls whom, what’s critical, what’s nice‑to‑have.
- Assign roles: someone plays the on‑call, someone plays observability, someone plays the database, etc.
2. Trigger the Incident
- The facilitator introduces the failure card: e.g., “At 10:03, the cache starts returning errors.”
- The person controlling that puppet acts it out: wobbling, falling over, or changing position.
- The facilitator tracks time (“It’s now T+5 minutes… T+15 minutes…”) as the scenario unfolds.
3. Act Out Detection and Response
Ask the team to react in character:
- Does monitoring notice? Move the “Alerting” puppet and read out a mock alert.
- What does the on‑call do first? Which puppet do they talk to? Which dashboard do they “look at”?
- If someone makes an assumption (“The DB must be fine”), reflect it in puppet movement.
This lets you see not just the system’s behavior, but the team’s behavior.
4. Introduce Twists
Once a response pattern emerges, the facilitator can:
- Add a second failure puppet (“Now the retries overload the queue”).
- Reveal missing observability (“You realize you don’t have latency metrics for this new dependency”).
- Simulate communication gaps (“The status page isn’t updated for 30 minutes”).
These twists surface hidden coupling, missing safeguards, and procedural fragility.
5. Debrief and Capture Learnings
After 20–30 minutes of play, pause and discuss:
- What detection path did we rely on most? Was it robust?
- Where did we guess instead of knowing? What data was missing?
- Which steps took the longest? Where did confusion or misalignment appear?
- If this happened in production tomorrow, what would we want to be different?
Turn these into concrete follow‑ups:
- New alerts or dashboards.
- Clearer ownership boundaries.
- Runbook updates.
- Experiments to validate assumptions.
This is where learning deliberately outpaces failure: the incident never actually happened, but the learning is real.
Why This Low‑Tech Method Works in High‑Tech Systems
Despite its simplicity, a cardboard stage taps into several powerful dynamics:
1. Safe, Low‑Stakes Exploration
A puppet show is obviously a simulation. That psychological distance:
- Lowers fear of blame.
- Encourages people to admit uncertainty.
- Makes experimenting with “bad ideas” acceptable.
When the risk is zero, creativity goes up—and so does honest reflection.
2. Making Complexity Tangible
Large‑scale systems are often too big to hold in one person’s head. Puppets:
- Turn invisible network calls into visible lines and positions.
- Highlight coupling simply by how cluttered the stage becomes.
- Make it easier to say, “Wait—why does this depend on that?”
That’s the moment where hidden problems surface.
3. Practicing the Human Side of Incidents
Most postmortems reveal not only technical failures, but also:
- Miscommunication and unclear ownership.
- Slow or noisy incident coordination.
- Unclear decision‑making authority.
Puppet simulations help teams rehearse:
- Who becomes incident commander.
- How status is shared internally and externally.
- How handoffs work if the incident spans teams or time zones.
You’re not just debugging the system—you’re debugging the response process.
4. Storytelling as a Learning Engine
Through the lens of educational puppetry, incidents become stories:
- There’s a cast (services and people).
- There’s conflict (failures, overloads, bugs).
- There’s resolution (mitigation, recovery, learning).
Story structure makes information stick. When people later face a real incident, they can recall: “This feels like the puppet scenario where the cache quietly failed and the DB melted.”
Making Puppet Stages Part of SRE Practice
To get lasting value, integrate puppet stages into your reliability culture:
- Regular drills: Run a session monthly or quarterly, like fire drills.
- Cross‑team participation: Include product, support, and leadership occasionally.
- Rotate facilitators: Spread ownership; different people see different failure modes.
- Version the system: Update the puppets when architecture changes.
You can even connect puppet stages with other practices:
- Use scenarios derived from real post‑incident reviews.
- Validate new architecture designs by acting them out before implementation.
- Use findings to prioritize backlog items for reliability work.
Over time, these playful rehearsals build muscle memory. The team becomes more confident, more aligned, and more skilled at detecting and addressing issues early—exactly what SRE aims to achieve.
Conclusion: Learning Faster Than You Fail
A cardboard box and a handful of paper marionettes will never replace production telemetry, chaos experiments, or solid engineering. But they don’t have to. Their power lies in how they:
- Lower the stakes enough to invite honest exploration.
- Turn abstract architectures into concrete, shareable models.
- Use storytelling, role‑play, and educational puppetry techniques to deepen engagement and independent problem‑solving.
A Cardboard Reliability Puppet Stage reframes incident practice as iterative learning, not just damage control. By acting out failures before they hit production, you train your team to spot weak signals earlier, coordinate more effectively, and respond with less panic and more clarity.
In other words: you rehearse the outage so your users never have to see it.
All you need to start is some cardboard, a few markers—and a willingness to let your systems and your team be a little bit theatrical in the name of reliability.