The Pencil-Drawn Outage Puppet Theater: Acting Out Hidden Failure Paths With Paper Characters
How a low-tech, paper-and-pencil “puppet theater” can reveal hidden failure paths, sharpen incident response, and turn reliability work into something teams actually look forward to doing.
The Pencil-Drawn Outage Puppet Theater: Acting Out Hidden Failure Paths With Paper Characters
If you’ve ever tried to reason about a complex system during a major incident, you know the feeling: logs are flying by, dashboards are red, Slack is on fire, and half the dependencies you’re discovering in real-time weren’t even on the architecture diagram.
Now imagine you’d already rehearsed this outage—using nothing but pencils, paper, and a bit of theater.
That’s the idea behind the Pencil-Drawn Outage Puppet Theater: a low-tech, collaborative way to act out incidents and failure paths using paper “characters” to represent systems, services, and users. It’s playful, but it’s not a joke. This method can expose hidden failure modes, clarify mental models, and dramatically improve your team’s incident readiness.
This post explores how it works, why it’s effective, and how you can run your first session.
What Is a Pencil-Drawn Outage Puppet Theater?
Think of it as tabletop incident rehearsal meets storyboarding.
- You draw systems, services, and users as simple paper characters.
- You place them on a table or whiteboard and move them around like puppets as the story unfolds.
- You narrate an incident scenario in scenes, like a play, exploring how the outage starts, spreads, and is ultimately resolved.
The point isn’t artistic quality. Stick figures are fine. What matters is:
- Narrative structure (a clear beginning, middle, end)
- Visible interactions (who calls what, who depends on whom)
- Shared understanding (everyone can see the same “stage”)
By externalizing the system as a story with characters and actions, you make complexity legible not just to senior engineers, but to anyone involved in delivery and operations.
Why Use Puppets When You Have Dashboards?
Because this approach solves some problems that tools can’t.
1. A Low-Risk, Low-Cost Incident Simulator
Running real failure injection in production or staging is powerful but expensive and sometimes politically tricky.
Pencil-drawn puppet theaters provide a low-risk, cost-effective way to simulate incidents:
- No infrastructure required beyond paper, pens, and maybe sticky notes.
- No risk of breaking real systems.
- Easy to try risky “what-if” ideas you’d never run live.
You can quickly explore:
- “What happens if this region dies?”
- “What if this third-party API returns junk instead of errors?”
- “What if our alerting is delayed by 10 minutes?”
Because the simulation is lightweight, you can run many scenarios in an afternoon that would be impractical to test for real.
2. Uncover Hidden Failure Paths and Edge Cases
Formal reviews and automated tests are crucial—but they typically focus on known paths:
- Documented dependencies
- Intended flows
- Expected error conditions
When you act out outages as stories, people naturally improvise:
- “Wait—doesn’t the mobile app also talk directly to Service B?”
- “If this queue backs up, won’t billing be delayed too?”
- “Who owns this legacy cron job that retries forever?”
That improvisation is where the hidden failure paths emerge.
The combination of narrative and visual puppets makes it easier to spot:
- Unexpected transitive dependencies
- Non-obvious shared bottlenecks (e.g., “all of these call the same feature flag service”)
- Edge cases around retries, timeouts, and degraded modes
Traditional architecture diagrams are static; puppet theater is dynamic. You see how the system behaves over time, not just where the boxes and arrows live.
Turning Systems Into Characters and Scenes
The heart of the method is treating components as characters in a story.
Step 1: Cast the Characters
Give each important entity a simple paper card:
- Systems & Services:
API Gateway,Search Service,Payment Processor,Feature Flags,Auth Service - Data Stores:
User DB,Cache,Analytics Warehouse - External Dependencies:
Email Provider,SMS Gateway,Third-Party API - Users & Roles:
End User,On-call Engineer,SRE,Support,Manager
Add short notes to each card:
- Purpose: “stores user sessions”, “handles checkout”
- SLO/SLA snippets: “p95 < 200ms”, “3 nines uptime target”
- Known quirks: “rate-limited”, “slow cold starts”, “single-region”
These become your puppets.
Step 2: Define the Plot
Like any good play, your outage needs a story arc:
- Setup – Normal operation; how users and systems interact on a good day.
- Inciting Incident – Something fails or degrades.
- Escalation – The impact spreads, new symptoms appear, people get paged.
- Climax – Key decisions, tradeoffs, and interventions.
- Resolution – Systems recover, follow-ups are identified.
You can outline this on a whiteboard and then let the team fill in the details.
Step 3: Act It Out
Now, run the scenario as a live rehearsal:
- One person plays the Narrator (“At 14:03, latency to the database spikes…”).
- Others pick up characters and move them or speak for them (“Search Service: I can’t get results from the DB, I’ll start timing out users”).
- A facilitator asks questions: “Who notices first?” “What alerts fire?” “Who gets paged?” “What do you try?”
You’re not aiming for perfection; you’re aiming for learning. Every “Wait, I don’t know” marks a spot for follow-up.
A Rehearsal Technique for Real Incidents
Theater folk rehearse so that during a performance they can respond under pressure without freezing. Outage puppet theaters serve the same purpose.
They help teams practice:
- Coordination under stress – Who leads? Who communicates externally? Who dives into logs?
- Decision-making – Do we roll back, failover, or go into read-only mode?
- Communication patterns – How do we keep stakeholders informed without distracting responders?
By walking through incidents in a calm environment, teams build muscle memory:
- Newer engineers learn what “good” incident response looks like.
- Senior engineers surface implicit knowledge they’ve never documented.
- Everyone learns the language of impact, risk, and tradeoffs.
When a real outage hits, it feels less like chaos and more like a hard but familiar scene they’ve rehearsed.
Cross-Functional Learning, Not Just SRE Playtime
One of the biggest advantages of the puppet theater format is that it welcomes non-engineers into the reliability conversation.
Because the whole system is visible on the table:
- Product managers can ask, “If this fails, which user flows actually break?”
- Support can share, “Here’s what customers complain about when this is slow.”
- Security can highlight, “This fail-open behavior is risky.”
The visual, narrative style helps bridge vocabulary gaps:
- Instead of “the DB’s p95 latency exceeded thresholds,” you might say, “The User DB character is responding slowly, so Search Service is making users wait longer, and they start refreshing the page.”
This shared story:
- Builds psychological safety (“It’s okay not to know everything; we’re exploring together”).
- Encourages open communication (“This part of the system always confused me—can we talk through it?”).
- Enables cross-functional tradeoff discussions (“If we prioritize this reliability fix, it reduces these three outage scenes we just saw”).
From Ambiguous Projects to Concrete Storyboards
Many reliability projects start with vague slogans:
- “We need better resilience.”
- “We should reduce incident MTTR.”
- “We have to make the payment pipeline more robust.”
The puppet theater turns those fuzzy goals into concrete, visual storyboards:
- You draw the current flow: who talks to whom, in what order.
- You act out failure scenarios and identify specific weak spots.
- You sketch improved flows directly on paper: new timeouts, retries, fallbacks, canary paths.
Each iteration is quick:
- Change the cards.
- Move arrows.
- Try a new scene.
Instead of arguing in abstract terms, you point to the table and say:
“With this change, when the Email Provider goes down, we defer emails to a queue and keep checkout working, instead of failing the whole order.”
Now “better resilience” is no longer philosophy; it’s a sequence of visible, agreed-upon behaviors.
Making Reliability Work Engaging (and Memorable)
Borrowing from performance and puppetry—roles, scenes, scripts—isn’t just cute. It helps knowledge stick.
People tend to remember:
- Stories better than stats
- Characters better than components
- Dialogues better than diagrams
After a session, someone might say:
- “Remember when Feature Flags crashed and took half the app with it?”
- “We realized the On-call Engineer didn’t know the backdoor read-only endpoint existed.”
These shorthand references become part of your team’s shared folklore, making future discussions faster and more grounded.
The theatrical format also makes reliability work less intimidating:
- New hires can participate on day one—no deep system knowledge needed.
- People feel more comfortable asking “naive” questions when everyone is playing.
- It’s easier to schedule a “puppet theater session” than “another 2-hour incident meeting.”
Engagement matters because the more people actively participate, the richer the mental models your organization develops.
How to Run Your First Puppet Theater Session
You don’t need a big rollout. Start small.
-
Pick a scope
Choose one user journey or subsystem (e.g., “user login” or “checkout flow”). -
Gather 4–8 people
Include at least one engineer, one on-call or SRE, and one non-engineer (PM, support, etc.). -
Prepare basic materials
- Index cards or sticky notes
- Pens/markers
- Table or whiteboard surface
-
Draw the characters
Make cards for key services, data stores, users, and roles. -
Outline one incident scenario
For example: “The database’s primary node becomes slow and occasionally times out during peak hours.” -
Act it out
- Walk through normal operation.
- Introduce the incident.
- Narrate who notices, what alerts fire, who gets paged, what actions are taken.
-
Capture insights
On a separate sheet, note:- “Unknowns” and surprises
- Hidden dependencies
- Process or tooling gaps
- Improvement ideas
-
Decide on 1–3 follow-ups
Turn insights into concrete tasks (e.g., “Document dependency on Feature Flags,” “Add alert for queue backlog,” “Define read-only mode playbook”).
Repeat regularly. Treat these sessions like tabletop fire drills for your systems.
Conclusion: Rehearse the Outage Before It Happens
Modern systems are too complex to fully understand from static diagrams or code alone. We need ways to play with failure, safely, cheaply, and collaboratively.
The Pencil-Drawn Outage Puppet Theater offers:
- A low-risk, low-cost simulator for incidents and outages
- A way to surface hidden failure paths and edge cases
- A shared narrative that aligns engineers, PMs, and stakeholders
- A rehearsal space to practice incident response before the real thing
- A tool to turn ambiguous reliability goals into concrete, visual storyboards
- A memorable, engaging format borrowed from performance and puppetry
You don’t need new software. Just paper, pencils, and a willingness to treat your systems like characters on a stage.
Start with one small scenario. Draw the puppets. Tell the story. You might be surprised how much your team learns before anything ever goes down in production.