The Cardboard Reliability Theater: Acting Out Incidents With Zero Screens and Maximum Insight
How low‑tech, screen‑free tabletop exercises—“cardboard reliability theater”—can supercharge your incident response, improve team coordination, and tie practical drills into serious frameworks like NIST CSF 2.0.
The Cardboard Reliability Theater: Acting Out Incidents With Zero Screens and Maximum Insight
If your incident practice mainly looks like “everyone jumps on Zoom and stares at dashboards,” you’re leaving a lot of resilience on the table.
Yes, we need good tools, metrics, and alerts. But the most complex system in your incident response stack isn’t Grafana or PagerDuty—it’s the people and the way they work together under stress.
Cardboard Reliability Theater is a deliberately low‑tech way to practice incidents: no laptops, no dashboards, no terminals—just people, paper, and props. Think of it as tabletop incident response meets improv theater. You act out real failures, follow real playbooks, and stress‑test your real processes—without touching production.
This post explains how and why to run these screen‑free rehearsals, how to connect them to serious frameworks like NIST CSF 2.0, and how to turn each “performance” into data that improves your systems and your playbooks.
Why Practice Incidents at All?
Most teams agree they should “be prepared” for incidents. Fewer have a structured way to do it.
Incident response tabletop exercises give you a safe sandbox to:
- Rehearse real‑world failures without impacting customers
- Expose gaps in your monitoring, processes, and on‑call rotations
- Build muscle memory so people aren’t improvising everything under stress
- Normalize talking about failure openly and constructively
When you treat these exercises as part of your operating rhythm (not a once‑a‑year compliance checkbox), you build a culture where incidents are expected, prepared for, and learned from.
Playbooks: Scripts for High‑Stress Moments
In theater, even great improvisers work from some kind of structure. In incident response, that structure is your incident response playbooks.
A good playbook answers, in concrete terms:
- Who is responsible for what? (incident commander, scribe, comms lead, SMEs)
- What are the first three actions when X breaks? (triage, containment, comms)
- How do we communicate? (channels, frequency, audiences)
- When do we escalate? (severity levels, decision criteria)
- Where do we record decisions and timelines?
Under pressure, people don’t rise to the occasion—they fall back to their training. Clear, pre‑defined playbooks mean:
- Newer engineers can contribute quickly
- Senior folks aren’t making every decision from scratch
- The team has a shared mental model of “what good looks like”
Cardboard Reliability Theater is where these playbooks stop being theoretical documents and start becoming living scripts you practice, revise, and trust.
Connecting the Theater to NIST CSF 2.0
This isn’t just fun role‑play. You can directly map these exercises into your risk management work using frameworks like the NIST Cybersecurity Framework (CSF) 2.0.
NIST CSF 2.0 emphasizes:
- Identify (understand risks, assets, dependencies)
- Protect (preventive controls)
- Detect (visibility and alerts)
- Respond (contain, communicate, coordinate)
- Recover (restore, improve, learn)
Cardboard Reliability Theater mainly targets the Respond and Recover functions, but it also surfaces issues in Identify, Protect, and Detect.
For example, during an exercise you might discover:
- You can’t answer “Which services depend on this database?” → Identify gap
- The team doesn’t know who can approve a firewall change in an emergency → Protect gap
- No alert exists for this failure mode → Detect gap
By explicitly mapping your exercise findings back to NIST CSF 2.0, you:
- Turn a “fun drill” into evidence for audits and stakeholders
- Ensure that incident practice is embedded in your risk program, not ad‑hoc
- Prioritize fixes based on broader risk context, not just what feels urgent
Why Go Screen‑Free?
At first, “no laptops” sounds counterintuitive. Isn’t the whole point to practice with the tools you’ll use in a real incident?
It’s not an either/or. You should absolutely run tool‑centric drills. But screens pull attention away from the human system, and that’s where many incidents really go sideways.
Screen‑free, role‑play style exercises surface questions like:
- Who speaks up? Who stays quiet?
- Does someone clearly take command, or do we get decision paralysis?
- How do we handle conflicting opinions without stalling?
- Who remembers to update stakeholders outside the incident channel?
- Do we default to blame, or to curiosity?
You’ll see behaviors and communication patterns that are invisible in log files but crucial to real‑world outcomes.
And because you’re not allowed to “just check the logs,” people are forced to:
- Rely on playbooks and process
- Explain their mental models out loud
- Clarify assumptions about data, systems, and responsibilities
That is where deep learning happens.
Borrowing from Chaos Engineering: Perturb the Plot
Chaos engineering teaches us to deliberately inject failure into systems to understand how they behave. Cardboard Reliability Theater applies the same idea to your people and processes.
Don’t just run a simple, linear scenario. Perturb it mid‑exercise:
- Halfway through a simulated outage, the primary on‑call is “unreachable”
- The incident commander is suddenly “pulled into another emergency”—who steps up?
- The status page post accidentally over‑shares technical detail—how do you correct and coordinate with Legal/PR?
- A new, unrelated alert appears—do you triage it or ignore it?
You’re not trying to trick people; you’re trying to discover:
- How well do we handle surprise?
- Where are our single points of human failure?
- How resilient are our communication patterns when plans break?
These perturbations stress‑test not just the tech assumptions in your playbooks, but the social and organizational assumptions too.
Treat Each Performance as Data
Theater is temporary, but your learning shouldn’t be.
Approach every exercise as a structured experiment:
-
Start with hypotheses
- “If service X is down, the owning team will be paged within 5 minutes.”
- “Our runbook for database failover is clear enough for any L2 on‑call.”
-
Observe the performance
- Time how long it takes to assign roles
- Note when confusion or disagreements appear
- Track how often someone says “Wait, who’s responsible for that?”
-
Falsify assumptions
If your hypothesis was “everyone knows the escalation path,” and your actors argue for 10 minutes about who to call, that assumption is falsified. Good! Now you know what to fix. -
Capture insights concretely
After the exercise, run a short retro:- What surprised us?
- What worked well that we want to keep?
- What slowed us down or confused us?
- Which playbooks or policies need updates?
-
Feed back into systems and playbooks
- Update runbooks and incident roles
- Adjust monitoring, alerting, or dashboards
- Clarify ownership, on‑call rotations, or escalation paths
- Map changes back to NIST CSF (e.g., Respond/Recover functions)
Treat your cardboard stage as a lab for your incident process. Every exercise should change something in the real world.
How to Run Your First Cardboard Reliability Theater Session
You don’t need budget approval or fancy tools. You need:
- 60–90 minutes
- A room (or virtual whiteboard if remote)
- Sticky notes / index cards / “cardboard”
1. Define a realistic scenario
Pick a failure you’re actually worried about:
- Core database cluster unavailable
- Major cloud region outage
- Critical authentication service misconfigured
Write a short, concrete starting situation:
“It’s 10:07 AM on a Tuesday. PagerDuty just paged the primary on‑call for the Payments service: 5xx rates have spiked to 40%.”
2. Cast the roles
Assign people to roles with name cards:
- Incident Commander
- Scribe
- Comms Lead (internal and/or external)
- Technical leads (SRE, app engineer, DB, network, security)
- Optional: Exec, Customer Support, Legal/PR
3. Set the rules
- No laptops, no phones, no dashboards
- All “system state” comes from the facilitator (using prewritten cards)
- You may reference real playbooks or printed runbooks
4. Run the exercise
The facilitator reveals new information over time:
- “You get a report from Support: top customer unable to checkout”
- “Logs show elevated timeouts to the database tier”
- “Cloud provider status page reports issues in your primary region”
Let the team coordinate, ask questions, and make decisions as if it were real, but using dialogue instead of keyboards.
5. Add perturbations
Once the team settles into a pattern, introduce surprises:
- “The primary DB expert is on a plane with no Wi‑Fi”
- “Your Slack workspace just went down; how do you communicate now?”
Observe what breaks and what adapts.
6. Debrief and document
Reserve at least 20 minutes to debrief:
- What did we assume that turned out to be false?
- Where did we waste time or duplicate effort?
- Which decisions were hard, and why?
- What changes are we committing to by next week?
Translate those into concrete tickets and playbook updates.
Combining Structure and Creativity for Real Resilience
The most effective incident response programs don’t rely solely on:
- Frameworks and policies (NIST, compliance docs), or
- Ad‑hoc heroics and clever tooling, or
- Feel‑good simulations with no follow‑through
They combine:
- The discipline of structured frameworks like NIST CSF 2.0
- The clarity of well‑designed incident playbooks
- The insight from creative, low‑tech simulations like Cardboard Reliability Theater
That combination gives you something deeper than uptime: a team that knows how to think, communicate, and adapt together when things go wrong.
Conclusion: Step Onto the Cardboard Stage
Modern systems are too complex to avoid failure. The question isn’t if you’ll have incidents, but how prepared you’ll be when they hit.
Cardboard Reliability Theater offers:
- A cheap, low‑risk way to rehearse serious failures
- A lens on human and organizational issues your tools can’t see
- A structured pipeline from practice → insight → better playbooks and systems
Pick one scenario, gather a small group, ban laptops for an hour, and act it out.
You might feel silly at first. That’s fine. The first rehearsal is always awkward.
But over time, you’ll build a team that doesn’t just respond to incidents—they perform under pressure, with clarity, coordination, and confidence. And that’s the kind of reliability no dashboard can measure directly, but every customer can feel.