The Paper Failure Greenroom: Backstage Rituals for Rehearsing Incidents Before Opening Night
Learn how to use theater-inspired tabletop exercises and incident drills as a backstage greenroom for safely rehearsing failures before they ever reach production.
The Paper Failure Greenroom: Backstage Rituals for Rehearsing Incidents Before Opening Night
In theater, nobody walks on stage for opening night without rehearsal. There are read-throughs, blocking, tech runs, and dress rehearsals. Mistakes are welcomed early so they don’t happen under the spotlight.
Yet in many engineering organizations, “opening night” is the first time the team really experiences a serious incident together. Roles are unclear, communication is improvised, and the only script is whatever someone remembers from a dusty runbook.
This is where the Paper Failure Greenroom comes in: a backstage space to rehearse incidents safely, on paper (or in a sandbox), before anything breaks in production.
Incidents as Opening Night, Tabletop as Rehearsal
Think of your production environment as the main stage. Your customers are the audience. A major outage? That’s your high-stakes premiere.
But the real work of building a reliable incident response practice happens offstage.
Tabletop exercises—guided, low-stakes discussions that walk through hypothetical incidents—are your script read-throughs and blocking rehearsals. They’re:
- Low-cost: No need for complex tooling. A scenario, a facilitator, and a conference room (or Zoom) are enough.
- Low-stakes: Nothing in production is harmed. You simulate on paper (or slides), not in live systems.
- High-learning: They expose gaps in processes, tooling, and assumptions long before real customers are affected.
In a tabletop, you’re not trying to replicate every detail of an outage. You’re practicing how the cast—engineers, SREs, support, product, leadership—moves together when something goes wrong.
Designing Drills Across the Full Incident Lifecycle
A good rehearsal doesn’t just cover the climax of the play. It works through the entire story arc. Incident drills should do the same, covering the end-to-end lifecycle:
-
Detection
- How do we first notice something is wrong?
- Who gets paged? What alerts exist? What’s missing?
-
Triage & Classification
- How do we decide severity?
- Which teams are involved? Who’s the Incident Commander?
-
Coordination & Communication
- How are updates shared within the team?
- How do we communicate with stakeholders and customers?
-
Mitigation & Resolution
- What are our first safe actions?
- Where do we find runbooks and historical incidents?
-
Closure & Learning
- How do we know the incident is fully resolved?
- How and when do we run a retrospective?
A well-crafted scenario walks the team through each of these phases. The point isn’t to “win the game” by resolving quickly; it’s to surface friction:
- "We don’t know who owns this system."
- "We don’t have a clear definition of sev-1 vs sev-2."
- "Nobody knows where the runbooks live."
- "We forgot to update the status page for 30 minutes."
Those discoveries are gold. They’re the equivalent of catching a missed line or bad lighting cue during dress rehearsal—invaluable before opening night.
Severity Systems and Roles: Your Cast List and Script
You can’t run a coherent play if nobody knows who they are on stage.
Effective, repeatable incident response depends on two foundational elements:
1. A Clear Severity System
A severity system is your shared language for how bad something is and how quickly you must respond. For example:
- SEV-1: Critical impact, large customer base affected, or major data risk. All hands on deck.
- SEV-2: Significant degradation, notable customer impact, but partial workarounds exist.
- SEV-3: Minor degradation or localized impact, normal working hours response.
- SEV-4: Cosmetic or low-impact issues, handled as regular work.
Tabletop exercises pressure-test these definitions:
- Do people classify the same scenario consistently?
- Do they know what a SEV-1 automatically triggers (e.g., incident channel, comms lead, leadership page)?
When everyone shares the same mental model of severity, the first 10 minutes of a real incident are far less chaotic.
2. Defined Roles and Responsibilities
In theater, you have a director, stage manager, actors, lighting, sound. In incidents, you need similarly clear roles, for example:
- Incident Commander (IC) – Owns the response process, not the keyboard. Keeps focus, assigns tasks, manages the timeline.
- Operations Lead – Drives the technical diagnosis and mitigation work.
- Communications Lead – Handles updates to stakeholders, status pages, and internal channels.
- Scribe – Captures a detailed timeline and decisions for later review.
Tabletop drills let people practice being in these roles before the pressure is real. You can:
- Rotate participants through roles across sessions.
- Let newer engineers try being IC with mentorship.
- Refine responsibilities when confusion appears.
Roles turn an incident from “everyone talking at once” into an organized performance.
Building Psychological Safety Through Rehearsal
Incidents are stressful. Pages at 3 a.m., angry customer messages, leadership asking for updates—it’s a lot, especially for newer on-call engineers.
Regular simulations create psychological safety by making the unfamiliar feel familiar.
When people have:
- Seen a similar scenario in rehearsal,
- Practiced declaring an incident and assuming a role,
- Walked through communication patterns and escalation paths,
then real incidents feel less like chaos and more like a high-intensity version of something they already know how to do.
This reduces:
- Fear of “messing up” in front of others.
- Hesitation to take ownership or speak up.
- Cognitive overload when alarms start firing.
And it increases:
- Confidence in using tools and runbooks.
- Trust that the team will support, not blame.
- Willingness to surface uncertainty (“I don’t know” is allowed).
The greenroom is where actors loosen up, get into character, and shake off nerves. Your tabletop sessions should serve the same purpose for your on-call staff.
The "Play a Drill" Mindset: Frequent, Realistic, Immersive
To get real value, treat drills as a regular practice, not a one-off compliance tick box.
Adopt a "play a drill" mindset:
-
Frequent
- Run small, focused exercises monthly or even biweekly.
- Rotate systems, severities, and participants.
- Keep most sessions to 60–90 minutes to respect people’s time.
-
Realistic
- Base scenarios on real incidents (yours or others’).
- Include ambiguity—conflicting alerts, partial data, uncertain ownership.
- Reflect real constraints: incomplete dashboards, noisy logs, time pressure.
-
Immersive
- Use your actual tools: incident channels, ticket systems, status page drafts.
- Have people play their real roles as they would in production.
- Introduce time jumps ("10 more minutes pass, error rate doubles") to keep the pace.
-
Safe to Fail
- Explicitly frame drills as learning spaces, not performance evaluations.
- Reward surfacing problems (“We realized nobody has access to X”) instead of punishing mistakes.
The more your rehearsals resemble “the real thing,” the more opening night feels like just another performance—important, but not overwhelming.
Turning Rehearsals into Learning: Post-Exercise Reviews
The exercise is only half the value. The other half comes from what you do after.
Run a short post-exercise review (15–30 minutes) while everything is fresh:
-
Reconstruct the Story
- What happened, step by step?
- When did we first recognize the problem?
- What decisions were made, and why?
-
Highlight What Worked
- Clear handoffs? Great IC leadership? Solid customer updates?
- Tools or runbooks that genuinely helped?
-
Surface Gaps and Risks
- Missing alerts or dashboards.
- Unclear ownership or role confusion.
- Bottlenecks in approvals or communication.
-
Define Concrete Improvements
- Update severity definitions or role descriptions.
- Add or refine runbooks and alerts.
- Create follow-up tickets for systemic fixes.
-
Share Learnings Broadly
- Summarize outcomes for the wider org: “In this rehearsal, we learned X and are changing Y.”
- Use it to increase organizational risk awareness beyond just the on-call team.
Treat the review as a script rewrite. Each session makes the next performance tighter, clearer, and less surprising.
Getting Started: Your First Paper Failure Greenroom Session
You don’t need a huge program to begin. Start small:
-
Pick a System and a Scenario
- Example: Payment processing latency spikes during peak hours.
-
Define the Cast
- Choose an IC, Ops Lead, Comms Lead, Scribe, plus a facilitator.
- Invite a few observers from related teams.
-
Prepare the Beats, Not the Script
- Outline key stages: initial alert, customer reports, escalating impact.
- Decide where to inject new information (“Now the database CPU is 95%”).
-
Run for 60 Minutes
- Work through detection, triage, mitigation, comms, and closure.
- Pause occasionally to clarify intent, not to overcorrect.
-
Review and Capture Learnings
- What surprised people?
- What would we change about our real incident process tomorrow?
Then, schedule the next one.
Conclusion: Don’t Debut Underprepared
Opening night will come. Systems will fail, alerts will fire, customers will notice.
You can either treat each incident as an unscripted improv performance—or you can build a Paper Failure Greenroom, where your teams rehearse, refine, and grow confident together long before the house lights go up.
By:
- Using tabletop exercises as low-stakes rehearsals,
- Practicing clear severity systems and roles,
- Emphasizing psychological safety and frequent, immersive drills, and
- Turning every exercise into a learning opportunity,
you transform incidents from terrifying unknowns into challenging but familiar performances.
Rehearse your failures on paper first, backstage, where it’s safe. That way, when the curtain rises on your next real incident, your team is ready to deliver a calm, coordinated, and reliable show.