The Paper-First Incident Drillbook: Rehearsing On-Call Nightmares While It’s Still Daylight
How to design paper-first incident tabletop drills that actually rehearse your real on-call nightmares—before they wake you up at 3 a.m.
The Paper-First Incident Drillbook: Rehearsing On-Call Nightmares While It’s Still Daylight
Ask any on-call engineer what they fear most and you’ll hear the same themes: alerts that don’t make sense, missing runbooks, no clear owner, Slack chaos, leadership pinging for updates while nobody even knows what’s broken yet.
You don’t fix that kind of chaos by buying another monitoring tool. You fix it by rehearsing incidents before they happen—on paper, while it’s safe, quiet, and everyone is awake.
This is where a paper-first incident drillbook comes in: pre-planned tabletop exercises that walk your team through realistic, organization-specific incidents, step by step, until good practice becomes muscle memory.
Below is a practical guide to designing, running, and improving these drills.
Why “Paper-First” Beats Wing-It Chaos
When teams attempt incident drills for the first time, a common pattern emerges:
- Someone says, “Let’s just make up a scenario and see how we do.”
- The exercise turns into an unstructured debate about architecture diagrams.
- Nobody practices the actual incident process.
The result: interesting conversation, no operational improvement.
“Paper-first” means you:
- Plan the drill in advance, in writing.
- Base it on your real processes and threat landscape, not hypotheticals.
- Codify roles, steps, and prompts on paper (or in a doc) before you get in a room.
This shifts the exercise from improvisation to deliberate practice. Like rehearsing a play, everyone works from the same script—then learns where the script is missing or wrong.
Step 1: Anchor Drills in Real Processes
A drill can’t improve a process that doesn’t exist.
Before you design scenarios, identify the core workflows you want to rehearse. For most teams, that includes:
- Incident detection and triage (What happens when an alert fires?)
- Incident command (Who leads? How do they communicate?)
- Technical investigation and mitigation (How do responders coordinate?)
- Stakeholder communication (How and when do you update customers, execs, support?)
- Documentation and follow-up (How is the incident recorded and reviewed?)
Make sure you have at least a rough incident response process written down (even a one-page checklist) before running a drill. Then design scenarios specifically to:
- Exercise that process end to end.
- Reveal where the process is unclear, unrealistic, or missing.
If your “process” is really just tribal knowledge, the drill will expose that quickly—which is valuable. Capture what actually happens in the drill and use it to formalize the process.
Step 2: Design Scenarios That Reflect Your Real Threats
Effective tabletop drills don’t use generic cyber disaster movie plots. They simulate your likely on-call nightmares.
To design realistic scenarios:
-
Review your history
- Past SEV-1 or SEV-2 incidents
- Near-misses and recurring issues
- Outages at similar companies or in your industry
-
Map to your environment
Consider:- Cloud provider(s) and regions in use
- Critical dependencies (payment gateways, SSO, databases, message queues)
- Regulatory constraints (e.g., data privacy, uptime SLAs)
-
Pick 2–4 high-value scenario types, for example:
- Sudden traffic spike causing cascading failures
- Data store corruption or unavailable primary database
- Security incident (compromised credentials, suspicious access)
- Third-party outage impacting key flows (payments, logins, notifications)
Each scenario should include:
- Background: What’s normal? What’s at stake (revenue, safety, compliance, brand)?
- Trigger: What’s the first sign? (Alerts, customer tickets, dashboard anomalies.)
- Constraints: What’s broken or unavailable? Any time pressure?
- Complications: What twists will you introduce mid-exercise? (E.g., “PagerDuty is down,” “Your primary contact at Vendor X is offline.”)
Your goal is to make people say: “This is exactly what would keep me up at night.”
Step 3: Involve All the Right Stakeholders
On-call nightmares are rarely pure technical puzzles. They’re socio-technical: they go wrong at the intersection of people, process, and infrastructure.
You’ll only see those failure points if the right people are in the room.
Who should participate?
- On-call engineers / SREs / devs who would respond first.
- Incident commander (or whoever fills that role today).
- Support / customer success (often first to hear from customers).
- Product / business owner for the impacted service.
- Security (especially for any scenario with a security angle).
- Communications / PR / marketing for customer-facing impact.
You don’t need every person for every exercise, but for each scenario, ask:
“Who would be actively involved in a real incident like this?”
Then make sure they’re not just observers—they should speak, decide, and act during the drill.
Stakeholder participation reveals:
- Gaps in ownership (“Who is actually allowed to make this call?”)
- Breakdowns in handoffs (support → engineering, engineering → leadership, etc.)
- Missing or conflicting communication paths (too many channels, not enough structure).
Step 4: Choose the Right Format(s): Discussion vs. Hands-On
Not every drill needs to be a full technical simulation. There are two main styles, both useful:
1. Discussion-Based Tabletop
Everyone sits around a table (physical or virtual) and walks through the scenario step by step. The facilitator reveals new information over time.
Best for:
- New or evolving processes.
- Cross-functional communication practice.
- Early-stage maturity, small teams, or limited time.
Strengths:
- Low cost: you can run it with slides or a shared doc.
- Easier to involve non-technical stakeholders.
- Great for testing decision-making and coordination.
2. Operational / Hands-On Exercise
You simulate (or carefully induce) real system conditions and let the team respond in the actual tools they’d use in production.
Best for:
- Mature teams with stable processes.
- Practicing technical debugging and mitigation.
- Validating monitoring, dashboards, and runbooks.
Strengths:
- High-fidelity rehearsal of alerts, dashboards, and runbooks.
- Builds deep technical muscle memory.
Many organizations benefit from a hybrid approach:
- Start with discussion-based tabletops to refine process and roles.
- Graduate key scenarios into hands-on exercises once the process is stable.
Step 5: Build a Simple Paper-First Drill Template
You don’t need a fancy platform. A well-structured document is enough.
Here’s a minimal drill template you can adapt:
1. Scenario Overview
- Name:
Payment Gateway Partial Outage - Type: Availability / Third-party dependency
- Services impacted:
Checkout API,Order processing - Business impact:
Revenue loss, abandoned carts, support load
2. Objectives
- Validate incident detection and triage flow.
- Practice stakeholder communication to support and leadership.
- Test decision-making around failover to backup payment provider.
3. Participants & Roles
- Incident Commander:
Name - Primary Responder (Service X):
Name - Support Lead:
Name - Product Owner:
Name - Observer / Note-taker:
Name
4. Timeline & Injects
- T+0 min: Alert from monitoring:
Checkout errors > 15% - T+5 min: Support reports surge in “payment failed” tickets.
- T+10 min: Error logs show majority failures toward
PaymentProviderA. - T+15 min: New inject: Exec asks, “Should we post a status page update?”
- T+20 min: New inject: Backup payment provider has higher fees; finance raises concern.
For each inject, the facilitator asks:
- What do you do?
- Who does it?
- Where is it documented?
- How do you communicate it, and to whom?
5. Artifacts & Tools
- Link to runbook(s).
- Status page guidance.
- Internal incident channel naming convention.
- Monitoring dashboards.
6. After-Action Review Notes
Leave space to capture observations, decisions, and gaps as they appear.
This is all “paper-first”—you prepare the playbook before the rehearsal.
Step 6: Run a Structured After-Action Review
The most important part of any drill is what happens after it ends.
Without a structured after-action review (AAR), the same issues will reappear in the next incident.
Hold the AAR as soon as possible (ideally immediately) and focus on:
-
What actually happened?
- Timeline of key decisions and actions.
-
What went well?
- Call out behaviors, tools, or steps that should be repeated.
-
What was confusing or broken?
- Unclear ownership.
- Missing or outdated runbooks.
- Tooling gaps (no dashboard, useless alerts).
-
What will we change before the next drill?
- Convert each issue into a small, concrete action with an owner and a due date.
Example improvements:
- “Create a short incident commander checklist.”
- “Add a standard internal update cadence (every 15 minutes).”
- “Document backup payment provider failover steps.”
Capture the AAR outcomes in your incident drillbook—a living knowledge base that grows with every exercise.
Step 7: Start Small, Repeat Often
You do not need a huge budget to get value from incident drills. Consistency beats complexity.
Practical cadence and scope:
- Quarterly cross-team tabletop for a major scenario (e.g., company-wide outage, security incident).
- Monthly lightweight drill for critical services (30–60 minutes, just a few participants).
- Rotate scenarios so each core risk area (availability, security, data integrity, third-party dependencies) gets covered over time.
Keep early drills deliberately simple:
- One scenario, one hour, one facilitator.
- Use existing tools: video call, shared doc, chat channel.
- Aim for 1–3 clear improvements per session—no more.
Over time, that repetition builds:
- Muscle memory: People know what to do when paged at 3 a.m.
- Shared mental models: Teams understand how systems and roles fit together.
- Cultural safety: Incidents become learnable events, not personal failures.
Conclusion: Practice in Daylight to Sleep Better at Night
Real incidents are messy, emotional, and expensive. But the skills needed to handle them—clear communication, decisive leadership, effective technical response—are all trainable.
A paper-first incident drillbook gives you a low-risk way to:
- Rehearse real on-call nightmares before they happen.
- Expose gaps in process, tooling, and ownership while the stakes are low.
- Turn every exercise into lasting improvements through structured reviews.
You don’t need a perfect plan or a big platform to start. You just need:
- A simple written process.
- A realistic scenario.
- The right people in the room.
- A commitment to capture and act on what you learn.
Run your incidents on paper while it’s still daylight—so that when the real 3 a.m. page comes in, it feels less like a nightmare and more like a script you’ve already rehearsed.