The One-Page Incident Rehearsal: Script Tiny Fire Drills Before Your Next Production Outage
How to use a single-page script and tiny incident fire drills to sharpen your SRE practice, protect SLOs, and respond faster and calmer to real production outages.
The One-Page Incident Rehearsal: Script Tiny Fire Drills Before Your Next Production Outage
Most teams wait for real outages to test how good their incident response really is.
By then, it’s too late.
There’s a better pattern: tiny, scripted fire drills guided by a one-page rehearsal runbook. Done regularly, these short exercises harden your incident muscle memory, protect SLOs, and make real incidents feel less like chaos and more like execution.
This post walks through how to design and run one-page incident rehearsals as a core SRE practice, with concrete templates you can start using this week.
Why Rehearsal Belongs at the Heart of SRE
SRE isn’t just about building reliable systems—it’s about operating them reliably under stress. That’s where rehearsal shines.
Treat incident rehearsal as a first-class SRE practice for three reasons:
-
Direct protection of SLOs
Faster detection, better triage, and clearer communication all reduce user impact. Rehearsal improves:- MTTA (Mean Time to Acknowledge) – You spot and acknowledge issues sooner.
- MTTR (Mean Time to Recover) – You coordinate faster, escalate correctly, and avoid confusion.
- Error budget burn – You waste less downtime fumbling around.
-
Operational continuity
When key people are out, rehearsed processes keep things running. Standardized drills:- Spread knowledge across the team
- Reduce dependency on “that one person who knows how it works”
- Make handovers and on-call rotations less risky
-
Psychological safety under pressure
In a real incident, adrenaline is high. Rehearsals make the process familiar so:- People know their role and responsibilities
- Communication patterns are automatic
- The team can stay calm and systematic
To get these benefits, you don’t need complex game days. You need small, frequent, scripted practice.
Step 1: Standardize Incident “Response Codes”
Before you rehearse, you need a common language.
Incident response codes are short, standardized labels that encode priority and required actions. They let you communicate the situation instantly without a paragraph of explanation.
Example response code framework:
-
Code 0 – Informational
- No user impact.
- Example: Non-prod system degraded, metrics anomaly under investigation.
- Actions: Track, learn, improve, but no broad alert.
-
Code 1 – Low Impact
- Minor user impact, limited scope or easy workaround.
- Actions: On-call engineer investigates, no full incident room, optional status page update.
-
Code 2 – Moderate Incident
- Noticeable user impact; SLOs at risk but not fully violated yet.
- Actions: Declare incident, create channel, assign roles (incident commander, comms, ops lead), regular updates.
-
Code 3 – Major Incident
- Widespread impact or severe SLO breach.
- Actions: Full response team, exec visibility, customer communications, strict timelines for updates.
-
Code 4 – Critical / Safety / Legal
- Regulatory, safety, security, or critical data integrity risk.
- Actions: All-hands, special procedures, legal/compliance engaged.
Your codes don’t have to follow this exact structure, but they must be:
- Clear: Everyone understands what each code means.
- Actionable: Each code maps to specific required actions (who joins, who communicates, update cadence).
- Consistent: The same code means the same response across teams.
In rehearsal, you use these codes constantly: “We’re escalating this from Code 1 to Code 2 based on error rate and support tickets.” That repetition makes quick, aligned decisions easier during real outages.
Step 2: Design the One-Page Rehearsal Script
Your rehearsal runbook should fit on one page. If it’s longer, no one will use it under pressure.
The goal: a script you can quickly print, share, or keep on a second monitor.
A good one-page script includes:
-
Scenario header
- Name: e.g., “Partial DB outage – write failures”
- Default response code (starting level)
- Systems affected
- Dependencies to watch
-
Objectives for this drill
- Practice detection and triage within X minutes
- Validate escalation path and ownership
- Exercise external communication (status page, stakeholder updates)
-
Timeline checklist (end-to-end response)
0–5 minutes: Detection & Acknowledgment
- Who detects the issue? (monitoring, support, manual report?)
- How is it acknowledged? (paging tool, Slack, SMS?)
- Is the response code assigned?
5–15 minutes: Triage & Containment
- Assign roles: Incident Commander (IC), Ops Lead, Comms Lead.
- Identify blast radius: which services, regions, tenants?
- Implement quick containment or mitigation if possible.
15–30 minutes: Communication & Escalation
- Open or update incident channel/ticket.
- Decide whether to escalate the response code.
- Internal updates: every X minutes.
- External communication (status page, key customers) if relevant.
30+ minutes: Resolution & Verification
- Apply fix or rollback plan.
- Verify via metrics, logs, synthetic checks.
- Downgrade or close incident code when stable.
-
Key questions to ask during the drill
- How would we know this is happening if no one told us?
- Who owns the failing component?
- What is the fastest safe rollback or mitigation?
- Do we know when to escalate? Who approves?
-
Metrics to track
- Time to detect (simulated)
- Time to acknowledge
- Time to engage the right people
- Time to communicate externally
- Time to resolution (simulated)
This script is not a full runbook for fixing the system; it’s a playbook for running the incident response.
Step 3: Run Tiny, Focused Fire Drills
Avoid the temptation to simulate “everything breaks at once.” Those exercises are rare, heavy, and hard to repeat.
Instead, plan tiny, focused fire drills that each simulate one specific failure mode. Examples:
- API latency spikes in one region
- Database connection pool exhaustion
- Authentication service partial outage
- Message queue backlog for a single critical queue
- Misconfigured feature flag impacting 5% of traffic
Each drill should:
- Take 30–45 minutes end-to-end.
- Include a small group: 3–6 people is ideal.
- Focus on depth of response quality, not breadth of catastrophe.
You can run them as tabletop exercises (purely on paper/in tools, no real system impact) or pair them with a light chaos experiment in a safe environment if your maturity allows.
Sample schedule:
- Weekly: 15–30 minute micro-drill with the current on-call.
- Monthly: 45–60 minute scenario with cross-team participants.
- Quarterly: Larger, cross-org exercise that chains a couple of failure modes together.
Step 4: Practice End-to-End, Not Just the Fix
Many teams implicitly rehearse the technical fix by working on bugs and outages. What they don’t rehearse is everything around it.
Ensure every drill covers the full lifecycle:
-
Detection
- Which alerts would fire?
- Are they tuned to be actionable?
- Could support or sales spot this first via tickets or calls?
-
Triage
- Who owns the system?
- Where are the dashboards and logs?
- How do you decide if this is Code 1 vs Code 2 vs Code 3?
-
Communication
- Who declares the incident?
- How often do you post updates?
- What do you say when you don’t yet know the root cause?
-
Escalation
- When do you bring in a second on-call or a specialist?
- When do you inform stakeholders or leadership?
- What about legal/compliance in Code 4 scenarios?
-
Resolution & Verification
- How do you confirm the system is really healthy?
- How long do you monitor closely after the fix?
- When do you officially close the incident?
By rehearsing end-to-end, you develop smoother team coordination, not just better debugging.
Step 5: Continuously Refine the Script and Improve MTTA/MTTR
The one-page script is a living document.
After each drill, hold a 10–15 minute mini-retro to ask:
- What slowed us down?
- Where did we get confused about ownership or response codes?
- Did we miss an obvious mitigation path?
- Were any communications unclear, too late, or too vague?
Capture 1–3 improvements and update the script immediately:
- Add or refine a checklist item
- Clarify when to escalate response codes
- Link to the right dashboards, runbooks, or playbooks
- Adjust who should be paged first
Over time, you should see measurable improvement in:
- MTTA – People know exactly how and when to acknowledge and classify the incident.
- MTTR – Fewer false starts, faster mitigation, quicker escalations.
- Resilience – Teams stay effective even when senior folks are unavailable.
Make these metrics visible. Track them across drills just like you track them for real incidents.
Ready-to-Use One-Page Rehearsal Template
Here’s a simple template you can copy into your docs tool and adapt.
# One-Page Incident Rehearsal Script ## Scenario - Name: ________________________________ - Date: ________________________________ - Facilitator: _________________________ - Participants: ________________________ - Default Response Code: [0/1/2/3/4] - Systems/Services affected: ___________ - Dependencies to monitor: _____________ ## Objectives (pick 2–3) - [ ] Improve time to acknowledge alert - [ ] Practice assigning roles quickly - [ ] Validate escalation path - [ ] Exercise external communication - [ ] Test our dashboards/runbooks ## Timeline & Checklist **0–5 min: Detection & Acknowledgment** - [ ] Alert/source that detects issue: ___________ - [ ] On-call acknowledges via: ________________ - [ ] Assign initial response code: [0/1/2/3/4] **5–15 min: Triage & Containment** - [ ] Assign roles: IC, Ops Lead, Comms Lead - [ ] Identify blast radius (services/regions) - [ ] Identify potential quick mitigations **15–30 min: Communication & Escalation** - [ ] Create/update incident channel/ticket - [ ] Decide whether to escalate response code - [ ] Internal update frequency: every __ min - [ ] External update? [Yes/No] If yes, where? **30+ min: Resolution & Verification** - [ ] Proposed fix/rollback: ________________ - [ ] Verification via metrics/logs/checks - [ ] Downgrade/close incident when stable ## Key Learnings (post-drill) - What went well? - What slowed us down? - What will we change in our process/script? ## Metrics (simulated) - Time to detect: ______ - Time to acknowledge: ______ - Time to engage right people: ______ - Time to first internal update: ______ - Time to (simulated) resolution: ______
Customize the response codes, roles, and timelines to reflect your context and risk appetite.
Conclusion: Start Small, Repeat Often
You don’t need a massive game day to get better at incidents. You need a repeatable, one-page script and a habit of running tiny, focused fire drills.
To summarize:
- Define clear incident response codes so priority and actions are instantly obvious.
- Treat incident rehearsal as core SRE work, not an optional extra.
- Use a one-page runbook to keep drills lightweight and usable under pressure.
- Run small, specific fire drills that simulate real failure modes.
- Practice the entire response lifecycle, not just the technical fix.
- Continuously refine your script to improve MTTA, MTTR, and resilience.
Pick one scenario, copy the template, book 45 minutes with your on-call team, and run your first one-page incident rehearsal this week. Your future self—awake at 2 a.m. during a real outage—will be glad you did.