The Paper-Clock Outage Kitchen Timer: 10-Minute Analog Drills for Relentless On‑Call Practice
How to use a simple paper “clock” and 10‑minute timeboxes to transform on‑call from chaotic guesswork into calm, rehearsed incident response using SRE principles, observability, and continuous micro‑drills.
The Paper-Clock Outage Kitchen Timer: Designing 10-Minute Analog Drills for Relentless On‑Call Practice
Incidents rarely wait for convenient times. They erupt on Friday evenings, during deploy windows, or in the middle of someone’s first on‑call shift. Yet many teams “train” for incidents only through infrequent game days or by learning in the heat of a real outage.
There’s a better way: relentless, lightweight practice.
This post walks through a simple approach: using a paper-clock outage kitchen timer to run 10-minute analog drills that simulate mini incidents. You’ll see how to:
- Use a low‑tech paper “clock” to structure focused practice
- Apply SRE concepts—SLIs, SLOs, error budgets—inside a tight timebox
- Continuously prioritize scenarios by risk, not imagination
- Borrow from business continuity and disaster recovery (BC/DR) playbooks
- Sharpen observability skills under realistic time pressure
- Run tiny retrospectives that continuously improve your playbooks
- Build a culture where real incidents feel like rehearsals
Why a Paper-Clock?
With all our sophisticated tooling, it’s oddly powerful to train with something deliberately low‑tech: a paper circle with minutes marked around the edge and a hand you physically move.
Why this works:
- Tactile urgency: Moving a physical hand to “10 minutes from now” makes the countdown feel real.
- No app overhead: No context switching, no notifications, no fiddling with UI. Just you, the clock, and the problem.
- Shared artifact: In a conference room or remote screen-share, everyone sees the same timebox.
- Psychological boundary: The paper clock reminds everyone: this is practice. It’s okay to fail, to pause, to reflect.
All you need is:
- A printed circle with 0–10 minutes marked (or 0–15 if you want buffer)
- A paper arrow with a pin/tack or even just a pen to draw the “deadline”
That’s your outage kitchen timer.
The 10-Minute Mini Incident
Each drill is a compressed simulation of an incident, treated seriously but kept tiny by design.
A basic format:
-
Setup (1–2 minutes)
- Pick a scenario card (more on these later).
- One person is Incident Commander (IC).
- One person plays Scribe.
- Others act as on‑call responders.
-
Set the paper-clock (10 minutes)
Move the hand to 10 minutes from “now”. When it hits zero, the incident stops. -
Run the incident (10 minutes)
- Someone acts as the “system” (or injects logs/metrics/symptoms).
- The team investigates, communicates, and tries to mitigate.
-
Mini-retro (5–10 minutes)
- What went well, what was confusing, what to change in docs/tools.
That’s it. 20–25 minutes total for a full learning loop.
The tight timebox forces focus: there’s no time for perfection, so you practice prioritization, communication, and risk tradeoffs.
Weaving SRE Concepts into Each Drill
To keep these from becoming vague “debugging puzzles,” ground every drill in core SRE concepts.
1. SLIs in the Hot Seat
Begin each scenario by stating or asking for the Service Level Indicators (SLIs) that matter:
- “For this API, what are our primary SLIs?”
- “Which SLI is degrading right now? Latency, error rate, availability, freshness?”
Make participants pull up the relevant dashboards or describe which signals they’d check. During the drill, occasionally nudge:
- “Which SLI is your north star during this incident?”
2. SLOs as Decision Boundaries
Frame the scenario against your Service Level Objectives (SLOs):
- “We target 99.9% availability over 30 days. This incident has already burned 20% of the monthly budget. Does that change your tradeoffs?”
Encourage the team to:
- Weigh speed of mitigation vs. risk of regression
- Decide when to take a blunt, safe mitigation (e.g., temporary feature disable) vs. debugging deeper
3. Error Budgets as Permission to Act
Even in a 10-minute drill, you can simulate error budget burn:
- “You’ve already used half your error budget this week from earlier issues. Is a risky hotfix acceptable?”
- “Alternatively, you’re well within budget. Are we allowed to do a quick rollback that risks some latency spike?”
The point: practice explicit, budget-informed tradeoffs, not implicit intuition.
Picking the Right Scenarios: Risk-Driven Practice
Many teams pick dramatic but unlikely scenarios—total datacenter loss, global DNS meltdown—then ignore more common, insidious issues.
Instead, score and prioritize drill cards based on real risk signals:
- Past incidents: Recurring patterns in your incident log
- Near misses: Alerts dismissed as “flaky” that later turned serious
- Known weak spots: Single points of failure, risky long-lived feature flags, brittle data pipelines
- Business impact: Systems directly tied to revenue, SLAs, or safety
Create a simple risk score:
- Likelihood (1–5) × Impact (1–5) = Priority score
Rank your scenario cards by this score and draw from the top.
Example scenario cards:
- “Elevated 5xx errors on Checkout API from one region only”
- “Background job backlog growing, but user impact not yet visible”
- “Feature flag rollout causing 2% error spike in a single tenant”
- “Primary DB node slow but not dead; read replicas healthy”
Practice the most likely and highest-impact failures first, then occasionally sprinkle in rare, catastrophic scenarios for BC/DR alignment.
Borrowing from BC/DR: Frequent, Varied Tabletop Drills
Business continuity and disaster recovery disciplines have long known that frequent, realistic exercises beat rare, extravagant ones.
Bring that mindset to engineering:
- Tabletop, not only game days: Instead of quarterly “big bang” game days, run weekly 10-minute tabletops.
- Variety over spectacle: Rotate between network issues, dependency failures, misconfigurations, and capacity problems.
- Role variation: Sometimes the IC is senior; sometimes they’re new. Sometimes your DB expert is “unavailable” and others must cope.
The goal: make incident handling feel like a muscle you train regularly, not a rare performance.
Observability Under Time Pressure
On-call success is less about memorizing systems and more about reading signals and forming good hypotheses quickly.
Use the drills to sharpen:
1. Signal Triage
Practice fast answers to:
- “Which dashboards are go‑to for this symptom?”
- “Which logs do we tail first?”
- “Which traces clarify whether this is client-side, network, or backend?”
Clear expectation: no random clicking. Ask the team to narrate:
“We’re seeing elevated 5xx on this endpoint, so I’m checking the service’s SLO dashboard, then drilling into request traces filtered by 5xx, then correlating with deploy times.”
2. Hypothesis → Test → Learn Loop
In 10 minutes, you want several tight loops:
- Form a hypothesis (“We think pod restarts in region X are caused by config change Y”).
- Test it (check events logs, rollout history, or pod status).
- Update your belief quickly.
The facilitator should occasionally freeze and ask:
- “What’s your current hypothesis?”
- “What evidence would disconfirm it?”
Over time, you’ll see engineers become faster and more disciplined in their reasoning.
The 5–10 Minute Retro: Turning Drills into Real Improvement
Without reflection, drills become theater. The magic is in the mini-retrospective.
Right after each drill, while context is fresh:
1. Ask Three Questions
- What worked well?
- Specific tools, dashboards, or communication patterns.
- What was confusing or slow?
- Missing alerts, ambiguous ownership, poor naming in dashboards.
- What will we change this week?
- Concrete actions; avoid vague “we should” statements.
2. Update Your Artifacts Immediately
Within the retro timebox, actually touch the artifacts:
-
Runbooks / Playbooks
- Add missing steps (“Check feature flag X before diving into logs”).
- Simplify overly long flows.
-
Incident Response Plans
- Clarify who is IC, Scribe, Comms.
- Document escalation paths and paging rules.
-
Dashboards / Alerts
- Create or fix an alert that would have helped.
- Add a fast “triage” panel to your main service dashboard.
The rule: at least one small, concrete improvement per drill. Over a quarter, those tiny changes compound into a dramatically better incident response posture.
Building a Culture of Relentless, Lightweight Practice
This approach only works if it becomes habit, not a novelty.
1. Make It Routine
- Cadence: Start with once a week per team; consider 2–3 times a week during critical periods.
- Timebox: Protect 25–30 minutes on the calendar—no overrun.
- Rotation: Rotate roles (IC, Scribe, observer) and participants, including new hires.
2. Normalize “Safe Failure”
Drills should feel psychologically safe:
- It’s okay to miss a clue or choose a suboptimal mitigation.
- The point is to expose gaps here so they don’t hurt you there.
Celebrate:
- Discovering a missing runbook.
- Realizing a key alarm doesn’t page the right team.
- Admitting, “I don’t know where that dashboard lives.”
3. Measure Practice, Not Just Incidents
Track:
- Number of drills run per team per month
- New or updated runbooks from drills
- New alerts/dashboards implemented
Share progress in engineering reviews. Make practice an explicit pillar of reliability, not a side project.
Bringing It All Together
The paper-clock outage kitchen timer is deliberately simple. That’s the point. It lowers the barrier to practice so far that it’s hard not to do it.
By running 10-minute mini incidents that:
- Embed SRE fundamentals (SLIs, SLOs, error budgets)
- Are prioritized by real risk, not fantasy
- Borrow frequency and variation from BC/DR exercises
- Emphasize observability and fast hypothesis testing
- End with short, concrete retros that update runbooks and plans
…you build a culture where real incidents feel like rehearsed scenarios, not chaotic surprises.
Print the clock. Pin the arrow. Schedule the first 10-minute drill. In a few weeks, your on‑call rotations will feel noticeably calmer—and much more prepared for the next real outage.