The Analog Incident Cookbook: Recipe-Style Playcards for Your Nastiest Production Failures
How to turn your worst production incidents into a physical ‘cookbook’ of recipe-style playcards that guide calm, repeatable, and collaborative response during outages.
The Analog Incident Cookbook: Designing Recipe-Style Playcards for Your Nastiest Production Failures
When production is on fire, no one wants to read a wiki novel.
Yet most teams still rely on sprawling runbooks, ancient Confluence pages, or tribal memory when things go wrong. In the heat of an outage, these are hard to search, hard to follow, and way too easy to misinterpret.
There’s a better way: turn your worst incidents into analog recipe cards—simple, physical (or print-ready) playcards you can grab the moment something smells familiar.
This post explains how to design an “Incident Cookbook”: a curated set of recipe-style cards based on your nastiest failures, aligned with modern runbooks, playbooks, and tooling.
Runbooks vs. Playbooks: The Foundation of Your Cookbook
Before we get into cards and cookbooks, it’s important to understand the two pillars of good incident practice:
Runbooks: The “How-To” Instructions
Incident response runbooks are detailed, step-by-step guides for handling specific failure modes.
They answer questions like:
- “If service X returns persistent 500s, what logs do I pull and in what order?”
- “If the database CPU spikes above 90% for more than 5 minutes, which queries do I inspect?”
- “Which feature flags or rollout configs can safely be toggled to stabilize things?”
Runbooks provide:
- Repeatability – Same failure, similar response.
- Reliability – Reduced reliance on one “hero” who remembers what to do.
- Onboarding value – New on-call engineers have a safety net.
Playbooks: The “How We Work Together” Strategy
Where runbooks are about procedures, playbooks are about coordination and strategy.
Playbooks define:
- Who takes Incident Commander, Communications Lead, Ops Lead, etc.
- How and when to escalate (teams, managers, vendors).
- What communication channels to use (Slack, Zoom, status page, email).
- How often to broadcast updates to stakeholders.
Playbooks keep the team aligned and reduce chaos: everyone knows their role, where to speak, and when to step back.
Your Incident Cookbook lives on top of this foundation. Each card is a quick entry point into the right runbooks and the right parts of your playbook.
Why Recipe-Style Playcards Work in a Digital World
Incident response might be heavily digital, but the idea of analog recipe cards still makes sense:
- Low cognitive load: When adrenaline is high, you want a 1-page reference, not a 12-page document.
- Pattern recognition: Cards are designed around recognizable symptoms (“Login latencies > 5s for EU users”) rather than just components.
- Faster triage: You flip through a small set of curated “nasty incidents” and quickly see, “This looks like card #7.”
- Deliberate design: The act of condensing an incident into a card forces your team to truly understand the root cause and response.
This doesn’t replace your detailed docs. It routes you to them in a structured, calmer way.
The Core of a Good Incident Recipe Card
Each card in your Incident Cookbook should be:
- Short (1 page or less)
- Actionable (clear first steps)
- Grounded in real past incidents
Here’s a template you can adapt.
1. Title & Context
- Name: “API Latency Spikes During Traffic Surges”
- Category: Performance / API
- Last Updated: YYYY-MM-DD
- Related Systems: api-gateway, auth-service, db-primary
2. When to Reach for This Card (Symptoms)
Describe observable indicators:
- Median API latency > 1s for 5+ mins
- Error rate > 2% on
/loginor/checkout - Alert:
APIGatewayHighLatencyfiring in [observability tool]
This section helps on-call recognize pattern similarity quickly.
3. First 5 Minutes: Structured Response
Follow a staged response so panic doesn’t take over. A simple structure is:
-
Identify
- Confirm the alert in [tool]: dashboard link.
- Check whether multiple regions or services are affected.
-
Diagnose (Initial)
- Look at error breakdown: 4xx vs 5xx.
- Check recent deploys affecting API or database.
-
Stabilize (If Needed)
- If error rate > 5%, consider triggering traffic shed or feature flag rollback (link to runbook).
Keep this section minimal but direct—you’re buying time and sanity.
4. Deep Diagnosis: Don’t Skip Root Cause
The biggest trap in incidents is leaping straight to the first “fix” that appears to work.
Instead, the card should remind responders:
- “Do not ship permanent fixes without understanding root cause.”
- Link to a relevant runbook section: “Deep Dive Queries for API Latency”.
- Include a short checklist:
- What changed in infra? (deploys, config, autoscaling rules)
- What changed in usage? (traffic spikes, new customer behavior)
- Are there recurring patterns from past incidents? (time of day, region)
Fully understanding and verifying the root cause:
- Prevents the same incident from recurring next week.
- Leads to better long-term solutions (not band-aids).
- Makes future cards vastly more valuable.
5. Plan, Test, Deploy, Verify
Document a staged response pattern on each card:
- Plan: Enumerate options (rollback, scaling, config change) and potential blast radius.
- Test: Can you reproduce in staging? Can you run a limited canary?
- Deploy: Who approves? Who executes? What’s the rollout pattern?
- Verify: Which dashboards and metrics confirm success? What’s the rollback trigger?
Having this structure printed is a subtle but powerful reminder to slow down just enough to stay safe.
6. Communication & Coordination Cues
Tie back into your playbook:
- “If incident is SEV-1 for > 10 minutes, assign an Incident Commander and start a dedicated Slack channel.”
- “Post updates to #status every 15 minutes.”
- “If SLO breach looks likely, ping Product and Support for customer messaging.”
This keeps cross-functional coordination from being an afterthought.
7. Post-Incident Reflection Pointers
Every nasty incident is a learning artifact. On each card, add prompts for your post-incident review:
- What surprised us this time?
- What signals were noisy, missing, or misleading?
- What should we automate next time (e.g., auto-detection or guardrails)?
As these learnings accumulate, you refine the card rather than letting it rot.
Turning Nasty Incidents into Cookbook Recipes
Not every small glitch deserves a card. Focus on the painful, high-value incidents:
- Prolonged outages or SEV-1/SEV-2 events.
- Recurring problems (same class of issue appearing multiple times).
- Incidents that required cross-team or cross-functional coordination.
For each such incident:
- Run a proper post-incident review.
- Capture:
- Root cause (verified, not guessed).
- Contributing factors (alerts, deploys, human error).
- Effective mitigations and failed experiments.
- Extract a pattern: What would you want future you to recognize quickly?
- Distill the pattern into a 1-page card using the template above.
Your Cookbook becomes a curated library of proven responses to recognizable patterns.
Make It Cross-Functional by Design
Incidents are rarely purely technical. Your Cookbook should anticipate cross-functional collaboration from day one:
- Engineering: primary responders, diagnostics, and fixes.
- Support: frontline for user pain; early signals of lingering issues.
- Product/PM: trade-offs between uptime, feature flags, and customer impact.
- Marketing/Comms: status pages, public communication for major outages.
Add simple cues on each card:
- “If incident impacts revenue-related flows, loop in Product and Finance.”
- “If more than 50 customers are affected or social media reports spike, notify Comms.”
These prompts encourage coordination instead of siloed firefighting.
Don’t Forget the Tools: On-Call, Automation, and AI
The Cookbook is analog in form, but powered by your digital ecosystem.
Connect each card into your tooling:
- Alerting: Link relevant alerts from PagerDuty, Opsgenie, or your homegrown stack.
- Scheduling: Ensure your on-call rotation is healthy and sustainable.
- Observability: Embed links to dashboards, logs, traces, and SLO views.
- Automation:
- Safe “one-click” mitigations (e.g., reduce traffic to a region, toggle feature flags).
- Scripts to collect diagnostics with minimal manual effort.
- AI coordination (where available):
- Summarize evolving incident context for late joiners.
- Suggest likely recipe cards based on current alert patterns.
These tools support your runbooks/playbooks and help reduce responder fatigue—which is critical if you want people to follow process instead of improvising under stress.
How to Start Your Incident Cookbook This Quarter
You don’t need a full library to see value. Start small:
- Pick 3–5 recent “nasty” incidents.
- For each, run (or revisit) a root-cause review.
- Create a draft card based on:
- Symptoms
- First 5 minutes
- Deeper diagnosis
- Plan → Test → Deploy → Verify
- Communication cues
- Print them. Physically. Put them near the on-call station or in a shared binder.
- During the next incident, try using them:
- If a card fits, follow it and annotate anything that felt off.
- If no card fits, note whether a new pattern is emerging.
- After each major incident, update or add cards.
Within a few months, you’ll have a tailored, battle-tested Incident Cookbook that:
- Reduces panic.
- Speeds up safe response.
- Codifies tribal knowledge.
- Turns your worst firefights into reusable wisdom.
Conclusion: Cook with Your Failures, Don’t Fear Them
The point of an Incident Cookbook is not nostalgia for index cards—it’s about making critical knowledge fast, visible, and usable under stress.
By combining:
- Detailed runbooks (how to do the work),
- Clear playbooks (how to coordinate the work), and
- Recipe-style playcards (how to quickly recognize and respond to known patterns),
you transform your nastiest failures into a practical asset.
Your future self—and your future on-call engineers—will thank you when the next outage starts, and instead of panic, someone quietly says:
“This looks like a Card #4 incident. Let’s pull the Cookbook.”
And you get to work—calmly, deliberately, and together.