Rain Lag

The Pencil‑Only Incident Arcade: Reliability Games You Can Play in 15 Minutes Between Meetings

How to design fast, low‑friction incident response games that teams can play with just a pencil and paper—building real reliability skills in 15‑minute chunks between meetings.

The Pencil‑Only Incident Arcade: Reliability Games You Can Play in 15 Minutes Between Meetings

Most incident response training is either a massive, once‑a‑quarter chaos drill or an incident that hits you for real at 2 a.m.

Everything in between? Usually an empty space.

But what if practicing reliability skills felt more like a quick puzzle round than a corporate fire drill? What if you could run a realistic incident game in the 15 minutes before your next meeting—with nothing more than a pencil and a printout?

That’s the idea behind the Pencil‑Only Incident Arcade: small, repeatable, low‑stress exercises that sharpen how your team responds to failure, without requiring production access, heroic scheduling, or elaborate simulation tools.

In this post, we’ll walk through how to design these reliability mini‑games so you can:

  • Practice people, process, and observability together
  • Use realistic failure and threat scenarios
  • Keep stakes low and learning high
  • Make it fun enough that people want to come back

Why 15‑Minute Reliability Games Work

Long, carefully staged incident simulations absolutely have value. But they’re expensive—in time, coordination, and cognitive load. That means they happen rarely, and skill decay fills the gaps in between.

Short, pencil‑only games solve different problems:

  • They fit in natural gaps: end of a standup, just before a meeting, or during an onboarding session.
  • They set clear, low stakes: no one is touching production, and failure is safe by design.
  • They reduce tool complexity: no need to log into five systems; the exercise is self‑contained.
  • They encourage repetition: frequent, low‑effort practice is how people build robust mental models.

Think of these as the crossword puzzles of incident response: small, self‑contained challenges that accumulate into deep expertise over time.


Core Design Principle: It’s a Puzzle, Not a Drill

The heart of a Pencil‑Only Incident Arcade game is a puzzle:

"Here’s what you see. What do you think is happening? What would you do next?"

Instead of simulating all the operational mess of a live incident, you carve out a focused slice:

  • A short narrative (“Pager just fired for service X”)
  • A small bundle of signals (logs, alerts, graphs, tickets, Slack snippets)
  • A concrete question or goal ("Find the likely root cause" or "Decide the first three actions")

Participants only need:

  • A pencil or pen
  • Printed scenario sheets (or a single shared screen)
  • Access to your actual runbooks or docs (on laptop/phone is fine)

The constraint—no live tooling, no infinite clicking—forces people to:

  • Read carefully
  • Build a mental model from incomplete information
  • Practice structured thinking and communication

That’s much closer to how the best responders actually work in real incidents.


What a 15‑Minute Game Looks Like

Here’s a simple structure you can use for most sessions:

Minute 0–2: Setup

  • Facilitator hands out or shows the scenario
  • Quickly states the goal and the timebox

Minute 2–8: Individual or small‑group investigation

  • Participants read the scenario
  • They jot down:
    • What they think is happening
    • What they’d check next
    • What action they’d take first

Minute 8–13: Debrief and discussion

  • Ask: "What’s your hypothesis?" and "What would you do first?"
  • Compare answers, walk through the "official" solution or likely path
  • Discuss trade‑offs and what information was missing or misleading

Minute 13–15: Capture learning

  • Note any runbooks that didn’t match reality
  • Identify any missing dashboards or alerts
  • Capture one or two improvements to try in the real system

That’s it. End‑to‑end in under 15 minutes.


Testing People, Process, and Observability Together

Pencil‑only games are not just about technical diagnosis—they’re an opportunity to exercise the entire incident ecosystem.

Design scenarios to touch all three dimensions:

1. People

  • Who would be paged first?
  • Who else needs to be looped in and when?
  • How would you explain this to a non‑engineer?

Add questions like:

  • "What would you say in the status channel after 5 minutes?"
  • "Who owns this dependency, and how would you contact them?"

2. Process

  • Do you have a runbook for this class of incident?
  • Does the suggested process actually match how people want to respond?
  • Are escalation paths clear?

Ask participants to:

  • Locate the relevant runbook
  • Decide whether to follow it as‑is, adapt it, or ignore it (and explain why)

3. Observability

  • Are the right metrics and logs even available in your scenario?
  • Which dashboard would you open first and why?
  • Is there an alert that fired too late—or didn’t fire at all in the scenario?

You can even provide simplified screenshots of actual dashboards or log snippets, and ask:

  • "What signal here changes your mind about the root cause?"

This not only exercises people’s judgment, it reveals where your real systems need better instrumentation or documentation.


Sourcing Realistic Scenarios

Your arcade should feel grounded. Nothing kills engagement faster than obviously fake problems.

Good sources of scenarios:

  • Your own incident history

    • Strip identifying details and sensitive data
    • Compress timelines into a single snapshot
    • Focus each mini‑game on one key decision point
  • Public post‑mortems and write‑ups

    • Cloud provider outages
    • Famous incidents from large tech companies
    • Security breach reports (sanitized)
  • Threat and failure catalogs

    • Common misconfigurations (TLS, DNS, IAM)
    • Dependency outages (database, external API, message queue)
    • Malware or ransomware behaviors (sudden I/O spikes, suspicious processes)

Design prompts like:

  • "A third‑party payment provider is intermittently failing. How do you handle blast radius and communication?"
  • "CPU on a critical service is pegged at 100%, but traffic volume hasn’t changed. What are the likely culprits?"
  • "New deploy rolled out 20 minutes ago. Error rate and latency both spiked. What’s your rollback and verification plan?"

The more recognizable the pattern, the more transferable the learning.


Keeping Stakes Low and Learning High

Psychological safety matters. People learn more—and are more honest—when they’re not worried about looking incompetent.

Some guidelines:

  • Make it explicitly safe to be wrong. Emphasize that the goal is to explore thinking, not to "catch" people.
  • Celebrate diverse answers. Often there are several "reasonable" first moves; even sub‑optimal choices can be instructive.
  • Debrief the why behind decisions. "Why did you choose rollback over feature flagging?" surfaces mental models.
  • Use the game to debug your system, not your people. If everyone makes the same flawed assumption, that’s a design or documentation problem.

Your best metric is not who "wins" the puzzle; it’s how many concrete improvements you find for documentation, runbooks, and tooling.


Adding Lightweight Game Mechanics

You don’t need a full‑blown RPG. A few simple mechanics go a long way toward making the arcade habit‑forming.

Ideas to experiment with:

  • Time pressure

    • "You have 6 minutes to decide your first three actions."
    • Simulates the real‑world stress of early incident minutes.
  • Scoring

    • +1 point for identifying the correct likely cause
    • +1 for a good first action
    • +1 for a clear communication plan
    • Bonus point for suggesting a system or process improvement
  • Quests or storylines

    • Run a short series of connected incidents: "The Week of the Haunted Database" or "The Curious Case of the Flapping Load Balancer".
  • Leaderboards and prizes

    • Weekly or monthly totals
    • Small rewards: stickers, shout‑outs, choosing the next scenario, or a silly trophy

The goal is not cut‑throat competition; it’s repetition and engagement. Make it something people look forward to, not another compliance box to tick.


Building a Library of Reliability Mini‑Games

Over time, your Pencil‑Only Incident Arcade can become a living library.

After each session:

  1. Refine the scenario

    • Remove parts that confused people for the wrong reasons
    • Tighten hints or add one extra clue if everyone got stuck
  2. Capture what worked

    • Which questions sparked good discussion?
    • Which decision points revealed real gaps in process or tooling?
  3. Tag your scenarios by:

    • Service or domain (payments, auth, storage, ML)
    • Failure mode (latency, data loss, security, dependency, capacity)
    • Level (beginner, intermediate, advanced)
  4. Package for reuse

    • A one‑page scenario sheet
    • A facilitator guide with:
      • Expected path(s)
      • Common misconceptions
      • Key learning objectives

This library becomes invaluable for:

  • Onboarding new engineers and SREs
  • Cross‑training between teams
  • Refreshing muscle memory after long periods without major incidents
  • Sharing reliability culture across the org

Getting Started Next Week

You don’t need buy‑in from an entire organization to start. Try this small pilot:

  1. Pick one real incident from the last 6–12 months.
  2. Write a one‑page snapshot:
    • Symptoms (alerts, graphs, logs)
    • Constraints (what you can and cannot do)
    • Goal (stabilize, reduce impact, confirm cause)
  3. Book 20 minutes at the end of an existing team meeting.
  4. Run the game once. Use the simple 15‑minute structure above.
  5. Ask three questions afterward:
    • "What surprised you?"
    • "What felt missing from our runbooks or dashboards?"
    • "Should we do this again?"

If the answer to that last one is "yes," you’ve just opened your first cabinet in the Incident Arcade.


Conclusion: Practice Failure Like a Craft, Not a Crisis

You can’t eliminate incidents, but you can change how your team experiences them.

By turning reliability practice into short, pencil‑only games, you:

  • Normalize talking about failure openly
  • Make space for safe mistakes and experimentation
  • Strengthen the links between people, process, and observability
  • Build a reusable, shareable library of hard‑won knowledge

Most importantly, you stop relying on real outages as your primary training ground.

The next time you have 15 minutes before a meeting, don’t just scroll or refresh your inbox. Pull a scenario from your Incident Arcade, grab a pencil, and play.

Your next 2 a.m. self will thank you.

The Pencil‑Only Incident Arcade: Reliability Games You Can Play in 15 Minutes Between Meetings | Rain Lag