The Cardboard Reliability Arcade Cabinet: Turning On-Call Chaos into a Cooperative Paper Game
How a low-fi, cooperative “arcade cabinet” workshop can transform on-call chaos, reliability challenges, and chaos engineering into a hands-on, blame-free game your team actually wants to play.
The Cardboard Reliability Arcade Cabinet: Turning On-Call Chaos into a Cooperative Paper Game
On-call can feel like an endless boss battle you never signed up for.
Pager blasts at 3 a.m. You fumble through logs, Slack channels light up, and suddenly everyone is both confused and somehow at fault. Afterward, there’s a retro where you try to learn something—but mostly people just want it to be over.
What if, instead of waiting for production to explode, you could practice that chaos together, on purpose, in a safe, playful way?
Enter the Cardboard Reliability Arcade Cabinet: a low-fi, cooperative tabletop game where your team "runs" a fictional system landscape, experiments with DevOps practices, and triggers controlled disasters—using nothing but paper, pens, and some cardboard.
In this post, we’ll explore how to design and run this workshop-style game so you can turn on-call chaos into a shared learning experience instead of a lonely firefight.
Why Turn Reliability into a Game?
Traditional training for reliability, SRE, or incident response is often:
- Passive (slide decks, one-way talks)
- Abstract (concepts without context)
- Individualized (one person on-call, isolated from the team)
But the real work of reliability is collaborative and emergent. Things break in unexpected ways, across teams, services, and assumptions.
A game—especially a tabletop-style, cooperative one—lets you:
- Practice under pressure without real-world stakes
- Make decisions together and see how they compound over time
- Surface hidden assumptions
- Learn from “bad” decisions without blame or burnout
Think of it like a cybersecurity tabletop exercise, but focused on SRE and DevOps reliability. The cardboard cabinet is your “training sim” for:
- Detection: How do we notice something’s wrong?
- Triage: What do we do first? Who does what?
- Communication: Who needs to know and how do we sync?
Only this time, the system is fictional—but the patterns are very real.
The Core Metaphor: A Low-Fi Cooperative Arcade Cabinet
Visualize a big piece of cardboard set up like the face of an arcade cabinet:
- At the top: a system map of your fictional landscape—services, databases, external APIs, message queues
- In the middle: dials and gauges representing SLIs (latency, error rate, throughput, saturation)
- At the bottom: slots for practice cards and incident cards—these shape the game’s current "run"
The team stands around the cabinet like players at an arcade machine, but instead of joysticks they have:
- Architecture pattern cards (e.g., blue/green deployments, circuit breakers, retries with backoff)
- DevOps practice cards (e.g., feature flags, runbooks, observability investments, on-call handoff rituals)
- Chaos cards they can play to trigger failures (inspired by chaos engineering tools like Mutineer)
- Error tokens that represent user-visible impact or SLO burn
The goal is not to "beat" the game. The goal is to make tradeoffs, trigger incidents, and learn from the outcomes together.
Running the System: Choices That Shape Reliability and Risk
At the start of a session, the group chooses or is assigned a scenario:
- A fast-moving startup with minimal process
- A regulated financial platform with strict uptime requirements
- A legacy monolith slowly being split into services
Each scenario comes with:
- A set of baseline architecture choices
- Current DevOps maturity level
- Initial reliability goals and constraints (SLOs, budgets, team size)
Players then “run” the system over a series of rounds. Each round is typically:
-
Planning Phase
- As a group, choose a limited number of practice cards and architecture cards to play.
- Examples:
- Add structured logging and tracing
- Add a staging environment with smoke tests
- Implement circuit breakers between services
- Introduce feature flags and progressive delivery
-
Operation Phase
- The facilitator reveals a traffic/event card (e.g., seasonal traffic spike, new feature launch, external dependency slowdown).
- The cabinet’s dials move based on simple rules that connect your choices to system behavior.
-
Chaos Phase
- Someone plays (or randomly draws) a Chaos Card, simulating a failure:
- Database latency triples
- Message queue fills and starts dropping messages
- DNS misconfiguration causes partial outage
- A bad deploy corrupts a critical cache
- Someone plays (or randomly draws) a Chaos Card, simulating a failure:
-
Incident Response Phase (the heart of the exercise)
- The group must:
- Detect the problem using the observability they’ve invested in
- Triage the incident (what’s affected, how bad is it?)
- Choose response actions (rollback, failover, rate limiting, feature kill switches)
- Communicate: who do they notify? What do they say? How?
- As they act, the facilitator updates error tokens, user impact, and SLIs.
- The group must:
-
Mini-Retrospective Phase
- What helped?
- What hurt?
- What practices do they wish they’d invested in earlier?
By the end of several rounds, the players will have shaped the system’s reliability posture through their choices—and felt the consequences when chaos arrived.
Incorporating Chaos Engineering (Without Real Incidents)
Chaos engineering tools like Mutineer help teams experiment with failure in production or realistic test environments. The cardboard cabinet brings those ideas into a paper-first training space.
Chaos concepts you can easily model in the game:
- Latency injection: service response times spike
- Resource exhaustion: CPU, memory, or connections saturate
- Dependency failures: third-party APIs timeout or misbehave
- Network partitions: parts of your system can’t talk to each other
- Misconfigurations: bad feature flags, incorrect routing rules, expired certificates
Each Chaos Card should specify:
- What fails and how (e.g., “Payments API returns 500 for 20% of requests”)
- What the system “dials” do (latency/error rates increase, throughput drops)
- What signals are available, depending on your observability investments
- Potential mitigation paths (e.g., fallback flows, feature toggles)
The twist: the players can choose when to trigger some of these failures, turning chaos into intentional practice. This reframes failure from something to fear into something to explore.
Designing the Workshop: A 4-Hour Reliability Session
You can run the Cardboard Reliability Arcade Cabinet as a structured 4-hour workshop.
Suggested Agenda
-
0:00–0:30 – Setup and Context
- Introduce the fictional system and scenario
- Explain the rules, cards, and dials
- Align on goals: learning, experimentation, no blame
-
0:30–1:30 – First Run: Learning the System
- Play 2–3 rounds with simple events and lighter chaos
- Focus on understanding how choices affect reliability
- Encourage discussion: “Why pick this practice now?”
-
1:30–2:00 – Debrief #1
- What surprised the group?
- Which practices felt over- or under-valued?
- What did detection and triage look like?
-
2:00–3:00 – Second Run: Turn Up the Heat
- Introduce more complex incidents (compound failures, cascading outages)
- Allow players to design their own Chaos Cards based on real-world scars
- Increase time pressure for incident response rounds
-
3:00–3:30 – Debrief #2: From Game to Reality
- Compare game choices to your real systems
- Identify gaps: missing runbooks, weak observability, brittle dependencies
- Capture a shortlist of concrete reliability improvements
-
3:30–4:00 – Maturity Check-In and Next Steps
- Reflect on how a future run might look different once improvements are in place
- Define how often you’ll rerun the workshop (quarterly is common)
- Decide who will evolve the game with new scenarios and cards
Because everything is physical—cards, dials, tokens—the workshop stays tactile and engaging, even for people who dread traditional training.
Learning Over Blame: A Safe Space to Fail Loudly
A critical design principle: the arcade cabinet is a no-blame environment.
You’re not simulating who caused the failure. You’re exploring how the system responds and how the team collaborates.
To reinforce this:
- Avoid individual scoring. Use team-based outcomes instead.
- Treat "bad" decisions as data, not mistakes.
- Praise curiosity and experimentation over “getting it right.”
- Encourage players to try deliberately risky moves to see what happens.
The result is a space where:
- Newer engineers can safely experience high-pressure incidents
- Senior engineers can model calm, structured response
- Everyone can articulate the tradeoffs behind practices like feature flags, circuit breakers, and SLO-based alerting
Over time, this shared experience helps reshape real-world incident culture: fewer accusations, more joint problem-solving.
Making Reliability Tangible with Physical Artifacts
Reliability is often taught in abstractions: SLI, SLO, MTTR, error budgets. In the cardboard cabinet, these become objects you can touch:
- SLI Dials: Big cardboard gauges for latency, errors, saturation. Moving them makes impact visible.
- Error Tokens: Physical counters representing user-visible issues or SLO burn. Stack them in front of the team as the incident worsens.
- Runbooks & Checklists: Laminated cards listing possible actions (“Enable canary rollback,” “Page database specialist,” “Rate-limit expensive endpoints”).
- Communication Channels: Simple cards for “Status Page,” “Customer Support,” “Internal Chat,” prompting players to choose how to communicate.
These props:
- Make invisible system behavior visible and memorable
- Encourage the whole team (including product, support, PMs) to participate
- Anchor difficult tradeoffs in shared, concrete language
Tracking Reliability Maturity Over Time
The real power of the Cardboard Reliability Arcade Cabinet emerges when you revisit it regularly.
Each session becomes a snapshot of your team’s reliability mindset:
- Which practices are “no-brainers” now that were controversial before?
- Are you investing earlier in observability or automation?
- Does the team respond to complex incidents with more structure and less panic?
Capture artifacts across sessions:
- Photos of the cabinet state after big incidents
- Lists of agreed “If this was prod, we’d…“ actions
- Evolving house rules that reflect your organization’s learning
You can even align the game’s “maturity levels” with real-world roadmaps:
- Bronze: reactive, minimal observability, ad hoc incident response
- Silver: defined on-call, basic SLOs, some automation
- Gold: proactive chaos experiments, strong observability, practiced comms
As your real systems improve, so can the fictional ones.
Conclusion: Build Your Own Cardboard Cabinet
You don’t need a budget for fancy simulation tools to improve reliability culture. With cardboard, markers, and a few hours, you can:
- Turn on-call chaos into a cooperative learning game
- Practice incident response and communication without real outages
- Explore architecture and DevOps tradeoffs in a safe, blame-free space
- Gradually level up your team’s reliability maturity over time
If you’re tired of postmortems that don’t change behavior, or if newer engineers are terrified of taking the pager, try building your own Cardboard Reliability Arcade Cabinet.
You might still get paged at 3 a.m.—but your team will have already practiced that boss fight together, in cardboard, long before production is on the line.