Rain Lag

The Paper Incident Story Street Festival: Building a Pop-Up Reliability Fair With Analog Games and Micro-Workshops

How to turn reliability, incident response, and SRE practices into a hands-on street festival using paper incidents, analog games, and bite-sized workshops that make complex concepts approachable and human-centered.

The Paper Incident Story Street Festival: Building a Pop-Up Reliability Fair With Analog Games and Micro-Workshops

Imagine if your incident handbook, SRE playbook, and on-call war stories escaped Confluence and turned into a street festival.

No dashboards. No terminals. Just paper, markers, cardboard, and curious colleagues wandering between booths:

  • One booth runs a frantic tabletop game about an outage rolling through your fictional microservices.
  • Another hosts a 20-minute micro-workshop on SLAs and SLOs using sticky notes and string.
  • Around the corner, a “Storytelling Tent” fills with on-call veterans trading tales of 3 a.m. pages.

That’s the idea behind The Paper Incident Story Street Festival: a pop-up reliability fair that turns SRE concepts into approachable, physical, and memorable experiences.

This post walks through how to design and run your own reliability street festival, using analog games and micro-workshops to make incident culture tangible and fun.


Why a Street Festival for Reliability?

Most teams meet reliability through docs, tools, and occasionally painful real incidents. Those are important—but they’re also:

  • Abstract: Escalation paths and error budgets are easy to skim and forget.
  • Intimidating: New engineers often feel they need to be experts before they can even participate.
  • Tool-centric: The human side (communication, stress, teamwork) gets less attention.

A street festival flips that script:

  • Low-stakes, playful environment: You can “break” a paper system without waking up a real customer.
  • Tactile learning: People physically move cards, roll dice, and re-route paper packets.
  • Social, not solitary: Learning happens in small groups, over shared puzzles and stories.

The goal isn’t to replace documentation or formal training. It’s to seed a healthier incident culture—where people are more comfortable with the language, the flows, and most importantly, with each other.


Designing Your Reliability Street Festival

Think of your event layout like a real fair: a row of themed booths, each with a distinct activity. Attendees wander, try a few things, and come away with new perspectives.

A simple structure might include:

  1. Analog Game Zone – “Paper incidents” as tabletop games.
  2. Micro-Workshop Alley – Short, focused sessions on core topics.
  3. Story & Debrief Corner – War stories, reflection, and cross-team discussion.
  4. Knowledge Wall – A living board where insights and questions accumulate.

You don’t need a huge budget. Most of this can be built with:

  • Paper, index cards, markers, tape
  • Dice, timers, colored stickers
  • A few volunteers to host and facilitate

Let’s break down each element.


Booth 1: Analog Games for Paper Incidents

Inspired by games like Operation Raven, you can create tabletop-style scenarios that simulate incidents in a playful way.

Game Concept: “The Distributed Doughnut Shop”

Theme: You run a fictional online doughnut delivery platform. Behind the scenes, it’s a tangle of services: orders, payments, routing, notifications, inventory.

Components:

  • A paper map showing services as nodes
  • Incident cards ("payment latency spikes," "email vendor outage")
  • Role cards (Incident Commander, Comms, On-Call for Service X)
  • Timer and “customer satisfaction” track

How to play (15–25 minutes):

  1. Facilitator sets the scene: it’s a normal day, then the first incident card is revealed.
  2. Players must:
    • Identify which services might be impacted
    • Decide what signals they’d check (logs, metrics, traces—represented by cards)
    • Choose an escalation path from a printed "org map" of teams
  3. Time-limited decisions raise or lower the “customer satisfaction” meter.
  4. After resolution, a quick debrief: what went well, what was confusing, who felt overloaded.

Why analog games work

  • They externalize mental models. You can see how different people imagine the architecture and escalation paths.
  • They create safe failure. You can “misroute” an escalation and just laugh, then talk about what you’d do differently.
  • They practice roles. People who never act as Incident Commander can try it in a low-stakes way.

Run a couple of different games or variants: one focused on communication bottlenecks, another on cascading failures, another on prioritizing competing incidents.


Booth 2: Micro-Workshops on Core Reliability Themes

Instead of hour-long lectures, design 10–25 minute micro-workshops, each centered on one specific concept:

  • Incident response
  • On-call life
  • SLAs/SLOs
  • Post-incident learning

Each micro-workshop follows the pattern: Explain → Experience → Debrief.

Micro-Workshop: “SLOs With Sticky Notes”

Objective: Make SLAs/SLOs concrete and negotiable.

Flow (20 minutes):

  1. Explain (5 min): A simple framing: “An SLO is our promise to ourselves about reliability. An SLA is a promise to customers with consequences.”
  2. Experience (10 min):
    • Give groups a fictional product (e.g., video streaming, payments, search).
    • Ask them to pick 1–2 key user journeys and write them on sticky notes.
    • For each journey, they choose an SLI (e.g., latency, errors) and set a target SLO.
  3. Debrief (5 min): Discuss trade-offs: “What happens if we tighten this SLO? Who pays the cost? What if we loosen it?”

Participants leave with a felt sense of how SLOs are about user experience and trade-offs, not abstract percentages.

Micro-Workshop: “Escalation Paths as a Subway Map”

Objective: Demystify escalation and incident roles.

Flow (15–20 minutes):

  1. Facilitator shows a blank “subway map” template with lines for different escalation paths (technical, managerial, customer-facing).
  2. Teams map how incidents currently flow in their world: who gets paged, who they call, where decisions are made.
  3. Compare maps between teams—are they consistent? Overcomplicated? Missing stations?

This turns what’s usually a dense doc into a visual artifact people can question and improve.

Micro-Workshop: “Five-Minute Postmortem”

Objective: Practice fast, blame-aware learning.

Flow (10–15 minutes):

  1. Give each group a tiny fictional incident (a short story on a card).
  2. Use a one-page template:
    • What happened?
    • What made it harder?
    • What made it easier?
    • One thing we’d change in our system.
    • One thing we’d change in our process.
  3. Debrief: focus on systems and conditions, not individuals.

Participants get the feel of post-incident learning without an hour-long meeting.


Booth 3: The Human Side of On-Call

Reliability work is not just alerts and runbooks. It’s stress, judgment calls, and collaboration under pressure.

Design spaces that explicitly center this human side:

The “On-Call Lounge” Discussion Circle

Short facilitated circles (15–20 minutes) where people discuss:

  • How do you personally manage the stress of being on-call?
  • What norms should we have around sleep, handoffs, and saying no?
  • What does good psychological safety look like during an incident?

Provide prompt cards and leave room for people to share coping strategies, boundaries, and support needs.

Communication Role-Play Booth

Run a quick role-play:

  • One person is the Incident Commander.
  • One is the external stakeholder (PM, exec, or customer support lead).
  • One is a stressed on-call engineer.

Give them a simple scenario and 5 minutes to act it out, then 5 minutes to debrief:

  • What language helped?
  • Where did confusion or tension show up?
  • How might we script status updates more clearly?

These exercises build empathy and highlight communication as a reliability tool.


Booth 4: Peer Learning and Incident Culture

Treat the festival as a community hub, not just a training event.

War Story Tent

Set up a cozy corner with chairs and a whiteboard. Every half hour, host an informal session:

  • One volunteer tells a short story about a memorable incident.
  • Others ask questions: What surprised you? What did you wish you’d known? How did the team handle it emotionally?

Encourage comparisons:

  • Different on-call models (follow-the-sun vs. local rotation).
  • Different tooling approaches (centralized vs. team-owned).
  • Different communication styles (Slack channels, bridges, incident rooms).

This surfaces tacit knowledge that never makes it into formal docs.

The Reliability Knowledge Wall

Dedicate a big board to:

  • "Things I wish every teammate knew about incidents."
  • "Questions I still have about on-call."
  • "Ideas to try after today."

By the end, you have a crowdsourced snapshot of your organization’s reliability culture—pain points, gaps, and aspirations.


Why This Is Surprisingly Lightweight

Running a pop-up reliability fair sounds grand, but it can be lightweight and iterative:

  • Start with 2–3 booths and a single afternoon.
  • Use simple materials: paper, markers, printouts, and volunteers.
  • Reuse games and workshop formats at team offsites, onboarding, or brown-bags.

The payoff is outsized:

  • Shared language: People leave with a more aligned mental model of incidents, SLOs, and roles.
  • Cross-team understanding: Frontend engineers see how platform teams work, and vice versa.
  • Better incident culture: More psychological safety, better communication, and a sense that reliability is everyone’s job.

Most importantly, people experience reliability not as a dry checklist, but as a collaborative craft.


Conclusion: Build Your Own Street Festival

You don’t need a huge program or formal curriculum to improve reliability culture. You can start with:

  • One analog game about a fictional outage.
  • One 20-minute micro-workshop on SLOs.
  • One war story circle.

From there, you can grow it into a full Paper Incident Story Street Festival that:

  • Makes complex SRE practices understandable and memorable.
  • Highlights the human side of on-call and incident response.
  • Encourages peers to learn from each other and share what works.

Turn your reliability practices into a festival for a day, and watch how the energy, curiosity, and shared understanding carry back into your real incidents—and the way your teams respond to them.

The Paper Incident Story Street Festival: Building a Pop-Up Reliability Fair With Analog Games and Micro-Workshops | Rain Lag