The Cardboard Failure Ferris Wheel: Turning Tiny Outage Stories into a Weekly Reliability Ritual
How a simple, analog ‘cardboard ferris wheel’ of outage stories can turn blameless postmortems into an ongoing, shared learning practice that keeps reliability visible, human, and continuously improving.
The Cardboard Failure Ferris Wheel: Turning Tiny Outage Stories into a Weekly Reliability Ritual
Reliability work often dies in documents.
We run postmortems, write long reports, file tickets…and then everyone moves on. A few weeks later, nobody can quite remember what actually happened, what we decided, or what we learned. The system gets a bit more complex, the people change, and the same patterns show up again.
What if instead of postmortems disappearing into a wiki, they became part of a visible, physical, ongoing ritual? Imagine a cardboard ferris wheel on the wall of your team space, filled with small “failure cabins”: short, human‑readable outage stories that rotate through a weekly review.
In this post, we’ll explore how to build that Cardboard Failure Ferris Wheel—a low‑stakes, analog, blameless, and surprisingly powerful way to keep outages alive as shared learning.
Why a Cardboard Ferris Wheel?
The ferris wheel is a metaphor—but also a literal object you can build.
- Cardboard signals low stakes: it’s cheap, hackable, not precious.
- Ferris wheel signals rotation: stories come back around, not just once and gone.
- Cabins hold stories: each outage is a small, self‑contained narrative.
Instead of treating outages as “one‑off incidents” with a single formal report, you deliberately:
- Shrink each outage into a tiny story (1 page max, or 1 card).
- Standardize the format so stories are easy to tell, re‑tell, compare.
- Physically rotate them through a short, weekly review.
The goal is not arts & crafts; it’s to create a visible, shared, repeatable learning loop where reliability never drifts into the background.
Tiny Outage Stories: From Reports to Narratives
Traditional postmortems can be long, dense, and hard to re‑use. For the ferris wheel, you want something more like a micro‑story than a report.
A tiny outage story should:
- Be short: think 5–10 minutes to read aloud.
- Be narrative: what happened, to whom, when, in what context.
- Be portable: it fits on one A4 sheet, index card, or printed “story ticket.”
A simple template (printed on each card) could look like this:
- Name of the outage (a human name, not just a ticket ID)
- When and how it was noticed (by whom, via what signal)
- What users experienced (in plain language)
- What actually happened technically (the core mechanics)
- Key contributing factors (more than one!)
- What made detection, diagnosis, or recovery harder
- What we changed (or plan to change)
- Open questions (things we still don’t fully understand)
Every new outage gets a card. The act of writing the card forces clarity: Can we tell this story simply enough that someone outside the immediate team can follow it?
Blameless by Design: Borrowing from SRE Postmortems
If your ferris wheel becomes a weekly public shaming ritual, it will die fast.
To avoid that, borrow from established Site Reliability Engineering (SRE) blameless postmortem practices:
- No naming and shaming. Focus on the system and context, not on who “messed up.”
- Assume competence. Treat every action as reasonable given the information and constraints at the time.
- Look for systemic contributors. Process, tooling, culture, documentation, team structure—not just keystrokes.
On each story card, you can make this explicit with a short printed reminder at the bottom:
"We investigate systems and contexts, not individuals. Learning > Blame."
Over time, this framing encourages people to actually share the weird commands, the confusing dashboards, the miscommunications—because they no longer have to defend themselves; they’re helping improve the system.
Root Cause Exploration (Without “Single Root Cause” Myths)
The ferris wheel is only useful if it moves your reliability forward. That means every tiny story needs to do more than say, “We had an outage and we fixed it.”
You want to emphasize root cause exploration in a realistic sense:
- Multiple contributing factors: configuration quirks, unclear ownership, stale docs, alert fatigue, brittle dependencies, etc.
- Conditions that allowed it to happen: lack of guardrails, missing tests, risky manual steps.
- Conditions that allowed it to grow: slow detection, poor observability, unclear runbooks.
Your story template should explicitly ask for at least 3–5 contributing factors. Never stop at “human error.” If you see “engineer forgot X,” ask:
- Why was it possible to forget X without guardrails?
- Why was X not visible in the tooling or workflow?
- Why did nobody notice until users complained?
Crucially, every story should have concrete follow‑up actions:
- Specific changes (merged, in progress, or explicitly declined with a rationale).
- Owners and rough timeframes.
- A way to check back in (e.g., a small checkbox on the card for "revisited").
When that card comes back around on the wheel a month later, you can quickly ask: Did we do the things we said we would? Did they work?
Weekly Analog Reviews: The Ride Itself
The weekly analog review is the core ritual: a short, recurring session where the team stands around the ferris wheel and rides through a handful of stories.
A simple pattern:
- 15–30 minutes, once a week. Put it on the calendar.
- Pick 2–3 cabins (cards) to “ride” that week.
- One person reads each story aloud. Others can follow along.
- Quick discussion (5–7 minutes per story):
- What surprised us?
- What patterns do we recognize from other outages?
- Which follow‑ups matter most now?
- Mark the card: initialed by attendees, date revisited, any new notes.
Having it be analog matters more than it seems:
- Physical presence and ritual create a sense of shared ownership.
- The ferris wheel is a visual reminder: outages are part of our living history.
- It limits scope: you can’t discuss 30 reports in 30 minutes.
You are not replacing formal, detailed postmortems for major incidents. You are complementing them with an ongoing, lightweight learning loop.
Using Analogy to Connect the Dots
Much of reliability work is pattern recognition: “This feels like that other incident where…”
The ferris wheel explicitly invites analogical thinking:
- Metaphors: “This outage was like a traffic jam caused by a broken traffic light, not just a crash.”
- Metonymy: referring to “the cache incident” or “the Tuesday deploy wobble” to stand for a whole class of problems.
During the weekly review, ask questions like:
- "Which previous story does this feel most similar to?"
- "If we had to give this outage a movie genre (heist, horror, slow‑burn drama), which would it be and why?"
- "What’s the ‘family resemblance’ between these four incidents?"
This kind of analogical play does serious work:
- It helps people internalize complex ideas (like cascading failures or capacity ceilings) through concrete images.
- It highlights cross‑cutting themes: feature flags misused, dangerous manual ops, brittle integrations.
- It turns abstract reliability concepts into memorable, talkable stories.
Over time, your team will build a shared vocabulary of failure: "This proposal smells like the ‘silent timeout’ story," and everyone knows what that means.
Standardizing Without Killing the Fun
For the ferris wheel to remain useful over months and years, your outage stories need to be consistent enough to compare.
Create light‑weight, standardized guidelines:
- A single, shared story template (with prompts as above).
- A time limit for writing: e.g., 20–30 minutes per card, so it doesn’t become another giant task.
- A page/size limit: one side of one sheet or card.
- A short how‑to guide posted next to the wheel.
But avoid over‑formalizing:
- Allow sketches, diagrams, or little timelines.
- Let teams decorate, color‑code, or sticker their cards.
- Encourage different voices: SRE, product, support, customer success.
The goal is repeatability, not bureaucracy.
Co‑Creation and Shared Ownership
The ferris wheel should not belong to a single hero SRE or manager. It should be a co‑creation ritual.
Some ways to make that real:
- Rotating facilitator: each week, a different person spins the wheel and picks the cabins.
- Cross‑functional authorship: any role involved in an outage can write a story card.
- Open participation: invite engineers, support, product managers, designers—whoever is impacted.
- Visible contributions: people sign the cards they wrote or significantly contributed to.
This shared authorship strengthens collective ownership of reliability:
- Incidents stop being “ops problems” or “backend problems”; they’re team problems.
- Product and leadership see firsthand how technical decisions translate into user pain.
- Everyone starts to feel responsible for breaking patterns that keep recurring.
Getting Started: A Practical Checklist
You do not need a big program to begin. Start small.
- Build the wheel (or a low‑tech version).
- A cardboard circle with clothespins.
- A corkboard with columns named after the cabins.
- A literal ferris wheel cut‑out with slots.
- Define the tiny story template.
- One page, with the 8 fields above.
- Print a stack and keep them near the wheel.
- Seed the wheel with 3–5 recent incidents.
- Ask the people who worked on them to write the initial cards.
- Schedule a 20‑minute weekly review.
- Commit to doing it for at least 6 weeks before judging.
- Refine based on feedback.
- Are the prompts right? Too detailed? Too vague?
- Is the timebox working?
- Are people feeling safe to share?
Measure success qualitatively at first:
- Are more people able to explain past outages clearly?
- Are you seeing recurring themes more quickly?
- Are follow‑up actions actually happening?
Later you can watch for trends in MTTR, incident frequency, or on‑call stress—but the first signal is whether your team is telling and re‑telling these stories.
Conclusion: Keep the Wheel Turning
Incidents are inevitable; wasted incidents are optional.
A Cardboard Failure Ferris Wheel is a small, almost playful intervention with serious intent: make failures visible, memorable, and reusable. By rotating tiny, blameless outage stories through a weekly analog review, you:
- Turn one‑off postmortems into an ongoing learning practice.
- Reveal deep patterns through analogy and repetition.
- Normalize honest discussion of what really happens in your systems.
- Strengthen shared ownership of reliability across the whole team.
In a digital world full of dashboards and docs, a cardboard ferris wheel might feel quaint. That’s precisely its power. It slows the team down just enough, once a week, to look back together—and ensure that every failure buys you a little more wisdom the next time the wheel comes around.