Rain Lag

The Cardboard Incident: How Ferris Dock Turned Spinning Outages into a Reliability Breakthrough

A story-inspired deep dive into how to transform outages into lasting reliability improvements using structured postmortems, strong observability, chaos engineering, and a culture of continuous learning.

The Cardboard Incident: How Ferris Dock Turned Spinning Outages into a Reliability Breakthrough

There’s a story that’s become legendary inside the fictional yet all-too-relatable company Ferris Dock.

It’s known as “The Cardboard Incident.”

On the surface, it was just another outage: users staring at a spinning loader, support tickets piling up, engineering laptops opening in a synchronized panic. But what made this incident special wasn’t the failure itself—it was what came after.

Ferris Dock used this outage to fundamentally change how they handle reliability, incidents, and learning. This is the story of how they did it—and how you can do the same.


The Cardboard Incident: A Quick Story

Ferris Dock ran a logistics and docking platform used by thousands of warehouses. One Tuesday afternoon, a code deploy triggered cascading slowdowns. Requests didn’t fail fast; they just… spun.

The status page stayed green for too long, because the monitoring thresholds weren’t tuned for partial degradation. Customers refreshed over and over, assuming their browser or Wi‑Fi was the culprit.

By the time the incident was fully recognized, operations were impacted across multiple clients. One warehouse manager, frustrated and out of patience, printed the Ferris Dock dashboard, taped it to a piece of cardboard, wrote “SYSTEM DOWN AGAIN” in red marker, and hung it in the loading bay.

A photo of that cardboard sign made its way back to Ferris Dock.

It became a rallying artifact.

Instead of hiding the incident, leadership printed that photo and stuck it in the incident review room. The cardboard sign became a symbol: no more wasted outages, no more “spinning” without learning.


Step 1: Treat Every Outage as a Learning Asset

Ferris Dock’s first change was mindset: every incident—big or small—is an opportunity to improve the system and the organization.

This meant:

  • No more blame-based reviews
  • No more skipping postmortems for “minor” incidents
  • No more hand-wavy “we’ll be more careful next time” resolutions

Instead, they adopted a simple principle:

If it paged a human, it deserves to teach us something.

To operationalize this, they created a structured, repeatable postmortem process.


Step 2: Build a Clear, Repeatable Incident Analysis Process

Ferris Dock’s incident postmortem template had four core parts:

1. Factual Timeline

A minute-by-minute (or step-by-step) sequence of what happened:

  • When issues started
  • What users experienced
  • What alerts fired (or didn’t)
  • Who did what and when

This timeline avoided interpretation. It was just facts from logs, alerts, chat transcripts, and deployment records.

2. Multi-Cause Root Cause Analysis

Instead of searching for “the root cause” in the singular, they looked for:

  • Triggering causes (e.g., a config change, a bad deploy)
  • Contributing causes (e.g., missing alerts, noisy dashboards, unclear runbooks)
  • Systemic causes (e.g., brittle architecture, unclear ownership, weak testing)

They often used techniques like "5 Whys" or causal trees, but with one strong rule:

Root cause analysis cannot end on a person.

Blaming individuals is a dead end. Ferris Dock focused on:

  • Process gaps
  • Tooling limitations
  • Architectural weaknesses
  • Communication failures

3. Concrete, Trackable Action Items

Each incident resulted in a small, prioritized list of changes, such as:

  • Architectural fixes (e.g., add caching, remove single points of failure)
  • Observability improvements (e.g., missing metrics, broken dashboards)
  • Runbook updates (e.g., clearer steps, better escalation)
  • Guardrails (e.g., deployment checks, feature flags, automated rollbacks)

Every action item was:

  • Assigned an owner
  • Given a due date
  • Tracked in the same system as normal work (not in a forgotten doc)

4. Shared, Accessible Documentation

Postmortems weren’t buried. They were:

  • Indexed and searchable
  • Easy to skim
  • Tagged by system, service, and type of failure

New engineers at Ferris Dock were encouraged to read old incidents as part of onboarding. “Learn from previous pain before you encounter it yourself.”


Step 3: Design for Failure, Not Perfection

The real lesson of the Cardboard Incident wasn’t “avoid this exact bug next time.” It was deeper:

Systems will fail. Design so that failures are expected, contained, and recoverable.

Ferris Dock invested in proactive incident prevention by focusing on architecture and failure-aware design:

  • Redundancy and graceful degradation: services could run in reduced mode instead of failing completely.
  • Time limits and fallbacks: requests that would have spun indefinitely now failed fast with useful error messages.
  • Bulkheads and rate limits: one misbehaving client couldn’t sink the entire service.
  • Clear ownership: every critical component had a directly responsible team.

Prevention wasn’t about never shipping risky changes; it was about shipping in ways that acknowledge reality: networks glitch, disks die, humans make mistakes.


Step 4: Invest in Strong Observability

The outage revealed that Ferris Dock’s dashboards were colorful—but not useful. They measured “interesting” things, not the things that truly represented user experience.

They overhauled observability with three questions in mind:

  1. Can we detect problems early, before users send angry emails?
  2. Can we understand what’s wrong from metrics, logs, and traces alone?
  3. Can we respond quickly with confidence?

Concretely, they:

  • Implemented golden signals (latency, traffic, errors, saturation) per key service.
  • Added user-centric metrics, like successful checkouts per minute, rather than just CPU.
  • Standardized structured logging across services.
  • Introduced distributed tracing to see request flows across microservices.

The result: fewer “mystery outages” and much faster incident triage.


Step 5: Use Chaos Engineering and SLOs to Find Weaknesses Early

Ferris Dock realized they had been conducting “involuntary chaos experiments” in production—also known as outages.

They decided to run intentional ones.

Chaos Engineering

In a controlled way, they started to:

  • Kill instances
  • Inject latency
  • Break network links
  • Simulate dependency failures

All while observing:

  • Does the system stay up (or degrade gracefully)?
  • Do alerts fire appropriately?
  • Do runbooks help engineers respond quickly?

Failures discovered during chaos experiments were far less painful than failures discovered by customers.

SLOs: Service Level Objectives

They also defined SLOs to quantify reliability expectations. For example:

  • “99.9% of check-in API requests complete successfully within 500ms over 30 days.”

Tracking SLOs allowed them to:

  • See reliability trends, not just isolated incidents.
  • Decide when to prioritize reliability work over new features.
  • Measure whether post-incident improvements were actually effective.

Step 6: Build a Culture of Continuous Learning

Technical changes alone weren’t enough. Ferris Dock needed a cultural shift.

They adopted practices to ensure every incident—no matter the size—improved reliability:

  • Blameless postmortems: focus on system design, not individual failure.
  • Regular incident review meetings: short, recurring sessions where teams skim recent incidents together.
  • Psychological safety: engineers were explicitly encouraged to surface near-misses, weak points, and “we got lucky” stories.
  • Celebrating fixes, not just heroics: praise the quiet work of prevention and resilience.

The cardboard sign in the meeting room was a constant visual reminder:

The cost of not learning is paid by our users.


Step 7: Communicate Transparently with Users

One painful lesson from the Cardboard Incident was that users were confused for too long. They didn’t know:

  • If the system was down or slow
  • Whether their data was safe
  • If they should retry or wait

Worse, in the chaos, phishing attempts emerged: scammers emailed customers pretending to be from Ferris Dock support, asking for credentials.

Ferris Dock responded by formalizing their outage communication strategy:

  • Public status page with real-time updates and clear incident states (investigating, identified, monitoring, resolved).
  • Plain-language explanations of what users might experience (e.g., “You may see loading spinners on the bookings page. Data remains safe.”).
  • Safety and scam awareness reminders during and after incidents:
    • “We will never ask for your password or 2FA code via email or chat.”
    • “If you receive suspicious messages during this incident, forward them to security@….”
  • Post-incident summaries that explain:
    • What happened (in non-technical terms)
    • What was impacted
    • What is being done so it doesn’t happen again

This shifted incidents from being purely negative events to opportunities to build trust through honesty.


Bringing It All Together

The Cardboard Incident didn’t just teach Ferris Dock about one buggy deployment. It pushed them to overhaul how they think about reliability:

  • Incidents became learning assets, not embarrassments to hide.
  • Postmortems became structured and actionable, not blame sessions.
  • Architecture evolved to be failure-aware, not failure-denying.
  • Observability matured so that engineers could see and fix issues quickly.
  • Chaos engineering and SLOs exposed weak points before they hurt customers.
  • Culture shifted toward continuous learning, safety, and accountability.
  • User communication improved, including clear status, safety, and scam awareness guidance.

The cardboard sign still hangs in their internal war room—not as a reminder of failure, but as a reminder of commitment:

Outages will happen. Spinning is optional. Learning is not.

If your organization is still “spinning” its way through outages, take a page from Ferris Dock’s story. Start small: run one structured postmortem, define one SLO, fix one observability gap.

Then let each incident—no matter how minor—pull you a little closer to the resilient, trustworthy systems your users deserve.

The Cardboard Incident: How Ferris Dock Turned Spinning Outages into a Reliability Breakthrough | Rain Lag