Rain Lag

The Analog Failure Observatory Desk: Building a Tiny Paper Lab for Safe Outage Experiments

How to design a low‑stakes, analog “failure desk” that lets teams safely simulate outages, explore sociotechnical failure, and practice resilience before anything breaks in production.

Introduction

Most teams only meet failure when it’s already too late.

We gather in war rooms, spin up incident bridges, and scramble to diagnose what went wrong in a live, high‑stakes environment. Afterwards, we promise to “do better next time” and maybe add a few alerts. But what if we could practice long before things go wrong—quietly, cheaply, and safely?

Enter the Analog Failure Observatory Desk: a tiny, paper‑based lab for experimenting with outages without touching production at all.

Think of it as a tabletop simulator for sociotechnical failure. You use sticky notes, printed diagrams, index cards, and checklists to explore how systems break—and how people respond—before anyone’s pager ever goes off.

This post walks through why such a desk matters, how to build and use it, and how to connect it to Resilience Engineering, chaos engineering, compliance, and real‑world fears like power grid dependency.


Why an Analog “Failure Desk” at All?

Most failure work focuses on bits and bytes:

  • Broken code
  • Misconfigured services
  • Failing dependencies

But real outages are sociotechnical phenomena. They emerge from:

  • Humans (skills, mental models, stress, fatigue)
  • Processes (runbooks, escalation paths, approvals)
  • Technology (APIs, queues, databases, networks)
  • Context (regulation, risk appetite, customer expectations)

A paper‑based “failure desk” gives you a low‑stakes sandbox to explore these interactions.

Why analog?

  • Low risk: No production access, no real customers affected.
  • Low friction: Pens, paper, and index cards are cheaper and faster than building a full digital simulator.
  • High focus on people: When nothing is automated, you see how people actually think, decide, and adapt.
  • Accessible: Anyone—engineers, product, legal, compliance, security, support—can participate.

You’re not trying to simulate every detail of your architecture. You’re building a small, imperfect model that’s just realistic enough to reveal surprising interactions.


Designing Your Tiny Paper Lab

You don’t need a dedicated room or a fancy setup. A corner table or a rolling cart works fine. The key is intentional structure.

Core Components

Stock your Analog Failure Observatory Desk with:

  • System Maps

    • Printed architecture diagrams (current or simplified)
    • Data flow charts and dependency maps
    • External dependencies (payment processor, cloud provider, power grid)
  • Role Cards

    • On‑call engineer, SRE, product manager, incident commander
    • Security officer, compliance officer, customer support
    • External stakeholders (regulator, large enterprise customer)
  • Scenario Cards

    • “Primary database read replica lags by 30 minutes”
    • “Cloud region experiences partial network partition”
    • “Regulator issues urgent data handling directive mid‑incident”
    • “Power outage hits one major office; remote employees unaffected”
  • Constraint & Risk Cards

    • Regulatory constraints (e.g., “No customer data may leave region X”)
    • Security constraints (e.g., “All emergency accesses are logged and reviewed”)
    • Business constraints (SLAs, cost ceilings, contractual penalties)
  • Process Aids

    • Incident timeline sheets
    • Communication templates (status updates, internal Slack messages, exec summaries)
    • Post‑incident review templates
  • Basic Office Supplies

    • Sticky notes in multiple colors
    • Index cards
    • Thick markers
    • Tape, string (to show dependencies on the wall/whiteboard)

Your goal: create a physical environment where people can:

  1. Model systems quickly
  2. Imagine realistic breakdowns
  3. Practice coordinated responses
  4. Reflect on what helped and what hindered

Failure as Sociotechnical: Beyond “Bug Hunting”

Traditional incident practice often assumes failures are primarily technical: a bug, a misconfiguration, a failing server.

In reality, outages arise from:

  • Ambiguous documentation (two teams interpret a policy differently)
  • Misaligned incentives (product wants speed; infra wants stability)
  • Hidden couplings (a rarely used API depends on a fragile legacy service)
  • Cognitive overload (on‑call can’t track all moving parts under stress)

Around the desk, you can make these factors visible.

Example exercise:

  1. Draw a simplified architecture on a large sheet.
  2. Ask participants to place sticky notes for:
    • “Assumptions we rely on but never verify”
    • “We think this is ‘safe enough’ because…”
    • “If this silently failed, who would notice?”
  3. Now introduce a scenario card: e.g., “External payment gateway starts returning intermittent 5xx errors.”
  4. Discuss:
    • Who gets paged first?
    • Who should get paged first?
    • What assumptions break first?
    • Which team is dragged in late and surprised?

You’re no longer just debugging code—you’re debugging coordination, communication, and mental models.


Safety-I and Safety-II: Studying Success, Not Just Failure

Resilience Engineering offers a crucial shift:

  • Safety-I: Focus on what goes wrong. Prevent failures, minimize errors.
  • Safety-II: Focus on what goes right. Study how people adapt and succeed under varying conditions.

At the desk, combine both.

Safety-I at the Desk

Use the lab to:

  • Reconstruct past outages on paper
  • Map out decision points and information gaps
  • Identify places where procedures or tools made things harder
  • Try alternative responses and see how outcomes might change

Safety-II at the Desk

Dedicate sessions to: “How do things usually go right?”

  • Ask on‑call engineers to map everyday workarounds:
    • “What do you do when the alert is noisy but might be real?”
    • “How do you prioritize clashing pages?”
  • Capture adaptations as sticky notes:
    • Unofficial dashboards
    • Side channels for communication
    • Small hacks in runbooks

These everyday adaptations are sources of resilience, not “process violations.” Document them, understand them, and, where appropriate, formalize or support them.


Running Chaos and Incident Drills on Paper

You can bring chaos engineering and incident response practice into the analog realm.

Analog Chaos Exercises

  1. Choose a System Slice
    Pick one critical flow: login, checkout, data export, batch processing.

  2. Inject Failures with Cards
    Examples:

    • “Latency between Service A and B spikes to 2 seconds.”
    • “DNS misconfiguration breaks traffic routing for 10% of users.”
  3. Ask: What Actually Happens?

    • Which monitors fire?
    • Which team sees the first symptom?
    • What does the user experience?
  4. Adapt the System on Paper

    • Add an extra cache: what changes?
    • Add a rate limit: who is protected, who isn’t?

The point isn’t precision; it’s surfacing assumptions and weak spots before you run chaos experiments in real environments.

Incident Response Roleplay

Run tabletop incident drills:

  • Assign roles using cards (incident commander, communications, technical lead, security, compliance).
  • Present a scenario gradually: start with a simple symptom, then add new findings or constraints every 5–10 minutes.
  • Track:
    • How decisions are made
    • Who is consulted when
    • How information flows (and where it gets stuck)

Afterward, do a structured debrief:

  • What helped collaboration?
  • Where did we argue or stall?
  • Which runbooks or tools were missing or confusing?

You’re training coordination muscles in a zero‑risk setting.


Using the Desk for Knowledge Sharing and Cross‑Functional Learning

The desk also works as a knowledge commons.

Post‑Incident Reviews

Instead of a slide deck, reconstruct incidents physically:

  • Timeline strip across the table
  • Sticky notes for events, decisions, signals, and uncertainties
  • Different colors for technical events, human decisions, external pressures

Invite people from different functions—support, product, sales, security, compliance—to walk along the timeline.

Questions to ask:

  • “What surprised you?”
  • “What context would have helped you in the moment?”
  • “What parts of this are invisible in our normal incident report?”

Turn insights into:

  • Updated runbooks
  • New training scenarios
  • Design changes to improve observability or safety

Cross‑Functional Drills

Use the desk to involve non‑engineering roles in realistic ways:

  • Legal/compliance: “A regulator might ask this; how would you respond?”
  • Security: “Is this mitigation acceptable under our threat model?”
  • Product: “What trade‑off would you recommend to leadership?”

This builds shared mental models and reduces finger‑pointing when real incidents happen.


Don’t Forget Compliance, Security, and Real‑World Constraints

Outage scenarios often ignore constraints that matter deeply in production:

  • Data residency and privacy regulations
  • Industry‑specific requirements (HIPAA, PCI DSS, SOC 2, etc.)
  • Internal risk limits and auditability

Make them first‑class citizens in your desk:

  • Create compliance and security constraint cards that are played mid‑scenario:
    • “Regulation prohibits this data from leaving region X.”
    • “Emergency access must be justified and logged.”
    • “Encryption keys may not be exported under any circumstance.”
  • Challenge teams to find response strategies that both restore service and respect constraints.

This builds realistic habits: responders learn to think in terms of safe, compliant adaptations, not just any quick fix.


Users’ Fears, Dependencies, and the Power Grid

Incidents aren’t just about uptime; they’re about people’s fears and dependencies on the systems behind the scenes.

Consider how reliance on the power grid shapes system design and incident response:

  • Users assume electricity “just works,” yet power outages can cascade into:
    • Data center failures
    • Connectivity loss
    • Payment terminal downtime
  • Teams might design only for cloud provider failures, not for upstream infrastructure fragility.

At the desk, explore:

  • Scenarios where both a cloud region and a major urban area lose power.
  • How critical services behave if backup generators fail.
  • What communication channels remain when corporate networks are down.

Prompt questions like:

  • “What does our service mean to users in a crisis?”
  • “Are there non‑technical ways to reduce harm—like clearer expectations, offline fallbacks, or manual processes?”

This shifts thinking from pure cost optimization to human‑centric resilience.


Conclusion: Start Small, Learn Continuously

An Analog Failure Observatory Desk won’t predict every outage or replace real‑world testing. It isn’t meant to.

Its value lies in:

  • Creating a safe, low‑stakes space to explore how systems, people, and processes interact under stress.
  • Embedding Resilience Engineering and Safety-II thinking into everyday practice.
  • Normalizing chaos exercises and incident drills as routine, not exceptional.
  • Bringing compliance, security, and user realities into the center of outage planning.

You can start this week:

  1. Claim a desk or table.
  2. Print a simple architecture diagram.
  3. Write three outage scenario cards and three constraint cards.
  4. Invite a few colleagues for a 60‑minute tabletop session.

From there, evolve the lab. Add more scenarios, refine role cards, and fold in insights from real incidents.

The more you practice failure in this tiny paper lab, the better prepared your organization will be when the real outages arrive.

The Analog Failure Observatory Desk: Building a Tiny Paper Lab for Safe Outage Experiments | Rain Lag