Rain Lag

The Cardboard Incident Observatory Tram: Riding a Paper Route Through Your System’s Quietest Near‑Misses

How treating weak signals and near‑misses like a cardboard observatory tram through your systems can transform your security, resilience, and engineering culture.

The Cardboard Incident Observatory Tram: Riding a Paper Route Through Your System’s Quietest Near‑Misses

Imagine building a city‑wide observatory tram out of cardboard.

It’s not designed to survive a storm, or even a strong gust of wind. It’s not the final product. It’s a paper route: a fragile, temporary way to move slowly through your system’s landscape, noticing the things you normally speed past.

That’s what a near‑miss observability practice can be for your software systems.

Most teams only stop the world for full‑blown incidents: outages, security breaches, data loss. But long before those events, your systems are constantly whispering: an odd log line here, a retry storm there, a flaky test that fails only on Tuesdays.

These are near misses—issues caught before causing real harm. They’re a goldmine for improving security, resilience, and safety culture.

This post is about building that “cardboard tram”: a light‑weight, visual, and cultural system to regularly ride through your quietest near‑misses and learn from them—before they turn into headlines.


Why near misses matter more than you think

A near miss is an event where something almost went wrong but didn’t—often because of luck, partial safeguards, or a human catching it just in time.

Examples:

  • A misconfigured firewall rule that’s caught in review minutes before deployment.
  • A database migration that would have locked a table for an hour, but is rolled back after a sharp‑eyed engineer notices the plan.
  • A background job that silently retries 200 times a day, never fully failing, never fully succeeding.

None of these triggered a formal incident. No status page was updated. No customer complained.

But each quiet almost‑problem contains:

  • Evidence of brittle assumptions
  • Hidden single points of failure
  • Design gaps where safety depended on individual heroism

If you only study the few major incidents that breach your defenses, you ignore the vast body of daily weak signals that show how your system is really operating.

Aviation figured this out a long time ago.


Borrowing from aviation: near‑miss reporting as a safety engine

Commercial aviation is extraordinarily safe not because planes never fail, but because the industry is obsessed with learning from every small failure and near miss.

Pilots, crew, and controllers are encouraged—sometimes legally required—to report:

  • “I almost lined up with the wrong runway.”
  • “We briefly dipped below the minimum safe altitude.”
  • “We misheard a clearance and nearly took off without proper authorization.”

Most of these never become accidents. But they feed a systematic learning machine: databases, safety boards, cross‑airline reviews, and robust sensemaking processes.

Software engineering rarely has such rigor. Many teams:

  • Only track P1/P0 incidents formally.
  • Treat “almost issues” as noise or local quirks.
  • Rely on memory and hallway conversations instead of structured learning.

Yet our systems are complex socio‑technical environments, much like aviation. We also need systematic ways to:

  • Capture weak signals
  • Share them across teams
  • Turn them into design, tooling, and process improvements

That’s where a Cardboard Incident Observatory Tram comes in.


The cardboard tram: a metaphor for lightweight, visible learning

Think of your near‑miss practice as a tram made of cardboard:

  • It’s not a heavy governance process.
  • It’s experimental and easy to reshape.
  • It’s slow on purpose, so people can look around.
  • It runs on a visible track that everyone can see and contribute to.

In practical terms, this looks like:

  1. Deliberately capturing near‑misses and weak signals
  2. Visualizing them in shared systems
  3. Regularly riding through them as a team, making sense of patterns and acting on them

Let’s break those down.


Step 1: Make near‑misses reportable and safe

You can’t learn from what you don’t see.

Start by explicitly defining what you care about:

“A near miss is any event where something felt wrong, surprising, or risky—even if nothing broke and no customer noticed.”

Encourage reporting of:

  • Suspicious patterns: sudden latency spikes that self‑resolve.
  • Accidental saves: “I almost hit deploy to production instead of staging.”
  • Uncomfortable dependencies: “If this job fails, we have no alerting.”
  • Design unease: “Every time I touch this code, I feel like I’m handling nitroglycerin.”

Support this with:

  • Blameless framing: Focus on conditions and systems, not on who messed up.
  • Low‑friction capture: A short form, a Slack emoji, or a quick ticket template is better than a full incident report.
  • Leadership signaling: Tech leads and managers should share their own near‑misses regularly—“I almost shipped a secret; here’s what caught it.”

Your goal: make “almost problems” socially and procedurally as important as actual outages.


Step 2: Use visual systems to make weak signals observable

Once near‑misses are being reported, they must become visible artifacts, not just forgotten chat logs.

Some effective visual tools:

1. Near‑Miss Kanban Board

A simple Kanban board with columns like:

  • Spotted – Raw, untriaged near‑misses
  • Sensemaking – Under discussion and exploration
  • Decisions – What we’ll change (or consciously not change)
  • In Progress – Mitigations being implemented
  • Learned – Documented insights and updated practices

Each card:

  • Describes the near miss in plain language
  • Notes where it was caught (monitoring, review, gut feeling)
  • Tags impacted systems, teams, and risk types (security, reliability, data, UX)

The board is your cardboard tram track: a visible route through your system’s quiet anomalies.

2. Incident & Near‑Miss Map

Create a time‑or system‑based map showing:

  • Major incidents
  • Minor incidents
  • Near‑misses and recurring weak signals

Patterns jump out:

  • “All of these cluster around our auth service.”
  • “Half our near‑misses involve data migrations.”
  • “Every quarter end we hit rate‑limit warnings but never quite fail.”

A map turns vague discomfort into concrete, discussable risk landscapes.


Step 3: Sensemaking rituals – riding the tram together

Collecting events is not enough. The real power is in sensemaking: collectively interpreting weak signals, exploring multiple stories, and deciding what they might mean.

Add a recurring ritual:

Near‑Miss Tram Ride – 45–60 minutes, every 2–4 weeks.

Participants: engineers, SREs, security, product, maybe support—anyone who touches real behavior of the system.

Agenda:

  1. Review the board/map

    • What’s new in Spotted and Sensemaking?
    • Are there clusters or recurring themes?
  2. Storytelling, not blaming

    • Ask: “What surprised us?” and “What made this possible?”
    • Explore context: on‑call load, tooling gaps, design constraints.
  3. Hypothesis and experiment, not decree

    • Consider small mitigations, probes, or observability improvements.
    • Explicitly note where uncertainty remains—this is valuable signal.
  4. Capture decisions and non‑decisions

    • “We’re choosing not to fix this now because…” is itself a learning artifact.

This is where preoccupation with failure comes alive: a calm, curious, ongoing attention to how things might fail next, based on the smallest hints.


Refining rituals during change and pain

The best time to refine your near‑miss practices is:

  • When your architecture is shifting (cloud migration, monolith–>services, AI features).
  • When teams feel pain (on‑call burnout, repeated confusion, churn around specific services).

Use these moments to ask:

  • Which weak signals are we currently ignoring?
  • What rituals feel stale or performative?
  • Where can we create lighter‑weight, more honest forums?

You might:

  • Fold near‑miss review into existing incident review meetings.
  • Add a short “any near‑misses this week?” section to retros.
  • Start with a time‑boxed 3‑month experiment for the tram, then reassess.

The cardboard metaphor is important here: your process doesn’t need to be perfect. It just needs to exist, be visible, and be adjustable.


Turning near‑miss culture into operational advantage

When you intentionally cultivate a culture that values reporting and exploring almost‑problems, several things happen:

  • Security improves: You catch misconfigurations, unsafe defaults, and access creep before they’re exploited.
  • Resilience increases: You see stress patterns, capacity cliffs, and brittle dependencies well before they collapse.
  • Safety culture deepens: People feel safer surfacing concerns early, because they see discussion turn into action rather than blame.
  • Learning accelerates: New team members get exposed to a rich history of “how things nearly broke” rather than only polished architecture diagrams.

Over time, your quietest weak signals become a strategic asset—a live, evolving map of where your systems are most likely to hurt you next.


How to start in the next 30 days

You don’t need a full program to begin. Try this:

  1. Week 1 – Announce the experiment

    • Define “near miss” for your org.
    • Create a simple form or ticket template.
    • Share a few personal examples from leaders.
  2. Week 2 – Stand up the visual

    • Create a Near‑Miss Kanban board.
    • Seed it with 3–5 recent examples.
  3. Week 3 – Run your first tram ride

    • 45 minutes, small group.
    • Focus on storytelling and curiosity.
  4. Week 4 – Adjust the cardboard

    • Ask what felt useful vs. heavy.
    • Remove friction. Shorten templates. Tighten focus.

If it helps, explicitly call it an experiment:

“We’re building a cardboard incident observatory tram. It may be wobbly. That’s fine. We’ll reinforce the parts that help us see better.”


Conclusion: Don’t wait for the big crash

Big incidents will always demand attention. They’re loud, painful, and expensive.

But the most powerful learning opportunities often live in the quiet, ambiguous corners of your system—the checkout that almost timed out, the secret that almost leaked, the cron job that almost filled the disk.

Building a Cardboard Incident Observatory Tram is about:

  • Making those near‑misses visible.
  • Giving your teams time and space to make sense of them.
  • Turning weak signals into deliberate design and cultural improvements.

You don’t need a flawless process. You just need the willingness to ride that paper tram through your system regularly, look out the windows together, and say:

“We almost had a problem here. What is it trying to tell us?”

That’s how you turn your quietest near‑misses into your loudest source of insight.

The Cardboard Incident Observatory Tram: Riding a Paper Route Through Your System’s Quietest Near‑Misses | Rain Lag