Rain Lag

The Brown Paper Reliability Bazaar: Building a Friction‑Full Incident Practice on Purpose

Why adding the right kind of friction to your incident practices—through deliberate reviews, checklists, and continuous learning—can make your systems and teams more reliable over the long term.

The Brown Paper Reliability Bazaar: Building a Friction‑Full Incident Practice on Purpose

If you work in operations, SRE, or platform engineering, you’ve probably felt the pressure to make everything smoother, faster, and more “automated away.” Incidents should be rare, detection should be instant, response should be streamlined, and recovery should be painless.

That all sounds good—until it quietly makes you fragile.

This post is about doing something that sounds backwards: intentionally adding friction to your incident practices. Think of it as a “Brown Paper Reliability Bazaar” where you lay everything out in the open—processes, mistakes, weird edge cases—and invite people to examine them slowly, in public, and with curiosity.

The thesis: reliability isn’t only about speed and uptime. It’s about how your organization manages uncertainty over time. And that depends as much on your culture and practices as it does on your tooling and math.


Reliability Is More Than "Keeping Things Up"

Reliability engineering is often misunderstood as “make sure it doesn’t go down.” In reality, it’s closer to:

Managing uncertainty and the risk of failure over the entire lifetime of a system.

This includes:

  • How your system behaves under weird, rare conditions
  • How people understand and operate that system
  • How you learn from both failures and near‑misses
  • How your practices evolve as the system and organization change

Mathematical and statistical models (MTBF, SLOs, reliability functions, probabilities of failure) are powerful tools. They help you reason about risks and make trade‑offs. But durable reliability lives at the intersection of math and human practice:

  • Are incidents noticed quickly—or only when customers complain?
  • Do people know what to do when an alert fires?
  • Are incident roles and authority clear under pressure?
  • Do you actually act on what you learn from incidents?

This is where a well‑designed Incident Response Plan (IRP) becomes essential.


What an Effective Incident Response Plan Really Does

An Incident Response Plan is more than a runbook on a wiki. At its best, it provides clear, actionable instructions for:

  1. Detecting an incident

    • How do we know something is wrong?
    • What are the thresholds, alarms, and signals that matter?
  2. Responding to an incident

    • Who’s in charge?
    • How do we communicate internally and externally?
    • How do we decide what action to take first?
  3. Recovering from an incident

    • How do we safely restore service?
    • How do we verify the system is stable again?
    • How do we avoid causing more harm during recovery?

And it does this with one explicit goal:

Minimize the overall impact on the organization.

That impact isn’t just uptime. It’s:

  • Customer trust
  • Operational cost and burnout
  • Reputational damage
  • Regulatory or compliance exposure

A good IRP gives people something to hold onto when everything feels chaotic. But if you treat the IRP as a static document—or something you “set and forget”—its value decays quickly.

That’s where friction‑full practices come in.


The Case for Friction‑Full Practices

In most organizations, the instinct is to remove friction:

  • Fewer steps
  • Fewer checks
  • Fewer approvals
  • More automation

Some of that is healthy. But removing all friction is dangerous. It often means you:

  • Bypass human judgment when it’s actually needed
  • Hide complexity behind tools people don’t fully understand
  • Move too quickly through situations that deserve careful thought

Friction‑full practices are deliberate additions of thoughtful resistance into your workflows. Not bureaucracy for its own sake, but structures that force you to confront complexity and uncertainty instead of sliding past it.

Examples include:

  • Checklists for high‑risk actions (e.g., failovers, schema migrations, emergency patches)
  • Deliberate pre‑flight reviews for major changes
  • Formal incident role assignments (incident commander, scribe, comms lead)
  • Mandatory post‑incident reviews before closing out a ticket

These don’t exist to slow you down arbitrarily. They exist to:

  • Make assumptions visible
  • Encourage shared understanding under pressure
  • Catch errors at the point they’re easiest and cheapest to correct

Think of aviation: pilots are highly trained, but they still use checklists. The point isn’t to compensate for incompetence—it’s to compensate for being human in complex, high‑stakes systems.


The Brown Paper Reliability Bazaar

Imagine this experiment:

  • After a major incident, you print out relevant metrics, logs, timelines, Slack threads, screenshots, and graphs.
  • You tape them all over a big wall—real “brown paper” style.
  • You invite the people who were involved (and some who weren’t) to walk through the story together.

You ask:

  • What did we notice first?
  • What were we assuming in this moment?
  • Where were we guessing—and why did that feel reasonable at the time?
  • What signals did we miss because of how our tools or alerts are designed?

This is your Reliability Bazaar: a structured, shared, friction‑full space where the organization comes to trade stories, surface assumptions, and learn.

The format doesn’t have to be literal paper. The important parts are:

  • It’s visible (not hidden in a ticket comment)
  • It’s collaborative (not one person filling out a template alone)
  • It’s structured (guided questions, not a blamey free‑for‑all)

This is what a structured post‑incident review looks like when you treat it as a learning event, not a compliance checkbox.


Turning Incidents into Learning, Not Just Fixes

A structured post‑incident review has a few non‑negotiable characteristics:

  1. Blamelessness with accountability
    You don’t focus on who to blame; you focus on how it made sense for people to act as they did in the moment, given the information, tools, and pressures they had. Accountability shows up in what the organization changes, not in who gets punished.

  2. Timeline reconstruction
    You rebuild what actually happened: alerts, decisions, actions, communications. This surfaces hidden dependencies and misunderstandings.

  3. Multiple perspectives
    Include on‑call engineers, product, customer support, and anyone who was impacted. Incidents are sociotechnical—they live at the intersection of systems and people.

  4. Concrete improvements
    You leave with actions, not platitudes: revised alerts, runbooks, training, code changes, or process tweaks.

Done well, these reviews do double duty:

  • They improve your incident response next time (smoother roles, better communication, clearer signals).
  • They improve your overall system resilience (better designs, safer defaults, more realistic training).

The key is consistency. A brilliant review once a year is less valuable than a decent review after every meaningful incident and near‑miss.


Incident Practice as a Continuous, Iterative Loop

The most effective organizations treat incident management as a continuous, iterative practice, not a reactive fire drill. Lessons from each event are fed back into:

  • System design

    • Are there safer architectures or patterns we should adopt?
    • Can we isolate or limit blast radius more effectively?
  • Operations and tooling

    • Do dashboards and alerts show what humans actually need under pressure?
    • Are runbooks discoverable, accurate, and usable at 3 a.m.?
  • Training and onboarding

    • Do new engineers get to walk through old incidents?
    • Do we rehearse incident roles through drills or game days?
  • Culture and decision‑making

    • Do people feel safe to raise concerns early?
    • Do leaders reward thoughtful slow‑downs when something feels off?

Over time, this loop transforms incidents from isolated disasters into regular input signals that shape how you design, build, and operate systems. That’s how you manage uncertainty over the long term.


How to Start Adding the Right Kind of Friction

If your current incident practice is mostly ad‑hoc, here are some practical starting points:

  1. Define a minimal IRP

    • Establish incident severity levels and who gets paged.
    • Define basic roles: incident commander, comms lead, scribe.
    • Write a one‑page “How we run incidents” guide.
  2. Introduce a lightweight checklist

    • For declaring an incident.
    • For closing an incident (including scheduling a review).
  3. Run structured post‑incident reviews

    • 60–90 minutes, within a week of the incident.
    • Use the same template and facilitation style each time.
    • Focus on context, not blame.
  4. Make the learning visible

    • Share findings in a regular “reliability review” or engineering newsletter.
    • Turn repeated themes into backlog items for platform and product teams.
  5. Iterate intentionally

    • Every quarter, review your incident process itself.
    • Ask: What felt too heavy? What felt too loose? Where did we need more friction—or less?

The goal isn’t a perfect process; it’s a living practice that evolves with your systems and your people.


Conclusion: Reliability as a Long Game

Fast detection and quick recovery will always matter. But if you only optimize for speed and "smoothness," you risk eroding your organization’s ability to understand and manage the real complexity of its systems.

The Brown Paper Reliability Bazaar mindset says:

  • Lay the reality of your incidents out in the open.
  • Add deliberate, thoughtful friction where it improves understanding and reduces long‑term risk.
  • Treat every incident and near‑miss as an investment opportunity in future resilience.

In other words: don’t just fight fires—study them. Over time, this is how you build systems and teams that don’t just stay up more often, but fail more gracefully, recover more wisely, and learn more deeply every time they do.

The Brown Paper Reliability Bazaar: Building a Friction‑Full Incident Practice on Purpose | Rain Lag