Rain Lag

The Paper Reliability Streetcar: Riding a Daily Analog Route Through Your System’s Weirdest Edge Cases

How to build a repeatable, low-tech “paper streetcar” ritual that exercises your system’s strangest edge cases, exposes hidden failure modes, and permanently upgrades your incident readiness and system design.

Introduction: Why Your Worst Incidents Are Never the Ones You Practiced For

Most reliability programs are great at the obvious stuff: load tests, uptime dashboards, paging on error spikes. Yet the incidents that really hurt rarely come from the obvious paths.

They come from:

  • The misconfigured partner webhook that only fires once a quarter
  • The odd payment flow that only applies to one legacy region
  • The partial outage where DNS, auth, and feature flags all disagree at once

In other words: your weirdest edge cases.

These edges are exactly where your monitoring is weakest, your playbooks are vaguest, and your team is least practiced. And that’s a problem reliability tooling alone can’t fix.

This is where the Paper Reliability Streetcar comes in: a daily analog route through your system’s strangest paths. It’s a manual, structured ritual that:

  • Exercises non-obvious behaviors on purpose
  • Reveals gaps in design, monitoring, and communication
  • Turns edge cases into regular practice instead of rare surprises

Think of it as a low-tech, high-value reliability workout for your system and your team.


What Is the “Paper Reliability Streetcar”?

The metaphor comes from a simple idea: a streetcar follows the same route every day, hitting the same stops, in all weather. The "paper" part means this route is:

  • Manual – run by humans, not automation (at least initially)
  • Scripted – defined as a checklist, scenario, or runbook
  • Repeatable – done on a regular cadence (daily, weekly, or per on-call shift)

Your paper streetcar is a fixed set of edge-case scenarios that you:

  1. Walk through as if they’re happening right now
  2. Observe what breaks (or would break)
  3. Capture improvements for design, tooling, and process

This isn’t chaos engineering in production, and it’s not a synthetic test suite. It’s more like a tabletop exercise for the oddest 1% of your system behavior, done often enough that your team becomes fluent in handling the bizarre.


Step 1: Define Your Daily Analog Route

Start by designing the route your streetcar will take. The key rule: avoid the happy path.

Good sources of edge cases:

  • Past incidents that involved multiple interacting failures
  • Silent failures (e.g., dropped events, delayed jobs, partial data writes)
  • Partner or vendor dependencies (payment providers, SMS gateways, SSO, CDNs)
  • Policy- or region-specific logic (country-specific rules, legacy customer tiers)
  • Rare lifecycle moments (account closure, subscription reactivation, data migration)

Turn these into scenarios like:

  • “Customer in a deprecated plan attempts to upgrade during a partial payment-provider outage.”
  • “Webhook from a third-party fires twice with slightly different payloads; our system processes both.”
  • “A large tenant hits a rate limit in only one microservice, causing inconsistent state across services.”

For each scenario, define a minimal analog script:

  • Preconditions (what must be true in the system?)
  • Trigger (what action or event starts this scenario?)
  • Expected correct behavior
  • Likely failure modes (from your current understanding)

Your initial route might be just 3–5 scenarios. The goal is not to cover everything, but to practice something non-obvious every single time.


Step 2: Treat Reliability Like a Formal Program

The paper streetcar is not a vibes-based exercise. It’s part of a formal reliability program with:

  • Clear ownership (e.g., SRE, platform, or a rotating incident captain)
  • Defined cadence (daily, weekly, per on-call handoff)
  • Standard artifacts: checklists, run logs, metrics, and tickets

During each run:

  1. Pick one or more scenarios on the route.
  2. Walk through them step-by-step.
  3. Log what you discover: gaps, confusion, missing metrics, ambiguous behavior.
  4. Create follow-up work items with owners and deadlines.

You are not just “role-playing incidents.” You are:

  • Testing your mental model of the system
  • Monitoring how your processes and tools support that model
  • Predicting failure modes before they become incidents

Over time, this becomes an input to your reliability roadmap: what to monitor, what to redesign, what to automate.


Step 3: Make Scenarios Hands-On and Realistic

The value of the streetcar depends on how concrete the scenarios are. "What if payments fail?" is too vague. Instead, aim for hands-on, specific situations that mirror real-world incidents.

Layers to include:

  • Technical reality: actual systems, actual states (or realistic sandboxes)
  • Operational process: paging, escalation, handoffs, status updates
  • Customer impact: what the user sees, what support is told, what SLAs apply

For example, a good exercise might include:

  • Simulating (or imagining with real logs) a partial outage of a partner API
  • Checking dashboards: what signals would we see?
  • Drafting the internal incident announcement
  • Writing the customer-facing status page update
  • Determining how we’d validate that we’re truly recovered

The goal is to practice the muscle movements your team will need mid-incident—in context, under a bit of time pressure, but with room to think.


Step 4: Use Edge-Case Drills to Expose Communication Gaps

Your biggest surprises won’t just be technical—they’ll be human.

These analog runs often reveal issues like:

  • No one knows who owns the integration with a critical vendor.
  • Support and engineering use different terms for the same failure.
  • Legal or compliance constraints aren’t known to the incident commander.
  • The partner’s SLA or escalation path is unclear (or wishful thinking).

Design some scenarios that explicitly involve external parties or other internal teams:

  • "Our upstream provider returns malformed data for 0.1% of requests. What do we do? Who do we call?"
  • "A major customer’s integration breaks after they change their SSO setup. How do we coordinate with them?"

As you run the route, track these friction points:

  • Missing contact lists
  • Ambiguous responsibilities
  • Conflicting priorities between teams

Then treat these as reliability work—in the same backlog as technical fixes. A system is only as reliable as the social graph that supports it.


Step 5: Turn Insights into Concrete Improvements

If your streetcar just generates interesting conversations, it’s theater, not engineering.

Each run should end with:

  • A short debrief (10–15 minutes)
  • A list of concrete outcomes, such as:
    • New or refined alerts/metrics
    • Updated runbooks and escalation paths
    • Clarified ownership and documentation
    • Design changes or technical debt tickets

Make these outcomes visible:

  • Track them in your incident management tool or reliability backlog.
  • Tag them (e.g., source:streetcar) so you can report on impact.
  • Review them in reliability reviews or post-incident retros.

Over a few weeks, you should see:

  • Fewer "unknown unknowns" during real incidents
  • Faster time to triage and mitigation
  • Better alignment between teams (engineering, ops, support, product, legal)

The streetcar is the front door for discovering weak points. The actual value comes from systematically closing them.


Step 6: Combine Systems Thinking with Design Discipline

Defining edge cases is both an architecture problem and a design problem.

Use high-level systems thinking to:

  • Map critical flows and dependencies
  • Identify where coupling, retries, and backpressure might fail
  • Understand failure domains (per region, per vendor, per tenant)

Then apply low-level design discipline to:

  • Specify exact inputs, states, and transitions
  • Document contracts: what do we promise under degraded conditions?
  • Write scenario checklists at the level of real APIs, queues, and flags

For example, don’t just say:

“Rate limiting might cause issues for large customers.”

Instead, define:

"Tenant A reaches 95% of API limit on Service X while Service Y is under maintenance, causing delayed webhooks and stale dashboards for 20 minutes. How do we detect, communicate, and recover?"

The paper streetcar is where these high-level and low-level views meet and get tested in practice.


Step 7: Make the Streetcar Permanent, Not a One-Off

The biggest failure mode of exercises like this is treating them as special events:

  • “We did an incident game day last quarter; we’re good.”

Real systems and organizations change constantly. New:

  • Features
  • Teams
  • Vendors
  • Customers

…all introduce new edges.

That’s why the streetcar must be:

  • Permanent – part of your reliability operating model
  • Predictable – on the calendar with clear expectations
  • Evolving – scenarios added, retired, and refined over time

Cement it into your workflow by:

  • Making it a standard part of on-call onboarding
  • Including it in incident commander training
  • Reporting key learnings to leadership regularly

When done well, the streetcar becomes as normal as standups or retros—a quiet, continuous force improving how you handle the weirdest 1% of reality.


Conclusion: Reliability Lives at the Edges

Your system’s real reliability isn’t measured on the happy path. It’s measured when:

  • A vendor’s SLA gets quietly violated
  • A legacy customer hits a code path no one’s touched in three years
  • Multiple “rare” things go wrong at the same time

You can’t anticipate every edge case, but you can decide to practice living at the edges.

The Paper Reliability Streetcar is a simple, analog way to do exactly that:

  • A daily (or weekly) route through your strangest scenarios
  • A structured program that surfaces technical, operational, and communication gaps
  • A repeatable engine that turns weird situations into concrete improvements

Start small: pick three edge cases, write them down, and walk through them with your team this week. Then do it again. And again.

Over time, you’ll notice something powerful: the incidents that once felt like freak accidents now feel like drills you’ve already run. And that’s what real reliability looks like.

The Paper Reliability Streetcar: Riding a Daily Analog Route Through Your System’s Weirdest Edge Cases | Rain Lag