The Paper Reliability Streetcar: Riding a Daily Analog Route Through Your System’s Weirdest Edge Cases
How to build a repeatable, low-tech “paper streetcar” ritual that exercises your system’s strangest edge cases, exposes hidden failure modes, and permanently upgrades your incident readiness and system design.
Introduction: Why Your Worst Incidents Are Never the Ones You Practiced For
Most reliability programs are great at the obvious stuff: load tests, uptime dashboards, paging on error spikes. Yet the incidents that really hurt rarely come from the obvious paths.
They come from:
- The misconfigured partner webhook that only fires once a quarter
- The odd payment flow that only applies to one legacy region
- The partial outage where DNS, auth, and feature flags all disagree at once
In other words: your weirdest edge cases.
These edges are exactly where your monitoring is weakest, your playbooks are vaguest, and your team is least practiced. And that’s a problem reliability tooling alone can’t fix.
This is where the Paper Reliability Streetcar comes in: a daily analog route through your system’s strangest paths. It’s a manual, structured ritual that:
- Exercises non-obvious behaviors on purpose
- Reveals gaps in design, monitoring, and communication
- Turns edge cases into regular practice instead of rare surprises
Think of it as a low-tech, high-value reliability workout for your system and your team.
What Is the “Paper Reliability Streetcar”?
The metaphor comes from a simple idea: a streetcar follows the same route every day, hitting the same stops, in all weather. The "paper" part means this route is:
- Manual – run by humans, not automation (at least initially)
- Scripted – defined as a checklist, scenario, or runbook
- Repeatable – done on a regular cadence (daily, weekly, or per on-call shift)
Your paper streetcar is a fixed set of edge-case scenarios that you:
- Walk through as if they’re happening right now
- Observe what breaks (or would break)
- Capture improvements for design, tooling, and process
This isn’t chaos engineering in production, and it’s not a synthetic test suite. It’s more like a tabletop exercise for the oddest 1% of your system behavior, done often enough that your team becomes fluent in handling the bizarre.
Step 1: Define Your Daily Analog Route
Start by designing the route your streetcar will take. The key rule: avoid the happy path.
Good sources of edge cases:
- Past incidents that involved multiple interacting failures
- Silent failures (e.g., dropped events, delayed jobs, partial data writes)
- Partner or vendor dependencies (payment providers, SMS gateways, SSO, CDNs)
- Policy- or region-specific logic (country-specific rules, legacy customer tiers)
- Rare lifecycle moments (account closure, subscription reactivation, data migration)
Turn these into scenarios like:
- “Customer in a deprecated plan attempts to upgrade during a partial payment-provider outage.”
- “Webhook from a third-party fires twice with slightly different payloads; our system processes both.”
- “A large tenant hits a rate limit in only one microservice, causing inconsistent state across services.”
For each scenario, define a minimal analog script:
- Preconditions (what must be true in the system?)
- Trigger (what action or event starts this scenario?)
- Expected correct behavior
- Likely failure modes (from your current understanding)
Your initial route might be just 3–5 scenarios. The goal is not to cover everything, but to practice something non-obvious every single time.
Step 2: Treat Reliability Like a Formal Program
The paper streetcar is not a vibes-based exercise. It’s part of a formal reliability program with:
- Clear ownership (e.g., SRE, platform, or a rotating incident captain)
- Defined cadence (daily, weekly, per on-call handoff)
- Standard artifacts: checklists, run logs, metrics, and tickets
During each run:
- Pick one or more scenarios on the route.
- Walk through them step-by-step.
- Log what you discover: gaps, confusion, missing metrics, ambiguous behavior.
- Create follow-up work items with owners and deadlines.
You are not just “role-playing incidents.” You are:
- Testing your mental model of the system
- Monitoring how your processes and tools support that model
- Predicting failure modes before they become incidents
Over time, this becomes an input to your reliability roadmap: what to monitor, what to redesign, what to automate.
Step 3: Make Scenarios Hands-On and Realistic
The value of the streetcar depends on how concrete the scenarios are. "What if payments fail?" is too vague. Instead, aim for hands-on, specific situations that mirror real-world incidents.
Layers to include:
- Technical reality: actual systems, actual states (or realistic sandboxes)
- Operational process: paging, escalation, handoffs, status updates
- Customer impact: what the user sees, what support is told, what SLAs apply
For example, a good exercise might include:
- Simulating (or imagining with real logs) a partial outage of a partner API
- Checking dashboards: what signals would we see?
- Drafting the internal incident announcement
- Writing the customer-facing status page update
- Determining how we’d validate that we’re truly recovered
The goal is to practice the muscle movements your team will need mid-incident—in context, under a bit of time pressure, but with room to think.
Step 4: Use Edge-Case Drills to Expose Communication Gaps
Your biggest surprises won’t just be technical—they’ll be human.
These analog runs often reveal issues like:
- No one knows who owns the integration with a critical vendor.
- Support and engineering use different terms for the same failure.
- Legal or compliance constraints aren’t known to the incident commander.
- The partner’s SLA or escalation path is unclear (or wishful thinking).
Design some scenarios that explicitly involve external parties or other internal teams:
- "Our upstream provider returns malformed data for 0.1% of requests. What do we do? Who do we call?"
- "A major customer’s integration breaks after they change their SSO setup. How do we coordinate with them?"
As you run the route, track these friction points:
- Missing contact lists
- Ambiguous responsibilities
- Conflicting priorities between teams
Then treat these as reliability work—in the same backlog as technical fixes. A system is only as reliable as the social graph that supports it.
Step 5: Turn Insights into Concrete Improvements
If your streetcar just generates interesting conversations, it’s theater, not engineering.
Each run should end with:
- A short debrief (10–15 minutes)
- A list of concrete outcomes, such as:
- New or refined alerts/metrics
- Updated runbooks and escalation paths
- Clarified ownership and documentation
- Design changes or technical debt tickets
Make these outcomes visible:
- Track them in your incident management tool or reliability backlog.
- Tag them (e.g.,
source:streetcar) so you can report on impact. - Review them in reliability reviews or post-incident retros.
Over a few weeks, you should see:
- Fewer "unknown unknowns" during real incidents
- Faster time to triage and mitigation
- Better alignment between teams (engineering, ops, support, product, legal)
The streetcar is the front door for discovering weak points. The actual value comes from systematically closing them.
Step 6: Combine Systems Thinking with Design Discipline
Defining edge cases is both an architecture problem and a design problem.
Use high-level systems thinking to:
- Map critical flows and dependencies
- Identify where coupling, retries, and backpressure might fail
- Understand failure domains (per region, per vendor, per tenant)
Then apply low-level design discipline to:
- Specify exact inputs, states, and transitions
- Document contracts: what do we promise under degraded conditions?
- Write scenario checklists at the level of real APIs, queues, and flags
For example, don’t just say:
“Rate limiting might cause issues for large customers.”
Instead, define:
"Tenant A reaches 95% of API limit on Service X while Service Y is under maintenance, causing delayed webhooks and stale dashboards for 20 minutes. How do we detect, communicate, and recover?"
The paper streetcar is where these high-level and low-level views meet and get tested in practice.
Step 7: Make the Streetcar Permanent, Not a One-Off
The biggest failure mode of exercises like this is treating them as special events:
- “We did an incident game day last quarter; we’re good.”
Real systems and organizations change constantly. New:
- Features
- Teams
- Vendors
- Customers
…all introduce new edges.
That’s why the streetcar must be:
- Permanent – part of your reliability operating model
- Predictable – on the calendar with clear expectations
- Evolving – scenarios added, retired, and refined over time
Cement it into your workflow by:
- Making it a standard part of on-call onboarding
- Including it in incident commander training
- Reporting key learnings to leadership regularly
When done well, the streetcar becomes as normal as standups or retros—a quiet, continuous force improving how you handle the weirdest 1% of reality.
Conclusion: Reliability Lives at the Edges
Your system’s real reliability isn’t measured on the happy path. It’s measured when:
- A vendor’s SLA gets quietly violated
- A legacy customer hits a code path no one’s touched in three years
- Multiple “rare” things go wrong at the same time
You can’t anticipate every edge case, but you can decide to practice living at the edges.
The Paper Reliability Streetcar is a simple, analog way to do exactly that:
- A daily (or weekly) route through your strangest scenarios
- A structured program that surfaces technical, operational, and communication gaps
- A repeatable engine that turns weird situations into concrete improvements
Start small: pick three edge cases, write them down, and walk through them with your team this week. Then do it again. And again.
Over time, you’ll notice something powerful: the incidents that once felt like freak accidents now feel like drills you’ve already run. And that’s what real reliability looks like.