The Analog Incident Streetcar Switchboard: Walking Paper Routes Through Split‑Brain Outages

Introduction

Imagine this: your incident channel is on fire, dashboards are red, services are flapping in and out of life—and yet each subsystem claims “I’m fine” when you look at it in isolation. Alerts conflict, logs disagree, and your mental model of the system collapses under pressure.

That’s what a split‑brain outage feels like.

To explore this, we’ll use a fictional but realistic scenario: the Analog Incident Streetcar Switchboard—an old‑school control panel that routes “streetcars” (incidents) across your system. It’s part parody, part teaching tool, and a surprisingly effective way to think about complex, cascading failures.

We’ll walk through:

What the “Analog Incident Streetcar Switchboard” case study looks like
How distributed failures propagate and turn small problems into big outages
Why “2+1” split‑brain scenarios are so dangerous
How quorum design and tabletop exercises make real systems more resilient
Why sharing post‑incident lessons widely is non‑optional

The Analog Incident Streetcar Switchboard (AISS)

Picture a large, physical switchboard on a wall:

Each streetcar line = a critical service (auth, billing, search, messaging)
Each switch = a routing decision (which region, which database, which cache)
Each lamp = health signal or alert
A small printed map pinned next to it = your system diagram

This is your Analog Incident Streetcar Switchboard (AISS). On a normal day, streetcars run on time. Lights blink in predictable patterns. Operators (your on‑call engineers) know the rhythms.

Now inject a serious incident.

The “Split‑Brain” Outage on the AISS

A major network event hits one of your core regions:

Links between Region A and Region B start dropping packets.
Monitoring in Region C begins to see inconsistent heartbeats.
Some operators’ consoles show A as primary; others show B as primary.

On the Switchboard:

The Region A lamp is green locally but red on the aggregated dashboard.
The Region B lamp is flickering between amber and green.
Streetcars for “auth” are routed to both A and B, depending on which panel you look at.

You’ve entered a split‑brain world: different parts of the system disagree on who’s alive, who’s primary, and where writes should go.

This is where real outages stop being “a bug in X” and become “a complex system accident” involving:

Partial network partitions
Conflicting timeouts and retries
Inconsistent quorum decisions
Human operators working from different mental models

How Distributed Failures Cascade Like Streetcars

Distributed systems rarely fail cleanly. They fail like tangled train lines at rush hour.

Key failure properties:

Local becomes global, fast
A quiet local problem (a single node misbehaving, a flapping link) can:
- Trigger retries and thundering herds
- Blow out connection pools
- Overload neighboring nodes
- Knock over shared dependencies (e.g., a central auth service)
State disagreement spreads
Different nodes see different realities:
- Node A: “I can’t see Node B; I’ll become leader.”
- Node B: “I can’t see Node A; I’ll become leader.”
- Clients: “Sometimes A is leader, sometimes B is leader; I’ll just keep retrying both.”
Observability lies by omission
When links are broken, metrics and logs from the isolated side may not make it to your central store. You see:
- Gaps in dashboards
- Missing logs right where you need them
- Healthy‑looking panels hiding invisible failures

On the AISS, this translates to streetcars being routed onto tracks that look fine but are actually dead‑ends, while other lines silently pile up in a tunnel you can’t see.

The “2+1” Split: When Quorum Rules Hurt Availability

Most modern distributed systems use some variant of consensus (e.g., Raft, Paxos) with quorum‑based decisions:

With 3 nodes, quorum is 2.
With 5 nodes, quorum is 3.

That’s great for consistency, but network partitions still produce nasty patterns.

The 2+1 Scenario

Consider a 3‑node cluster (N1, N2, N3):

A partial partition isolates N3.
N1 and N2 can still communicate.

Result:

N1 + N2 maintain quorum and can continue as the authoritative side.
N3 loses quorum and steps down. Ideally, it becomes read‑only or fully unavailable.

This is exactly what you want for consistency. But operationally, on the AISS, it looks like this:

The panel near N3’s racks shows that its local node is “up but lonely”.
A regional status page that only queries N3 shows green.
Global control planes show the cluster as degraded but serving (thanks to N1 + N2).

You get conflicting signals:

"My node is up!" (N3)
"The cluster is up!" (N1 + N2)
"Some clients are timing out" (because traffic is stickied to N3’s zone)

Even though the system protected you from data corruption, the user experience still degrades and operators may spend precious time arguing with their dashboards.

Why Single‑Node Failure Should Not Kill Quorum

This is why well‑designed clusters try to ensure that a single node loss does not take down quorum:

3‑node clusters tolerate 1 faulty node.
5‑node clusters tolerate 2 faulty nodes.

In practice, that means:

Carefully choosing replication factors and failure domains (AZs, racks, regions)
Avoiding designs where one “special” node’s loss kills the whole system

Done right, you get better resilience and cost‑effectiveness for most workloads: you pay for a few extra nodes but avoid entire‑cluster outages when a single machine or link misbehaves.

Still, no amount of good quorum math prevents confusing 2+1 half‑degraded scenarios. You need process as much as you need protocol.

Tabletop Exercises: Walking Paper Routes Through the Outage

You don’t want to do your first complex incident in production.

A powerful practice is the tabletop exercise: a structured, low‑stress rehearsal using detailed, realistic scenarios.

For something like the AISS split‑brain outage, a tabletop might look like this:

Prepare the scenario
- Network partition between Region A and Region B
- Intermittent packet loss rather than a clean cut
- Some health checks succeed, others fail
- A few dependent services (e.g., queues, caches) behave differently in each region
Print the paper
- System diagrams
- Simplified logs and metrics snapshots at T+5, T+15, T+30
- User reports (support tickets, synthetic checks, status page complaints)
Assign roles
- Incident commander
- Comms lead (status page, internal updates)
- Operators for each subsystem (DB, network, app, observability)
Walk the “streetcar routes”
- Follow a user request as it hits:
  - DNS → edge → app → cache → DB
- Show how the same request behaves differently if it lands in Region A vs Region B
- Surface conflicting signals from different views of the system
Force the hard decisions
- When do you fail traffic fully to one region?
- When do you demote a region’s leader?
- What signals are “authoritative” during disagreement?

Tabletops like this build muscle memory and reveal design flaws while the cost is low.

Sharing Post‑Incident Learnings: From One Switchboard to Many

The AISS outage shouldn’t just be a story told by the people who were on call that day. It should become a case study that informs:

How you design new services
How you configure quorums and failover
How you build observability and status pages

That only happens if you share lessons widely.

Effective patterns:

Blameless post‑incident reviews published internally
Cross‑team debriefs where teams present their view of the incident timeline
Architecture guidelines updated with examples from the outage
Training materials (runbooks, tabletop decks, onboarding content) that reuse the same scenario

The goal is to prevent “we fixed that service” and instead achieve “we changed how we design systems across the organization.”

When the next team designs a new control plane, they should already be thinking:

How will this behave in a 2+1 split?
What happens if the observability pipeline itself is partitioned?
Which signals are authoritative when views conflict?

That’s the real payoff of a vivid case like the Analog Incident Streetcar Switchboard.

Conclusion

The Analog Incident Streetcar Switchboard is fictional, but the patterns it illustrates are painfully real:

Distributed failures propagate quickly and unpredictably.
Split‑brain and 2+1 scenarios create disagreement, not just downtime.
Quorum design should ensure that single‑node failures don’t take down the system.
Even with good design, people and process decide how bad the outage gets.

By:

Running tabletop exercises with realistic, messy scenarios
Designing clusters and failure domains that tolerate single‑node loss without quorum loss
And sharing post‑incident learnings widely across your organization

…you can turn confusing, stressful outages into powerful, reusable lessons.

In other words: don’t wait until the streetcars are piled up in the tunnel. Walk the routes on paper, rehearse the hard calls, and let one Analog Incident Streetcar Switchboard incident improve the entire city of your systems.