The Analog Incident Streetcar Switchboard: Walking Paper Routes Through Split‑Brain Outages
How a fictional ‘Analog Incident Streetcar Switchboard’ outage helps us understand split‑brain, cascading failures, and why tabletop exercises and better quorum design matter for real distributed systems.
Introduction
Imagine this: your incident channel is on fire, dashboards are red, services are flapping in and out of life—and yet each subsystem claims “I’m fine” when you look at it in isolation. Alerts conflict, logs disagree, and your mental model of the system collapses under pressure.
That’s what a split‑brain outage feels like.
To explore this, we’ll use a fictional but realistic scenario: the Analog Incident Streetcar Switchboard—an old‑school control panel that routes “streetcars” (incidents) across your system. It’s part parody, part teaching tool, and a surprisingly effective way to think about complex, cascading failures.
We’ll walk through:
- What the “Analog Incident Streetcar Switchboard” case study looks like
- How distributed failures propagate and turn small problems into big outages
- Why “2+1” split‑brain scenarios are so dangerous
- How quorum design and tabletop exercises make real systems more resilient
- Why sharing post‑incident lessons widely is non‑optional
The Analog Incident Streetcar Switchboard (AISS)
Picture a large, physical switchboard on a wall:
- Each streetcar line = a critical service (auth, billing, search, messaging)
- Each switch = a routing decision (which region, which database, which cache)
- Each lamp = health signal or alert
- A small printed map pinned next to it = your system diagram
This is your Analog Incident Streetcar Switchboard (AISS). On a normal day, streetcars run on time. Lights blink in predictable patterns. Operators (your on‑call engineers) know the rhythms.
Now inject a serious incident.
The “Split‑Brain” Outage on the AISS
A major network event hits one of your core regions:
- Links between Region A and Region B start dropping packets.
- Monitoring in Region C begins to see inconsistent heartbeats.
- Some operators’ consoles show A as primary; others show B as primary.
On the Switchboard:
- The Region A lamp is green locally but red on the aggregated dashboard.
- The Region B lamp is flickering between amber and green.
- Streetcars for “auth” are routed to both A and B, depending on which panel you look at.
You’ve entered a split‑brain world: different parts of the system disagree on who’s alive, who’s primary, and where writes should go.
This is where real outages stop being “a bug in X” and become “a complex system accident” involving:
- Partial network partitions
- Conflicting timeouts and retries
- Inconsistent quorum decisions
- Human operators working from different mental models
How Distributed Failures Cascade Like Streetcars
Distributed systems rarely fail cleanly. They fail like tangled train lines at rush hour.
Key failure properties:
-
Local becomes global, fast
A quiet local problem (a single node misbehaving, a flapping link) can:- Trigger retries and thundering herds
- Blow out connection pools
- Overload neighboring nodes
- Knock over shared dependencies (e.g., a central auth service)
-
State disagreement spreads
Different nodes see different realities:- Node A: “I can’t see Node B; I’ll become leader.”
- Node B: “I can’t see Node A; I’ll become leader.”
- Clients: “Sometimes A is leader, sometimes B is leader; I’ll just keep retrying both.”
-
Observability lies by omission
When links are broken, metrics and logs from the isolated side may not make it to your central store. You see:- Gaps in dashboards
- Missing logs right where you need them
- Healthy‑looking panels hiding invisible failures
On the AISS, this translates to streetcars being routed onto tracks that look fine but are actually dead‑ends, while other lines silently pile up in a tunnel you can’t see.
The “2+1” Split: When Quorum Rules Hurt Availability
Most modern distributed systems use some variant of consensus (e.g., Raft, Paxos) with quorum‑based decisions:
- With 3 nodes, quorum is 2.
- With 5 nodes, quorum is 3.
That’s great for consistency, but network partitions still produce nasty patterns.
The 2+1 Scenario
Consider a 3‑node cluster (N1, N2, N3):
- A partial partition isolates N3.
- N1 and N2 can still communicate.
Result:
- N1 + N2 maintain quorum and can continue as the authoritative side.
- N3 loses quorum and steps down. Ideally, it becomes read‑only or fully unavailable.
This is exactly what you want for consistency. But operationally, on the AISS, it looks like this:
- The panel near N3’s racks shows that its local node is “up but lonely”.
- A regional status page that only queries N3 shows green.
- Global control planes show the cluster as degraded but serving (thanks to N1 + N2).
You get conflicting signals:
- "My node is up!" (N3)
- "The cluster is up!" (N1 + N2)
- "Some clients are timing out" (because traffic is stickied to N3’s zone)
Even though the system protected you from data corruption, the user experience still degrades and operators may spend precious time arguing with their dashboards.
Why Single‑Node Failure Should Not Kill Quorum
This is why well‑designed clusters try to ensure that a single node loss does not take down quorum:
- 3‑node clusters tolerate 1 faulty node.
- 5‑node clusters tolerate 2 faulty nodes.
In practice, that means:
- Carefully choosing replication factors and failure domains (AZs, racks, regions)
- Avoiding designs where one “special” node’s loss kills the whole system
Done right, you get better resilience and cost‑effectiveness for most workloads: you pay for a few extra nodes but avoid entire‑cluster outages when a single machine or link misbehaves.
Still, no amount of good quorum math prevents confusing 2+1 half‑degraded scenarios. You need process as much as you need protocol.
Tabletop Exercises: Walking Paper Routes Through the Outage
You don’t want to do your first complex incident in production.
A powerful practice is the tabletop exercise: a structured, low‑stress rehearsal using detailed, realistic scenarios.
For something like the AISS split‑brain outage, a tabletop might look like this:
-
Prepare the scenario
- Network partition between Region A and Region B
- Intermittent packet loss rather than a clean cut
- Some health checks succeed, others fail
- A few dependent services (e.g., queues, caches) behave differently in each region
-
Print the paper
- System diagrams
- Simplified logs and metrics snapshots at T+5, T+15, T+30
- User reports (support tickets, synthetic checks, status page complaints)
-
Assign roles
- Incident commander
- Comms lead (status page, internal updates)
- Operators for each subsystem (DB, network, app, observability)
-
Walk the “streetcar routes”
- Follow a user request as it hits:
- DNS → edge → app → cache → DB
- Show how the same request behaves differently if it lands in Region A vs Region B
- Surface conflicting signals from different views of the system
- Follow a user request as it hits:
-
Force the hard decisions
- When do you fail traffic fully to one region?
- When do you demote a region’s leader?
- What signals are “authoritative” during disagreement?
Tabletops like this build muscle memory and reveal design flaws while the cost is low.
Sharing Post‑Incident Learnings: From One Switchboard to Many
The AISS outage shouldn’t just be a story told by the people who were on call that day. It should become a case study that informs:
- How you design new services
- How you configure quorums and failover
- How you build observability and status pages
That only happens if you share lessons widely.
Effective patterns:
- Blameless post‑incident reviews published internally
- Cross‑team debriefs where teams present their view of the incident timeline
- Architecture guidelines updated with examples from the outage
- Training materials (runbooks, tabletop decks, onboarding content) that reuse the same scenario
The goal is to prevent “we fixed that service” and instead achieve “we changed how we design systems across the organization.”
When the next team designs a new control plane, they should already be thinking:
- How will this behave in a 2+1 split?
- What happens if the observability pipeline itself is partitioned?
- Which signals are authoritative when views conflict?
That’s the real payoff of a vivid case like the Analog Incident Streetcar Switchboard.
Conclusion
The Analog Incident Streetcar Switchboard is fictional, but the patterns it illustrates are painfully real:
- Distributed failures propagate quickly and unpredictably.
- Split‑brain and 2+1 scenarios create disagreement, not just downtime.
- Quorum design should ensure that single‑node failures don’t take down the system.
- Even with good design, people and process decide how bad the outage gets.
By:
- Running tabletop exercises with realistic, messy scenarios
- Designing clusters and failure domains that tolerate single‑node loss without quorum loss
- And sharing post‑incident learnings widely across your organization
…you can turn confusing, stressful outages into powerful, reusable lessons.
In other words: don’t wait until the streetcars are piled up in the tunnel. Walk the routes on paper, rehearse the hard calls, and let one Analog Incident Streetcar Switchboard incident improve the entire city of your systems.