The Analog Incident Train Station Waiting Room: Designing Slow Reliability Check‑Ins Between On‑Call Storms

The Analog Incident Train Station Waiting Room

Designing Slow Reliability Check‑Ins Between On‑Call Storms

Introduction: The Quiet Anxiety of the On‑Call “Waiting Room”

Being on‑call often feels less like heroic firefighting and more like sitting in a nearly empty train station late at night.

Nothing is happening. No trains are arriving. The monitors are quiet.

You pretend to read the metaphorical magazines—Slack, dashboards, email, logs—while a low‑grade anxiety hums underneath: When is the next storm coming?

Many reliability and platform engineers live in this rhythm: intense incident flares followed by deceptively calm lulls. Those lulls can feel lonely and invisible. You’re expected to be “fine,” to use the downtime “productively,” to stay sharp—but no one really asks how you are.

This post explores how to intentionally design those quiet periods as a kind of analog train station waiting room for reliability work—structured, recurring check‑ins that are:

Slow instead of frantic
Collaborative instead of lonely
Empathy‑first instead of metric‑only
Incremental instead of heroic

The goal: turn the time between incidents into a practice of slow reliability, so your team becomes more prepared, more resilient, and less burned out before the next storm arrives.

On‑Call as Emotional Weather: Naming the Reality

Before we design any rituals, we have to name the emotional landscape.

For many engineers, on‑call is:

Isolating – You’re the one holding the pager at 2 a.m. while others sleep.
Invisible – If nothing breaks, your work is “nothing happened”—which rarely gets recognition.
Performative – You’re supposed to appear calm and competent, even when you’re quietly dreading the next alert.
Cumulative – Each incident adds to a backlog of stress that rarely gets processed.

That means the quiet days are not purely “rest.” They’re often tense waiting:

“I’m supposed to be catching up on projects and writing docs, but I’m mentally braced for an incident that may or may not happen.”

If we ignore this reality, our reliability practices skew toward:

Only caring about humans during an incident
Only talking about systems after something breaks
Treating lulls as unstructured, unacknowledged background time

Instead, we can use the waiting room metaphor to design those lulls as intentional spaces for health checks—technical and human.

The Train Station Waiting Room as a Design Metaphor

Imagine a physical train station waiting room:

There are different zones: quiet corners, information desks, seating areas.
There are rituals: checking the schedule, glancing at the destination board, buying a snack, stretching your legs.
There is ambient awareness: the murmur of announcements, the presence of other travelers, the sense of shared purpose.

Now, apply this to your reliability practice between incidents.

Your organization’s “waiting room” could be:

A recurring cross‑functional check‑in meeting
A set of lightweight offline prompts and notebooks
A few micro‑rituals baked into normal work (end of day, end of week)

The key is to treat it as an environment you design, not an accident of calendar gaps.

Designing Reliability Check‑Ins as Waiting Room Rituals

Create a structured, recurring ritual that happens even when nothing is on fire. Think of it as a scheduled visit to the waiting room.

Cadence ideas:

Weekly for high‑change, high‑incident environments
Bi‑weekly or monthly for more stable systems

Participants:

On‑call engineers (past, current, next rotation)
SRE / platform / infra engineers
Product owners / EMs for critical services
Optional: support, customer success, incident manager

A Simple Agenda Template

Opening check‑in (5–10 minutes)
- One‑word weather report: “How’s your internal weather?” (e.g., sunny, hazy, stormy)
- Quick round: “One sentence about how on‑call has felt this week.”
System health snapshot (10–15 minutes)
- High‑level metrics: error rates, latency, availability, key SLOs
- Recent near‑misses or noisy alerts
- Any emerging risks or weird patterns
Risk and readiness discussion (15–20 minutes)
- “If an incident happened tonight, what’s the most likely cause?”
- “What’s one part of the system we least want to page us right now?”
- “What’s confusing or brittle from the last rotation?”
Concrete availability improvements (10–15 minutes)
- Choose 1–3 small, specific changes (not giant projects):
  - Tune or remove a noisy alert
  - Add a missing dashboard panel
  - Improve one runbook step
  - Add a test for a known edge case
Closing reflection and acknowledgement (5 minutes)
- “What’s one thing we did this week that future us will thank us for?”
- Explicit thanks to current/oncoming on‑call engineers

The purpose is not to generate a long backlog; it’s to convert unease into small, steady actions.

Empathy‑First: Making It Safe to Not Be “Fine”

For these check‑ins to work, they must be emotionally safe spaces, not just extra stand‑ups with graphs.

Some design principles:

Feelings are data. Treat stress, dread, and confusion as legitimate inputs to reliability planning.
No hero worship. Avoid glorifying “I survived 20 pages last weekend.” Normalize not wanting that.
Psychological safety first. Make it clear that admitting fear, fatigue, or confusion will not be punished.
Leader vulnerability. Managers and senior engineers should model honesty: “I’m tired,” “I’m worried about X.”

You might add simple prompts like:

“What is one part of on‑call that quietly scares you?”
“Where do you feel least prepared during an incident?”
“What would make the next rotation feel meaningfully safer for you?”

Document the concerns. Treat them as reliability work, not soft side chatter.

Offline‑Friendly Tools and Micro‑Rituals

Not everything needs a meeting. Many insights about reliability and burnout surface in the small gaps: before bed, after a tough shift, on a commute.

Design offline‑friendly tools so people can capture those fleeting reflections.

Simple Tools

Pocket notebook or notes app titled “On‑Call Waiting Room”
Printed reflection cards near desks or in an ops binder
Short, recurring forms (e.g., weekly 3‑question survey)

Micro‑Ritual Prompts

Prompts should be quick—30–90 seconds:

End of shift:
- “What felt fragile today?”
- “What saved you time or pain today?”
After an alert (even small):
- “What confused you for more than 2 minutes?”
- “Did the alerts tell a coherent story?”
Pre‑rotation:
- “What do you wish you’d review before going on‑call (but usually don’t)?”

These notes don’t need to be polished. Their job is to seed future waiting room check‑ins—to give you real, human data beyond graphs.

Slow Reliability: Turning Lulls into Practice

Most organizations only do reliability work reactively: post‑mortems, RCAs, and big initiatives after major outages.

Slow reliability is different. It’s about:

Small, repeatable improvements between events
Continuous readiness instead of crisis response only
Strengthening human + system resilience together

Examples of slow reliability habits:

Each waiting room session must end with one change that ships within a week.
Each on‑call rotation must include one documentation or runbook improvement.
Each quarter, pick one high‑anxiety scenario (e.g., full region failure) and practice it in a low‑stakes game day.

Over time, these habits:

Reduce surprise during incidents
Decrease cognitive load for on‑call
Build a culture where reliability is a continuous craft, not a panicked reaction

Designing Themed Rooms in Your Waiting Area

A station waiting room isn’t a single chair; it’s a set of spaces for different needs. You can mirror that by defining a few themed modes in your reliability practice.

1. Reflection Room

Focus: Looking back without judgment.

Review recent alerts and near‑misses.
Gather anonymous or private reflections from the last rotation.
Ask: “What did we learn about ourselves and our systems?”

2. Planning Room

Focus: Looking ahead with intention.

Identify top 1–3 reliability risks.
Choose small, realistic improvements.
Align on who will do what before the next check‑in.

3. Debrief Room

Focus: Digesting incidents—even small ones.

Short, non‑blaming debriefs for any meaningful incident.
Capture both technical and emotional impact.
Ask: “What needs extra support—code, docs, or people?”

4. Emotional Check‑In Room

Focus: Tending to the humans.

Talk about on‑call load, sleep, resentment, fear, or pride.
Normalize saying “this is too much” or “I need a break.”
Feed insights into better rotations, backup policies, and staffing.

You don’t need four literal meetings. These are modes that can be woven into a single recurring session or rotated by week.

Conclusion: Designing Calm Before the Next Storm

On‑call will always have storms: late‑night alerts, cascading failures, unexpected edge cases. We invest heavily in tools and runbooks for those moments—and we should.

But the quiet in between is where culture is formed and resilience is built.

By treating that time as a designed train station waiting room—with:

Structured, recurring reliability check‑ins
Empathy‑first conversations about stress and burnout
Offline‑friendly tools and micro‑rituals
Themed spaces for reflection, planning, debrief, and emotional care

—you turn anxious waiting into intentional practice.

Your systems become more reliable. Your people feel less alone. And the next time a storm rolls in, you’re not just hoping the trains will run on time—you’ve quietly been strengthening the tracks all along.