Rain Lag

The Analog Incident Railway Dining Car: Paper Menus for Choosing What NOT To Fix During an Outage

How a simple, tactile “paper menu” and a railway dining car metaphor can transform early-incident decision-making, reduce risky fixes, and bring calm, SRE-style discipline to chaotic outages.

The Analog Incident Railway Dining Car: Paper Menus for Choosing What NOT To Fix During an Outage

When production is on fire, your brain is not at its best.

Cognitive load spikes, stress rises, executives are pinging for updates, and a dozen different people start suggesting "quick" fixes. This is exactly when teams reach for the riskiest changes: restarting core databases, toggling mysterious feature flags, redeploying half the stack, or "just" rolling back everything.

That’s how minor incidents become full-blown outages.

This post introduces a deliberately low-tech countermeasure: The Analog Incident Railway Dining Car—a simple, tactile paper menu you use in the first 15 minutes of an incident to decide, together, what you will not touch.

Pair this menu with a lightweight incident playbook grounded in SRE best practices, and you get calmer decision-making, fewer panic changes, and more resilient incident response.


Why an “Incident Dining Car”?

Picture an old railway dining car.

You sit down, you get a menu. The menu doesn’t show everything in the universe; it shows curated, pre-decided options. You’re not inventing dishes from scratch under pressure—you’re choosing safely within known bounds.

Now apply that metaphor to incident response:

  • The railway is your production environment in motion.
  • The dining car is your incident command space—Zoom, Slack, war room.
  • The menu is a physical paper sheet listing common actions that are tempting but risky.

At the start of the incident, instead of rushing into random fixes, your team opens the menu and explicitly chooses what will stay off-limits in the first 15 minutes.

This tiny ritual does three powerful things:

  1. Slows you down just enough to avoid panic.
  2. Constrains your option space to safer, pre-considered moves.
  3. Aligns everyone’s expectations on what is absolutely not happening (yet).

The Paper Menu: Choosing What NOT to Touch

The paper menu is literally that: paper.

Why analog in a digital world?

  • It’s tactile and hard to ignore.
  • It works when tools or dashboards are slow or failing.
  • It changes the psychological frame: you’re not hacking, you’re ordering.

What the Menu Looks Like

A one- or two-page sheet, printed and stored at every on-call desk, with sections like:

Section A – High-Risk Actions (Default: DO NOT TOUCH in first 15 minutes)

  • Restart primary database cluster
  • Fail over to secondary region
  • Full rollback of the main monolith
  • Global feature-flag flips affecting >50% of traffic
  • Changing DNS records or traffic routing at the edge
  • Purging all caches / mass invalidation

Section B – Medium-Risk Actions (Require Explicit Approval)

  • Redeploy any core service
  • Schema migrations / rollback
  • Bulk data fixes or scripts

Section C – Safe Defaults (Encouraged Early Moves)

  • Enable additional logging / tracing
  • Add temporary alerts / dashboards
  • Reduce load (rate limits, queue pausing, partial degradation)
  • Toggle pre-vetted “safe mode” flags

At incident start, the Incident Commander (IC) reads Section A out loud and says something like:

"For the next 15 minutes, we are NOT doing any of these high-risk actions unless we unanimously decide to break glass. Agreed?"

The team checks the boxes for actions that are explicitly forbidden for now.

This small step reframes the conversation from "What can we try?" to "What will we intentionally not try yet?"


Pairing the Menu with an Incident Playbook

The menu is not a standalone gimmick. It works best when paired with a clear, lightweight incident playbook that guides the first 15 minutes and beyond.

Your playbook should include at least:

1. First-15-Minutes Checklist

The goal of the first 15 minutes is stabilization and observability, not heroics.

Example checklist:

  1. Assign roles (IC, Communications, Scribe, Subject-Matter Experts).
  2. State the impact in plain language: "Users cannot X; error rate Y; started around Z time."
  3. Open the paper menu and decide what you will NOT touch yet.
  4. Collect facts, not theories: dashboards, logs, recent changes, alerts.
  5. Enable or improve observability: add logs, narrow probes, enable tracing.
  6. Consider containment: rate limiting, partial feature disable, shedding non-critical traffic.
  7. Announce initial status update using a comms template.

2. Recovery Decision Matrix

To avoid ad-hoc, risky fixes, define pre-agreed criteria for common recovery strategies.

For example:

  • When to restart a service
    • Only if: the service is non-responsive and stateless and we have no open writes in flight.
  • When to fail over regions
    • Only if: impact is regional, automated checks mark target region healthy, and comms & rollback plans are ready.
  • When to roll back
    • Only if: the incident correlates strongly with a specific deployment and the previous version is known-good.

The paper menu can reference this matrix: "Before doing anything in Section A or B, consult the Recovery Decision Matrix."

3. Comms Templates

Incidents feel worse when nobody knows what is happening. Provide templates:

  • Internal update
  • Exec summary
  • Customer status page note

Each template emphasizes honesty about uncertainty: "We’re investigating, we’ve stabilized X, we’re avoiding Y risky changes, next update in Z minutes."

4. Safe Restore Steps

Document step-by-step safe restoration paths:

  • Known-good rollback sequences
  • Database failover procedures with guardrails
  • Traffic shifting with clear abort criteria

Tie these steps back to the menu: they are what you may progress to after the first 15 minutes, when you’ve gathered enough information.


SRE Best Practices Behind the Dining Car

This approach is a friendly metaphor over some serious SRE principles:

Clear Roles

During an incident, someone must be the Incident Commander. Others take roles like:

  • Operations Lead / Tech Lead
  • Communications Lead
  • Scribe

The IC holds the menu and playbook, keeping the team within the rails of agreed behavior.

Predefined Runbooks

Your playbook plus service-specific runbooks are your menu descriptions. They turn "we might try X" into "here’s exactly how we safely do X." Ad-hoc, undocumented moves are where outages get worse.

Decision Criteria over Gut Feel

Under pressure, people default to intuition or the loudest voice. The menu and decision matrix encode criteria, not opinions. That’s how you avoid the 2 a.m. "let’s just reboot everything" impulse.


Stabilize First, Fix Later

A key mindset shift: stabilization over fixing.

In the dining car, you don’t rebuild the train engine from your table; you order something that keeps you fed until the next station.

In incidents, stabilizing means:

  • Containing blast radius (rate limits, feature disabling, graceful degradation).
  • Ensuring you can see what’s happening (metrics, logs, traces).
  • Preserving data integrity over availability if you must choose.

Only once the system is not actively getting worse should you:

  • Attempt invasive recovery actions.
  • Change schemas, configs, or long-lived infrastructure.

The paper menu reinforces this by making invasive actions opt-in, not reflexive.


Training Tool and Leadership Exercise

The Incident Dining Car is also a training device and a leadership lab.

Use It in Drills

Run game days and chaos exercises where:

  • The IC must open the menu within 2 minutes.
  • The team must explain why they’re not touching certain items.
  • You simulate executive pressure: "Can’t we just fail over?" The IC practices saying, "Here’s why that’s not in the first-15-minutes menu."

Practice Calm, Explicit Decisions

The ritual of checking boxes on paper:

  • Makes leadership visible.
  • Creates psychological safety: people see there’s a plan.
  • Reduces blame later—risky actions weren’t forgotten, they were consciously deferred.

You’re training people to choose deliberately, not to react.


Keep the Menu Alive: Review After Every Incident

A static menu becomes stale or dangerously incomplete.

After each significant incident, in your post-incident review:

  1. Ask what actions were too tempting. Did anyone suggest something that should’ve been on the "do not touch early" list?
  2. Retire ineffective or harmful moves. If a standard action repeatedly makes things worse, move it up to High-Risk or remove it entirely.
  3. Promote proven safe actions. If a repeated response is both safe and helpful, add it to the Safe Defaults.
  4. Print and redistribute. Update the physical menu. Make sure every new on-caller has the latest version.

Over time, your menu becomes a distilled record of institutional learning: all the "never again" and "always consider" wisdom captured in one sheet.


How to Get Started This Week

You don’t need a big project. Start small:

  1. Draft a first version of the paper menu with:
    • 5–10 high-risk actions (do not touch in first 15 minutes).
    • 3–5 safe early moves.
  2. Print it and put it next to every on-call workstation.
  3. Add a simple first-15-minutes checklist to your incident docs.
  4. Run a single tabletop exercise using the menu.
  5. Refine based on feedback.

You’ll be surprised how often, in real incidents, someone says: "Let’s check the menu first."


Conclusion: A Simple Ritual for Better Incidents

Modern systems are complex, but our incident rituals don’t have to be.

The Analog Incident Railway Dining Car—a paper menu, a shared metaphor, and a first-15-minutes playbook—helps teams:

  • Avoid panicked, high-risk fixes.
  • Focus on stabilization and observability before deep surgery.
  • Practice calm, criteria-based leadership under pressure.
  • Turn hard-earned lessons into a living, tangible artifact.

In the middle of the next outage, when adrenaline is high and everyone wants to "just try something," step into the dining car, open the menu, and decide—together—what you will not fix yet.

Your future self, your uptime graphs, and your users will all be better off for it.

The Analog Incident Railway Dining Car: Paper Menus for Choosing What NOT To Fix During an Outage | Rain Lag