Rain Lag

The Pencil-Drawn Outage Compass Trolley: Rolling a Single Paper Map Through Every On‑Call Handoff

How to design humane, resilient on-call rotations that rely on one shared ‘map’—centralized runbooks, clear ownership, and thoughtful handoffs—to cut chaos, MTTA, and MTTR.

The Pencil-Drawn Outage Compass Trolley: Rolling a Single Paper Map Through Every On‑Call Handoff

Imagine your incident response process as a single, well‑worn paper map, mounted on a little trolley that rolls from one on‑call engineer to the next. No matter who’s holding it at 2 a.m., it shows the same routes, the same landmarks, the same “you are here” marker.

That’s the Pencil-Drawn Outage Compass Trolley: a metaphor for a centralized, shared way of navigating outages. Instead of every engineer relying on tribal knowledge, scattered dashboards, and half-remembered Slack threads, the team shares one consistent “map” of how to respond.

In this post, we’ll explore how to:

  • Design sustainable on-call rotations with clear coverage models and escalation paths.
  • Build standardized, practical incident runbooks that actually get used.
  • Minimize coordination tax and context switching during handoffs.
  • Make on-call more humane without sacrificing reliability.

1. Start with the Map: Choosing a Sustainable On-Call Coverage Model

Before you worry about runbooks, dashboards, or tooling, you need a coverage model that people can actually live with.

1.1 Common coverage patterns

A few patterns that work well in many orgs:

  • Follow-the-sun: Teams in different time zones cover their local daytime hours.

    • Pros: Less sleep disruption; better long‑term sustainability.
    • Cons: Requires multiple regions/teams and excellent cross‑regional communication.
  • Primary/secondary rotation:

    • Primary: Handles alerts, triage, customer-impacting issues.
    • Secondary: Acts as backup, handles escalations and helps with complex incidents.
    • Pros: Avoids single points of human failure; spreads load.
    • Cons: Needs clear rules for when to escalate from primary to secondary.
  • Team-wide rotation (everyone rotates through on-call):

    • Pros: Shared ownership; broad knowledge of the system.
    • Cons: Risk of overloading people if rotation is too frequent or incident load is high.

The right model depends on your headcount, time zones, and incident volume—but predictability is non‑negotiable. People should know:

  • When they will be on call.
  • What they are responsible for.
  • How they can hand off issues at the end of a shift.

1.2 Make escalation paths painfully obvious

In a crisis, ambiguity is expensive.

At a minimum, define and document:

  • Who is on point at any given time (primary on‑call).
  • Who is backup (secondary, manager, or incident commander on‑call).
  • What triggers escalation, e.g.:
    • MTTA (Mean Time To Acknowledge) > X minutes.
    • Customer-impacting severity above a certain threshold.
    • Duration beyond X minutes without clear mitigation.

This escalation path should live in your “paper map”—the shared incident documentation your team relies on. When someone “rolls the trolley” to the next shift, the path doesn’t change.


2. Runbooks as the Map Legend: Standardized, Real-World Guidance

If your system goes down and your best engineer is on vacation, can a reasonably experienced teammate still navigate the outage?

That’s the core purpose of standardized incident response runbooks.

2.1 What a good runbook looks like

Every runbook should answer four questions clearly:

  1. What is this runbook for?

    • Scope (“API latency alerts for service X” or “Database read replicas unhealthy”).
  2. How do you recognize the problem?

    • Relevant alerts, dashboards, and logs.
    • Typical symptoms customers see.
  3. What are the first steps?

    • A simple, numbered checklist:
      1. Confirm the alert is real (check dashboard A and B).
      2. Notify the incident channel.
      3. Declare severity per your incident severity matrix.
  4. What are the common mitigations?

    • Concrete commands, runbooks to link to, or toggles to flip.
    • “If X is true, do Y” style branches.

Include copy‑paste‑able commands, screenshots, and links to observability dashboards. The more friction you remove, the more likely the runbook is to be used under stress.

2.2 Use real incidents to refine runbooks

After each meaningful incident, ask:

  • Where did we guess?
  • Where did we argue about the next step?
  • Where did we lose time finding the right data?

Those gaps are runbook improvements waiting to happen.

For example, suppose a database failover incident took 60 minutes to mitigate:

  • You discovered that three engineers each checked different dashboards.
  • Nobody was sure who could authorize a failover.
  • The correct command existed, but only in an old wiki page.

Turn this into a better runbook:

  • Add a “Golden Dashboard” section at the top with direct links.
  • Document the authorization rule: “On‑call SRE may trigger failover for SEV‑1 without additional approval.”
  • Copy the exact failover command and safety checklist into the runbook.

This tightens both MTTA (Mean Time To Acknowledge) and MTTR (Mean Time To Recovery) because:

  • Less time is wasted finding data.
  • Fewer decisions rely on memory or authority myths.

3. The Trolley Itself: Centralizing Information and Responsibilities

The “paper map on a trolley” is more than a metaphor. It’s a principle: one place, one owner, one flow.

3.1 Centralize your incident information

During an incident, people should never have to ask:

  • “Where’s the latest status?”
  • “Which doc is the right doc?”
  • “Who’s currently in charge?”

Create a single incident space that acts as your trolley:

  • A dedicated incident channel template (e.g., #inc-sev1-*).
  • A standard incident document template (Google Doc, Notion, internal tool).
  • Links at the top to:
    • Current commander.
    • Status page.
    • Relevant runbooks.

Your goal: anyone who joins mid-incident can get oriented in under 60 seconds.

3.2 Minimize the “coordination tax”

Coordination tax is the overhead of:

  • Figuring out who is doing what.
  • Repeating status updates.
  • Reconciling conflicting information.

Reduce this by:

  • Assigning explicit roles early:

    • Incident commander: makes decisions, manages the flow.
    • Communications lead: posts status updates.
    • Operations lead(s): runs commands and checks metrics.
  • Using short, consistent status updates, e.g., every 15 minutes:

    • Impact.
    • Current hypothesis.
    • Next actions.

Clear roles and a single source of truth turn a chaotic swarm into a coordinated response.


4. Protecting Focus: Batching and Reducing Context Switching

On-call work can easily become an endless stream of tiny context switches: a low‑priority alert here, a documentation request there, a tooling tweak squeezed in between.

Over time, this destroys both productivity and morale.

4.1 Batch similar tasks for the on-call engineer

Instead of expecting on-call engineers to do “a bit of everything all the time,” batch responsibilities:

  • Incident response: Primary responsibility for live issues.

  • Post-incident improvements: During low‑incident periods of the shift, focus on:

    • Updating or creating runbooks.
    • Improving alerts and thresholds.
    • Automating recurring manual steps.
  • Tooling or SRE work: Assign projects that can be paused and resumed between incidents, rather than deep architectural work.

By aligning the type of work, you lower the cognitive load of switching contexts repeatedly.

4.2 Guardrails for non-urgent work during on-call

Make it clear that on-call engineers can say no to non-urgent requests during their shift. For example:

  • Code reviews can be optionally delayed.
  • New feature work is not mandatory.
  • Ad-hoc meeting invites can be declined.

This is not laziness; it’s how you keep your “navigator” alert and effective when the real storm hits.


5. Making On-Call More Humane and Resilient

Reliability doesn’t require burning people out. In fact, burned-out teams are less reliable.

5.1 Predictable schedules and fair rotation

Aim for:

  • Schedules published well in advance (e.g., 1–3 months).
  • Rotation lengths that are humane:
    • Long enough to avoid constant switching (e.g., 1 week) but
    • Short enough to avoid sustained sleep disruption (avoid month‑long 24/7 duties for one person).
  • Recovery time after heavy on-call weeks, such as:
    • No meetings the following Monday morning.
    • A lighter sprint commitment.

5.2 Explicit ownership during outages

Ambiguity about ownership is stressful. Make ownership explicit:

  • “This incident is owned by the Payments on-call engineer.”
  • “This API has a dedicated on-call rotation; they lead.”

Pair that with psychological safety:

  • Blameless postmortems.
  • A culture of learning rather than punishment.

When people aren’t terrified of being “the one on duty,” they’re more likely to take ownership and improve the system.


6. Bringing It All Together: One Map, Many Hands

The Pencil-Drawn Outage Compass Trolley is a simple idea:

  • One shared map: Standardized runbooks, consistent documentation.
  • Clear routes and legends: Coverage models, escalation paths, and role definitions.
  • A sturdy trolley: Centralized communication channels and incident docs that roll smoothly from one shift to the next.

When you treat your incident process like a shared navigation tool rather than a heroic art form, you:

  • Lower MTTA and MTTR by removing guesswork.
  • Reduce coordination tax by clarifying roles and centralizing information.
  • Make on-call sustainable and humane by designing predictable, fair rotations.

Your systems will always have failures. The question is whether your team faces them with a shared, well‑marked map—or with a messy pile of half-finished sketches.

Invest in the map. Your future on-call self will thank you at 2 a.m., pencil in hand, calmly rolling the outage trolley to the next safe stop.

The Pencil-Drawn Outage Compass Trolley: Rolling a Single Paper Map Through Every On‑Call Handoff | Rain Lag