Rain Lag

The Quiet Pager: Designing Low‑Tech Rituals That Keep High‑Tech Incidents Small

How to design humane on-call practices, embed SRE into everyday development, and use simple, repeatable rituals to keep your pager quiet—but trustworthy—during high-tech incidents.

Introduction: Why the Quiet Pager Matters

Every engineering team dreams of the quiet pager: an on-call phone that almost never goes off—and when it does, it’s for something that truly matters.

The reality on many teams looks very different: noisy alerts, half-baked runbooks, exhausted SREs, and post-incident reviews that fix the symptom but not the system. The technology is advanced—distributed tracing, AI-assisted alerting, autoscaling—but the human side of incident response is often improvised.

The secret isn’t more sophisticated tooling. It’s better rituals.

The most reliable teams use a set of low-tech, repeatable practices that make their high-tech systems feel boring in the best possible way. They integrate SRE thinking deep into development, design sustainable on-call, and treat alerts as a finely tuned signal—not a firehose.

This post explores how to design those rituals so your pager can stay quiet, and your incidents stay small.


1. Embed SRE in Development, Not Just Operations

A noisy pager is often a symptom of a deeper problem: reliability is treated as something you add on after development, rather than something you design in from day one.

From “throw over the wall” to shared ownership

Traditional patterns:

  • Developers ship features.
  • Ops or SREs “own” production.
  • Incidents become a blame game.

A better pattern:

  • SREs are embedded with product teams or work in a strong partnership model.
  • Reliability requirements are part of product requirements, not an afterthought.
  • Developers see incident data and participate in on-call, reviews, and capacity planning.

Practical ways to integrate SRE into development:

  • SRE office hours: Weekly or biweekly drop-in time where product teams bring design docs, launch plans, or monitoring questions.
  • Reliability checklists in PRs: Add simple questions to your pull request template:
    • What metrics and logs will help debug this feature in production?
    • How will we know this is broken before users do?
    • What’s the rollback or kill-switch strategy?
  • SRE in design reviews: For larger features, include at least one SRE (or reliability-minded engineer) in the design review to discuss failure modes, SLOs, and observability.

Reliability becomes a by-product of how you build software, not just how you respond when it fails.


2. Make SRE Principles Part of Everyday Work

Core SRE principles—automation, monitoring, structured incident response—shouldn’t only appear during emergencies. They should shape your daily development workflows.

Automation as the default

  • Automate repetitive operational work (deploys, rollbacks, provisioning) so humans are free to think during incidents.
  • Require that new services include:
    • Automated build and test pipelines.
    • One-command or one-click deploy and rollback.
    • Automated health checks and basic alerts.

Automation doesn’t eliminate incidents, but it makes responses fast, consistent, and less stressful.

Monitoring as a design constraint

Before shipping, ask:

  • What does “healthy” look like for this service?
  • What one or two graphs would I check first if something broke?

Bake these into your monitoring as golden signals (latency, errors, traffic, saturation). Tie them explicitly to SLOs or expectations. This makes your alerts meaningful and reduces noise later.

Structured incident response as muscle memory

Don’t wait for a major incident to define your process. Have a simple, written structure:

  1. Declare an incident (with clear severity levels).
  2. Assign roles (incident commander, communications, subject-matter expert).
  3. Use a shared channel/document for timelines, hypotheses, and changes.
  4. Resolve, then review (blameless post-incident review with concrete follow-ups).

Run occasional game days or tabletop exercises where you walk through this process with a fake incident. The goal is not drama; it’s calm, practiced execution.


3. Design Sustainable On-Call Schedules

You can’t have a quiet pager if everyone is already exhausted. Burnout leads to slower responses, sloppy fixes, and more incidents later.

Principles for humane on-call

  • Limit consecutive on-call time. Prefer frequent, shorter rotations (e.g., 1 week) over long multi-week stretches.
  • Cap after-hours load. Track pages per week per person. If it’s consistently high, treat that as a reliability bug.
  • Always have clear escalation paths. A primary and secondary on-call, with an easy way to pull in a third if needed.
  • Pay and recognize on-call work. Treat it as real engineering, not invisible overhead.

Build in recovery

  • Encourage a “page budget” mindset: if someone had a brutal night, they get slack the next day (no critical meetings, fewer tickets).
  • Make it normal to swap shifts when life happens, with a simple, documented process.

Sustainable on-call is not just about being nice; it’s a reliability strategy. Rested engineers spot patterns, improve systems, and prevent future incidents.


4. Low-Tech Incident Rituals for High-Tech Failures

The tools can be complex. The rituals should be simple.

Runbooks: Write for 3 a.m. You

A good runbook is:

  • Short and focused: one page is better than ten.
  • Task-oriented: “If X alert fires, do Y and Z.”
  • Opinionated: clear steps, not vague hints.

Start with:

  • How to access dashboards and logs.
  • Known quick checks (e.g., “Is the database CPU pegged?”).
  • Safe, reversible actions (e.g., “Scale this deployment to N replicas”).
  • When and how to escalate.

Treat runbooks as living documents. After each incident, update the relevant runbook with what actually worked.

Handoffs: Make the invisible visible

Incidents often suffer when context gets lost between shifts or teams. Use a simple handoff ritual:

  • Shared doc or ticket with:
    • Current status
    • What was tried
    • What’s risky
    • Next best hypothesis
  • A 10–15 minute verbal handoff when severity is high.

Simple formats like “Last, Now, Next” (“Last: what happened so far; Now: current state; Next: what we plan to try”) keep everyone aligned without a fancy tool.

Reviews: Blameless, brief, and actionable

Post-incident reviews don’t need to be long or formal to be powerful.

Include:

  • A basic timeline.
  • User impact and duration.
  • The real contributing factors (technical and process).
  • 2–5 concrete actions that:
    • Prevent recurrence.
    • Improve detection.
    • Improve response (better runbook, tooling, or training).

Ritualize reviews: any incident above a certain severity must have one. Keep them short but consistent.


5. Fight Alert Fatigue Before It Starts

Once engineers lose trust in alerts, the pager becomes background noise—and real incidents slip through.

Test alerting in realistic environments

Before rolling out new alert rules widely:

  • Test them in staging or a realistic pre-prod environment with synthetic load.
  • Shadow them in “monitoring-only” mode, where alerts are logged or sent to a low-noise channel for a week or two.
  • Ask: would this page at 3 a.m. be:
    • Actionable?
    • Urgent?
    • Clear about what to check?

If the answer is “maybe,” it shouldn’t page a human.

Tune thresholds as a continuous process

Alert tuning is not a one-time setup.

  • Regularly review:
    • Top noisy alerts.
    • Alerts that are frequently auto-closed.
    • Alerts that never result in action.
  • Reduce or remove alerts that are:
    • Non-actionable.
    • Duplicates of better signals.
    • About long-term trends rather than immediate risk.

Push non-urgent signals into dashboards, reports, or weekly reviews instead of pages.


6. Keep the Pager Quiet, but Trustworthy

The goal is not zero alerts; it’s high-signal alerts.

Design alerts as if you pay for each page

Imagine every page costs real money (and in terms of human energy, it does).

Ask for each alert:

  • What exact action do we expect on-call to take?
  • If they took no action, what would happen?
  • Can automation handle this instead?

If there is no clear, time-sensitive action, don’t page.

Iterate based on real incidents

Treat alerting and incident processes as a product you’re building:

  • After each incident, ask:
    • Did we detect this as early as we should have?
    • Were there false alarms that distracted us?
    • Did the runbooks and dashboards actually help?
  • Adjust:
    • Thresholds.
    • Runbooks.
    • On-call expectations.
    • Escalation paths.

Over time, this feedback loop makes your pager both quieter and more trusted.


Conclusion: Calm by Design, Not by Luck

Reliable systems don’t emerge from the latest observability tool or the fanciest incident bot. They come from deliberate, low-tech rituals that wrap around your high-tech stack.

To keep incidents small and pages quiet:

  • Integrate SRE closely with developers so reliability is built in, not bolted on.
  • Apply SRE principles—automation, monitoring, structured response—to everyday work.
  • Design humane on-call schedules that prevent burnout and support fast, thoughtful responses.
  • Use simple, well-defined rituals: runbooks, handoffs, and reviews that people actually follow.
  • Treat alerting as an iterative craft: test, tune, and refine until only meaningful, actionable pages get through.

When you design for calm on purpose, the pager becomes what it should be: a rare, trusted signal that something truly needs your attention—and a reminder that your rituals are working.

The Quiet Pager: Designing Low‑Tech Rituals That Keep High‑Tech Incidents Small | Rain Lag