Rain Lag

The Reliability Pencil Case: Tiny Analog Habits for Taming Always‑On Engineering Work

How a simple “reliability pencil case” and a few tiny analog habits can make on‑call work saner, strengthen reliability culture, and turn incidents into lasting organizational learning.

Introduction: Reliability in a World That Never Sleeps

Always‑on systems have created always‑on humans.

If you work in reliability, SRE, platform, or production engineering, you know the pattern: constant context switching, Slack pings at odd hours, incident bridges spinning up with little warning, and a backlog of postmortems that keeps getting pushed to "later." Tools have improved, but the feeling of being on the hook never really disappears.

Digital tools are necessary—but they’re not sufficient. When the stakes are high and cognitive load is maxed out, tiny, low‑tech habits can quietly become your strongest allies. That’s where the idea of a reliability pencil case comes in.

This isn’t just about stationery. It’s a metaphor and a literal kit: a small, physical set of tools and prompts that help you turn incidents into durable learning, keep on‑call sustainable, and connect daily work to your organization’s reliability goals.


Why Analog Habits Matter in a Digital Incident World

Most incident response is powered by software: paging tools, chat, runbooks, dashboards, ticketing systems. They’re great at coordination and visibility—but terrible at helping your brain actually learn from what just happened.

Research shows that handwritten notes:

  • Promote deeper processing of information
  • Encourage summarizing and conceptual thinking instead of verbatim copying
  • Improve long‑term retention and recall

In the middle of an incident, your working memory is overloaded. If everything stays in ephemeral chat threads and scattered tabs, most of the meaningful learning evaporates shortly after resolution.

A tiny analog habit—like grabbing a pencil and a dedicated notebook at the start of an incident—forces you to:

  • Slow down just enough to notice what’s happening
  • Capture context and decisions in your own words
  • Create raw material that will later feed a strong, blameless postmortem

The reliability pencil case is about engineering that habit into the fabric of your always‑on work.


The Reliability Pencil Case: What It Is (and Isn’t)

Think of a reliability pencil case as a small, portable kit that lives on your desk or in your backpack. It’s deliberately low‑friction and low‑tech:

  • A simple notebook dedicated to incidents and on‑call
  • A pen or pencil you actually like using
  • A few pre‑printed cards or sticky notes with tiny prompts
  • Optionally: small index cards for quick checklists or follow‑ups

This is not about aesthetic journaling. The point is reliability outcomes, not pretty notes.

The power of the pencil case comes from the tiny, repeatable analog habits it anchors—habits that reinforce core practices like blameless postmortems, sustainable on‑call, and organizational learning.


Blameless Postmortems Start During the Incident

Teams often treat postmortems as a bureaucratic afterthought: something you write because the process requires it. That leads to shallow analysis, blameful narratives, and repeat incidents.

Blameless incident postmortems are different. They:

  • Focus on systems and conditions, not individual mistakes
  • Seek root causes and contributing factors, not culprits
  • Encourage honesty, curiosity, and learning

But to do this well, you need good raw data—what people saw, thought, and tried in real time.

This is where the pencil case shines.

A Tiny Habit: The 3 Lines Per Incident Rule

At the moment you realize “this is an incident,” grab your notebook and write just three lines:

  1. Time + trigger — "10:24 UTC – PagerDuty page for elevated 500s on checkout API."
  2. First hypothesis — "Maybe new deployment? Or upstream payment provider."
  3. First action — "Rolled back last deploy; checking payment provider status page."

That’s it.

This tiny habit:

  • Captures your initial mental model, which is crucial for later learning
  • Keeps the cognitive cost low enough to be sustainable during a crisis
  • Seeds a richer, more honest postmortem later

When you come back to write the postmortem, these short notes help you reconstruct the real story: the confusion, the false leads, the human context. That supports a truly blameless analysis—because you’re looking at the system that produced those decisions, not judging people in hindsight.


Postmortems as a Regular Practice, Not an Occasional Ceremony

Treating postmortems as rare, heavyweight events sends a subtle signal: learning is exceptional, not routine.

High‑reliability teams instead treat postmortems as regular practice:

  • Low‑severity incidents still get lightweight reviews
  • Repeated patterns trigger deeper analysis
  • Learnings are fed back into runbooks, tooling, and training

Your pencil case can help make this rhythm tangible.

A Tiny Habit: One Page Per Postmortem

For every incident that crosses a pre‑defined threshold (e.g., customer impact, duration), dedicate exactly one notebook page with a consistent structure:

  • What happened (facts only)
  • What surprised us
  • What helped
  • What made it harder than it needed to be
  • One systemic improvement we’ll actually do

By constraining yourself to one page, you:

  • Avoid perfectionism and over‑editing
  • Focus on the most important insights
  • Make it realistic to keep postmortems regular

Over time, the notebook becomes a physical log of your team’s reliability journey—an artifact that reflects a culture of continuous learning.


On‑Call Tools, Automation, and the Human Brain

Digital incident tooling is doing more and more of the heavy lifting:

  • Effective on‑call tools give you immediate, relevant context (recent deploys, known issues, dashboards)
  • Automated coordination handles paging, status updates, escalations, and stakeholder comms

This is essential for reducing cognitive load so humans can focus on problem‑solving instead of logistics.

But even with great tools, there’s a gap: the jump from data and events to understanding and learning.

Analog habits bridge that gap. When your tools:

  • Auto‑populate incident timelines
  • Capture chat logs and decisions
  • Track tasks and follow‑ups

You can use your pencil case to do the human part:

  • Sketching a quick diagram of how the failure propagated
  • Summarizing "what we think the system is doing right now"
  • Noting moments of confusion or misalignment

These scribbles often reveal mental model mismatches that pure logs never show—and those mismatches are where some of the most powerful reliability improvements hide.


Fair, Transparent On‑Call: Preventing Burnout by Design

Always‑on engineering can’t be sustainable if on‑call is a mystery or a burden unevenly placed on a few people.

Fair and transparent on‑call distribution is critical to:

  • Preventing burnout and attrition
  • Maintaining trust and psychological safety
  • Preserving the energy required for learning and improvement

You’ll still need proper scheduling tools and policies, but tiny analog habits can make the human reality more visible.

A Tiny Habit: The On‑Call Fairness Snapshot

Once per quarter, use two pages in your notebook:

  • List everyone who has been on‑call
  • Next to each name, mark:
    • Weeks on‑call
    • Off‑hours incidents handled
    • High‑severity incidents handled

You’re not doing full analytics—just a hand‑drawn snapshot.

Then answer in writing:

  • “Does anything here feel unfair or unsustainable?”
  • “If yes, what’s one small change we’ll try next quarter?”

This practice:

  • Makes invisible load visible
  • Creates a concrete artifact you can bring into planning or governance discussions
  • Reinforces the idea that sustainability is part of reliability—not an afterthought

Connecting Incidents to Organizational Governance

Reliability work competes with features, deadlines, and budget constraints. If incident response and postmortems are disconnected from governance, they become "nice to have" rather than "must do."

Integrating incidents into governance means:

  • Reliability work is prioritized and funded
  • Postmortem outcomes influence roadmaps and staffing
  • Leaders see reliability as aligned with business goals, not opposed to them

Your analog habits can supply powerful, human‑readable evidence.

A Tiny Habit: Monthly Reliability Brief

Once a month, flip through your incident notes and postmortem pages. On a fresh page, create a simple brief:

  • Top 3 themes you’re seeing (e.g., “deploy friction,” “runbook gaps,” “tooling interruptions”)
  • One story that illustrates the cost or pain of an incident
  • Three concrete improvements you propose (with rough impact/effort notes)

Use this brief in whatever forum exists for governance—ops review, product planning, leadership sync.

Because it’s based on handwritten observations and patterns, this brief is often sharper and more compelling than a dashboard screenshot. It represents lived experience, not just metrics.


How to Start: Building Your Own Reliability Pencil Case

You don’t need anything fancy. To get started:

  1. Grab a small notebook and label it: "Incidents & On‑Call."
  2. Pick a writing tool you like and put both within arm’s reach of where you respond to incidents.
  3. Create 2–3 simple prompts on sticky notes or index cards, such as:
    • "Three lines when an incident starts"
    • "One page per postmortem"
    • "Monthly reliability brief"
  4. Tell your team what you’re trying and invite others to join you; share a photo or a one‑paragraph reflection after a month.

The key is not perfection; it’s consistency and smallness. Tiny habits beat elaborate systems you never use.


Conclusion: Small, Handwritten Acts of Reliability

Always‑on engineering will never be effortless. But it can be more humane and more effective.

  • Blameless postmortems become easier when you’ve captured real‑time thinking in a notebook.
  • Treating postmortems as a regular, lightweight practice embeds learning into daily work.
  • Good on‑call tools and automation reduce cognitive load; analog habits help your brain actually understand and remember.
  • Fair, transparent on‑call practices guard against burnout.
  • Simple, handwritten summaries connect incident learning to organizational governance and investment.

The reliability pencil case is a reminder that not all solutions are digital. Sometimes, the most powerful reliability tools are a notebook, a pencil, and a set of tiny habits that turn chaos into learning—one incident at a time.

The Reliability Pencil Case: Tiny Analog Habits for Taming Always‑On Engineering Work | Rain Lag