The Paper Incident Story Ferryman: Hand-Carrying Fragile Context Across the River Between Dev and On-Call

The Paper Incident Story Ferryman

Hand-Carrying Fragile Context Across the River Between Dev and On-Call

Every team has that incident.

You know the one: production goes sideways at 2:17 a.m., the on-call’s pager explodes, and everyone scrambles through dashboards and logs trying to reconstruct a story no one fully remembers. Somewhere, there’s a design doc or Slack thread that explains what’s going on, but the people who wrote it are asleep, on vacation, or long gone.

In those moments, context is the most fragile thing in your system.

Think of it like a river crossing. On one side you have development: tickets, PRs, design docs, Slack discussions, architecture diagrams. On the other side you have the on-call engineer staring at a red alert and a hazy graph at 2 a.m. Between them is a river of time, shift changes, organizational boundaries, and half-remembered history.

You need a ferryman: a deliberate way to hand-carry fragile incident context safely from dev to on-call.

This post explores how to build that ferryman: standardized handoffs, sustainable on-call rotations, SRE-style runbooks and automation, and a culture that makes incident work fair, humane, and effective.

1. Why Incident Context Is So Fragile

Alerting systems are great at screaming “something is wrong”.

What they’re terrible at is screaming “here’s what that means, why it’s happening, and what’s likely to break while you fix it.”

Effective incident response depends on much more than:

The metric that triggered the alert
The raw error message
The name of the failing service

It depends on a richer story:

Who: Who owns this service? Who has touched this part of the system recently? Who knows its failure modes?
What: What did we just ship? What dependencies are involved? What known issues exist?
Why: Why does this system behave this way under load, at region failover, or with partial dependencies?
Current hypotheses: What do we think is happening? What did previous incidents teach us?
Known pitfalls: What fixes are tempting but dangerous? What switches or configs are foot-guns?

Without this story, on-call is reduced to cargo-cult debugging: flipping toggles and rolling back blindly, hoping something works. That’s unsafe for reliability—and unfair to responders.

Your goal is to move context as intentionally as you move code.

2. Standardized On-Call Handoff: Build the Ferry Schedule

On-call handoff should never be an afterthought.

A mature incident practice treats handoff like a scheduled ferry crossing: predictable, documented, and consistent. That means:

2.1 Make Handoff a Deliberate Ritual

Every shift change should include a short, structured handoff. For example:

Time-boxed: 10–15 minutes
Channel: Dedicated Slack/Teams channel, plus calendar event
Participants: Current on-call, next on-call, and if needed, a rotation lead

2.2 Use a Standard Handoff Template

Include, at minimum:

Active incidents
- Current status
- Severity and impact
- Owner(s) and stakeholders
- Current hypotheses and next steps
Known risk windows
- Planned deploys
- Large migrations or infrastructure changes
- Regulatory deadlines or business events (launches, sales)
Smoldering issues
- Degraded systems that aren’t paging (yet)
- Temporary workarounds in place
Ownership boundaries
- Which teams/jurisdictions must be involved for certain actions (e.g., data deletion, access escalations, legal constraints)

Codify this template in your runbooks or incident tooling so no one is improvising at 3 a.m.

3. Moving Context from Dev to On-Call: More Than an Alert

Good incident response starts long before an incident.

When dev ships new features or large changes, they should pre-load the on-call with:

3.1 Pre-Deployment Context Packs

For significant changes, require a short “context pack” linked in Jira, GitHub, or your deployment tool:

Change summary: What’s new? What did we not touch?
What can break: Expected failure modes and dependency impacts
Signals to watch: Key metrics and logs that indicate problems
Safe rollback or kill switches: How to safely undo the change
Nightmare scenarios: Changes that could cause data loss, compliance issues, or cascading outages

These should be linked directly from alerts where possible.

3.2 Alert Descriptions as Mini Runbooks

Alert descriptions are prime real estate for context. Include:

A short “first 5 minutes” checklist
Links to relevant runbooks and dashboards
Common false positives and how to recognize them
Known dangerous “fixes” to avoid

3.3 Pairing Dev and On-Call Roles

When rolling out risky changes:

Make one of the feature devs shadow on-call during the rollout
Ensure they’re reachable for a defined window
Make it explicit in the ticket or deployment: “Escalate to Alice (feature dev) if alerts X/Y fire between now and tomorrow 12:00 UTC”

This ensures the people who built the system help ferry context across the release river.

4. Sustainable On-Call: Protecting Humans While Protecting Uptime

An on-call system that burns people out will eventually burn your reliability down with it.

Sustainable rotations share some key traits:

4.1 Clear Escalation Tiers

Define who does what, and when:

Tier 1 (front-line): Handles triage, simple runbook actions, basic mitigation
Tier 2 (service experts): Deep debugging, complex mitigations, non-trivial config changes
Tier 3 (domain specialists / leadership): Cross-org decisions, major incident command, customer communication

Document clearly:

When to escalate
How to escalate (tools, channels, escalation paths)
What authority each tier has (e.g., rollbacks, traffic shifts, data access)

4.2 Predictable Schedules and Capacity Planning

Use published on-call calendars at least 4–6 weeks out
Respect time zones and local labor regulations
Protect focus time by adjusting sprint capacity for people on call
- Example: On-call engineers commit to 60–70% of normal sprint work
- Treat incident response as legitimate work, not invisible overhead

4.3 Track Time, Work, and Compensation Fairly

Especially in small DevOps/SRE teams, it’s easy for on-call work to disappear into the cracks. Make it visible:

Track:
- Time spent responding to incidents
- After-hours interruptions
- Follow-up work from incidents
Use this for:
- Compensation or time-off in lieu
- Rotation adjustments
- Headcount and tool investment justifications

Transparency here is essential for long-term sustainability.

5. SRE Practices: Codifying Knowledge So No One Starts from Scratch

Google-style SRE practices are fundamentally about turning tribal knowledge into systems.

5.1 Runbooks as the First-Class Artifact

Runbooks should be:

Discoverable: Linked from alerts and dashboards
Specific: Not just “check logs” but which logs, what patterns
Maintained: Updated after every incident as you learn more
Scoped: One runbook per symptom or incident class, not a 50-page monolith

Train your team to always ask: “Where is this in a runbook?” if they have to do something twice.

5.2 Guardrails and Automation Over Heroics

Wherever you see repeated manual steps, ask whether they can be:

Turned into scripts or ChatOps commands
Encapsulated in safe, parameterized operations (“drain traffic from region X”, “roll back service Y”)
Automated entirely by self-healing workflows

This reduces toil—repetitive, manual, low-value work that eats capacity and morale.

The more you automate safe, boring actions, the more energy you have for hard problems.

6. Culture: Blameless, Curious, and Constraint-Aware

Even the best process fails in a culture that punishes mistakes or ignores real-world constraints.

6.1 Blameless, Curiosity-Driven Learning

After incidents, focus on systems, not individuals:

Use blameless postmortems to ask:
- What made this failure possible?
- What made it hard to detect or mitigate?
- What did we assume that turned out to be wrong?
Reward people for surfacing near-misses and uncomfortable truths
Turn each learning into:
- Runbook updates
- New alerts or dashboards
- Design changes that reduce risk

Punishing individuals kills reporting and learning; curiosity keeps the river of context flowing.

6.2 Respect Organizational and Jurisdictional Boundaries

Your incident playbooks must account for:

Locations and time zones: Who’s realistically available, when
Regulations: Data-residency rules, access controls, auditability
Ownership: Clear boundaries between teams, vendors, and cloud providers

Codify these constraints:

Which actions require approval (e.g., deleting customer data)
Who can access which environments (e.g., production DB vs. logs)
What must be logged for compliance (e.g., access to PII)

A responder should never have to guess whether they’re allowed to push the metaphorical big red button.

7. The Ferryman’s Checklist

To keep your incident context from sinking between dev and on-call, make sure you have:

Standardized on-call handoff
- Scheduled, structured, and documented for every shift
Dev-to-on-call context packs
- For risky changes, with clear failure modes and rollback instructions
Rich alert descriptions
- First actions, key links, and known pitfalls embedded in alerts
Sustainable rotations
- Clear tiers, predictable schedules, and sprint capacity adjustments
Fair tracking and compensation
- Incident time, interruptions, and follow-up work are all visible
Runbooks, guardrails, and automation
- Codified operational knowledge that reduces toil and heroics
Blameless, constraint-aware culture
- Learning-oriented, legally compliant, and ownership-aware

Conclusion: Build the Boat Before the Flood

You can’t stop incidents from happening. But you can stop them from turning into archaeology projects where each on-call has to dig through layers of history to rediscover how your systems really work.

Your real reliability edge isn’t just your monitoring stack or deployment pipelines. It’s how well you carry context across the river between development and on-call.

If you deliberately design your handoffs, on-call structure, SRE practices, and culture, you’re no longer relying on heroism—or memory—to save you at 2 a.m.

You’ve built a ferryman.

And the next time the pager goes off, your on-call won’t just have an alert. They’ll have a story they can act on.