Rain Lag

The Pencil-Drawn Incident Lighthouse Map: Turning Quiet Near-Misses into Your Strongest Early-Warning System

How to build a near-miss reporting culture, borrow practices from aviation, and use structured tools and analysis to prevent outages before they form into full-blown storms.

The Pencil-Drawn Lighthouse Map: Seeing the Storm Before It Forms

Imagine an old sea chart someone once updated by hand: a tiny lighthouse, sketched in pencil, added after a ship nearly ran aground there. No disaster, no headlines—just a quiet, near catastrophe. That pencil-drawn lighthouse becomes a warning for everyone who sails that route afterward.

In modern software systems, near misses are those pencil marks: the silent almost-outages, the internal failures caught just in time, the misconfigurations that got rolled back seconds before impact. They don’t make the status page, but they’re priceless signals—if you capture and learn from them.

This post explores how to:

  • Treat near misses as early-warning signals instead of lucky breaks.
  • Borrow near-miss reporting practices from aviation to strengthen software reliability.
  • Build a no-blame reporting culture that encourages people to speak up.
  • Use structured post-incident tools and the Five Whys to turn anecdotes into insight.
  • Apply correlation engines and configuration validation to catch issues before they become storms.

What Is a Near Miss in Software Systems?

A near miss is an event that could have led to an incident or outage but didn’t—often thanks to luck, last-minute intervention, or safety mechanisms doing their job.

Examples:

  • A config change that would have dropped 80% of traffic, caught during deployment validation.
  • A background job that silently retried for hours due to a misconfiguration, narrowly avoiding data loss.
  • A third-party dependency that slowed requests to the edge of your SLO, but recovered before breaching.

Near misses are early-warning signals that the system is closer to a boundary than you think. Ignoring them is like ignoring a ship’s log that says, “We almost hit something here.”

When systematically captured and examined, near misses:

  • Reveal latent system weaknesses before they’re exploited.
  • Help you strengthen guardrails and safety mechanisms.
  • Improve both technical resilience and operational awareness.

What Aviation Can Teach Software About Near Misses

Aviation has spent decades building a mature near-miss reporting culture. Pilots file reports about events that didn’t cause accidents—but could have.

Key practices worth borrowing:

  1. Non-punitive reporting
    Aviation safety systems work because pilots know they won’t be punished for reporting an honest mistake or scare. The goal is learning, not blame.

  2. Centralized, structured data
    Reports go into centralized systems with consistent fields (context, conditions, contributing factors), enabling pattern detection across thousands of flights.

  3. Systemic, not individual, focus
    When things go wrong (or almost do), investigators look at systems, not just people: procedures, interfaces, training, communication, automation.

Bringing this mindset into software engineering means:

  • Treating near misses as safety data, not personal failures.
  • Logging them into a shared repository—even when no customer was affected.
  • Asking, “What in our system made this likely?” instead of “Who messed up?”

Building a No-Blame Near-Miss Culture

You will never see near misses if people are afraid to talk about them.

Concrete steps to encourage reporting:

1. Redefine Success

Make it explicit that catching a problem early is a success, not an embarrassment. Celebrate:

  • Someone aborting a risky deployment after noticing a subtle metric shift.
  • A support engineer flagging a strange pattern in tickets before it snowballs.
  • A developer reporting, “I almost pushed a bad migration; here’s what stopped me.”

2. Normalize “Almost” Stories

Create regular forums:

  • Near-Miss Roundups” in weekly reliability or platform meetings.
  • Slack channels specifically for near misses (e.g., #near-miss-log).
  • Short write-ups shared in engineering newsletters.

The more near misses are discussed openly, the less stigma they carry.

3. Remove Fear and Blame

Leadership needs to:

  • Publicly support no-blame language in incident and near-miss reviews.
  • Focus questions on conditions and systems, not judgmental “why did you…”
  • Treat honest mistakes as design problems, not performance issues.

Without this foundation, the rest of your process will never reach full potential.


Structured Post-Incident Tools: Your Pencil-Drawn Lighthouse Map

If your incident and near-miss analyses live in random documents, chat logs, or people’s memories, you can’t build a coherent “map” of danger zones.

Use structured, customizable postmortem templates for both incidents and near misses. A good template includes:

  • Summary: What happened, and what almost happened?
  • Impact: Even if users were spared, what internal impact existed (e.g., paging, manual intervention, risk exposure)?
  • Timeline: Ordered events from trigger → detection → response → resolution.
  • Detection: How was the near miss caught? Alert? Human intuition? Random luck?
  • Contributing factors: Technical and organizational.
  • Lessons learned: What do we now know that we didn’t before?
  • Action items: Concrete follow-ups, owners, and deadlines.

Use the same structure for both incidents and near misses so you can:

  • Run queries across all records.
  • Spot repeat patterns (e.g., “Config changes during peak hours” keeps showing up).
  • Feed data into your correlation and analytics tools.

These templates are your pencil-drawn lighthouse maps—each one marking a place where a future storm could have formed.


Ask Better Questions: Using the Five Whys

The Five Whys technique helps ensure you don’t stop at the first obvious cause.

Example near miss:

  • A config change almost routed all traffic to a non-existent backend.

Walkthrough:

  1. Why did the config nearly route all traffic incorrectly?
    Because a new host group name was mistyped.

  2. Why was a mistyped host group name able to pass configuration checks?
    Because the validation only checked syntax, not existence.

  3. Why does validation only check syntax?
    Because validation rules haven’t been updated since we introduced dynamic host groups.

  4. Why haven’t validation rules kept up with infrastructure changes?
    Because there’s no defined owner or process for updating validation logic.

  5. Why is there no owner/process?
    Because configuration safety is not clearly part of any team’s charter.

Now your solution is not “be more careful when typing,” but:

  • Assign clear ownership for configuration safety.
  • Expand config validation to confirm host group existence.
  • Add pre-deploy safety checks and dry runs.

The Five Whys turns a “fat-finger mistake” into a system design failure you can fix.


Finding Subtle Signals: Correlation Engines and Early Detection

Humans are good at storytelling; machines are better at spotting subtle patterns buried in logs, traces, and metrics.

Advanced correlation engines can:

  • Detect recurring combinations of signals (e.g., a specific error spike + latency increase + cache miss pattern) that often precede incidents.
  • Surface “weak signals” that would be ignored by threshold-based alerts.
  • Connect today’s near miss to similar events from months ago.

How this helps near-miss detection:

  • You can define a near miss not just by human narration, but by data patterns that historically correlate with close calls.
  • When those patterns recur, the system can warn: “You’re entering known-danger territory.”
  • Over time, you build an automated early-warning layer powered by your own history.

To leverage this:

  1. Feed incident and near-miss records into your observability and analytics stack.
  2. Tag events with context (services, configs, features, time windows).
  3. Use correlation tools to find common precursors across multiple events.

This is how your map gets richer: each near miss adds more ink to the patterns you can see.


Configuration: The Quiet Giant Behind Many Near Misses

Across the industry, configuration errors are one of the leading causes of outages—and an even larger number of near misses.

Why config is so risky:

  • It’s often treated as “just data,” but it can be as powerful as code.
  • It frequently bypasses the rigor of code review, testing, and staging.
  • Config changes are sometimes made under pressure (“just tweak this flag”).

To reduce config-related near misses:

1. Treat Config as Code

  • Store config in version control.
  • Require code review and approvals for changes.
  • Use pull requests with automated checks.

2. Implement Robust Validation

  • Validate syntax and semantics (e.g., not just “is this JSON valid?” but “do these references exist?”).
  • Run pre-deploy checks (dry-runs, shadow traffic, staging tests).
  • Add schema-based validation for configuration formats.

3. Add Guardrails and Safe Deployment Patterns

  • Progressive rollouts: canary, region-by-region, feature flags with kill switches.
  • Automatic rollback on error rate or latency regression.
  • Clear “change windows” and visibility for risky config updates.

Every avoided outage due to config is a near miss worth documenting—and a strong argument for investing in better validation.


Bringing It All Together: Turning Quiet Close Calls into Collective Wisdom

To build your own "pencil-drawn incident lighthouse map" for software systems:

  1. Name and value near misses. Treat them as crucial safety intelligence, not embarrassing close calls.
  2. Create a no-blame reporting culture. Reward early detection and transparent storytelling.
  3. Use structured templates for both incidents and near misses so lessons are comparable and searchable.
  4. Apply the Five Whys to identify systemic causes and avoid shallow fixes.
  5. Leverage correlation engines to mine historical events for recurring precursors and subtle patterns.
  6. Invest heavily in configuration safety—validation, ownership, and deployment guardrails.

Your systems will still face storms. But with each near miss you capture and analyze, your map gets sharper, your lighthouses brighter, and your crews better prepared. Over time, you’ll find fewer surprises—not because you’re lucky, but because you’ve turned quiet close calls into the most powerful learning engine in your reliability toolkit.

The Pencil-Drawn Incident Lighthouse Map: Turning Quiet Near-Misses into Your Strongest Early-Warning System | Rain Lag