Rain Lag

The Paper Incident Story Railway Clocktower: Building a Vertical Time Map of How Outages Really Unfold

How a “vertical time map” turns scattered incident data into a coherent, shared story of outages—revealing real dependencies, enabling one-click postmortems, and driving systematic reliability gains.

Introduction: Incidents Are Stories, Not Just Spikes on a Graph

Most incident reviews still look like this: a messy mix of logs, alert screenshots, half-remembered Slack threads, and a hastily written postmortem a week later. We treat outages as a pile of data points when, in reality, they are stories unfolding over time.

That’s where the idea of a vertical time map comes in—what we’ll call the "Paper Incident Story Railway Clocktower" view. Instead of a flat line of alerts, you get a tower-like, vertical visualization of what happened: systems at different layers, teams in motion, decisions made, customer impacts observed, and the invisible physics of failure all stacked over a single timeline.

In this post, we’ll explore how building a vertical time map of incidents can:

  • Turn raw logs into a coherent, shared narrative
  • Enable one-click draft postmortems
  • Improve collaboration during and after outages
  • Reveal hidden dependencies using graph models and reinforcement learning
  • Connect software symptoms to hardware and system-level failure mechanisms
  • Drive long-term reliability and resilience improvements

What Is a Vertical Time Map?

A vertical time map is a structured visualization of an incident where time runs along one axis (usually vertical) and layers of your system, organization, and impact are stacked horizontally. Imagine a tall, precise clocktower that shows, at each minute:

  • Signals and alerts (metrics, logs, traces)
  • System-level states (services, databases, queues, networks)
  • Team actions (pages, escalations, runbook steps, deploys, rollbacks)
  • Customer experience (error rates, latency, availability, support tickets)
  • Root causes and contributing factors (hardware faults, config changes, capacity limits)

Instead of a flat "timeline" that just lists events, the vertical time map organizes them into layers and relationships so you can see not only when things happened, but also how effects cascaded across your stack and organization.

This transforms the incident from a chaotic blur into a readable story:

At 09:01, a disk controller starts erroring. At 09:03, the storage service retries spike. At 09:05, the primary database latency climbs. At 09:07, the API gateways time out requests. At 09:09, customers see 500s. At 09:10, on-call is paged. At 09:15, a failover is attempted and partially succeeds…

All of that becomes one coherent vertical map that everyone can see and interpret together.


From Raw Logs to a Coherent Incident Story

Raw incident data is notoriously fragmented:

  • Logs and traces in observability tools
  • Alerts in paging systems
  • Chat logs in Slack or Teams
  • Tickets and status updates in separate systems
  • Manual notes in docs or wikis

A vertical time map acts as a story engine. It ingests these raw events and aligns them along a single, consistent timeline. Then, instead of a laundry list of timestamps, you get a narrative structure:

  1. Trigger – The earliest detectable anomaly or change.
  2. Escalation – Cascading impacts across services and components.
  3. Detection – The moment humans or automation recognize the incident.
  4. Response – Actions taken, hypotheses tested, mitigations attempted.
  5. Stabilization – Systems restored, workarounds applied.
  6. Recovery & Learning – Backlog drained, follow-ups identified.

Once you have this scaffold, you can layer on annotations like:

  • "We misinterpreted this alert as a network issue."
  • "This runbook step took 20 minutes due to unclear ownership."
  • "This fallback worked surprisingly well; we should formalize it."

That’s the point: the vertical time map is not just a visualization; it’s a medium for shared narrative—what actually happened and why.


One-Click Draft Postmortems: Documentation Without the Drag

Writing postmortems is essential, but it’s also tedious. Much of the work is mechanical:

  • Collecting timestamps and event sequences
  • Copy-pasting from logs and chat
  • Reconstructing who did what and when
  • Summarizing impact and duration

When you have a vertical time map, you already have the backbone of a postmortem. With the right tooling, you can generate a one-click draft postmortem that includes:

  • A chronological summary of key events
  • Automatically inferred incident phases (detection, mitigation, resolution)
  • Impact windows and affected services/customers
  • A first-pass hypothesis of root causes and contributing factors

Humans still review, refine, and interpret, but they no longer start from zero. This shifts the team’s energy away from rote documentation towards analysis, learning, and reliability improvements.

The result: more consistent postmortems, less friction, and faster time from incident to real improvements.


Shared Timeline, Shared Understanding: Better Collaboration

During an outage, misalignment is expensive:

  • Engineering is focused on mitigating symptoms.
  • Operations is tracking operational risk and capacity.
  • Management needs to understand impact and communicate with stakeholders.

A vertical time map becomes the single source of temporal truth:

  • Engineers see which components are failing, what’s been tried, and which hypotheses are in play.
  • Ops teams see the broader operational context—changes, rollouts, capacity conditions.
  • Leaders and customer-facing teams see: when customers were affected, how severe the impact was, and what’s being done.

Because everyone is looking at the same structured timeline, communication improves:

  • Fewer status pings and duplicated explanations
  • Clearer handoffs as shifts change
  • Easier alignment on what "resolved" actually means (and when it happened)

Post-incident, the same map powers collaborative reviews. Different teams can annotate the timeline, highlight blind spots, and propose changes—around a shared, factual backbone instead of conflicting memories.


Revealing Hidden Dependencies with Graphs and Reinforcement Learning

Incidents don’t travel linearly. They propagate through complex webs of dependencies: services, queues, caches, networks, hardware, and human processes.

By treating each incident’s vertical time map as data, we can apply:

  • Graph-based models to represent services, components, and their relationships.
  • Reinforcement learning (RL) to simulate and evaluate different response strategies.

Over time, this unlocks capabilities like:

  • Non-obvious dependency discovery – The system learns that a “minor” service repeatedly precedes major outages, suggesting hidden coupling.
  • Critical path identification – Which nodes in the graph most frequently sit on the path from first signal to customer impact?
  • Policy optimization – RL agents can propose response playbooks that minimize time to mitigation under different scenarios (e.g., "failover earlier," "rate-limit sooner," "prioritize rollback over patching").

The vertical time map is the training data format that makes these techniques practical. Instead of raw logs, we feed structured, contextualized sequences of events into learning systems—and in return, we get better recommendations for future incidents.


The Physics of Failure: Connecting Software Symptoms to Hardware Reality

Modern incidents often appear as software glitches: elevated error rates, timeouts, thread pools stuck. But underneath, the true cause may be hardware and system-level:

  • Disk controller errors and latent sector faults
  • Thermal throttling on overloaded hosts
  • Network congestion or misconfigured routing
  • Power events, firmware bugs, or rack-level failures

Physics of failure thinking asks: What physical mechanisms are plausibly causing this pattern of software symptoms?

Integrating that lens into the vertical time map means:

  • Tagging events that likely correspond to hardware or infrastructure anomalies
  • Linking software symptoms to underlying failure modes (e.g., "IO wait spikes" ↔ "degraded SSD" ↔ "aging hardware cohort")
  • Building a cross-layer narrative: from electrons and spinning disks all the way up to user-facing errors

This cross-layer visibility helps teams:

  • Avoid over-focusing on code when the real issue is infrastructure aging
  • Make smarter capacity and hardware refresh decisions
  • Design software that is inherently more resilient to known physical failure patterns

Learning Across Incidents: Patterns, Weak Points, and Systematic Gains

A single vertical time map is useful for one outage. A collection of them is transformative.

When you apply a consistent structure across incidents, patterns start to emerge:

  • The same component is often the first to show distress.
  • Certain alerts are noisy and rarely correlate with real impact.
  • Certain mitigations work reliably; others usually waste time.
  • Some teams are consistently brought in too late or too early.

By aggregating vertical time maps, you can:

  • Build heatmaps of where incidents commonly originate or concentrate.
  • Identify chronic weak points in architecture, process, and tooling.
  • Prioritize reliability work based on observed, repeated failure paths.

Over time, this leads to systematic resilience gains:

  • Better alerting tuned to early, reliable signals of real problems.
  • Hardened services along known critical paths.
  • More effective runbooks and incident playbooks, refined from real data.
  • Organizational learning that compounds from incident to incident.

Conclusion: Build the Clocktower, Not Just Another Dashboard

Outages are not just spikes on charts; they are stories that unfold across systems, people, and customers. A vertical time map—your incident story railway clocktower—turns scattered, noisy data into a coherent, shared narrative:

  • It simplifies documentation with one-click draft postmortems.
  • It improves collaboration between engineering, operations, and leadership.
  • It enables advanced analysis with graph models and reinforcement learning.
  • It connects software symptoms to the underlying physics of failure.
  • It reveals patterns and weak points across incidents, driving long-term reliability.

If your current incident reviews feel like squinting at fragments and guessing at the in-between, it might be time to build the clocktower: a vertical time map that shows not just that an outage happened, but how—and what you can do better next time.

The Paper Incident Story Railway Clocktower: Building a Vertical Time Map of How Outages Really Unfold | Rain Lag