Rain Lag

The Analog Incident Story Cabinet of Clocks: Building a Wall of Time to See How Outages Really Unfold

How an analog-style “wall of time” transforms incident timelines from abstract metrics like MTTR into a vivid, shared story of how outages truly unfold—and where your response process is really breaking down.

The Analog Incident Story Cabinet of Clocks

Building a Wall of Time to See How Outages Really Unfold

If you only look at MTTR dashboards, you’re flying blind.

Mean Time To Resolution (MTTR) is on almost every incident report and SRE slide deck, but by itself it hides more than it reveals. It flattens a rich, messy story into a single number. Incidents don’t live in spreadsheets—they live in time.

Imagine instead a wall of clocks.

Each clock represents an incident, its hands moving through the phases: detection, assignment, engagement, acknowledgment, fix, and resolution. Side by side, they form a “wall of time”—an analog-style visualization of how outages actually unfold across teams, systems, and tools.

That’s the idea behind the Incident Story Cabinet of Clocks: using timeline-based, analog-like views to reveal where incidents really spend their time, and where your operational processes are silently failing.


Why a “Wall of Time” Beats a Single MTTR Number

Metrics like MTTR make us feel in control. They’re tidy. They fit nicely in OKRs and quarterly reviews. But they are summaries, not explanations.

Consider this:

  • Two incidents can both have a 45-minute MTTR.
  • In one, engineers engaged immediately, but the fix was complex.
  • In the other, the fix took five minutes—but no one picked it up for 40 minutes.

On a dashboard, those incidents look identical. On a wall of time, they look completely different.

A visual timeline makes these differences obvious:

  • Long flat segments where nothing happens
  • Dense bursts of activity
  • Overlaps between related incidents and alerts

By placing incidents on a shared temporal canvas, you stop treating outages as numbers and start treating them as stories unfolding in time.


Per-Table Timelines: Zooming in on How Incidents Really Progress

Most organizations already have lots of structured data: Incident tables, Alert tables, Change tables, Ticket queues, and so on. The problem isn’t lack of data—it’s the lack of a coherent narrative.

A powerful way to restore that narrative is a dedicated timeline view per data table. For example:

  • Incident timeline: From first detection → assignment → engagement → acknowledgment → mitigation → resolution.
  • Alert timeline: From initial alert → deduplication → suppression → correlation → escalation.
  • Change timeline: From change request → approval → deploy start → deploy end → rollback (if any).

By giving each table its own stacked sequence of events over time, you can:

  • Zoom into specific incident types (e.g., SEV-1 vs SEV-3) to see how their life cycles differ.
  • Compare how certain services or teams perform across similar incidents.
  • Spot recurring process delays, such as handoffs or approvals.

Instead of a static list of records, you get a living storyboard of your operations.


Why MTTR Alone Is Not Enough

MTTR (Mean Time to Resolution) is typically defined as:

The average time it takes to restore service from the moment an incident begins.

Useful? Yes. But it bundles many distinct activities into one opaque duration:

  • Time to detect and route the incident
  • Time for someone to actually start working on it
  • Time to understand and acknowledge the problem
  • Time to develop and deploy a fix

When you only track MTTR, you can’t tell which part of the process is slow. Are you bad at detection? Engagement? Fixes? Handoffs? You don’t know.

That’s where more granular metrics and timeline views come in.


MTTE: Measuring How Long It Takes Humans to Actually Start

Mean Time to Engage (MTTE) measures the delay between when an incident is assigned to a team or person and when they actually start acting on it.

Think of it as answering the question:

Once it’s on someone’s plate, how long until they really begin?

Why this matters:

  • It surfaces staffing issues: People are overloaded, multitasking, or constantly context-switching.
  • It exposes weak on-call practices: Maybe people aren’t truly reachable or alerts aren’t prioritized.
  • It reveals organizational friction: Confusion about ownership, unclear escalation paths, or approval bottlenecks.

On an analog-style incident clock, MTTE is the segment between:

  1. "Incident assigned" and
  2. "First meaningful action taken"

On a wall of time, you can instantly see where most of the delay lies: in humans starting, not in systems failing.


MTTR Decomposed: Engage, Acknowledge, Fix

To make MTTR genuinely useful, you need to break it into components, each mapped to a visible phase on your incident clocks:

  • Time to engage – When someone starts working on the incident (MTTE input).
  • Time to acknowledge – When the person/team confirms they understand the problem scope and are actively handling it.
  • Time to fix – When the underlying issue is mitigated or resolved.

Conceptually, you can think of total resolution time as a sum of parts:

MTTR ≈ MTTE (engage) + MTTA (acknowledge) + MTTF (fix)

Breaking MTTR into these slices lets you:

  • Isolate bottlenecks – Are we slow to engage, slow to understand, or slow to fix?
  • Target process improvements – Training, runbooks, better alerting, automation, or staffing changes.
  • Communicate clearly – Instead of saying “Our MTTR is bad,” you can say “Our engagement time is fine, but acknowledging and understanding the problem is taking too long.”

On your incident story cabinet, each of these phases becomes a visually distinct segment, not just an invisible contributor to a single number.


Calculating MTTF: What Happens Between Acknowledgment and Fix

In many operational contexts, MTTF is "Mean Time to Failure." Here, it’s more useful to think of it as Mean Time to Fix within the incident lifecycle.

Given:

  • MTTR – Mean Time to Resolution
  • MTTE – Mean Time to Engage
  • MTTA – Mean Time to Acknowledge

You can express your fix time approximately as:

MTTF ≈ MTTR – MTTE – MTTA

This framing does something important:

  • It highlights the stages between engagement, acknowledgment, and the actual fix.
  • It emphasizes that latency can live anywhere in the process, not just in the coding or deploying of a solution.

On your wall of clocks, MTTF is the segment where the real engineering happens—diagnosing, testing, rolling out changes. If MTTE and MTTA are short but MTTF is long, the work likely lies in:

  • Better runbooks and playbooks
  • Improved observability and logging
  • More resilient architectures and safer deployment patterns

If instead MTTF is short but MTTE or MTTA dominates the clock, you have a process and communication problem, not a technical one.


Analog Timelines + Uptime Tools: A Richer Picture of Outages

Most teams already use uptime monitors, alerting tools, and dashboards. These are essential, but they often show slices of the truth:

  • Uptime graphs: "Was the service up or down? When?"
  • Alert dashboards: "How many alerts per minute? Which ones fired?"
  • Ticket systems: "How many incidents and who owns them?"

What’s missing is how these slices interact over time.

By combining your existing tools with an analog-style incident wall:

  • You overlay monitoring data, alerts, and tickets on the same visual timeline.
  • You see exactly when a spike on your uptime graph aligns with a delayed engagement segment on an incident clock.
  • You can trace how a single deploy triggers a cascading series of alerts and follow-on incidents.

This creates a richer, more intuitive understanding of:

  • Where outages spend most of their time (waiting in queues, in human delay, or in complex fixes).
  • How multiple systems and incidents interrelate in a single outage scenario.
  • Which improvements would most reduce pain—tooling, automation, staffing, or architecture.

Building Your Own Incident Story Cabinet of Clocks

You don’t need a new product category to get started. You can approximate this approach with tools you already have:

  1. Define your phases

    • Decide which timestamps you can reliably capture: detection, assignment, engagement, acknowledgment, mitigation, resolution.
  2. Instrument your workflows

    • Ensure your ticketing/incident system records these transitions explicitly.
    • Encourage responders to update incident states as part of standard practice.
  3. Create per-table timelines

    • For Incidents, Alerts, and Changes, generate time-based visualizations rather than just tabular lists.
    • Align them on a shared time axis so you can see overlaps.
  4. Compute MTTE, MTTA, MTTF, and decomposed MTTR

    • Use your captured timestamps to calculate each component, not just MTTR.
  5. Review incidents as stories, not just numbers

    • In post-incident reviews, walk through the timeline clock-style.
    • Ask: Where did time pool? What was preventable? What was confusion vs. complexity?

Over time, this cabinet of clocks becomes a shared visual language for your organization’s reliability work.


Conclusion: From Numbers to Narratives

Outages are not just failures of systems; they are also reflections of how your organization thinks, acts, and communicates under stress.

A single MTTR number can’t tell you that story.

An analog incident story cabinet—a wall of time made up of per-incident clocks and per-table timelines—can. It:

  • Makes delays and bottlenecks literally visible.
  • Turns abstract metrics like MTTR into decomposed, actionable insights.
  • Connects human behavior, process design, and system behavior on one shared canvas.

If you want to truly understand how outages unfold—and how to shorten them—the first step is simple:

Stop just looking at numbers. Start looking at time.

The Analog Incident Story Cabinet of Clocks: Building a Wall of Time to See How Outages Really Unfold | Rain Lag