The Paper Incident Story Time Machine: Rewinding Outages With Hand‑Drawn Alternate Timelines
How hand‑drawn, paper-based alternate timelines can transform incident reviews into powerful counterfactual story labs—especially for distributed teams handling complex outages.
The Paper Incident Story Time Machine: Rewinding Outages With Hand‑Drawn Alternate Timelines
When a big outage hits, our tools explode with data: alerts, logs, traces, chats, tickets, deployment events, on‑call rotations, and more. Afterward, we dutifully open a doc and begin the post‑incident review.
But most reviews stay trapped in a single, linear narrative: “Here’s what happened, then this, then that.” Lost in that neat sequence is the messy reality of what could have happened—the roads not taken, the alternative decisions, the earlier signals we didn’t connect.
This is where the Paper Incident Story Time Machine comes in: a low‑tech, high‑insight way to rewind outages and explore hand‑drawn alternate timelines. It combines concepts from impact evaluation, counterfactual analysis, and sociotechnical systems thinking—without adding more dashboards.
Why We Need Alternate Timelines, Not Just Root Causes
Most traditional incident reviews chase a single root cause and produce a canonical story. That’s useful, but incomplete.
Real outages are:
- Sociotechnical: A mix of people, tools, alerts, and organizational structures.
- Distributed: Teams spread across time zones, communicating mostly via chat and tickets.
- Nonlinear: Multiple threads of work, partial understandings, and evolving mental models.
To learn deeply, we need to ask: “Compared to what?” What if we’d paged a different person? Rolled back sooner? Paid attention to a different signal?
This is where counterfactual analysis comes in.
Counterfactuals 101: Selecting a Good Comparison Group
In impact evaluation (e.g., public policy, medicine, economics), we rarely ask, “Did X work?” in isolation. We ask:
Compared to what would have happened without X?
That “what would have happened” is called a counterfactual. To reason about it, you need a comparison group:
- In medicine: treated patients vs. similar untreated patients.
- In product experiments: users in an A/B test.
- In incidents: one outage vs. other similar outages or alternate plausible choices.
For incidents, useful comparison groups might be:
- Similar past incidents: Same service, similar failure mode
- Parallel incidents: Happening in the same time frame on other systems
- Hypothetical branches: “What if we’d acted on this alert 30 minutes earlier?”
Choosing the right comparison group is central to impact evaluation in incidents. Without it, we just tell a story about what happened, not what might have been different.
Ex Ante vs Ex Post: When You Design Your Counterfactuals
Counterfactual analysis around incidents can happen at two different times:
1. Ex Ante (Prospective): Before the Incident Happens
This is when you design experiments or scenarios in advance:
- Chaos experiments: “If region A fails, here’s how we expect detection and response to unfold. We’ll compare this to what actually happens.”
- Runbooks: “If alert X fires, we’ll try approach A first, then compare to past similar incidents that used approach B.”
Here, you’re intentionally setting up comparison conditions ahead of time, so you can later evaluate which approach led to faster mitigation, better communication, or fewer regressions.
2. Ex Post (Retrospective): After the Incident
This is where most teams start.
Post‑incident, you:
- Compare this outage to prior incidents.
- Ask, “What if we had noticed this earlier?”
- Explore alternative routing, escalation, or rollback decisions.
Both approaches are useful. Ex ante design shapes how you collect data during incidents. Ex post analysis shapes how you interpret that data later.
The Paper Time Machine technique works in both modes, but is especially powerful ex post, when you already have a messy history to untangle.
Step 1: Build a Coherent Timeline With Time‑Bound Grouping
Incidents rarely consist of a single alert and a single fix. More often, they’re a storm of:
- Alerts
- Symptom reports
- Auto‑healing events
- Partial rollbacks
- Slack messages
- Service degradations in different regions
If we treat each alert or ticket as an isolated event, we miss the story.
Instead, define a time frame and deliberately group signals and events into a coherent incident narrative:
- Pick a window: From the first user impact or major alert to well after mitigation (often +1–2 hours for lingering effects).
- Gather all relevant signals:
- Monitoring alerts (across all related services)
- CI/CD events, config changes, deploys
- Customer support tickets and status‑page updates
- Chat logs and incident channels
- Cluster related events:
- Group alerts that spike in the same 10–15 minute windows
- Combine “symptom” alerts (latency, errors) with “cause” candidates (deploys, resource saturation)
This lets you construct a single, richer timeline, instead of many unconnected alarms.
Step 2: Aggregate Multiple Incidents Into One Outage Narrative
Modern systems often suffer chains of related incidents:
- A database degradation on Monday
- A follow‑on cache issue on Tuesday
- A partial rollback on Wednesday that introduces a new bug
Treating these as separate, unrelated tickets obscures the underlying dynamics. For meaningful learning, post‑incident reviews should:
- Aggregate related incidents into a broader “outage narrative.”
- Examine recurring signals across days or weeks.
- Identify slow‑burning conditions (e.g., capacity limits, brittle runbooks, unclear ownership).
Your time machine isn’t just replaying one bad hour; it’s reconstructing an arc that may span multiple days and incidents.
Step 3: Bring Out the Paper – Drawing the Main Timeline
Now the fun part: go analog.
Grab a large sheet of paper or a whiteboard. Draw a horizontal timeline across it and mark:
- Key times (T0 = first symptom, T+15, T+30, etc.)
- Technical events (deploys, rollbacks, failovers, capacity changes)
- Signals (alerts firing, error spikes, user reports)
- Human actions (who joined, when escalations happened, major decisions)
This is your “primary reality” timeline: what actually happened.
Underneath each major event, add short annotations like:
- “Chose rollback vs. feature flag off.”
- “Dismissed CPU alert as noisy.”
- “Assumed problem was region‑specific.”
You are not just reconstructing events—you’re starting to capture how the team was thinking.
Step 4: Draw Alternate Timelines – Your Counterfactual Branches
Now turn that single timeline into a story lab.
Identify key decision points where other choices were plausible:
- A different alert could have been taken seriously.
- A different person or team might have been paged first.
- A rollback could have happened earlier or later.
- A risky mitigation might have been avoided.
From each of these points, branch off an alternate timeline:
- Use a different color pen for “What if we had…?” paths.
- Annotate estimated impacts: “Likely 30 min faster recovery,” “May have caused wider blast radius,” etc.
You are now doing structured counterfactual analysis:
- Comparing actual outcomes to plausible alternatives.
- Identifying which choices really mattered vs. which felt dramatic but had little outcome difference.
- Revealing where your comparison group (other incidents, other decisions) shows that a different path historically works better.
This is not about blame. It’s about expanding your understanding of the space of possible actions in future incidents.
Step 5: Zoom In on the Sociotechnical System – Especially for Distributed Teams
Distributed teams aren’t just “remote people with laptops.” They are sociotechnical systems:
- Work is coordinated through tools: chat, incident bots, ticketing systems.
- Decision‑making is shaped by what these tools highlight—or hide.
- Shared situational awareness is assembled in channels, threads, dashboards, and call bridges.
During your paper session, explicitly map communication onto the timeline:
- When did the first incident channel or bridge start?
- Who joined when, from where, and through which medium?
- Which messages changed the team’s understanding? (e.g., “DB graph just spiked”, “Users only affected in EU”, “Roll back completed”)
Notice that distributed and face‑to‑face teams process information differently:
- In co‑located war rooms, you get rapid side conversations, glances at someone’s screen, overheard remarks.
- In chat‑first environments, information must be typed, threaded, and read—which means it can be delayed, lost, or misinterpreted.
This shapes how:
- Shared mental models form: Everyone’s internal picture of “what’s going on” and “what we’re trying next.”
- Decisions are made: Who feels empowered to act? Who waits for explicit approval?
- Uncertainty is handled: Are people comfortable saying “I don’t know”? Do they propose hypotheses in chat?
On your paper timeline, mark:
- Key communication gaps (“Assumption made in DMs, never surfaced to main channel”).
- Moments where mental models shifted (“Realized traffic was rerouted weeks ago,” “Discovered canary results weren’t wired into alerts”).
These become leverage points for improving team tooling, norms, and training, not just code.
Turning Insights Into Change
Your Paper Incident Story Time Machine session should end with concrete outputs:
-
Improved timelines as artifacts
- Save photos of your annotated timelines.
- Translate them into a more readable digital form for the post‑incident document.
-
Better comparison groups
- Identify sets of similar incidents to track as a cohort.
- Define metrics (time to detection, time to mitigation, user impact) to compare across these groups.
-
Ex ante experiments
- Turn promising alternate timelines into proposed runbook changes or chaos tests.
- Example: “Next time alert X fires, we’ll try path B and measure outcomes.”
-
Sociotechnical improvements
- Changes to on‑call rotations, escalation paths, or incident channel practices.
- Chat conventions to make critical information more visible.
- Tooling to surface related alerts and incidents automatically within defined time frames.
Conclusion: Drawing Our Way to Better Incidents
Hand‑drawn alternate timelines might seem quaint in an era of AI‑powered observability, but that’s exactly their strength. They:
- Slow us down enough to see how incidents actually unfold across tools, teams, and time.
- Encourage explicit counterfactual thinking: not just “what happened,” but “what else could have happened?”
- Reveal the sociotechnical nature of outages, especially for distributed teams whose collaboration lives in tools, not rooms.
By deliberately selecting comparison groups, designing ex ante and ex post counterfactual analyses, grouping alerts into coherent time‑bound narratives, and examining how communication shapes team mental models, we can turn post‑incident reviews into powerful labs for learning.
The Paper Incident Story Time Machine doesn’t replace your dashboards or incident bots. It complements them—by turning complex data and human decisions into stories you can see, question, and rewrite.
And the next time the pager goes off, you’ll have more than a runbook. You’ll have a richer map of the alternate timelines you’ve already explored—on paper.