The Analog Incident Lighthouse Staircase: A Paper‑Based Guide Through Live Outages
How to design and build the Analog Incident Lighthouse Staircase—a simple, paper‑based, step‑by‑step visualization that keeps your team aligned, your data consistent, and your systems like PagerDuty and CrateDB in sync during live incidents.
Introduction
During a live production incident, your team doesn’t fail because of a lack of tools. You fail when cognitive load spikes, communication frays, and people stop sharing the same mental model of what’s happening.
The Analog Incident Lighthouse Staircase is a deliberately low‑tech answer to that problem: a paper‑based, step‑by‑step visualization of incident progress that stays visible, stable, and grounded in the business context—even as dashboards, alerts, and Slack channels explode with noise.
This post explains:
- What the Lighthouse Staircase is and why it matters
- How it fits into an incident workflow with PagerDuty and CrateDB
- How it helps keep data consistent and reduces ingestion errors
- How to build your own staircase with simple materials
- How to use it effectively, with human factors and real‑time dependencies in mind
What Is the Analog Incident Lighthouse Staircase?
The Lighthouse Staircase is a step‑by‑step, physical timeline of an incident that you assemble in real time on paper.
Think of it as:
- A vertical, stair‑shaped representation of an incident’s lifecycle, from detection through resolution and retrospective.
- A single shared reference point the whole room can see at a glance.
- A data collection scaffold that mirrors the structure of how you later ingest data into systems like CrateDB.
Each “step” on the staircase represents a discrete stage or event in the incident:
- Detection
- Triage started
- Impact assessed
- Hypothesis formed
- Mitigation applied
- Verification
- Communication updates
- Resolution declared
- Follow‑up / postmortem tasks
You build the staircase in real time with pre‑printed cards or sticky notes, arranged on a wall or large board. Each step captures:
- Time
- Owner
- System(s) / dependency involved
- Decision or action taken
- Business impact
By the end of the incident, you have a clean, ordered, human‑readable log of what happened that can be directly translated into structured records for CrateDB, PagerDuty incident notes, and your postmortem template.
Why Analog? And Why During a Live Outage?
During a serious outage, two things break quickly:
- Shared context: People see different slices of data, infer different stories, and talk past each other.
- Reliable data capture: Notes are scattered across Slack, Zoom, terminals, and people’s heads. Later, when you sync with CrateDB or your incident management system, you fight missing timestamps, inconsistent terminology, and gaps.
An analog, paper‑based artifact solves several of these problems:
- Low cognitive overhead: No UI to learn under pressure. Writing on a card and sticking it to a wall is trivial.
- High visibility: Everyone in the room (or on video, via a camera pointed at the wall) sees the same reality.
- Stable structure: The staircase layout enforces a consistent, chronological format.
- Data discipline: If a step doesn’t get a card, it’s probably not recorded anywhere else either—so gaps are obvious.
Most importantly, the Lighthouse Staircase becomes the authoritative analog reference when you later:
- Reconstruct the timeline
- Enrich records in CrateDB
- Cross‑check PagerDuty events
- Write the post‑incident review
How the Staircase Fits with PagerDuty and CrateDB
In a modern incident workflow, you might:
- Trigger and manage incidents in PagerDuty
- Stream or batch incident data into CrateDB for analysis, dashboards, and historical queries
The Lighthouse Staircase sits alongside these tools, not instead of them.
During the Incident
- PagerDuty handles alerts, escalation, and roles (incident commander, scribe, etc.).
- The staircase provides a business‑aligned, human‑readable timeline of what the team is doing and why.
- The scribe (or a dedicated timeline owner) keeps the staircase updated in real time.
After the Incident
You use the staircase as a ground truth transcript:
- Validate timestamps against PagerDuty event logs.
- Normalize terminology (service names, runbook IDs, business impacts) before ingesting into CrateDB.
- Fill in gaps where automation failed to capture context (e.g., the rationale behind a rollback attempt).
This workflow dramatically reduces:
- Ingestion errors (mismatched fields, missing steps, corrupted sequences)
- Inconsistent classification (e.g., severity vs. actual business impact)
CrateDB then becomes the system where you:
- Analyze patterns across incidents
- Build reports on MTTR, blast radius, etc.
- Query dependencies and recurring failure modes
The staircase is what keeps those CrateDB records coherent, complete, and comparable.
Prerequisite: Know What Matters and How Things Work
The staircase is only useful if it reflects your business and your systems.
Before building it, you must:
-
Clarify business priorities
- What is truly critical? (Payment processing, order placement, patient data, etc.)
- How is severity defined? (Revenue, customers blocked, safety, regulatory exposure.)
-
Map key systems and dependencies
- Which services are customer‑facing?
- Which internal systems they depend on (databases, message queues, third‑party APIs)?
- Where do monitoring and alerts originate?
-
Define recovery goals
- RTO and RPO for critical systems.
- Response SLAs per severity.
These inputs directly shape:
- The fields printed on your staircase cards
- The steps you include in the staircase
- Which actions are mandatory to record
Without this alignment, you’ll collect data—but not the data you actually need.
How to Build the Lighthouse Staircase (Step‑By‑Step)
You don’t need anything fancy. You need clarity and consistency.
Materials
- Large wall, whiteboard, or foam board visible to everyone
- Painter’s tape or thick markers to draw the staircase shape
- Pre‑printed cards or sticky notes (ideally in multiple colors)
- Thick markers (dark colors, high contrast for cameras)
- Optional: webcam or phone stand to stream the wall into remote calls
Step 1: Design the Staircase Layout
On your board or wall, draw a big staircase going up from left to right:
- Each step = one major phase or significant event.
- The vertical rise = time passing and impact increasing.
Label the steps across the bottom or side. A simple starting template:
- Detection & Declaration
- Triage & Assignment
- Impact & Scope Assessment
- Hypothesis & Plan
- Mitigation / Change Applied
- Verification & Monitoring
- Communication & Stakeholders
- Resolution Declared
- Follow‑Up & Postmortem Tasks
You can combine or expand steps based on your process maturity.
Step 2: Define Card Templates
Create pre‑printed cards or sticky‑note templates for the most common event types. For example:
-
Detection card (e.g., blue):
- Time:
- Source (PagerDuty service, monitor name):
- Detected by (tool or person):
- Symptom (short):
-
Action / Change card (e.g., yellow):
- Time:
- Owner:
- System / dependency:
- Action taken:
- Expected effect:
-
Impact card (e.g., red):
- Time:
- Affected business capability:
- Customers affected (estimate):
- Severity:
-
Communication card (e.g., green):
- Time:
- Audience (internal / external):
- Channel (email, status page, Slack):
- Summary:
Each card maps directly to fields you plan to store in CrateDB and/or PagerDuty’s notes/custom fields. That mapping is how you avoid data drift.
Step 3: Establish Roles
- Incident Commander (IC): Directs the response, calls out when a new card is needed.
- Scribe / Timeline Owner: Owns the staircase. Writes cards and places them.
- Technical Leads: Ensure their actions are represented accurately on the staircase.
Make it explicit: nothing “counts” as part of the official story unless it’s on the staircase.
Step 4: Use the Staircase in Real Time
During the incident:
- As soon as the incident is declared, the scribe places the first Detection card on the first step.
- When ownership is clear and triage begins, the scribe adds a card to the Triage & Assignment step (who’s leading, what’s the focus).
- As hypotheses form, mitigations are applied, and impact estimates change, new cards go on the corresponding steps.
- When communication goes out (e.g., public status page), the scribe adds Communication cards.
- When the IC declares Resolution, the scribe adds the final card and starts a final pass to ensure no obvious gaps.
The IC regularly “reads the staircase” aloud to:
- Keep everyone aligned
- Validate order and correctness
- Decide next actions
Step 5: After the Incident – Sync to CrateDB and PagerDuty
Once the incident is closed:
- Photograph the staircase and attach it to the PagerDuty incident or your incident ticket.
- Transcribe each card into your incident data pipeline:
- Map fields to your CrateDB schema (e.g.,
time,actor,system,action,business_impact). - Ensure step names map to canonical lifecycle stages.
- Map fields to your CrateDB schema (e.g.,
- Use the staircase to:
- Cross‑check automated logs and events.
- Fill in the “why” behind decisions.
- Identify missing instrumentation or alerting where cards exist but logs don’t.
This is where you reap the main benefit: clean, structured, context‑rich data with minimal rework.
Designing for Human Factors
The Lighthouse Staircase only works if it reduces human friction, not adds to it. Design it with real human behavior under stress in mind.
Minimize Cognitive Load
- Use large fonts, high contrast colors, and simple forms.
- Limit each card to a few critical fields; avoid tiny text blocks.
- Keep the number of steps manageable. Too many levels, and nobody uses them correctly.
Support Clear Communication
- Make updating the staircase a ritual: “We don’t move on until the last step is on the wall.”
- Encourage engineers to speak in staircase terms: “We’re still in impact assessment,” or “This is a mitigation step.”
Counter Group Dynamics
- Give the scribe explicit authority to pause for accuracy: “Hold on, I need to capture that before we continue.”
- Encourage quieter participants to validate the staircase view: “Is anything missing from this step?”
Handle Stress Responses
- The physical act of standing up, writing a card, and putting it on the wall provides a short cognitive reset.
- The visual progression up the staircase reassures the team that they are making progress, reducing panic.
Conclusion
The Analog Incident Lighthouse Staircase is deliberately simple: just paper, pens, and a wall. But behind that simplicity is a powerful idea:
- Make business‑aligned progress visible.
- Capture structured data in real time, not after the fact.
- Use that structure to reduce ingestion errors and keep CrateDB and PagerDuty data consistent and meaningful.
In an age saturated with tools, the staircase is a low‑tech anchor: a single, shared visualization of what’s happening, what matters, and what comes next.
Start small. Define your steps, print a few card templates, and try it on your next medium‑severity incident. Iterate from there.
Over time, you’ll not only handle outages more calmly—you’ll also build a far richer, cleaner, and more useful incident history for your organization to learn from.