Rain Lag

The Pencil-Drawn Incident Greenhouse Tram: Moving a Single Paper Map Through Every Stage of an Outage

How to redesign your incident response and postmortems around a simple, shared “paper map” that follows cognitive load, supports stressed brains, and safely integrates AI assistance in real outages.

The Pencil-Drawn Incident Greenhouse Tram: Moving a Single Paper Map Through Every Stage of an Outage

When something breaks in production, your brain is the first system to go into degraded mode.

You’re juggling alerts, logs, Slack threads, executives asking for ETAs, half‑remembered runbooks, and maybe a panicked customer. Under that kind of stress, your brain does not have the material yet for clear reasoning. It’s trying to build a mental map of what’s happening while you’re already driving the tram.

This post is about building that map deliberately—using a “pencil‑drawn incident greenhouse tram”: a simple, shared, evolving diagram that moves through each stage of an outage. Think of it as a single sheet of paper (or whiteboard, or digital equivalent) that everyone can point at to understand:

  • What we know
  • What we’re trying
  • Who owns what
  • Where AI fits in (and where it absolutely does not)

And later, the same map lets you reconstruct the incident for a clear, useful postmortem.


Start With the Reader’s Brain, Not the Clock

Most incident reviews are written like: "At 09:13, the alert fired. At 09:16…"

Chronology is easy to log, but it’s hard to read. When I’m trying to understand an outage—from a dashboard, from a runbook, or from a postmortem—I need a map before a timeline.

Resequence your story around cognitive load

Instead of a minute‑by‑minute replay, structure incident artifacts like this:

  1. High-level map (the big picture)

    • What broke (symptom)?
    • What was impacted (users, systems)?
    • What kind of failure was it (performance, data corruption, availability, etc.)?
    • What fixed it (high-level)?
  2. System overview (how things are wired)

    • Simple diagram of the main components involved
    • Arrows showing data/traffic flow
  3. Key decision points

    • What we thought was true
    • What we decided to do
    • What we discovered
  4. Detailed timeline

    • Events, logs, metrics, changes, chat transcripts

People read top‑down, not left‑to‑right over time. Use that: build incident documents that match how the brain makes sense of chaos.


The “Single Paper Map” for an Incident

Imagine that every major outage is accompanied by a single piece of paper everyone can see. Maybe it’s literally paper taped to a wall, a whiteboard, or a shared online whiteboard.

On that paper you draw a greenhouse tram diagram: a box for each key component, plus the lines of traffic between them. As the incident evolves, you add:

  • Red X’s on suspected or confirmed broken components
  • Sticky notes or callouts for hypotheses and tests ("If we do X, we expect to see Y")
  • Names or roles next to each active action ("API rollback – owned by On‑call A")

This map becomes your visual incident nerve center.

What goes on the map?

At minimum:

  • User entry point: Where traffic enters (browser, mobile app, external API consumer)
  • Critical path: Load balancers → API gateways → services → databases → queues, etc.
  • Current symptom: Where we see the problem (e.g., "Checkout latency > 5s," "Error 500 on /login")
  • Last known good point: Where the path was still healthy
  • Known changes: Deploys, config changes, infra updates touching that path

You update it live as you:

  • Gather data (metrics, logs, traces)
  • Change hypotheses ("Maybe it’s the database … no, it’s the cache tier")
  • Execute actions (failover, rollback, feature flag flips)

That single map is the tram line your whole team rides during the outage.


Design for Brains Under Stress

During an incident, no one has the full picture. People are:

  • Partially informed
  • Adrenaline‑spiked
  • Context‑switched

That’s not a bug in your team; that’s how humans work.

So design your incident process and documentation for low cognitive load:

1. Make runbooks visual and scannable

  • Start each runbook with:
    • "You are here if…" (symptoms)
    • Simple diagram of involved systems
    • 3–5 high‑level steps before any detailed sub‑steps
  • Use bullet points and explicit checks:
    • "Confirm X metric is above Y for Z minutes. If not, stop and re-evaluate."

2. Keep roles crystal clear

On the paper map, always show:

  • Incident commander (IC)
  • Comms lead
  • Primary responder(s) per subsystem

When ownership is explicit, people don’t waste brain cycles wondering "Who’s on this?" or "Am I supposed to do that?".

3. Limit active threads

Encourage responders to:

  • Park non‑critical ideas in a visible "later" section on the map
  • Focus on one hypothesis at a time
  • Explicitly close hypotheses ("Ruled out: network path A")

The goal: make the state of the investigation visible, so your brain doesn’t have to remember everything.


Where AI Fits: Assistant, Not Authority

AI tools can be incredibly helpful during incidents—but they are not reliable reasoners and are not ground truth.

You must treat them as assistants, not authorities.

Known weak spots of AI in incidents

  • Logical reasoning: AI often produces plausible but wrong chains of cause and effect.
  • Hallucination: It may invent APIs, flags, or configs that don’t exist.
  • Lack of context: It can’t see everything you see—especially real‑time system signals—unless you carefully feed it that context.

Safe uses of AI during an outage

On the paper map, you can literally mark “AI-assisted” next to ideas or code it generates.

AI is most effective when used to:

  • Summarize logs or error messages you paste in
  • Suggest search queries for logs or metrics
  • Generate boilerplate diagnostics scripts (e.g., a script to hit a health endpoint repeatedly)
  • Draft runbook updates or incident summaries post-fact

All of these still require human validation.

Dangerous uses of AI (without guardrails)

  • Asking, "What’s causing this outage?" and acting directly on the answer
  • Letting AI modify production configs or infra definitions without review
  • Accepting AI-generated queries or migrations that touch real user data without tests

Any AI suggestion that alters code / configs / data should be treated like:

  • Code from an unknown junior engineer
  • Who might be brilliant
  • But who you must review and test rigorously

You can still track these on the map: "Hypothesis from AI: X. Human review by Y. Status: accepted/rejected." This keeps humans in charge of reasoning.


AI-Assisted Changes: Review and Test Every Time

During a crisis, the temptation is huge: "The AI says this query will fix it; let’s just run it."

Resist that. Create explicit rules:

  1. All AI-generated changes are reviewed

    • A named human reviews the diff, script, or command.
    • The reviewer is recorded on the incident map.
  2. Changes are tested in the safest available environment first

    • Staging, shadow traffic, or a limited blast radius (e.g., one shard or tenant).
  3. Rollback steps are written down before execution

    • "If this goes wrong, we will revert by doing X, Y, Z."

This might feel slow in the moment, but one AI‑induced secondary outage will convince you it’s necessary.


Run the Postmortem Within 48 Hours

The incident doesn’t end when the graph goes green.

The longer you wait to run the postmortem, the more your shared map decays. People forget:

  • What they actually thought at the time
  • Which paths they explored and abandoned
  • How stressed or confused they were

To keep the reconstructed map accurate, adopt a simple rule:

Run the postmortem within 48 hours of resolving the incident.

Schedule it while you’re still in the incident channel.

This tight window gives you:

  • Fresher memories
  • More reliable qualitative data
  • Better insight into how your process performed under real stress, not how it looks in hindsight.

Build the Full Picture With Qualitative Data

Metrics and logs show what happened. They rarely show how it felt or why people made the decisions they did.

A good postmortem treats the incident like a human + system event. To capture the human side, gather qualitative data:

1. Short questionnaires

After resolution, send a quick form to everyone involved:

  • What was most confusing?
  • When did you feel most stuck?
  • Which tools/docs helped or hurt?
  • What do you wish you had on your screen that you didn’t?

2. Brief interviews

For key participants (IC, primary responders, on‑call devs):

  • 10–15 minute conversations
  • Focus on decision points: "At 10:27, why did you pick path A instead of B?"

3. First-person notes

Encourage responders to keep timestamped notes during the incident (even rough bullets). These help reconstruct:

  • Actual hypotheses vs. clean‑room hindsight
  • Emotional load ("I was super unsure here")

In your written postmortem, weave these perspectives together with the data. That gives you a true map of the incident—not just the sanitized, technical version.


Bringing It All Together

The “pencil-drawn incident greenhouse tram” is a metaphor for designing outages around one evolving, shared map instead of scattered mental models and endless chat scroll.

To recap:

  • Structure docs for cognitive load: high-level map first, timeline later.
  • Use a single visual map to track systems, hypotheses, ownership, and actions.
  • Design for stressed brains with clear roles, visual runbooks, and limited parallel threads.
  • Treat AI as an assistant: helpful, fast, sometimes wrong—never the authority.
  • Review and test all AI-generated changes like untrusted code.
  • Hold postmortems within 48 hours while memories are fresh.
  • Collect qualitative data so you understand not just what broke, but how humans navigated the chaos.

If you can draw your outage on one piece of paper, keep that drawing alive through the whole incident, and then replay it together within two days, you’ll get more resilient systems—and more confident humans running them.

The Pencil-Drawn Incident Greenhouse Tram: Moving a Single Paper Map Through Every Stage of an Outage | Rain Lag