The Paper Incident: How Street Map Cafe Sketches Its Way to Daily Reliability
How a small team at the fictional Street Map Cafe uses paper-first incident rituals, FRACAS, and lightweight roles to turn outages into a daily reliability practice between deploys.
The Paper Incident: Sketching Daily Reliability Rituals Between Deploys
Most teams treat incidents like storms: disruptive, exhausting, and something everyone hopes won’t happen again. At Street Map Cafe, a fictional but familiar SaaS product for city discovery, the team learned to treat incidents like stories instead—stories they sketch, map, and revisit between deploys.
This post walks through how a lean engineering team can:
- Use FRACAS (Failure Reporting, Analysis, and Corrective Action System) without drowning in process.
- Turn incident reports into structured data for reliability, safety, and logistics—not just dusty postmortems.
- Run incidents with a tiny core team and clear roles.
- Use paper-first sketching (timelines, dependency maps, story maps) to internalize lessons and build daily reliability rituals.
Think of it as a paper incident story map—a way to connect what happens during outages with how you work between deploys.
From One-Off Postmortems to a Living Reliability System
At Street Map Cafe, incidents used to follow a familiar pattern:
- Something broke.
- Slack exploded.
- People scrambled, fixed it, wrote a postmortem…
- …and nothing in their daily work really changed.
The turning point came after a payments outage that looked just like one they’d had six months earlier. Different root cause, same pattern:
- Slow detection
- Confused ownership
- Customers learning about the issue before the team
They realized the problem wasn’t just technical—it was systemic. Incidents were treated as isolated events, not datapoints in a larger reliability story.
So they adopted a lightweight FRACAS mindset.
FRACAS Without the Bureaucracy
FRACAS stands for:
- Failure Reporting – log the incident in a structured way.
- Failure Analysis – understand what actually happened and why.
- Corrective Action – implement changes to prevent or mitigate recurrence.
- System – make it repeatable and trackable across incidents.
A lot of teams avoid FRACAS because it sounds like aviation-grade bureaucracy. Street Map Cafe took a different route:
- A single, simple incident form (in a doc or ticketing system) for every event.
- The same few required fields every time.
- A habit of reviewing patterns between deploys.
No giant spreadsheets. No ten-step approval chains. Just enough structure to see trends over time.
Incident Reports as a Data Asset, Not a Graveyard
The biggest shift wasn’t the form—it was how they thought about the data.
Instead of treating incident reports as one-off postmortems, they asked:
“If we look at these as structured data over a year, what can we learn about our reliability, safety, and logistics?”
So they standardized a core incident schema:
- What failed? (service, feature, dependency)
- Failure class: data, auth, performance, configuration, dependency, etc.
- Detection source: monitoring, customer ticket, internal report.
- Impact surface: customers affected, region, internal vs external.
- Time metrics: time to detect, time to mitigate, time to fully resolve.
- Contributing factors: process, tooling, environment, human factors.
- Corrective actions: technical and process changes.
This let them answer questions like:
- Which services account for most incidents?
- How often do external dependencies cause problems?
- Are incidents mostly detected by customers or monitoring?
- Do certain release windows correlate with more failures?
By treating incidents as structured reliability data, they created a feedback loop into planning:
- Reliability work added to the roadmap based on evidence, not anecdotes.
- Platform and tooling investments justified with patterns, not gut feel.
- On-call training adjusted to real-world recurring failure types.
Lean Teams Need Lightweight Incident Workflows
Street Map Cafe runs with a small engineering team. They couldn’t afford a heavyweight incident bureaucracy—but they also couldn’t afford chaos.
They designed a workflow with two constraints:
- Everything must be runnable by 3–5 people.
- No process step survives if it doesn’t measurably reduce confusion or recurrence.
The result: a small, clearly defined core incident response team.
The 3–5 Person Core Team
For any significant incident, they assign three key roles:
-
Incident Commander (IC)
- Owns decision-making and overall coordination.
- Decides priorities: rollback vs patch, partial vs full shutdown.
- Keeps the team focused, avoids thrash.
-
Tech Lead (TL)
- Drives root cause analysis and technical mitigation.
- Directs engineers: where to look, what to change, how to test.
- Tracks hypotheses and evidence.
-
Communications Lead (CL)
- Manages stakeholder updates (customers, leadership, support, sales).
- Owns status page updates and internal announcements.
- Translates technical status into clear, plain language.
Optionally, they add 1–2 supporting engineers as needed, but the core roles stay constant. This consistency:
- Reduces coordination overhead.
- Makes training easier.
- Keeps responsibility clear even when things feel chaotic.
No war rooms with 25 people. No blurry lines of authority. Just a tiny team that knows who decides what.
Why Paper? The Street Map Cafe Sketching Ritual
The team’s most unusual reliability habit started almost by accident.
During a particularly noisy incident, one engineer closed their laptop, pulled out a notebook, and began sketching:
- A vertical timeline of key events.
- A quick dependency diagram of services and external APIs.
- A story map: users, their paths, and where errors appeared.
They noticed a change:
- Conversation became more focused.
- The team stopped chasing random Slack messages and aligned around the same picture.
- People remembered the incident details better in the following days.
They turned this into a ritual.
Handwritten Sketching During Incidents
Whenever an incident crosses a severity threshold, someone (often the TL or IC) starts a paper sketch:
- Timeline column: detection, key changes, mitigation attempts, state transitions.
- Systems map: boxes and arrows for services, data stores, 3rd-party APIs, queues.
- Impact notes: where user flows break, what symptoms appear.
Why paper instead of a fancy tool?
- Lower cognitive load: sketching is fast and forgiving—no UI friction.
- Better focus: pen and paper pull you away from tab overload.
- Stronger memory: handwriting improves recall and understanding.
- Intentional constraint: you only write what actually matters.
Photos of the sketches get attached to the incident record later, but in the moment, the team optimizes for thinking speed, not tooling.
Paper-First Mapping Between Deploys
The real magic happens not during the incident, but between deploys.
Street Map Cafe created a recurring ritual: the Paper Incident Story Map Session.
Every week (or after a major incident), a small group meets for 45–60 minutes with:
- Incident reports from the past period.
- Printed or digital metrics (SLIs/SLOs, MTTR, incident counts).
- Blank paper, pens, and markers.
They do three kinds of maps:
1. Incident Timelines at a Glance
On one page:
- Sketch mini-timelines for each recent incident.
- Mark detection time, escalation, mitigation, resolution.
- Note which signals triggered action (alerts, tickets, dashboards).
This often reveals:
- Alert rules that are too noisy or too quiet.
- Slow escalations due to unclear ownership.
- Repeated delays in the same class of incidents.
2. Dependency Story Maps
On another sheet, they:
- Draw the core user journeys (e.g., “browse map → pick cafe → pay”).
- Layer on top the services and dependencies each step hits.
- Annotate where recent incidents intersect those paths.
Patterns emerge:
- One backend service appears in half of all incidents.
- A single 3rd-party provider creates cascading issues.
- Certain flows have no graceful degradation path.
These maps feed directly into architectural priorities and reliability epics.
3. Daily Reliability Rituals Map
Finally, they sketch a loop of daily habits that would make incidents less likely or less painful:
- Pre-deploy checks
- Alert tuning
- Runbook updates
- Chaos tests or drills
- On-call shadowing
For each habit, they write:
- Which incident(s) motivated it.
- Who owns it.
- How often it happens.
This keeps reliability work tangible and visible—not an abstract “someday” concern.
From Incident Stories to Daily Practice
Over time, Street Map Cafe’s paper-first rituals changed the team’s default behavior:
- Engineers thought in failure modes when designing new features.
- Product managers understood reliability trade-offs in roadmaps.
- On-call folks used story maps and runbooks shaped by real incidents, not hypotheticals.
The biggest shift: incidents stopped feeling like rare catastrophes and became the primary raw material for improving how the team works every day.
How to Start Your Own Paper Incident Story Map
You don’t need new tools. You need a few constraints and a pen.
-
Define a minimal FRACAS schema.
Decide on 6–10 fields you will capture for every incident, and stick to them. -
Establish a 3–5 person core incident team.
Name your Incident Commander, Tech Lead, and Communications Lead roles explicitly. -
Introduce paper sketching in your next incident.
Assign one person to draw a rough timeline and systems map while others work. -
Schedule a short mapping session between deploys.
Once a week or after major incidents, sketch timelines, dependency maps, and daily habits. -
Tie maps to concrete actions.
Every session should end with 1–3 specific changes (alerts, docs, code, process).
Conclusion: Reliability Is a Story You Draw, Not Just a Metric You Track
Reliability isn’t only about uptime percentages and MTTR charts. It’s about how your team makes sense of failure, and what you practice between the moments of crisis.
By combining a lightweight FRACAS approach, a small and clearly defined incident response team, and paper-first sketching rituals, Street Map Cafe turned incidents into:
- A source of structured reliability insight.
- A shared visual language for complex systems.
- A foundation for daily reliability habits that survive beyond the postmortem.
If your incident process feels either chaotic or overcomplicated, try something deceptively simple:
- Shrink the team.
- Standardize the data.
- And pick up a pen.
The story your incidents are telling might finally become clear enough to change what you do every day between deploys.