The Pencil-Drawn Incident Aquarium: Floating Paper Fish That Visualize Hidden Reliability Currents

The Pencil-Drawn Incident Aquarium

Imagine walking into a war room during a major outage.

On the wall is a giant, hand-drawn aquarium: rough pencil lines for glass, pipes, and plants. Inside it, dozens of paper fish hang from threads. Each fish has a service name scribbled on it: payments-api, search-indexer, notifications, auth-gateway.

As the incident unfolds, some fish drift upward, some sideways, others blink red or turn upside down as engineers flip them around, scribbling new notes on the backs. Lines of colored string show data flows and dependencies: who calls whom, what breaks when something else fails. The whole thing looks whimsical—but it’s deadly serious.

This is the Pencil-Drawn Incident Aquarium: a physical metaphor for what’s happening inside your production environment. The paper fish are your services. The pencil lines are your architecture. The movement of the fish shows the hidden reliability currents—how failures propagate, how small config changes ripple across the system, how a single dependency can quietly drag down everything else.

In most organizations, those currents are invisible—until something breaks.

This post uses that aquarium as a mental model to explore how you can:

Make incident response calmer and more effective
Build a reliability program as a living knowledge system, not a set of static documents
Avoid the traps of premature scale and hidden complexity
Integrate reliability into daily work so the aquarium is always up-to-date

1. Incident Response: Calming the Water with Automation

During a major incident, humans are like divers in turbulent water: limited visibility, limited air, and high stress. Every bit of manual friction—looking up a dashboard, paging the wrong person, copying logs—steals attention from the most important task: understanding what’s going on and making good decisions quickly.

In our aquarium metaphor, incident response is what happens when a current suddenly shifts and half the fish turn belly-up. The last thing you need is to:

Manually create conference bridges
Rebuild the same charts and queries from scratch
Guess who owns which fish
Click through five tools to assemble context

Routine automation and simplified workflows are the pumps and filters that keep the water clear:

Automated incident creation & routing: Alerts that meet a threshold automatically open an incident, assign severity, and route to the right team.
Standardized incident rooms: Each new incident spins up a pre-templated channel (chat, docs, timelines) with checklists and links.
Single-click context: For any affected “fish” (service), responders get instant access to:
- Owner and on-call
- Recent deploys and config changes
- Health dashboards and logs
- Downstream dependencies

The goal isn’t to automate decision-making; it’s to automate the friction around decision-making. Every repetitive, predictable step you automate gives responders more mental space to reason about the current itself.

When done well, your incident process starts to feel like a clear, calm aquarium where you can see each fish, each bubble, each current—rather than a murky tank you’re blindly reaching into.

2. Reliability as a Knowledge-Based System, Not a Checklist

A strong reliability program is not just a set of tools or a compliance checklist. It’s a complex, knowledge-based system built around your specific products, users, and constraints.

In the aquarium, this knowledge looks like:

Why certain fish are grouped together
Which pipes are double-reinforced
Which currents are allowed to be turbulent and which must stay calm

In practice, that means:

Context-specific SLOs and error budgets: Not everything needs five-nines. A marketing landing page and a core payment API shouldn’t be held to the same standard.
Runbooks grounded in reality: Not generic “restart the service” steps, but knowledge rich with context: "If this metric spikes and that dependency is flaky, it’s usually a DNS issue or rate limiting. Start here."
Architectural patterns that match your domain: Some domains tolerate eventual consistency; others (like money movement or health records) absolutely do not.

This knowledge system must be:

Continuously updated via post-incident reviews
Discoverable by anyone touching production
Embedded in tooling, docs, and code comments—not stuck in someone’s head

The Pencil-Drawn Incident Aquarium, as a metaphor, reminds us that the drawing is never finished. The complexity of your environment is living and evolving, and your reliability program must grow with it.

3. Leadership: The Hand That Keeps Redrawing the Glass

Reliable systems emerge from reliable organizations, and that begins with leadership.

Leadership commitment shows up in simple, tangible ways:

Reliability is a priority, not a side quest: Time is explicitly allocated for reliability work: capacity planning, chaos experiments, runbook creation, architecture reviews.
Healthy trade-offs are encouraged: It’s acceptable—even expected—to slow a feature launch to protect core reliability.
People development is intentional: Incident commanders are trained, shadowing is encouraged, and deep technical skills are nurtured as a strategic asset.

Without this commitment, the aquarium becomes a neglected school project: faded lines, missing fish, nobody sure who was supposed to feed what.

With it, leaders:

Repeatedly ask: “How will this affect reliability?”
Attend post-incident reviews and understand both technical and organizational root causes
Fund work that may not have an immediate ROI but reduces future chaos

Leadership doesn’t control the currents directly—but it determines whether the organization invests in the right pumps, filters, and monitoring.

4. Reliability in the Flow of Work, Not as an Afterthought

One of the most common failure patterns is treating reliability as a separate track: a quarterly project, a special team, a thing you “get to later.” In aquarium terms, that’s like building the tank, filling it with fish, and then thinking about the filtration system next year.

Effective teams bake reliability into daily operations:

In design reviews: Every new feature must answer, “How can this fail, and how will we know?”
In code review: Reliability considerations—timeouts, retries, idempotency, observability—are part of the checklist.
In sprint planning: Work on observability, resilience, and incident follow-ups is prioritized alongside product features.
In onboarding: New engineers learn incident processes and reliability expectations from day one.

When reliability is integrated this way, the aquarium is continuously updated. New fish aren’t added without labeling them, recording their dependencies, and understanding how they’ll behave when the currents change.

5. Standard Work: The Swim Lanes for Every Incident

In a crisis, ambiguity is as dangerous as the underlying bug. Standard work practices provide a shared script for how everyone moves when the alarm bell rings.

Typical elements include:

Clear incident roles: Incident commander, communications lead, operations lead, domain experts. Everyone knows the swim lane they’re in.
Unified severity definitions: No debates in the middle of chaos about whether this is a SEV-1 or SEV-2.
Common playbooks: For recurring classes of incidents (degraded dependency, capacity exhaustion, configuration drift), responders have pre-agreed steps.
Post-incident rituals: Time-boxed reviews, well-understood templates, and clear follow-up tracking.

Standard work doesn’t mean rigidity; it means predictable structure. It gives responders psychological safety and a foundation on which to improvise. Like well-marked lanes in the aquarium, it ensures that when the water gets rough, the fish don’t all smash into each other.

6. The Danger of Overbuilding: When the Aquarium Becomes a Maze

Scaling is exciting. It’s easy to reach for the most advanced patterns and tools early: service meshes, intricate event buses, deeply nested microservices, complex multi-region topologies.

But overbuilding for scale too early has a painful side effect: it hides reliability trade-offs and slows down debugging.

When the aquarium becomes a labyrinth of pipes, hidden compartments, and obscure filtration systems:

No one can fully explain the end-to-end flow
A small config change in a “low-risk” fish triggers a cascading failure
New engineers need months just to build an accurate mental model

Simplicity is a reliability feature.

Principles to keep the aquarium understandable:

Start with the simplest architecture that meets current needs, then evolve deliberately.
Consolidate where possible: Not every function needs to be its own microservice.
Make trade-offs explicit: If you add complexity for scale, document the operational cost and the failure modes.

You want just enough glass, just enough pipes, and just enough compartments to support your current and near-future fish—not an underwater city that nobody can navigate.

7. Architecture & Dependency Mapping: Seeing the True Currents

The heart of the Pencil-Drawn Incident Aquarium is its visual map of services, assets, and configuration items, and how they connect.

Clear architecture and dependency mapping answer questions like:

What depends on this service, directly and indirectly?
If this database slows down by 50%, who feels it first?
Which configuration items (flags, secrets, routing rules) control critical paths?

In the aquarium:

Each fish (service) is labeled with its owners, SLOs, and criticality.
Strings between fish show real, current dependencies, not the ones from a diagram drawn three years ago.
Color or position encodes blast radius, data sensitivity, or failure domain.

Operationally, this means:

A living service catalog / CMDB integrated with CI/CD and observability
Topology-aware alerting and dashboards that show not just “what’s broken” but “who this will hurt next”
Incident tools that can reveal impact paths in seconds, not hours

When you can see the currents, you can reason about propagation and containment. You can say, with confidence, “If this fish dies, these three will be sick, and these others will be fine.”

Conclusion: Keep Redrawing the Aquarium

The Pencil-Drawn Incident Aquarium is intentionally low-fi. That’s the point.

Reliability isn’t about having the fanciest tools or the most elaborate diagrams. It’s about:

Reducing friction under stress through thoughtful automation
Treating reliability as a living, knowledge-based system tied to your domain
Ensuring leadership invests in skills, time, and alignment
Embedding reliability into everyday work, not bolting it on later
Following standard work so that everyone knows how to move when the water churns
Resisting premature complexity, keeping systems understandable
Maintaining clear architecture and dependency maps so you can see how currents really flow

If your current production environment feels more like a dark, chaotic ocean than a clear aquarium, start small:

Draw your own pencil-and-paper dependency map
Label your most critical fish and their currents
Automate one painful incident task
Formalize one standard practice

Then keep redrawing.

Over time, you’ll build an environment where incidents are still stressful—but no longer mysterious. The paper fish will show you where the hidden currents are, and your teams will have the skills, structures, and systems to swim confidently through them.