← Writing

Failure, Maintenance & Time

Systems don't fail suddenly

The 3am alert is not where the failure begins. It is where accumulated decisions become impossible to ignore.

5 min

Systems do not fail suddenly.

The 3am alert, the cascading error, the incident that takes down a product for four hours — these feel sudden. They are not. They are the visible surface of something that was already true, revealed by the particular set of conditions that made it unavoidable.

Understanding this changes how you build, how you operate, and what you do after something breaks.

The slow accumulation

Most failures have long prehistories.

A database index that was never created because the query ran fast enough in development. A timeout value set “temporarily” that was never revisited. A manual deploy process that worked fine until it did not. A runbook that was accurate in 2022 and has been wrong ever since.

None of these are dramatic. None of them feel like risk at the time they are introduced. They feel like normal work under normal constraints — a shortcut taken because the deadline was real, a decision deferred because the system was working, a document not updated because updating it seemed less urgent than everything else.

Failures are the compound interest on ordinary decisions made in ordinary times.

The trigger is not the cause

What triggers a failure is rarely the cause.

A spike in traffic. A configuration change. A dependency update. A holiday weekend with reduced on-call coverage. These are triggers. The cause was already present. The trigger is the condition that made the existing weakness visible.

This matters because post-mortems often focus on the trigger. “We deployed on Friday.” “Traffic was unusually high.” “The disk was full.” These are true, but they are not where the learning is.

The learning is in what made the system susceptible to that trigger in the first place. Why did traffic volume reveal that bug? Why did a configuration change expose that assumption? Why did reduced coverage mean no one noticed until the damage was done?

The trigger explains when. The prehistory explains why.

Maintenance as a practice

The way to avoid failures is not to respond better to them. It is to continuously reduce the surface area of accumulated risk.

This is maintenance — not in the narrow sense of keeping servers running, but as a discipline of observation and reduction. Identifying the indexes that do not exist. Revisiting the timeouts that were set temporarily. Reviewing the runbooks after the conditions they describe have changed.

Maintenance is not glamorous. It rarely produces a ticket anyone is excited to write. The benefit is negative: the number of incidents that do not happen, the failures caught before they cascade, the engineer who does not spend their Saturday undoing six months of accumulated fragility.

The best operational teams treat maintenance as a first-class practice, not as what you do when nothing else is scheduled.

What a good post-mortem actually does

A post-mortem that ends at the trigger has not done its job.

A useful post-mortem asks: where else in the system does this kind of assumption exist? What made this failure mode invisible until now? What in our processes made it normal to defer the fix?

The goal is not to document the specific incident. The goal is to understand the conditions that allowed the failure to accumulate, because those conditions are probably generating other failures that have not surfaced yet.

The most valuable output of a post-mortem is not the action items. It is the updated mental model of where the system is fragile.

Operating with honesty

Teams that operate well do not assume reliability — they measure it.

They know which parts of the system are fragile, not because they intended it that way, but because they have been honest about where the risk lives. They have run failure simulations before production ran them. They have read the code that nobody understands. They have asked what happens when this particular service is slow, not just when it is down.

This is not pessimism. It is operational honesty. The risk exists whether or not anyone has looked at it. Looking does not create the problem; it reveals it at a time when something can still be done.

On time

Time is the one variable no system escapes.

Every decision has a date on it. The schema chosen in year one is still there in year five. The service that was “temporary” is production-critical in year three. The abstraction that made sense at a hundred thousand users does not make sense at ten million.

Systems age. Not gracefully, unless someone is continuously working against entropy. The teams that build durable systems are not the ones who made better initial decisions — they are the ones who revisit decisions as the context around them changes.

Failure is often just time making visible a decision that should have been revisited earlier.

The question worth asking is not “why did this break?” It is “what else have we been too busy to look at?”