The automation that broke three months ago and nobody noticed.

There is a specific type of operational failure that is worse than a crash.

A crash is obvious. Something stops working. Someone complains. You fix it. The damage is bounded by how long it takes you to notice and respond.

The other type of failure is the silent one. The automation that keeps running but starts producing wrong outputs. The Zapier chain where one step starts failing and the rest of the chain carries on regardless. The report that has been pulling from the wrong date range for six weeks. The CRM sync that stopped updating three months ago and nobody noticed because the sales team stopped trusting it long before that and started keeping their own notes.

These failures accumulate. By the time someone finds them, the damage is deep and the trace is long.

Why most automations fail silently

The honest reason is that most automations are built for the happy path.

Someone identifies a repetitive task. They connect two tools. It works. They move on. Nobody builds in what happens when one of the connected tools changes its API. Nobody defines what failure looks like. Nobody sets up monitoring to detect when the expected output stops arriving. Nobody names who is responsible for checking.

So the automation runs. And then something changes. The tool updates, the data format shifts, a field name changes, an API key expires. And the automation keeps running, but it is running on nothing. Or worse, it is running on bad data and quietly poisoning downstream systems with it.

The Zapier problem

Zapier is the tool that gets blamed for this the most, and not entirely fairly. Zapier works well for simple, stable connections between tools. The problem is that it is so easy to use that people build complex, multi-step automation chains in it and then treat them as infrastructure.

They are not infrastructure. They are glue. And glue dries out.

A Zapier chain with eight steps has eight failure points. When a step fails, Zapier will either stop and send an error email that goes to an inbox nobody monitors, or it will continue with a null value that propagates through the rest of the chain. Either way, most teams find out weeks later when something obviously wrong surfaces in a report or a client complaint.

What monitoring actually looks like

Building automation that stays up means building in two things from the start: observability and ownership.

Observability means you know when something changes. Not just when something breaks with an error, but when the output is outside expected bounds. If your order sync usually moves 50 records a day and today it moved zero, that should trigger an alert whether or not there was a technical error. The output being wrong is the failure, not the error message.

Ownership means there is a named person responsible for each automation. Not a team. A person. When the automation fails, that person gets the alert, understands what it does, and knows how to fix it or escalate it. Without named ownership, alerts go to inboxes and nothing happens.

How we build it differently

When we build automation for clients, we build it in Odoo workflows wherever possible. Not because Odoo is perfect, but because it is the same system where the data lives. When something goes wrong, there is one place to look. The failure surface is smaller.

Where we use external automation, we build explicit monitoring. Expected output ranges. Failure notifications to a real channel that a real person watches. Documented runbooks for what to do when something breaks.

We also build reversibility. Every automation we deploy has a documented way to turn it off and a way to identify and correct the records it affected if something went wrong. If you cannot roll back an automation, you are one silent failure away from a problem you cannot fix cleanly.

The audit question

Here is the question to ask about every automation currently running in your business:

If this automation has been producing wrong outputs for the past 30 days, would you know?

If the answer is no, or "probably not," that automation needs monitoring. Not eventually. Now.

Go through your Zapier account. Your Make scenarios. Your connected tools. Find the ones that run silently in the background and ask whether anyone would know if they stopped working. The answer is usually uncomfortable.

The meta-lesson

The businesses that handle this well are the ones that treat automation as infrastructure, not as magic. Infrastructure needs maintenance schedules, monitoring, ownership, and documentation. Magic is what you believe in until it fails.

We learned this in our own operations. Some of our most painful operational problems were automations we built ourselves that we trusted because they had always worked. Until they didn't.

Every automation we deploy now has a health check. Someone owns it. There is documentation. There is a way to turn it off.

It is more work upfront. It is dramatically less work than the alternative.

— Qann Commerce · qann.co

← all posts