Loud failure is a gift. Quiet failure is a debt.

That's the lesson my last quarter of agent operations has taught me — and it's the one most reliability engineering doesn't take seriously enough. We obsess over error handling, retry logic, exception traces. The agents that crashed loudly got fixed quickly. The ones that lied politely cost me weeks of corrupted state I had to dig out by hand.

This post is the long-form argument behind a week of LinkedIn posts. If you saw any of them, you've seen the parts. Here's the whole.

Three failure modes that don't crash

The agents I run — JARVIS, Loki, Vision, Wong, Pepper, and a handful of smaller specialists — fail in three distinct ways that monitoring and error logging do not catch. None of them throw stack traces. All of them have cost me real time.

1. Silent halts

The agent stops without raising an error.

OAuth tokens expire mid-run. A rate limiter throttles to zero. A cron job misfires and skips silently. A background worker dies but the supervisor reports "healthy." A queue consumer disconnects and jobs accumulate.

In every case, the system looks fine. Logs are clean. Health checks pass. Nothing crashed. Nothing is running.

Detection isn't a logging problem. It's a liveness problem. You need a heartbeat — a positive signal that says "I am still doing the thing" — and an alert that pages someone when the heartbeat stops.

Most teams have a dashboard. Most dashboards are nobody's morning routine.

2. Polite lies

The agent or upstream API returns success — but the semantics are subtly wrong.

This is the worst class of failure I deal with, by far. The LLM returns confident output for a hallucinated tool call. The API accepts the request and interprets it differently than intended. JSON parses but lacks expected fields. Embeddings match, but semantically nothing meaningful was retrieved. Tool args are wrong but plausible — corrupting every subsequent step.

200 OK. No exceptions. Downstream systems carry the wrong state forward.

I'll give you the canonical case from this week. My Vision agent triages my Todoist inbox every morning — it reschedules tasks I haven't gotten to into the future, on my behalf. It's been doing this reliably for two months.

Last Tuesday I noticed my weekly review hadn't appeared in my queue. Then my monthly billing reconciliation. Then a couple of other recurring tasks I'd been ignoring. I dug in and found that three weeks of recurring tasks had silently disappeared from my system.

The cause: Todoist's due_string field is a full replacement of the due object. If you post {"due_string": "tomorrow"} against a recurring task, the API accepts it, returns 200, reports success — and silently strips the is_recurring flag. The task is now a one-off. Complete it, and it never comes back.

Vision did its job perfectly. Todoist honored its API contract perfectly. My recurring tasks died because no one lied. Each individual operation was correct. The combined behavior was wrong.

That's a polite lie. The API didn't refuse the request — it just answered a different question than the one I asked.

The fix isn't error handling, because there was no error. The fix is a pre-write guard: GET the task first, check due.is_recurring, refuse silently destructive operations. Force the agent to see the consequence before it commits.

3. Drift

Nothing in your code changed. The model behind it did. Or the data did. Or the world did.

Same prompts, different answers. Output style shifts. Accuracy on edge cases erodes. Vendor "improvements" change downstream tool behavior.

You only notice drift when the answers start being wrong in the same direction — when the bias in the new model surfaces as a pattern. Random degradation looks like noise. Systematic degradation looks like a problem you've been making yourself.

Detection requires a fixed reference. You capture a set of golden cases — known-good inputs and outputs — and you re-run them on a schedule. When the output diverges past a threshold, an alert fires.

Most teams skip this because it feels redundant. It's not. It's the only way drift becomes visible before it shows up as customer impact.

Why these three are categorically different

A regular bug is loud. The system was supposed to do X, it did Y, an error fired or a test failed, you fix it. Bounded cost, immediate detection, single repair.

These three are quiet. Quiet failures cost you everything downstream of when they started — every action, decision, or correction made on top of corrupted state.

Imagine a polite-lie bug at hour zero of an eight-hour pipeline. Each subsequent step processes the corrupted state. Decisions are made. Notifications fire. Records are written. Customers see results. By hour eight, you have eight hours of polluted output, and the repair isn't a fix — it's an archaeology project.

This is the asymmetry that makes structural fixes mandatory. Not "best practice." Mandatory. The cost-of-detection function for quiet failures is roughly exponential in the time-since-occurrence. Every hour you don't catch them, the repair cost roughly doubles.

The Agent Reliability Audit

I built a three-dimension diagnostic mapping each detection layer to its corresponding failure mode. You score yourself 5/3/1 on each dimension — covered, partial, or not in place. Total out of 15.

Liveness — catches silent halts

  • 5: Heartbeat + alerting + recent verification (someone confirmed the alert pipeline this week)
  • 3: Heartbeat exists, alerting incomplete (you'd notice eventually)
  • 1: "We'd notice if it stopped" (no, you wouldn't)

Verification — catches polite lies

  • 5: Output validators + golden cases + pre-write guards on destructive operations
  • 3: Schema validators only (validates shape, not meaning)
  • 1: "The LLM said it was done"

Drift Detection — catches drift

  • 5: Baselines + scheduled re-runs + drift alerts
  • 3: Baselines exist, no schedule
  • 1: "We'll know if it gets worse" (no, you won't)

What the score means

Score Reading
13–15 Production-grade. Your agents fail loud.
9–12 Partial coverage. Audit your gaps.
5–8 Mostly trusting silence. High risk.
3–4 You're shipping silence. Stop and fix.

The audit takes 90 seconds to run. The fixes take longer.

You can grab the one-page PDF version at elliotbetancourt.com/resources/agent-reliability-audit.

Why structural fixes beat reactive ones

Most reliability work is reactive. Something breaks; you add error handling, retries, dead-letter queues. The instinct is to make failure modes survivable.

For loud failures, that's right. For quiet failures, it's exactly wrong.

You can't add error handling for the case where there is no error. You can't retry an operation that returned success. You can't dead-letter a request that the API accepted. The reactive toolkit assumes there is something to react to — a signal that triggers your handling. Quiet failures don't generate signals.

Structural fixes work differently. A heartbeat doesn't react to a halt — it reports continuous liveness, and its absence is the alert. A verifier doesn't react to a polite lie — it tests every output against a meaning-level contract before the output gets used downstream. A drift detector doesn't react to drift — it re-runs golden cases on a schedule and flags divergence before it becomes systemic.

Structural fixes catch failures before they become observable as cost. Reactive fixes catch them after. For quiet failures, "after" can be very expensive.

What to do tomorrow

If you build agent systems in production, here's where I'd start:

  1. Run the audit against your most important agent. The one whose output you'd trust without checking. That's the one most likely to be quietly broken.
  2. Pick the lowest-scoring dimension. Fix one thing. Don't try to upgrade three dimensions at once.
  3. Capture a golden case set. Even five examples beats none. Without a reference, you can't see drift.
  4. Write a pre-write guard for any agent that mutates persistent state through a "helpful" API. The Todoist case isn't unique — it's a pattern. Stripe webhooks, Notion API blocks, GitHub issue updates, calendar invitations — they all have analogous footguns where the API "succeeds" with the wrong semantics.
  5. Find a dashboard you don't read. Replace it with an alert someone has to acknowledge.

This isn't more work. It's different work. Reliability isn't built from layers of error handling — it's built from layers of detection.

Loud failure is a gift. Quiet failure is a debt.

Pay it now or pay it later. Compound interest applies either way.