to2d

A space for ideas, notes, and ongoing work.

Failure Paths & Correction Loops

Deterministic recovery without drift

Most agent systems break not because the environment is difficult, but because they have no deterministic way to handle failure. When an LLM output is invalid — a wrong selector, a missing field, an impossible workflow step — typical agents retry blindly, produce new errors, or drift into different attractor regions.

Production automation systems cannot behave this way. They need structured failure semantics and deterministic correction loops that recover state without introducing noise.

This section defines failure path design, correction loops, and how they integrate with Zero-Context Architecture and operator composition.

1. Why failure handling is the weakest part of most agents

LLM-driven agents often treat failure as:

  • a reason to regenerate the whole plan,
  • a prompt to "try again" with more context,
  • something to patch with chain-of-thought,
  • a signal to produce a completely different output.

This causes:

  • divergent retries,
  • nondeterministic behavior,
  • looping browser automations,
  • oscillation between attractor states,
  • cascading corruption of state.

Without deterministic failure paths, all reliability collapses.

2. Formal structure of failure paths

At each operator stage we have:

zₜ = Pₛ(xₜ)
uₜ = f(zₜ)
yₜ = V(uₜ)

If verification fails:

yₜ = ⊥      (invalid transition)

Now the system must transition into a failure path operator, not retry the same transformation.

Let the failure operator be:

F : (xₜ, uₜ) → xₜ′

Where xₜ′ is a safe, noise-free recovery state.

3. Types of deterministic failure paths

1. Reject & Re-extract

If the output is invalid, discard it and re-run the domain extractor:

xₜ′ = reproject(xₜ)

Used in:

  • browser flows when DOM changes mid-cycle,
  • document extraction when a field is missing.

2. Reject & Canonicalize

Invalid outputs trigger a canonical rewrite:

zₜ′ = canonicalize(zₜ)

Used in:

  • inconsistent tables,
  • malformed JSON from LLM outputs.

3. Fallback operator

Switch to a simpler or more constrained operator:

xₜ′ = fallback(oᵢ)(xₜ)

Used in:

  • browser actions: fallback to direct selector search,
  • document fields: fallback to regex extraction.

4. Escalation path

If all operators fail:

xₜ′ = escalate(xₜ)

Where escalation is fully deterministic:

  • log issue,
  • alert human reviewer,
  • store trace.

No guessing.

5. Branch to alternative graph edge

In multi-jurisdiction or multi-form workflows, failure may indicate a different valid branch:

xₜ′ = follow_edge(G, condition)

4. Real Example: Browser Automation Failure Loop

Situation

LLM proposes an action:

click(#submit)

Verifier sees:

  • selector missing,
  • element disabled,
  • element off-screen.

Correct failure path

1. Reject action
2. Re-extract visible DOM slice
3. Canonicalize DOM
4. Re-run operator f on new representation

This sequence is deterministic.
No drift.

Incorrect (typical agent) behavior

  • Retry same prompt with more context
  • Add chain-of-thought
  • Produce new wrong selectors
  • Loop infinitely

5. Real Example: Document Extraction Failure Loop

Situation

Model extracted:

start_date: "Monday"

Verifier rejects (wrong format).

Correct failure path

  1. Re-extract the relevant line
  2. Rewrite with format constraints
  3. Ask operator for only the date token
  4. Re-verify

Why this works

The failure never corrupts state.
The correction loop produces the same result every time.

6. Real Example: Compliance / Payroll Workflow Failure

Situation

LLM proposes a compliance step that violates jurisdiction rules.

Correct failure path

1. Reject step
2. Re-align employee data with jurisdiction schema
3. Rebuild required step set
4. Re-run operator only for missing subset

No large regeneration. No drift.

7. Why deterministic correction stabilizes multi-step systems

1. Localizes failure

A single invalid step doesn't poison the entire workflow.

2. Makes retries meaningful

Retries operate on a corrected domain, not the same error loop.

3. Limits entropy growth

No additional context or narrative.

4. Maintains global consistency

State transitions remain valid.

5. Enables safe operator composition

Downstream operators receive verified, corrected states.

8. Correction Loops in Graph-Based Workflows

In DAG workflows, failure recovery depends on graph structure.

If operator oᵢ fails:

  • oᵢ₊₁ is skipped,
  • oᵢ feeds into a correction node,
  • the graph resumes at a safe merge point.

This is essential for:

  • multi-form onboarding,
  • jurisdiction-dependent flows,
  • payroll cycles with conditional rules.

9. Designing correction loops for large-scale automation

Correction loops must be:

  • deterministic,
  • stateless or minimally stateful,
  • domain-specific,
  • schema-aware,
  • verifiable.

Most LLM agents fail because their failure behavior is:

  • non-deterministic,
  • free-form,
  • prompt-influenced,
  • context-influenced.

You eliminate the chaos.

10. Research Directions

  • automated classification of failure categories,
  • architecture for failure-dependent branching graphs,
  • domain-driven fallback operators,
  • formal stability proofs for correction systems,
  • error-bound metrics under repeated failure loops.

Failure Paths & Correction Loops are the safety layer of automation — the reason multi-step systems stay stable instead of collapsing.

← Back to AI Era