Agents fail not because models are weak, but because the input domain is wrong.
Domain extractors and problem rewriting are the machinery that converts a messy real-world situation into a solvable representation — a representation that places the model in the correct manifold region and eliminates ambiguity before any operator is applied.
This section formalizes the design of domain extractors, the theory behind representation rewriting, and includes concrete examples that show exactly how this works in production.
1. Why domain extraction is necessary
Real-world inputs contain:
- irrelevant data,
- inconsistent structures,
- mixed modalities,
- conflicting cues,
- high entropy,
- ambiguous semantics.
LLMs cannot disentangle this. They treat everything as one collapsed input.
Domain extractors isolate only what matters.
They reduce the problem to a tractable slice with minimal ambiguity.
2. Formal definition
Define the system state:
xₜ ∈ S
A domain extractor is a projection:
zₜ = Pₛ(xₜ)
Where zₜ is the canonical form used by the operator.
Properties:
- idempotent (extracting again does not change the result),
- structure-preserving,
- domain-purifying,
- entropy-reducing,
- invariant-respecting.
3. Problem rewriting
After extraction, the canonical form may still not reflect the representation the model can solve.
Problem rewriting transforms the slice into a representation that aligns with a stable manifold region:
rₜ = R(zₜ)
Where R:
- normalizes structure,
- enforces schemas,
- removes ambiguity,
- simplifies the objective,
- clarifies constraints.
This step converts an otherwise impossible prompt into a solvable one.
4. Example: Browser automation (the most intuitive case)
Raw input
A full DOM tree with:
- hidden nodes,
- inconsistent structure,
- irrelevant sections,
- noise from scripts.
Domain extractor
zₜ = visible_interaction_region(DOM)
Examples:
- the form currently displayed,
- the active modal,
- the table row that changed.
Rewriter
rₜ = canonical_DOM_structure(zₜ)
This may:
- flatten tables into lists,
- normalize forms,
- remove dynamic attributes,
- simplify selectors,
- preserve only actionable elements.
Only then is the operator called:
actions = f(rₜ)
This eliminates 90% of browser-agent instability.
5. Example: Document extraction
Raw input
A 50-page PDF with layout noise, mixed styles, and irrelevant sections.
Domain extractor
zₜ = extract_section(document, target_field)
Examples:
- only the "Payment Information" table,
- only the "Employment Start Date" line,
- only the W-4 box the workflow needs.
Rewriter
rₜ = normalize_table(zₜ)
Or:
rₜ = canonical_text_form(zₜ)
Now the operator extracts fields reliably.
6. Example: Compliance / HR workflows
Raw input
Employee data across:
- multiple systems,
- different formats,
- optional fields,
- irrelevant attributes,
- location-dependent rules.
Domain extractor
zₜ = jurisdiction_relevant_subset(employee_state)
Rewriter
rₜ = schema_align(zₜ, compliance_rule_schema)
Now the operator produces:
steps = f(rₜ)
And each step is correct because the domain is correct.
7. Why domain extractors stabilize LLMs
Domain extractors:
- contract the manifold region,
- remove high-entropy elements,
- eliminate mixed-domain patterns,
- prevent attractor drift,
- align the input with the model's strongest internal structure.
Rewriting ensures that the operator sees only the representation it can reliably transform.
This is the hidden source of high reliability in well-designed agent systems.
8. Representation rewriting patterns
1. Canonicalization
Convert multiple possible input forms into a single standard form.
- tables → lists
- messy paragraphs → key-value blocks
- raw DOM → interaction graph
2. Constraint encoding
Bake constraints into the representation instead of describing them.
3. Goal specifying through structure
Use structure, not text, to express the objective.
4. Noise pruning
Delete everything irrelevant.
5. Semantic flattening
Simplify concepts into machine-stable forms.
9. Why most agent frameworks fail
They skip this entire step.
They give the model raw:
- HTML,
- user history,
- entire conversations,
- full documents,
- mixed tasks.
The model collapses all of this into one latent state → unstable trajectories → hallucination → failed workflows.
Domain extraction + rewriting solves this.
10. Link to 0-Context Architecture
0-context is essentially domain extraction plus strict rewriting with zero residue.
You isolate:
- one domain,
- one structure,
- one objective,
- one representation.
And you present only that to the operator.
This is why 0-context outperforms long-context systems on real automation tasks.
11. Research directions
- automated canonicalization of enterprise schemas,
- stability analysis under different rewriting strategies,
- manifold-region mapping using domain-extracted embeddings,
- designing robust domain-projection languages,
- cross-domain extraction for multi-operator pipelines.
The extractor–rewriter architecture is the backbone of verifiable agents.