Agentic Automata Learning
Preprint · Under review · 2026

Can LLM Agents Infer World Models?

Evidence from agentic automata learning - a controlled framework for testing whether tool calling LLM agents can uncover hidden environments through interaction. Current agents can sometimes perform nontrivial interactive discovery, but remain far less robust and efficient than classic automata learning algorithms.

1The Hebrew University of Jerusalem  ·  2New York University  ·  3Google Research
Demo An agent infers a hidden DFA by issuing membership and equivalence queries against the oracle, with L* and TTT solving the same instance for reference.
The setup

Try it yourself! Can you discover a hidden DFA through two kinds of questions?

Diagram of agentic automata learning: an agent issues membership and equivalence queries to an environment holding a hidden DFA.
Figure 1 Agentic automata learning. An agent identifies a hidden DFA through repeated interaction. Membership queries ask whether a word is accepted (left). Equivalence queries submit a candidate DFA and ask whether it matches the target; the environment confirms equivalence or returns a counterexample where the two disagree (right).
Interactive challenge The hidden language is over {a,b}. Ask membership queries, then submit a regular expression as an equivalence query.

Ask membership queries

Type a word over a and b. The empty word is written as ε.

Ask a membership query to start collecting examples.
#wordanswer
No membership queries yet.

Submit a regular expression

Use a, b, parentheses, union |, Kleene star *, and plus +. Your regex is converted to a DFA, minimized, and compared with the hidden DFA.

Submit a hypothesis when you think you know the language.
Proposed hypotheses
No counterexamples yet.
TL;DR

We recast active automata learning as an agentic task. An oracle hides a deterministic finite automaton (DFA); a tool calling LLM agent must recover it using only membership queries ("is this word in the language?") and equivalence queries ("is this the DFA?"). Because instances are generated procedurally with a known number of states, the task gives precise control over difficulty, an exact measure of interaction efficiency, and provably optimal classical baselines (L* and TTT) to compare against. Reasoning models can perform nontrivial interactive discovery on small DFAs, but success degrades sharply as the number of states grows, and even their successful runs are slower and less efficient than the classic algorithms.

<25%
Best success rate any model reaches on 9 state automata, where classic learners solve 100%.
+45.8%
Extra queries the strongest model needs over TTT, counting only its successful runs.
0%
Success for nonreasoning models once automata reach 4 or more states.
480
Full evaluation runs across 6 models, totalling roughly $1,200 in compute.
Contributions

Three things this paper adds

01

A controlled framework

We introduce agentic automata learning, a controlled framework for evaluating whether tool calling LLM agents can infer latent structure through interaction.

02

Discovery that degrades with complexity

We show that advanced reasoning models can perform nontrivial interactive discovery, but degrade sharply as automata complexity increases.

03

Trajectory analysis

We analyze interaction trajectories and identify recurring limitations in planning, reasoning, and consistent use of accumulated information.

Abstract

Active learning, as an agent task

We propose agentic automata learning to evaluate the extent to which tool calling LLM agents can uncover hidden environments through interaction. An agent must identify a hidden deterministic finite automaton by querying an oracle with (1) membership queries - does this string belong to the target language? - and (2) equivalence queries - is this the target DFA? This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines from classic automata learning algorithms.

Evaluating state of the art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than nonreasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, current LLM agents can sometimes perform nontrivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

Findings

Agentic automata learning is challenging even for state of the art models and exposes substantial differences in their ability to learn through interaction.

Two plots: success rate falling as DFA states grow, and tool calls relative to TTT rising with size.
Figure 3. Model performance as the number of states in the minimal DFA increases. The left subplot shows success rate; L* and TTT are omitted there because both achieve 100% success. The right subplot compares successful run query cost relative to TTT: values above zero require more calls and values below zero require fewer.

Figure 3 shows that the performance of all models declines as the number of states in the minimal DFA increases. For automata with 8–9 states, no model exceeds a 25% success rate, whereas the classic algorithms solve all task instances with 100% success. The gap relative to classic algorithms is also reflected in interaction efficiency: in the same 8–9 state range, when considering only successful runs, Gemini 3.1 Pro, the best performing model, uses about 45.8% more tool calls than TTT.

Reasoning mechanisms separate the models

Alongside the overall difficulty of the task, the framework also exposes substantial differences between models, particularly for automata with 4 states or more. A clear gap emerges between models that employ reasoning mechanisms and those that do not: in the 4–5 state range, Gemini 3.1 Pro reaches 85% success, while Gemini 3 Flash, which applies reasoning only in part of the interaction steps, reaches 15% and shows intermediate performance. In contrast, GPT 5.4, Gemini 3.1 Flash Lite, and Llama 3.3 70B, which do not employ reasoning mechanisms, reach 0% success for automata with 4 states or more.

Stacked bars per model showing the share of instances that are successes, reasoning failures, or planning failures.
Figure 4. Error analysis. Each row corresponds to a model, and each stacked bar shows the percentage of task instances classified as successful runs, reasoning failures, or planning failures.

Two failure modes: reasoning or planning

We classify model failures into two mutually exclusive categories: planning failures and reasoning failures. A planning failure occurs when the model fails and none of the classic passive learning algorithms, RPNI, EDSM, or Blue Fringe, can infer the hidden DFA from the observations accumulated by the model; in this case the model did not collect sufficient information during interaction. A reasoning failure occurs when the model fails, but at least one passive learning algorithm can infer the hidden DFA from the accumulated observations: the necessary information was available, but the model failed to infer the correct DFA from it. Figure 4 shows that all evaluated models suffer from both reasoning and planning failures, although stronger models have proportionally fewer planning failures than the weaker models, where planning failures are dominant.

Noninformative query rate climbing as interaction step increases, for every model.
Figure 5. Noninformative query rate by interaction step. For each model and tool call number, the value reports the percentage of queries at that step that are noninformative across all task instances.

Noninformative queries grow as interaction progresses

We measure how consistently LLMs use the information accumulated during interaction through the notion of noninformative queries: tool calls whose answers can be implied by previous oracle responses and therefore cannot provide new evidence. Figure 5 shows that noninformative queries become increasingly common as interactions grow longer. After roughly 60 steps, even the most consistent model, DeepSeek V4 Pro, issues noninformative queries about 20% of the time. This pattern appears across all evaluated LLMs, whereas classic active learning algorithms maintain 0% noninformative queries by construction. These results suggest that as evidence accumulates in context, LLM agents increasingly struggle to organize and use it consistently.

LLMs do not seem to follow classical algorithms

A natural concern is that models may have been pretrained on classical formulations of the problem, as well as on standard algorithms such as L* and TTT, so that strong performance might reflect memorization or imitation rather than the ability to solve the task interactively. We alleviate this concern in three ways. First, since classic algorithms are deterministic, exact imitation can be checked by comparing the sequence of queries up to recovery; we find no task instance in which a model produces the same query sequence as a classical algorithm, and in some cases the model recovers the target DFA using fewer queries than the classic algorithms. Second, the best performing model, Gemini 3.1 Pro, overuses equivalence queries compared to classic algorithms, in which the number of equivalence queries is bounded by the number of states in the hidden DFA; Gemini 3.1 Pro exceeds this bound in 92.5% of the task instances. Third, its behavior violates a monotonicity property satisfied by the classic algorithms: monotonicity is preserved if for every hypothesis tested by an equivalence query the number of states is larger than in all previously tested hypotheses, and Gemini 3.1 Pro produced a nonmonotonic hypothesis sequence in all interactions.

Step Gemini 3.1 Pro Llama 3.3 70B L* TTT
Tool CallResp. Tool CallResp. Tool CallResp. Tool CallResp.
1EQ(∅)bbaMQ(ε)0MQ(b)1MQ(ε)0
2EQ(Σ*)aaaMQ(a)0MQ(a)0EQ(∅)bba
3EQ(b ∈ w)MQ(b)1MQ(ε)0MQ(bba)1
4--MQ(bb)1MQ(bb)1MQ(ba)1
5--MQ(ab)1MQ(ba)1MQ(a)0
6--MQ(ba)1EQ(b ∈ w)MQ(b)1
7--EQ(¬ε)aa--MQ(aa)0
8--MQ(aa)0--MQ(baa)1
9--EQ(b ∈ w)--EQ(b ∈ w)
Table 1. Example interactions between each learner and the oracle (Resp. column), for the hidden DFA accepting all words containing at least one b; - indicates the learner has already solved the problem. Gemini 3.1 Pro solves the task using only equivalence queries, whereas Llama 3.3 70B, L*, and TTT use both membership and equivalence queries. In EQ notation, ∅ denotes the empty language, Σ* denotes all words, ¬ε denotes all nonempty words, and b ∈ w denotes the language of words containing b.
Try it live

Probe a hidden DFA yourself

The interactive demo lets you step through the same loop the agents face: issue membership and equivalence queries against a hidden DFA and watch a hypothesis take shape, query by query.

Open the interactive demo ↗
Citation

Cite this work

@misc{menaged2026agentic,
  title   = {Can LLM Agents Infer World Models?
             Evidence from Agentic Automata Learning},
  author  = {Menaged, Reef and Lior, Gili and Ravfogel, Shauli
             and Aharoni, Roee and Stanovsky, Gabriel},
  year    = {2026},
  journal = {arXiv preprint}
}