Evidence from agentic automata learning - a controlled framework for testing whether tool calling LLM agents can uncover hidden environments through interaction. Current agents can sometimes perform nontrivial interactive discovery, but remain far less robust and efficient than classic automata learning algorithms.
{a,b}. Ask membership queries, then submit a regular expression as an equivalence query.
Type a word over a and b. The empty word is written as ε.
| # | word | answer |
|---|---|---|
| No membership queries yet. | ||
Use a, b, parentheses, union |, Kleene star *, and plus +. Your regex is converted to a DFA, minimized, and compared with the hidden DFA.
We recast active automata learning as an agentic task. An oracle hides a deterministic finite automaton (DFA); a tool calling LLM agent must recover it using only membership queries ("is this word in the language?") and equivalence queries ("is this the DFA?"). Because instances are generated procedurally with a known number of states, the task gives precise control over difficulty, an exact measure of interaction efficiency, and provably optimal classical baselines (L* and TTT) to compare against. Reasoning models can perform nontrivial interactive discovery on small DFAs, but success degrades sharply as the number of states grows, and even their successful runs are slower and less efficient than the classic algorithms.
We introduce agentic automata learning, a controlled framework for evaluating whether tool calling LLM agents can infer latent structure through interaction.
We show that advanced reasoning models can perform nontrivial interactive discovery, but degrade sharply as automata complexity increases.
We analyze interaction trajectories and identify recurring limitations in planning, reasoning, and consistent use of accumulated information.
We propose agentic automata learning to evaluate the extent to which tool calling LLM agents can uncover hidden environments through interaction. An agent must identify a hidden deterministic finite automaton by querying an oracle with (1) membership queries - does this string belong to the target language? - and (2) equivalence queries - is this the target DFA? This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines from classic automata learning algorithms.
Evaluating state of the art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than nonreasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, current LLM agents can sometimes perform nontrivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.
Figure 3 shows that the performance of all models declines as the number of states in the minimal DFA increases. For automata with 8–9 states, no model exceeds a 25% success rate, whereas the classic algorithms solve all task instances with 100% success. The gap relative to classic algorithms is also reflected in interaction efficiency: in the same 8–9 state range, when considering only successful runs, Gemini 3.1 Pro, the best performing model, uses about 45.8% more tool calls than TTT.
Alongside the overall difficulty of the task, the framework also exposes substantial differences between models, particularly for automata with 4 states or more. A clear gap emerges between models that employ reasoning mechanisms and those that do not: in the 4–5 state range, Gemini 3.1 Pro reaches 85% success, while Gemini 3 Flash, which applies reasoning only in part of the interaction steps, reaches 15% and shows intermediate performance. In contrast, GPT 5.4, Gemini 3.1 Flash Lite, and Llama 3.3 70B, which do not employ reasoning mechanisms, reach 0% success for automata with 4 states or more.
We classify model failures into two mutually exclusive categories: planning failures and reasoning failures. A planning failure occurs when the model fails and none of the classic passive learning algorithms, RPNI, EDSM, or Blue Fringe, can infer the hidden DFA from the observations accumulated by the model; in this case the model did not collect sufficient information during interaction. A reasoning failure occurs when the model fails, but at least one passive learning algorithm can infer the hidden DFA from the accumulated observations: the necessary information was available, but the model failed to infer the correct DFA from it. Figure 4 shows that all evaluated models suffer from both reasoning and planning failures, although stronger models have proportionally fewer planning failures than the weaker models, where planning failures are dominant.
We measure how consistently LLMs use the information accumulated during interaction through the notion of noninformative queries: tool calls whose answers can be implied by previous oracle responses and therefore cannot provide new evidence. Figure 5 shows that noninformative queries become increasingly common as interactions grow longer. After roughly 60 steps, even the most consistent model, DeepSeek V4 Pro, issues noninformative queries about 20% of the time. This pattern appears across all evaluated LLMs, whereas classic active learning algorithms maintain 0% noninformative queries by construction. These results suggest that as evidence accumulates in context, LLM agents increasingly struggle to organize and use it consistently.
A natural concern is that models may have been pretrained on classical formulations of the problem, as well as on standard algorithms such as L* and TTT, so that strong performance might reflect memorization or imitation rather than the ability to solve the task interactively. We alleviate this concern in three ways. First, since classic algorithms are deterministic, exact imitation can be checked by comparing the sequence of queries up to recovery; we find no task instance in which a model produces the same query sequence as a classical algorithm, and in some cases the model recovers the target DFA using fewer queries than the classic algorithms. Second, the best performing model, Gemini 3.1 Pro, overuses equivalence queries compared to classic algorithms, in which the number of equivalence queries is bounded by the number of states in the hidden DFA; Gemini 3.1 Pro exceeds this bound in 92.5% of the task instances. Third, its behavior violates a monotonicity property satisfied by the classic algorithms: monotonicity is preserved if for every hypothesis tested by an equivalence query the number of states is larger than in all previously tested hypotheses, and Gemini 3.1 Pro produced a nonmonotonic hypothesis sequence in all interactions.
| Step | Gemini 3.1 Pro | Llama 3.3 70B | L* | TTT | ||||
|---|---|---|---|---|---|---|---|---|
| Tool Call | Resp. | Tool Call | Resp. | Tool Call | Resp. | Tool Call | Resp. | |
| 1 | EQ(∅) | bba | MQ(ε) | 0 | MQ(b) | 1 | MQ(ε) | 0 |
| 2 | EQ(Σ*) | aaa | MQ(a) | 0 | MQ(a) | 0 | EQ(∅) | bba |
| 3 | EQ(b ∈ w) | ⊤ | MQ(b) | 1 | MQ(ε) | 0 | MQ(bba) | 1 |
| 4 | - | - | MQ(bb) | 1 | MQ(bb) | 1 | MQ(ba) | 1 |
| 5 | - | - | MQ(ab) | 1 | MQ(ba) | 1 | MQ(a) | 0 |
| 6 | - | - | MQ(ba) | 1 | EQ(b ∈ w) | ⊤ | MQ(b) | 1 |
| 7 | - | - | EQ(¬ε) | aa | - | - | MQ(aa) | 0 |
| 8 | - | - | MQ(aa) | 0 | - | - | MQ(baa) | 1 |
| 9 | - | - | EQ(b ∈ w) | ⊤ | - | - | EQ(b ∈ w) | ⊤ |
b; - indicates the learner has already solved the problem. Gemini 3.1 Pro solves the task using only equivalence queries, whereas Llama 3.3 70B, L*, and TTT use both membership and equivalence queries. In EQ notation, ∅ denotes the empty language, Σ* denotes all words, ¬ε denotes all nonempty words, and b ∈ w denotes the language of words containing b.The interactive demo lets you step through the same loop the agents face: issue membership and equivalence queries against a hidden DFA and watch a hypothesis take shape, query by query.
Open the interactive demo ↗@misc{menaged2026agentic,
title = {Can LLM Agents Infer World Models?
Evidence from Agentic Automata Learning},
author = {Menaged, Reef and Lior, Gili and Ravfogel, Shauli
and Aharoni, Roee and Stanovsky, Gabriel},
year = {2026},
journal = {arXiv preprint}
}