Can LLM Agents Infer World Models? Evidence from Agentic Automata Learning

The setup

Try it yourself! Can you discover a hidden DFA through two kinds of questions?

Diagram of agentic automata learning: an agent issues membership and equivalence queries to an environment holding a hidden DFA.

Figure 1 Agentic automata learning. An agent identifies a hidden DFA through repeated interaction. Membership queries ask whether a word is accepted (left). Equivalence queries submit a candidate DFA and ask whether it matches the target; the environment confirms equivalence or returns a counterexample where the two disagree (right).

Interactive challenge The hidden language is over {a,b}. Ask membership queries, then submit a regular expression as an equivalence query.

Ask membership queries

Type a word over a and b. The empty word is written as ε.

Ask a membership query to start collecting examples.

#	word	answer
No membership queries yet.

Submit a regular expression

Use a, b, parentheses, union |, Kleene star *, and plus +. Your regex is converted to a DFA, minimized, and compared with the hidden DFA.

Submit a hypothesis when you think you know the language.

→ Hidden regex: a(a|b)*b

Proposed hypotheses

No counterexamples yet.

Abstract

Active learning, as an agent task

We propose agentic automata learning to evaluate the extent to which tool calling LLM agents can uncover hidden environments through interaction. An agent must identify a hidden deterministic finite automaton by querying an oracle with (1) membership queries - does this string belong to the target language? - and (2) equivalence queries - is this the target DFA? This yields a scalable testbed with controlled task complexity, measurable interaction efficiency, and strong baselines from classic automata learning algorithms.

Evaluating state of the art LLMs, we find that performance drops sharply as DFA size increases. Reasoning models are markedly stronger than nonreasoning models, yet trajectory analyses reveal recurring failures in query planning, evidence integration, and hypothesis construction. Overall, current LLM agents can sometimes perform nontrivial interactive discovery, but remain far less robust and efficient than classic algorithms for the task.

Findings

Agentic automata learning is challenging even for state of the art models and exposes substantial differences in their ability to learn through interaction.

Two plots: success rate falling as DFA states grow, and tool calls relative to TTT rising with size.

Figure 3. Model performance as the number of states in the minimal DFA increases. The left subplot shows success rate; L^* and TTT are omitted there because both achieve 100% success. The right subplot compares successful run query cost relative to TTT: values above zero require more calls and values below zero require fewer.

Figure 3 shows that the performance of all models declines as the number of states in the minimal DFA increases. For automata with 8–9 states, no model exceeds a 25% success rate, whereas the classic algorithms solve all task instances with 100% success. The gap relative to classic algorithms is also reflected in interaction efficiency: in the same 8–9 state range, when considering only successful runs, Gemini 3.1 Pro, the best performing model, uses about 45.8% more tool calls than TTT.

Reasoning mechanisms separate the models

Alongside the overall difficulty of the task, the framework also exposes substantial differences between models, particularly for automata with 4 states or more. A clear gap emerges between models that employ reasoning mechanisms and those that do not: in the 4–5 state range, Gemini 3.1 Pro reaches 85% success, while Gemini 3 Flash, which applies reasoning only in part of the interaction steps, reaches 15% and shows intermediate performance. In contrast, GPT 5.4, Gemini 3.1 Flash Lite, and Llama 3.3 70B, which do not employ reasoning mechanisms, reach 0% success for automata with 4 states or more.

Stacked bars per model showing the share of instances that are successes, reasoning failures, or planning failures.

Figure 4. Error analysis. Each row corresponds to a model, and each stacked bar shows the percentage of task instances classified as successful runs, reasoning failures, or planning failures.

Two failure modes: reasoning or planning

We classify model failures into two mutually exclusive categories: planning failures and reasoning failures. A planning failure occurs when the model fails and none of the classic passive learning algorithms, RPNI, EDSM, or Blue Fringe, can infer the hidden DFA from the observations accumulated by the model; in this case the model did not collect sufficient information during interaction. A reasoning failure occurs when the model fails, but at least one passive learning algorithm can infer the hidden DFA from the accumulated observations: the necessary information was available, but the model failed to infer the correct DFA from it. Figure 4 shows that all evaluated models suffer from both reasoning and planning failures, although stronger models have proportionally fewer planning failures than the weaker models, where planning failures are dominant.

Noninformative query rate climbing as interaction step increases, for every model.

Figure 5. Noninformative query rate by interaction step. For each model and tool call number, the value reports the percentage of queries at that step that are noninformative across all task instances.

Noninformative queries grow as interaction progresses

We measure how consistently LLMs use the information accumulated during interaction through the notion of noninformative queries: tool calls whose answers can be implied by previous oracle responses and therefore cannot provide new evidence. Figure 5 shows that noninformative queries become increasingly common as interactions grow longer. After roughly 60 steps, even the most consistent model, DeepSeek V4 Pro, issues noninformative queries about 20% of the time. This pattern appears across all evaluated LLMs, whereas classic active learning algorithms maintain 0% noninformative queries by construction. These results suggest that as evidence accumulates in context, LLM agents increasingly struggle to organize and use it consistently.

LLMs do not seem to follow classical algorithms

A natural concern is that models may have been pretrained on classical formulations of the problem, as well as on standard algorithms such as L^* and TTT, so that strong performance might reflect memorization or imitation rather than the ability to solve the task interactively. We alleviate this concern in three ways. First, since classic algorithms are deterministic, exact imitation can be checked by comparing the sequence of queries up to recovery; we find no task instance in which a model produces the same query sequence as a classical algorithm, and in some cases the model recovers the target DFA using fewer queries than the classic algorithms. Second, the best performing model, Gemini 3.1 Pro, overuses equivalence queries compared to classic algorithms, in which the number of equivalence queries is bounded by the number of states in the hidden DFA; Gemini 3.1 Pro exceeds this bound in 92.5% of the task instances. Third, its behavior violates a monotonicity property satisfied by the classic algorithms: monotonicity is preserved if for every hypothesis tested by an equivalence query the number of states is larger than in all previously tested hypotheses, and Gemini 3.1 Pro produced a nonmonotonic hypothesis sequence in all interactions.

Step	Gemini 3.1 Pro		Llama 3.3 70B		L^*		TTT
Step	Tool Call	Resp.	Tool Call	Resp.	Tool Call	Resp.	Tool Call	Resp.
1	EQ(∅)	bba	MQ(ε)	0	MQ(b)	1	MQ(ε)	0
2	EQ(Σ*)	aaa	MQ(a)	0	MQ(a)	0	EQ(∅)	bba
3	EQ(b ∈ w)	⊤	MQ(b)	1	MQ(ε)	0	MQ(bba)	1
4	-	-	MQ(bb)	1	MQ(bb)	1	MQ(ba)	1
5	-	-	MQ(ab)	1	MQ(ba)	1	MQ(a)	0
6	-	-	MQ(ba)	1	EQ(b ∈ w)	⊤	MQ(b)	1
7	-	-	EQ(¬ε)	aa	-	-	MQ(aa)	0
8	-	-	MQ(aa)	0	-	-	MQ(baa)	1
9	-	-	EQ(b ∈ w)	⊤	-	-	EQ(b ∈ w)	⊤

Table 1. Example interactions between each learner and the oracle (Resp. column), for the hidden DFA accepting all words containing at least one b; - indicates the learner has already solved the problem. Gemini 3.1 Pro solves the task using only equivalence queries, whereas Llama 3.3 70B, L^*, and TTT use both membership and equivalence queries. In EQ notation, ∅ denotes the empty language, Σ* denotes all words, ¬ε denotes all nonempty words, and b ∈ w denotes the language of words containing b.

Can LLM Agents Infer World Models?

Try it yourself! Can you discover a hidden DFA through two kinds of questions?

Ask membership queries

Submit a regular expression

Three things this paper adds

A controlled framework

Discovery that degrades with complexity

Trajectory analysis

Active learning, as an agent task

Agentic automata learning is challenging even for state of the art models and exposes substantial differences in their ability to learn through interaction.

Reasoning mechanisms separate the models

Two failure modes: reasoning or planning

Noninformative queries grow as interaction progresses

LLMs do not seem to follow classical algorithms

Probe a hidden DFA yourself

Cite this work