What are the key points?

MIT researchers used 'Collaborative Battleship' to improve how AI agents formulate natural language questions. Llama 4 Scout achieved an 82 percent win rate using Monte Carlo inference at 1 percent of GPT-5 costs. Converting questions into Python code boosted model accuracy by 15 percent on average in verification tasks.

MIT Researchers Improve AI Questioning via Battleship Game

•MIT researchers used 'Collaborative Battleship' to improve how AI agents formulate natural language questions.
•Llama 4 Scout achieved an 82 percent win rate using Monte Carlo inference at 1 percent of GPT-5 costs.
•Converting questions into Python code boosted model accuracy by 15 percent on average in verification tasks.

MIT CSAIL and Harvard researchers developed a method to improve the information-seeking capabilities of language models by using 'Collaborative Battleship' as a testing ground. The team collected questions and answers from 40 human participants to create the 'BattleshipQA' dataset, which served as a benchmark for evaluating model performance. While frontier models like GPT-5 could already outperform humans by completing games in fewer turns, smaller models often struggled with rational questioning strategies. To address this, researchers implemented a Monte Carlo inference strategy (a technique for estimating probabilities by simulating random outcomes), allowing models to weigh the likelihood of different game states based on responses from a teammate. This approach significantly increased the win rate of the smaller Llama 4 Scout model from 8 percent to 82 percent. At this performance level, Llama 4 Scout outperformed GPT-5 in efficiency while operating at approximately 1 percent of the cost.

The team further enhanced model performance by using 'auto-formalization,' a process where language models convert natural language questions into Python code to verify solutions. This conversion enabled models to explicitly search areas of the game board, leading to an average accuracy boost of 15 percent in verifying ship placements. For example, the lightweight GPT-4o-mini model achieved a nearly 30 percent performance increase, while Claude 4 Opus improved by about eight percentage points. These findings suggest that providing AI agents with access to a 'world model' (a system capable of simulating environment dynamics) helps them generate more informative questions and gather data more efficiently.

The researchers extended their findings to other information-gathering tasks, such as playing 'Guess Who?'. By applying similar inference and coding-based strategies, Llama 4 Scout improved its success rate from 30 percent to over 72 percent, while GPT-4o increased from 62 percent to 90 percent. Although the models showed progress, expert human players remained difficult to beat, and the systems still faced challenges with answering highly complex queries. The research team, led by Gabriel Grand and Jacob Andreas, presented these findings as an oral presentation at the International Conference on Learning Representations (ICLR) in April, highlighting potential future applications for AI agents in scientific discovery, such as identifying molecular structures.

MIT CSAIL and Harvard researchers developed a method to improve the information-seeking capabilities of language models by using 'Collaborative Battleship' as a testing ground. The team collected questions and answers from 40 human participants to create the 'BattleshipQA' dataset, which served as a benchmark for evaluating model performance. While frontier models like GPT-5 could already outperform humans by completing games in fewer turns, smaller models often struggled with rational questioning strategies. To address this, researchers implemented a Monte Carlo inference strategy (a technique for estimating probabilities by simulating random outcomes), allowing models to weigh the likelihood of different game states based on responses from a teammate. This approach significantly increased the win rate of the smaller Llama 4 Scout model from 8 percent to 82 percent. At this performance level, Llama 4 Scout outperformed GPT-5 in efficiency while operating at approximately 1 percent of the cost.

The team further enhanced model performance by using 'auto-formalization,' a process where language models convert natural language questions into Python code to verify solutions. This conversion enabled models to explicitly search areas of the game board, leading to an average accuracy boost of 15 percent in verifying ship placements. For example, the lightweight GPT-4o-mini model achieved a nearly 30 percent performance increase, while Claude 4 Opus improved by about eight percentage points. These findings suggest that providing AI agents with access to a 'world model' (a system capable of simulating environment dynamics) helps them generate more informative questions and gather data more efficiently.

The researchers extended their findings to other information-gathering tasks, such as playing 'Guess Who?'. By applying similar inference and coding-based strategies, Llama 4 Scout improved its success rate from 30 percent to over 72 percent, while GPT-4o increased from 62 percent to 90 percent. Although the models showed progress, expert human players remained difficult to beat, and the systems still faced challenges with answering highly complex queries. The research team, led by Gabriel Grand and Jacob Andreas, presented these findings as an oral presentation at the International Conference on Learning Representations (ICLR) in April, highlighting potential future applications for AI agents in scientific discovery, such as identifying molecular structures.