Highlights:
- Introduces Multi-Agent Evolve (MAE), a self-improving framework for large language models (LLMs).
- Uses a trio of agents — Proposer, Solver, and Judge — to co-evolve reasoning abilities without human supervision.
- Achieves 4.54% performance improvement on benchmarks using Qwen2.5-3B-Instruct model.
- Demonstrates scalable and data-efficient reinforcement learning without relying on curated datasets.
TLDR:
Researchers propose Multi-Agent Evolve (MAE), a reinforcement learning framework that allows large language models to autonomously improve reasoning skills through co-evolving agents, significantly reducing dependence on human-annotated datasets.
A groundbreaking study titled *“Multi-Agent Evolve: LLM Self-Improve through Co-evolution”* by Yixing Chen, Yiding Wang, Siqi Zhu, Haofei Yu, Tao Feng, Muhan Zhan, Mostofa Patwary, and Jiaxuan You introduces a transformative approach to enhancing large language models (LLMs) using reinforcement learning (RL). The research, submitted to ICLR 2026, tackles one of the most persistent challenges in artificial intelligence — enabling models to self-improve without extensive human supervision or domain-specific ground truth environments.
Traditional reinforcement learning approaches for LLMs depend heavily on curated datasets and human-generated rewards, which limits their scalability and general applicability. Although self-play RL methods have made strides in structured environments such as games and code generation, their feedback systems depend on tangible outcomes — a constraint when extending to generalized reasoning or open-ended language tasks. The Multi-Agent Evolve (MAE) framework addresses this limitation by introducing a self-sustaining ecosystem of interacting agents derived from a single LLM instance: the Proposer, Solver, and Judge. Each agent contributes dynamically — the Proposer creates new problems, the Solver attempts solutions, and the Judge evaluates both, driving a continuous loop of learning and refinement.
From a technical standpoint, MAE integrates reinforcement learning into this tri-agent architecture, enabling co-evolution where agents learn optimal behaviors collaboratively. The Proposer’s objective is to generate challenging yet solvable tasks, incentivizing the Solver to produce increasingly accurate and logical responses. The Judge agent, acting as the arbiter, assesses reasoning chains and assigns rewards that guide the optimization process. Crucially, this process functions without human-crafted labels or external evaluation environments. Experiments conducted on Qwen2.5-3B-Instruct — a 3-billion-parameter model — demonstrated an average 4.54% improvement across diverse reasoning benchmarks, including mathematical problem-solving and general knowledge tasks. Such results signify that LLMs can enhance core reasoning capacities autonomously, reducing dependency on manual supervision.
This development opens a new frontier in AI research by highlighting the scalability and efficiency of multi-agent co-evolution systems. As AI models increasingly approach human-level reasoning capabilities, frameworks like MAE may become foundational to building autonomous, self-correcting AI ecosystems. The study reinforces the potential of reinforcement learning as a robust pathway for self-improving language models across domains — a major step towards truly general intelligence.
Source:
Source:
Chen, Yixing; Wang, Yiding; Zhu, Siqi; Yu, Haofei; Feng, Tao; Zhan, Muhan; Patwary, Mostofa; You, Jiaxuan. “Multi-Agent Evolve: LLM Self-Improve through Co-evolution.” arXiv:2510.23595 [cs.AI], October 27, 2025. https://arxiv.org/abs/2510.23595
