AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning

Abstract

Developing autonomous LLM agents capable of making a series of intelligent decisions to solve complex, real-world tasks is a fast-evolving frontier. Like human cognitive development, agents are expected to acquire knowledge and skills through exploration and interaction with the environment. Despite advances, the community still lacks a unified, interactive reinforcement learning (RL) framework that can effectively train such agents from scratch—without relying on supervised fine-tuning (SFT)—across diverse and realistic environments. To bridge this gap, we introduce AgentGym-RL, a new framework to train LLM agents for multi-turn interactive decision-making through RL. The framework features a modular and decoupled architecture, ensuring high flexibility and extensibility. It encompasses a wide variety of real-world scenarios, and supports mainstream RL algorithms. Furthermore, we propose ScalingInter-RL, a training approach designed for exploration-exploitation balance and stable RL optimization. In early stages, it emphasizes exploitation by restricting the number of interaction, and gradually shifts towards exploration with larger horizons to encourage diverse problem-solving strategies. In this way, the agent develops more diverse behaviors and is less prone to collapse under long horizons. We perform extensive experiments to validate the stability and effectiveness of both the AgentGym-RL framework and ScalingInter-RL approach. Our agents match or surpass commercial models on 27 tasks across diverse environments. We offer key insights and will open-source complete AgentGym-RL framework—including code and datasets—to empower the research community in developing the next generation of intelligent agents.

Figure 1: Overview of the AgentGym-RL framework. It features a decoupled, flexible, and extensible architecture, comprising three primary modules—the environment, the agent, and the training module. It supports diverse scenarios, environments, and algorithms.

Diverse Scenarios and Environments

To build LLM agents capable of multi-turn sequential decision-making for complex tasks in real-world environments, AgentGym-RL covers a broad spectrum of scenarios to comprehensively evaluate and foster the agent's ability to perceive its environment, long-term planning towards a goal, in-depth reasoning for making intelligent decisions, aptitude for reflection and correction when facing setbacks or making mistakes.

Web Navigation: Interacting with dynamic websites for tasks such as booking flights or extracting structured information, which requires agents to follow instructions, interpret textual and visual content, manipulate dynamic interfaces, and plan multi-step actions.
Deep Search: Performing multi-step, goal-directed queries with tools like browsers or Python interpreters, demanding strong information-seeking, multi-hop reasoning, long-term memory, and knowledge synthesis across sources.
Digital Games: Exploring and solving problems in interactive game-like environments, emphasizing real-time decision-making, strategy development, and adaptability to complex, dynamic settings.
Embodied Tasks: Controlling virtual or physical bodies for navigation, manipulation, and task execution, which calls for goal-directed planning, spatial reasoning, and robust perception–action grounding.
Scientific Tasks: Conducting experiments and solving problems in physically grounded, knowledge-intensive settings, requiring precise execution, dynamic interpretation of feedback, evidence-based reasoning, and iterative hypothesis refinement.

Figure 2: An overview of the visualized user interface of our framework, facilitating observability and analysis.

Visualization Demo Videos

WebArena: Multi-step web navigation.

SearchQA: Deep multi-hop information seeking.

TextCraft: Goal-driven reasoning and planning in a text-based minecraft world.

BabyAI: Navigation and sequential action execution in a gridworld environment.

SciWorld: Scientific task solving.

ScalingInter-RL Method

Beyond relying on internal reasoning to select the next action, agents should also expand their external interactions with the environment to ensure sufficient exploration and accumulate richer context toward the final goal—capturing a form of practice-driven insight. Yet, our preliminary experiments indicate that beginning with a large number of interaction turns often leads the model into redundant reasoning and unproductive actions, ultimately causing training collapse and degraded performance. Conversely, constraining the number of interactions to remain consistently small tends to narrow exploration and limits the agent’s ability to master diverse patterns. This motivates us to propose our ScalingInter-RL method.

Figure 3: Illustration of the ScalingInter-RL approach. It allows the agent to adapt in stages: initially, by limiting interaction turns to prioritize exploitation, master basic skills, and solve easy tasks; later, by gradually increasing interactions to explore, avoid shortcuts, refine behavior, and tackle harder problems. Ultimately, this process trains a stronger agent.

Performance

We leverage Qwen2.5-3B and Qwen2.5-7B as our primary backbone models. We evaludate AgentGym-RL and ScalingInter-RL across 5 scenarios and include multiple closed-source models and open-source models for comparison. The results are as follows.

Figure 4: Performance comparison of our AgentGym-RL trained agents against various baselines across five scenarios. AgentGym-RL-7B outperforms other open-source models by a large margin.

The AgentGym-RL framework and method substantially enhances the open-sourced 7B-scale models' capabilities to a level that rivals or even surpasses top-tier proprietary large models. Moreover, ScalingInter-RL demonstrates more stable and efficient training dynamics during RL optimization as shown in the figure below.

BibTeX


        @misc{xi2025agentgymrltrainingllmagents,
            title={AgentGym-RL: Training LLM Agents for Long-Horizon Decision Making through Multi-Turn Reinforcement Learning}, 
            author={Zhiheng Xi and Jixuan Huang and Chenyang Liao and Baodai Huang and Honglin Guo and Jiaqi Liu and Rui Zheng and Junjie Ye and Jiazheng Zhang and Wenxiang Chen and Wei He and Yiwen Ding and Guanyu Li and Zehui Chen and Zhengyin Du and Xuesong Yao and Yufei Xu and Jiecao Chen and Tao Gui and Zuxuan Wu and Qi Zhang and Xuanjing Huang and Yu-Gang Jiang},
            year={2025},
            eprint={2509.08755},
            archivePrefix={arXiv},
            primaryClass={cs.LG},
            url={https://arxiv.org/abs/2509.08755}, 
      }