Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning

ICLR 2026
Jingqi Tong1,2,*, Jixin Tang1,*, Hangcheng Li1,*, Yurong Mou1,*, Ming Zhang1, Jun Zhao1,†, Yanbo Wen1, Fan Song1, Jiahao Zhan1, Yuyang Lu1, Chaoran Tao1, Zhiyuan Guo1, Jizhou Yu1, Tianhao Cheng1, Zhiheng Xi1, Changhao Jiang1, Zhangyue Yin1, Yining Zheng1, Weifeng Ge1, Guanhua Chen3, Tao Gui1,2, Xipeng Qiu1,2,†, Qi Zhang1,†, Xuanjing Huang1
1Fudan University    2Shanghai Innovation Institute    3SUSTech
* Equal contribution   â€  Corresponding authors
Paper PDF GitHub 🤗GameQA-140K 🤗GameQA-text 🤗Models

News

Abstract

Vision-language reinforcement learning (RL) has primarily focused on narrow domains (e.g. geometry or chart reasoning). This leaves broader training scenarios and resources underexplored, limiting the exploration and learning of Vision Language Models (VLMs) through RL. We find video games inherently provide rich visual elements and mechanics that are easy to verify. To fully leverage the multimodal and verifiable rewards in video games, we propose Game-RL, constructing diverse game tasks for RL training to boost VLMs' general reasoning ability. To obtain training data, we propose Code2Logic, a novel approach that adapts game code to synthesize reasoning data with unlimited examples and controllable difficulty gradation, thus obtaining the GameQA dataset of 30 games and 158 verifiable tasks. Remarkably, RL training solely on GameQA enables multiple VLMs to generalize across 7 diverse out-of-domain vision-language benchmarks, demonstrating the value of Game-RL for enhancing VLMs' general reasoning. Furthermore, game data provides improvements comparable to general multimodal reasoning datasets (e.g. geometry/chart). More importantly, scaling up game diversity or game data volume consistently improves VLMs' generalizable reasoning capabilities. Our findings highlight scaling reinforcement learning in game environments as a promising direction for enhancing generalizable multimodal reasoning in foundation models.

Code2Logic Approach

The Code2Logic approach involves three main steps:

1. Using LLMs to construct game code of the selected game (Sokoban).

2. LLM-assisted design of the task templates including question and analysis templates based on the generated game code. Each task template condenses one type of reasoning pattern in the game.

3. Using LLMs to construct a data engine that directly reuses the core game code from the first step, including functions like move.

After these main steps, the data engine is executed to fill in the task templates developed in Step 2 and generate data samples.

Code2Logic Approach

GameQA Dataset

Our GameQA dataset provides diverse verifiable game tasks along with controllable difficulty, extending RL training scenarios for VLMs to the domain of video games.

It encompasses 30 different games classified into 4 categories based on the core capabilities required to solve game tasks. Four games from different categories and their example data samples are illustrated in the image below. The GameQA data samples are also reasonably graded by difficulty (see 🤗 GameQA-140K).

4 Game Example Samples from GameQA
30 Categorized Games in GameQA

Key Findings

😎 Game-RL leads to generalizable multimodal reasoning improvements

RL Training solely on game data (GameQA) enables three VLMs (Qwen2.5-VL, InternVL2.5, InternVL3) to achieve consistent performance improvements across 7 diverse vision reasoning benchmarks, demonstrating strong out-of-domain generalization. These results suggest that the models have successfully learned transferable visual understanding and reasoning abilities through Game-RL.

Evaluation Results on General Vision Benchmarks

💪 Game data is competitive to geometry datasets

Based on Qwen2.5-VL-7B, we applied the same training method on 5k GameQA samples, 8k samples from MAVIS, 8k Multimodal-Open-R1 samples, 8k MultiMath samples respectively, to conduct comparative training.

The GameQA-trained model is competitive compared to its counterparts trained on geometry or function data, where general vision benchmarks would be considered in-domain. These results suggest that GameQA enables stronger out-of-domain generalization, even when using less data from a mismatched domain.

GameQA Generalizes Better

📈 Scaling Effects: Game Diversity & Data Volume

Game Diversity: Scaling up game diversity (e.g., 4 games → 20 games) makes better generalization, enabling the model to acquire more robust visual understanding and reasoning abilities.

Scaling Effect of Game Diversity

Data Volume: Model's performance score demonstrates an overall upward trend on 7 general vision benchmarks as the amount of training data increases, indicating scaling up training game data volume effectively enhances the VLM's generalizable reasoning abilities.

Scaling Effect of Data Volume

Citation

@misc{tong2025gamerlsynthesizingmultimodalverifiable,
      title={Game-RL: Synthesizing Multimodal Verifiable Game Data to Boost VLMs' General Reasoning}, 
      author={Jingqi Tong and Jixin Tang and Hangcheng Li and Yurong Mou and Ming Zhang and Jun Zhao and Yanbo Wen and Fan Song and Jiahao Zhan and Yuyang Lu and Chaoran Tao and Zhiyuan Guo and Jizhou Yu and Tianhao Cheng and Zhiheng Xi and Changhao Jiang and Zhangyue Yin and Yining Zheng and Weifeng Ge and Guanhua Chen and Tao Gui and Xipeng Qiu and Qi Zhang and Xuanjing Huang},
      year={2025},
      eprint={2505.13886},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.13886}, 
}