AI4MATH Best Paper Award
A paper by Ph.D. students Wenlong Deng and Yi Ren, along with their respective supervisors (Xiaoxiao Li and Christos Thrampoulidis for Wenlong, Danica Sutherland for Yi) won the Best Paper Award at the 2nd AI for Math Workshop at the 2025 International Conference on Machine Learning (ICML).
Title: Token Hidden Reward: Steering Exploration-Exploitation in GRPO Training
Authors: Wenlong Deng, Yi Ren, Danica J. Sutherland, Christos Thrampoulidis, Xiaoxiao Li
Abstract: Reinforcement learning (RL) has substantially advanced the reasoning capabilities of large language models (LLMs), yet how to explicitly guide training toward exploration or exploitation remains underexplored. In this work, we start from the assumption that response confidence—the model’s likelihood assigned to correct responses—is a meaningful objective for reasoning tasks. To better understand and control learning under this objective, we analyze token-level dynamics in GRPO training and introduce Token Hidden Reward (THR), a novel metric that quantifies the contribution of individual tokens to response confidence. Based on THR, we propose a THR-guided reweighting strategy that modulates the learning signal to explicitly favor either high-confidence outputs (i.e., exploitation) or broader output diversity (i.e., exploration). Empirically, we find that increasing confidence mostly aligns with improved greedy decoding performance (exploitation), while encouraging lower-confidence increasing consistently boosts Pass performance (exploration).