Limit of RLVR

[…] RLVR narrows the model’s exploration, favoring known high-reward paths instead of discovering new reasoning strategies. Crucially, all correct solutions from RL-trained models already exist in the base model’s distribution, proving RLVR enhances sampling efficiency, not reasoning capacity, while inadvertently shrinking the solution space.

Project: https://limit-of-rlvr.github.io/ Paper: https://arxiv.org/abs/2504.13837

Limit of RLVR

C4AIL Commentary