|国家预印本平台
首页|Outcome-based Reinforcement Learning to Predict the Future

Outcome-based Reinforcement Learning to Predict the Future

Outcome-based Reinforcement Learning to Predict the Future

来源:Arxiv_logoArxiv
英文摘要

Reinforcement learning with verifiable rewards (RLVR) has boosted math and coding in large language models, yet there has been little effort to extend RLVR into messier, real-world domains like forecasting. One sticking point is that outcome-based reinforcement learning for forecasting must learn from binary, delayed, and noisy rewards, a regime where standard fine-tuning is brittle. We show that outcome-only online RL on a 14B model can match frontier-scale accuracy and surpass it in calibration and hypothetical prediction market betting by adapting two leading algorithms, Group-Relative Policy Optimisation (GRPO) and ReMax, to the forecasting setting. Our adaptations remove per-question variance scaling in GRPO, apply baseline-subtracted advantages in ReMax, hydrate training with 100k temporally consistent synthetic questions, and introduce lightweight guard-rails that penalise gibberish, non-English responses and missing rationales, enabling a single stable pass over 110k events. Scaling ReMax to 110k questions and ensembling seven predictions yields a 14B model that matches frontier baseline o1 on accuracy on our holdout set (Brier = 0.193, p = 0.23) while beating it in calibration (ECE = 0.042, p < 0.001). A simple trading rule turns this calibration edge into \$127 of hypothetical profit versus \$92 for o1 (p = 0.037). This demonstrates that refined RLVR methods can convert small-scale LLMs into potentially economically valuable forecasting tools, with implications for scaling this to larger models.

Benjamin Turtel、Danny Franklin、Kris Skotheim、Luke Hewitt、Philipp Schoenegger

计算技术、计算机技术

Benjamin Turtel,Danny Franklin,Kris Skotheim,Luke Hewitt,Philipp Schoenegger.Outcome-based Reinforcement Learning to Predict the Future[EB/OL].(2025-05-23)[2025-06-15].https://arxiv.org/abs/2505.17989.点此复制

评论