Test-Time Scaling with Reflective Generative Model
Test-Time Scaling with Reflective Generative Model
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3-mini's performance via the new Reflective Generative Form. The new form focuses on high-quality reasoning trajectory selection and contains two novelties: 1) A unified interface for policy and process reward model: we share the backbone network and use task-specific heads for reasoning trajectory predicting and scoring respectively, introducing only 53M extra parameters for trajectory scoring. 2) Eliminating the reliance on process-level annotation: we provide a self-supervised process reward model, which can directly learn the high-quality reasoning trajectory selection from the outcome reward. Equipped with the reflective generative form, MetaStone-S1 is naturally suitable for test-time scaling, and we provide three reasoning effort modes (low, medium, and high) based on the controllable thinking length. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
Zixiao Wang、Yuxin Wang、Xiaorui Wang、Mengting Xing、Jie Gao、Jianjun Xu、Guangcan Liu、Chenhui Jin、Zhuo Wang、Shengzhuo Zhang、Hongtao Xie
计算技术、计算机技术
Zixiao Wang,Yuxin Wang,Xiaorui Wang,Mengting Xing,Jie Gao,Jianjun Xu,Guangcan Liu,Chenhui Jin,Zhuo Wang,Shengzhuo Zhang,Hongtao Xie.Test-Time Scaling with Reflective Generative Model[EB/OL].(2025-07-09)[2025-07-18].https://arxiv.org/abs/2507.01951.点此复制
评论