首页|Weight Ensembling Improves Reasoning in Language Models

Weight Ensembling Improves Reasoning in Language Models

来源：

英文摘要

We investigate a failure mode that arises during the training of reasoning models, where the diversity of generations begins to collapse, leading to suboptimal test-time scaling. Notably, the Pass@1 rate reliably improves during supervised finetuning (SFT), but Pass@k rapidly deteriorates. Surprisingly, a simple intervention of interpolating the weights of the latest SFT checkpoint with an early checkpoint, otherwise known as WiSE-FT, almost completely recovers Pass@k while also improving Pass@1. The WiSE-FT variant achieves better test-time scaling (Best@k, majority vote) and achieves superior results with less data when tuned further by reinforcement learning. Finally, we find that WiSE-FT provides complementary performance gains that cannot be achieved only through diversity-inducing decoding strategies, like temperature scaling. We formalize a bias-variance tradeoff of Pass@k with respect to the expectation and variance of Pass@1 over the test distribution. We find that WiSE-FT can reduce bias and variance simultaneously, while temperature scaling inherently trades off between bias and variance.

作者：Xingyu Dang、Christina Baek、Kaiyue Wen、Zico Kolter、Aditi Raghunathan

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Xingyu Dang,Christina Baek,Kaiyue Wen,Zico Kolter,Aditi Raghunathan.Weight Ensembling Improves Reasoning in Language Models[EB/OL].(2025-04-14)[2025-07-16].https://arxiv.org/abs/2504.10478.点此复制

Weight Ensembling Improves Reasoning in Language Models

Weight Ensembling Improves Reasoning in Language Models

评论