|国家预印本平台
首页|From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

From Emergence to Control: Probing and Modulating Self-Reflection in Language Models

来源:Arxiv_logoArxiv
英文摘要

Self-reflection -- the ability of a large language model (LLM) to revisit, evaluate, and revise its own reasoning -- has recently emerged as a powerful behavior enabled by reinforcement learning with verifiable rewards (RLVR). While self-reflection correlates with improved reasoning accuracy, its origin and underlying mechanisms remain poorly understood. In this work, {\it we first show that self-reflection is not exclusive to RLVR fine-tuned models: it already emerges, albeit rarely, in pretrained models}. To probe this latent ability, we introduce Reflection-Inducing Probing, a method that injects reflection-triggering reasoning traces from fine-tuned models into pretrained models. This intervention raises self-reflection frequency of Qwen2.5 from 0.6\% to 18.6\%, revealing a hidden capacity for reflection. Moreover, our analysis of internal representations shows that both pretrained and fine-tuned models maintain hidden states that distinctly separate self-reflective from non-reflective contexts. Leveraging this observation, {\it we then construct a self-reflection vector, a direction in activation space associated with self-reflective reasoning}. By manipulating this vector, we enable bidirectional control over the self-reflective behavior for both pretrained and fine-tuned models. Experiments across multiple reasoning benchmarks show that enhancing these vectors improves reasoning performance by up to 12\%, while suppressing them reduces computational cost, providing a flexible mechanism to navigate the trade-off between reasoning quality and efficiency without requiring additional training. Our findings further our understanding of self-reflection and support a growing body of work showing that understanding model internals can enable precise behavioral control.

Xudong Zhu、Jiachen Jiang、Mohammad Mahdi Khalili、Zhihui Zhu

计算技术、计算机技术

Xudong Zhu,Jiachen Jiang,Mohammad Mahdi Khalili,Zhihui Zhu.From Emergence to Control: Probing and Modulating Self-Reflection in Language Models[EB/OL].(2025-06-13)[2025-07-16].https://arxiv.org/abs/2506.12217.点此复制

评论