|国家预印本平台
| 注册
首页|Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment

来源:Arxiv_logoArxiv
英文摘要

Ensuring consistent safety across multiple languages remains a significant challenge for large language models (LLMs). We introduce Soteria, a lightweight yet powerful strategy that locates and minimally adjusts the "functional heads" most responsible for harmful content generation in each language. By altering only a fraction of parameters, Soteria drastically reduces policy violations without sacrificing overall model performance, even in low-resource settings. To rigorously evaluate our approach, we also present XThreatBench, a specialized multilingual dataset capturing fine-grained harmful behaviors drawn from real policy guidelines. Experiments with leading open-source LLMs (e.g., Llama, Qwen, Mistral) show that Soteria consistently improves safety metrics across high-, mid-, and low-resource languages. These findings highlight a promising path toward scalable, linguistically attuned, and ethically aligned LLMs worldwide.

Somnath Banerjee、Sayan Layek、Pratyush Chatterjee、Animesh Mukherjee、Rima Hazra

计算技术、计算机技术

Somnath Banerjee,Sayan Layek,Pratyush Chatterjee,Animesh Mukherjee,Rima Hazra.Soteria: Language-Specific Functional Parameter Steering for Multilingual Safety Alignment[EB/OL].(2025-08-22)[2025-09-05].https://arxiv.org/abs/2502.11244.点此复制

评论