|国家预印本平台
首页|Safety Pretraining: Toward the Next Generation of Safe AI

Safety Pretraining: Toward the Next Generation of Safe AI

Safety Pretraining: Toward the Next Generation of Safe AI

来源:Arxiv_logoArxiv
英文摘要

As large language models (LLMs) are increasingly deployed in high-stakes settings, the risk of generating harmful or toxic content remains a central challenge. Post-hoc alignment methods are brittle: once unsafe patterns are learned during pretraining, they are hard to remove. We present a data-centric pretraining framework that builds safety into the model from the start. Our contributions include: (i) a safety classifier trained on 10,000 GPT-4 labeled examples, used to filter 600B tokens; (ii) the largest synthetic safety dataset to date (100B tokens) generated via recontextualization of harmful web data; (iii) RefuseWeb and Moral Education datasets that convert harmful prompts into refusal dialogues and web-style educational material; (iv) Harmfulness-Tag annotations injected during pretraining to flag unsafe content and steer away inference from harmful generations; and (v) safety evaluations measuring base model behavior before instruction tuning. Our safety-pretrained models reduce attack success rates from 38.8% to 8.4% with no performance degradation on standard LLM safety benchmarks.

Pratyush Maini、Sachin Goyal、Dylan Sam、Alex Robey、Yash Savani、Yiding Jiang、Andy Zou、Zacharcy C. Lipton、J. Zico Kolter

计算技术、计算机技术

Pratyush Maini,Sachin Goyal,Dylan Sam,Alex Robey,Yash Savani,Yiding Jiang,Andy Zou,Zacharcy C. Lipton,J. Zico Kolter.Safety Pretraining: Toward the Next Generation of Safe AI[EB/OL].(2025-04-23)[2025-05-18].https://arxiv.org/abs/2504.16980.点此复制

评论