|国家预印本平台
首页|A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content

来源:Arxiv_logoArxiv
英文摘要

Large Language Models (LLM) have made remarkable progress, but concerns about potential biases and harmful content persist. To address these apprehensions, we introduce a practical solution for ensuring LLM's safe and ethical use. Our novel approach focuses on a post-generation correction mechanism, the BART-Corrective Model, which adjusts generated content to ensure safety and security. Unlike relying solely on model fine-tuning or prompt engineering, our method provides a robust data-centric alternative for mitigating harmful content. We demonstrate the effectiveness of our approach through experiments on multiple toxic datasets, which show a significant reduction in mean toxicity and jail-breaking scores after integration. Specifically, our results show a reduction of 15% and 21% in mean toxicity and jail-breaking scores with GPT-4, a substantial reduction of 28% and 5% with PaLM2, a reduction of approximately 26% and 23% with Mistral-7B, and a reduction of 11.1% and 19% with Gemma-2b-it. These results demonstrate the potential of our approach to improve the safety and security of LLM, making them more suitable for real-world applications.

Chaima Njeh、Ha?fa Nakouri、Fehmi Jaafar

计算技术、计算机技术

Chaima Njeh,Ha?fa Nakouri,Fehmi Jaafar.A Data-Centric Approach for Safe and Secure Large Language Models against Threatening and Toxic Content[EB/OL].(2025-04-19)[2025-05-21].https://arxiv.org/abs/2504.16120.点此复制

评论