Multi-lingual Multi-turn Automated Red Teaming for LLMs
Multi-lingual Multi-turn Automated Red Teaming for LLMs
Language Model Models (LLMs) have improved dramatically in the past few years, increasing their adoption and the scope of their capabilities over time. A significant amount of work is dedicated to ``model alignment'', i.e., preventing LLMs to generate unsafe responses when deployed into customer-facing applications. One popular method to evaluate safety risks is \textit{red-teaming}, where agents attempt to bypass alignment by crafting elaborate prompts that trigger unsafe responses from a model. Standard human-driven red-teaming is costly, time-consuming and rarely covers all the recent features (e.g., multi-lingual, multi-modal aspects), while proposed automation methods only cover a small subset of LLMs capabilities (i.e., English or single-turn). We present Multi-lingual Multi-turn Automated Red Teaming (\textbf{MM-ART}), a method to fully automate conversational, multi-lingual red-teaming operations and quickly identify prompts leading to unsafe responses. Through extensive experiments on different languages, we show the studied LLMs are on average 71\% more vulnerable after a 5-turn conversation in English than after the initial turn. For conversations in non-English languages, models display up to 195\% more safety vulnerabilities than the standard single-turn English approach, confirming the need for automated red-teaming methods matching LLMs capabilities.
Abhishek Singhania、Christophe Dupuy、Shivam Mangale、Amani Namboori
常用外国语自动化技术、自动化技术设备计算技术、计算机技术
Abhishek Singhania,Christophe Dupuy,Shivam Mangale,Amani Namboori.Multi-lingual Multi-turn Automated Red Teaming for LLMs[EB/OL].(2025-04-04)[2025-04-26].https://arxiv.org/abs/2504.03174.点此复制
评论