首页|Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

来源：

英文摘要

Recent advancements in LLMs have raised significant safety concerns, particularly when dealing with code-mixed inputs and outputs. Our study systematically investigates the increased susceptibility of LLMs to produce unsafe outputs from code-mixed prompts compared to monolingual English prompts. Utilizing explainability methods, we dissect the internal attribution shifts causing model's harmful behaviors. In addition, we explore cultural dimensions by distinguishing between universally unsafe and culturally-specific unsafe queries. This paper presents novel experimental insights, clarifying the mechanisms driving this phenomenon.

作者：Somnath Banerjee、Pratyush Chatterjee、Shanu Kumar、Sayan Layek、Parag Agrawal、Rima Hazra、Animesh Mukherjee

作者单位：

学科分类：语言学

推荐引用：Somnath Banerjee,Pratyush Chatterjee,Shanu Kumar,Sayan Layek,Parag Agrawal,Rima Hazra,Animesh Mukherjee.Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations[EB/OL].(2025-05-20)[2025-07-21].https://arxiv.org/abs/2505.14469.点此复制

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

Attributional Safety Failures in Large Language Models under Code-Mixed Perturbations

评论