|国家预印本平台
首页|Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

Detecting Instruction Fine-tuning Attack on Language Models with Influence Function

来源:Arxiv_logoArxiv
英文摘要

Instruction fine-tuning attacks pose a significant threat to large language models (LLMs) by subtly embedding poisoned data in fine-tuning datasets, which can trigger harmful or unintended responses across a range of tasks. This undermines model alignment and poses security risks in real-world deployment. In this work, we present a simple and effective approach to detect and mitigate such attacks using influence functions, a classical statistical tool adapted for machine learning interpretation. Traditionally, the high computational costs of influence functions have limited their application to large models and datasets. The recent Eigenvalue-Corrected Kronecker-Factored Approximate Curvature (EK-FAC) approximation method enables efficient influence score computation, making it feasible for large-scale analysis. We are the first to apply influence functions for detecting language model instruction fine-tuning attacks on large-scale datasets, as both the instruction fine-tuning attack on language models and the influence calculation approximation technique are relatively new. Our large-scale empirical evaluation of influence functions on 50,000 fine-tuning examples and 32 tasks reveals a strong association between influence scores and sentiment. Building on this, we introduce a novel sentiment transformation combined with influence functions to detect and remove critical poisons -- poisoned data points that skew model predictions. Removing these poisons (only 1% of total data) recovers model performance to near-clean levels, demonstrating the effectiveness and efficiency of our approach. Artifact is available at https://github.com/lijiawei20161002/Poison-Detection. WARNING: This paper contains offensive data examples.

Jiawei Li

计算技术、计算机技术

Jiawei Li.Detecting Instruction Fine-tuning Attack on Language Models with Influence Function[EB/OL].(2025-04-11)[2025-05-28].https://arxiv.org/abs/2504.09026.点此复制

评论