首页|NEAT: Concept driven Neuron Attribution in LLMs

NEAT: Concept driven Neuron Attribution in LLMs

来源：

英文摘要

Locating neurons that are responsible for final predictions is important for opening the black-box large language models and understanding the inside mechanisms. Previous studies have tried to find mechanisms that operate at the neuron level but these methods fail to represent a concept and there is also scope for further optimization of compute required. In this paper, with the help of concept vectors, we propose a method for locating significant neurons that are responsible for representing certain concepts and term those neurons as concept neurons. If the number of neurons is n and the number of examples is m, we reduce the number of forward passes required from O(n*m) to just O(n) compared to the previous works and hence optimizing the time and computation required over previous works. We also compare our method with several baselines and previous methods and our results demonstrate better performance than most of the methods and are more optimal when compared to the state-of-the-art method. We, as part of our ablation studies, also try to optimize the search for the concept neurons by involving clustering methods. Finally, we apply our methods to find, turn off the neurons that we find, and analyze its implications in parts of hate speech and bias in LLMs, and we also evaluate our bias part in terms of Indian context. Our methodology, analysis and explanations facilitate understating of neuron-level responsibility for more broader and human-like concepts and also lay a path for future research in this direction of finding concept neurons and intervening them.

作者：Rahul Mishra、Vivek Hruday Kavuri、Gargi Shroff

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Rahul Mishra,Vivek Hruday Kavuri,Gargi Shroff.NEAT: Concept driven Neuron Attribution in LLMs[EB/OL].(2025-08-21)[2025-09-06].https://arxiv.org/abs/2508.15875.点此复制

NEAT: Concept driven Neuron Attribution in LLMs

NEAT: Concept driven Neuron Attribution in LLMs

评论