Holes in Latent Space: Topological Signatures Under Adversarial Influence
Holes in Latent Space: Topological Signatures Under Adversarial Influence
Understanding how adversarial conditions affect language models requires techniques that capture both global structure and local detail within high-dimensional activation spaces. We propose persistent homology (PH), a tool from topological data analysis, to systematically characterize multiscale latent space dynamics in LLMs under two distinct attack modes -- backdoor fine-tuning and indirect prompt injection. By analyzing six state-of-the-art LLMs, we show that adversarial conditions consistently compress latent topologies, reducing structural diversity at smaller scales while amplifying dominant features at coarser ones. These topological signatures are statistically robust across layers, architectures, model sizes, and align with the emergence of adversarial effects deeper in the network. To capture finer-grained mechanisms underlying these shifts, we introduce a neuron-level PH framework that quantifies how information flows and transforms within and across layers. Together, our findings demonstrate that PH offers a principled and unifying approach to interpreting representational dynamics in LLMs, particularly under distributional shift.
Aideen Fay、Inés García-Redondo、Qiquan Wang、Haim Dubossarsky、Anthea Monod
计算技术、计算机技术
Aideen Fay,Inés García-Redondo,Qiquan Wang,Haim Dubossarsky,Anthea Monod.Holes in Latent Space: Topological Signatures Under Adversarial Influence[EB/OL].(2025-05-26)[2025-06-15].https://arxiv.org/abs/2505.20435.点此复制
评论