Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.
Jewon Lee、Ki-Ung Song、Seungmin Yang、Donguk Lim、Jaeyeon Kim、Wooksu Shin、Bo-Kyeong Kim、Yong Jae Lee、Tae-Ho Kim
计算技术、计算机技术
Jewon Lee,Ki-Ung Song,Seungmin Yang,Donguk Lim,Jaeyeon Kim,Wooksu Shin,Bo-Kyeong Kim,Yong Jae Lee,Tae-Ho Kim.Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features[EB/OL].(2025-04-01)[2025-05-09].https://arxiv.org/abs/2504.00557.点此复制
评论