首页|SPKLIP: Aligning Spike Video Streams with Natural Language

SPKLIP: Aligning Spike Video Streams with Natural Language

来源：

英文摘要

Spike cameras offer unique sensing capabilities but their sparse, asynchronous output challenges semantic understanding, especially for Spike Video-Language Alignment (Spike-VLA) where models like CLIP underperform due to modality mismatch. We introduce SPKLIP, the first architecture specifically for Spike-VLA. SPKLIP employs a hierarchical spike feature extractor that adaptively models multi-scale temporal dynamics in event streams, and uses spike-text contrastive learning to directly align spike video with language, enabling effective few-shot learning. A full-spiking visual encoder variant, integrating SNN components into our pipeline, demonstrates enhanced energy efficiency. Experiments show state-of-the-art performance on benchmark spike datasets and strong few-shot generalization on a newly contributed real-world dataset. SPKLIP's energy efficiency highlights its potential for neuromorphic deployment, advancing event-based multimodal research. The source code and dataset are available at [link removed for anonymity].

作者：Yongchang Gao、Meiling Jin、Zhaofei Yu、Tiejun Huang、Guozhang Chen

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Yongchang Gao,Meiling Jin,Zhaofei Yu,Tiejun Huang,Guozhang Chen.SPKLIP: Aligning Spike Video Streams with Natural Language[EB/OL].(2025-05-18)[2025-06-17].https://arxiv.org/abs/2505.12656.点此复制

SPKLIP: Aligning Spike Video Streams with Natural Language

SPKLIP: Aligning Spike Video Streams with Natural Language

评论