|国家预印本平台
首页|VIBE: Video-Input Brain Encoder for fMRI Response Modeling

VIBE: Video-Input Brain Encoder for fMRI Response Modeling

VIBE: Video-Input Brain Encoder for fMRI Response Modeling

来源:Arxiv_logoArxiv
英文摘要

We present VIBE, a two-stage Transformer that fuses multi-modal video, audio, and text features to predict fMRI activity. Representations from open-source models (Qwen2.5, BEATs, Whisper, SlowFast, V-JEPA) are merged by a modality-fusion transformer and temporally decoded by a prediction transformer with rotary embeddings. Trained on 65 hours of movie data from the CNeuroMod dataset and ensembled across 20 seeds, VIBE attains mean parcel-wise Pearson correlations of 0.3225 on in-distribution Friends S07 and 0.2125 on six out-of-distribution films. An earlier iteration of the same architecture obtained 0.3198 and 0.2096, respectively, winning Phase-1 and placing second overall in the Algonauts 2025 Challenge.

Daniel Carlström Schad、Shrey Dixit、Janis Keck、Viktor Studenyak、Aleksandr Shpilevoi、Andrej Bicanski

计算技术、计算机技术

Daniel Carlström Schad,Shrey Dixit,Janis Keck,Viktor Studenyak,Aleksandr Shpilevoi,Andrej Bicanski.VIBE: Video-Input Brain Encoder for fMRI Response Modeling[EB/OL].(2025-07-25)[2025-08-10].https://arxiv.org/abs/2507.17958.点此复制

评论