首页|When Vision-Language Model (VLM) Meets Beam Prediction: A Multimodal Contrastive Learning Framework

When Vision-Language Model (VLM) Meets Beam Prediction: A Multimodal Contrastive Learning Framework

来源：

英文摘要

As the real propagation environment becomes in creasingly complex and dynamic, millimeter wave beam prediction faces huge challenges. However, the powerful cross modal representation capability of vision-language model (VLM) provides a promising approach. The traditional methods that rely on real-time channel state information (CSI) are computationally expensive and often fail to maintain accuracy in such environments. In this paper, we present a VLM-driven contrastive learning based multimodal beam prediction framework that integrates multimodal data via modality-specific encoders. To enforce cross-modal consistency, we adopt a contrastive pretraining strategy to align image and LiDAR features in the latent space. We use location information as text prompts and connect it to the text encoder to introduce language modality, which further improves cross-modal consistency. Experiments on the DeepSense-6G dataset show that our VLM backbone provides additional semantic grounding. Compared with existing methods, the overall distance-based accuracy score (DBA-Score) of 0.9016, corresponding to 1.46% average improvement.

作者：Ji Wang、Bin Tang、Jian Xiao、Qimei Cui、Xingwang Li、Tony Q. S. Quek

作者单位：

学科分类：无线通信

推荐引用：Ji Wang,Bin Tang,Jian Xiao,Qimei Cui,Xingwang Li,Tony Q. S. Quek.When Vision-Language Model (VLM) Meets Beam Prediction: A Multimodal Contrastive Learning Framework[EB/OL].(2025-08-01)[2025-08-11].https://arxiv.org/abs/2508.00456.点此复制

When Vision-Language Model (VLM) Meets Beam Prediction: A Multimodal Contrastive Learning Framework

When Vision-Language Model (VLM) Meets Beam Prediction: A Multimodal Contrastive Learning Framework

评论