基于横向分组查询的多头注意力机制的跨模态检索算法
ross-modal retrieval algorithm based on multi-head attention mechanism based on cross group query
随着多媒体数据的迅猛增长,跨模态检索技术的重要性日益凸显,但面临着数据处理复杂、训练效率低等诸多挑战。本研究聚焦于跨模态检索中的横向分组查询注意力机制,旨在提高训练效率与检索准确率。该模型利用 Fast-RCNN 提取图像特征,BERT 处理文本信息,通过计算余弦相似度得到相关性矩阵,并采用特定池化策略评估图像与句子相关性。其核心的横向分组注意力机制在进入线性层前对键头与值头矩阵减半,分组计算后融合,减少计算量与内存需求,提高计算效率,且经残差连接、归一化及前置网络处理,使模型稳定高效运行。在 Flickr30K 数据集上的实验表明,横向分组注意力模型文本检索性能提升,平均准确率提高了0.8\%,训练速率提升 14.3\%
With the rapid growth of multimedia data, the importance of cross-modal retrieval technology has become increasingly prominent. However, it faces numerous challenges such as complex data processing and low training efficiency. This research focuses on the cross grouped query attention mechanism in cross-modal retrieval, aiming to improve training efficiency and retrieval accuracy. The model utilizes Fast-RCNN to extract image features and BERT to process text information. By calculating the cosine similarity, a relevance matrix is obtained, and a specific pooling strategy is adopted to evaluate the relevance between images and sentences. Its core cross grouped attention mechanism halves the key and value head matrices before entering the linear layer, conducts grouped calculations and then fuses them, reducing the computational load and memory requirements and improving computational efficiency. Moreover, through residual connections, normalization, and processing by the pre-network, the model can operate stably and efficiently. Experiments on the Flickr30K dataset show that although the cross grouped attention model has a slight decrease in image retrieval accuracy, the text retrieval performance is improved, the average accuracy rate has increased by 0.8\%, and the training speed is enhanced by 14.3\%.
章涵宇、欧中洪
计算技术、计算机技术
预训练模型跨模态检索深度学习自注意力机制图文检测
Pretrained modelsCross-modal retrievalDeep learningSelf-attention mechanismImage-text detection
章涵宇,欧中洪.基于横向分组查询的多头注意力机制的跨模态检索算法[EB/OL].(2025-01-21)[2025-08-02].http://www.paper.edu.cn/releasepaper/content/202501-35.点此复制
评论