|国家预印本平台
首页|Image Captioning via Compact Bidirectional Architecture

Image Captioning via Compact Bidirectional Architecture

Image Captioning via Compact Bidirectional Architecture

来源:Arxiv_logoArxiv
英文摘要

Most current image captioning models typically generate captions from left-to-right. This unidirectional property makes them can only leverage past context but not future context. Though refinement-based models can exploit both past and future context by generating a new caption in the second stage based on pre-retrieved or pre-generated captions in the first stage, the decoder of these models generally consists of two networks~(i.e. a retriever or captioner in the first stage and a captioner in the second stage), which can only be executed sequentially. In this paper, we introduce a Compact Bidirectional Transformer model for image captioning that can leverage bidirectional context implicitly and explicitly while the decoder can be executed parallelly. Specifically, it is implemented by tightly coupling left-to-right(L2R) and right-to-left(R2L) flows into a single compact model to serve as a regularization for implicitly exploiting bidirectional context and optionally allowing explicit interaction of the bidirectional flows, while the final caption is chosen from either L2R or R2L flow in a sentence-level ensemble manner. We conduct extensive ablation studies on MSCOCO benchmark and find that the compact bidirectional architecture and the sentence-level ensemble play more important roles than the explicit interaction mechanism. By combining with word-level ensemble seamlessly, the effect of sentence-level ensemble is further enlarged. We further extend the conventional one-flow self-critical training to the two-flows version under this architecture and achieve new state-of-the-art results in comparison with non-vision-language-pretraining models. Finally, we verify the generality of this compact bidirectional architecture by extending it to LSTM backbone. Source code is available at https://github.com/YuanEZhou/cbtic.

Huixia Ben、Meng Wang、Richang Hong、Zijie Song、Daqing Liu、Yuanen Zhou、Zhenzhen Hu

计算技术、计算机技术

Huixia Ben,Meng Wang,Richang Hong,Zijie Song,Daqing Liu,Yuanen Zhou,Zhenzhen Hu.Image Captioning via Compact Bidirectional Architecture[EB/OL].(2025-07-29)[2025-08-11].https://arxiv.org/abs/2201.01984.点此复制

评论