首页|SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

来源：

英文摘要

Image captioning has emerged as a crucial task in the intersection of computer vision and natural language processing, enabling automated generation of descriptive text from visual content. In the context of remote sensing, image captioning plays a significant role in interpreting vast and complex satellite imagery, aiding applications such as environmental monitoring, disaster assessment, and urban planning. This motivates us, in this paper, to present a transformer based network architecture for remote sensing image captioning (RSIC) in which multiple techniques of Static Expansion, Memory-Augmented Self-Attention, Mesh Transformer are evaluated and integrated. We evaluate our proposed models using two benchmark remote sensing image datasets of UCM-Caption and NWPU-Caption. Our best model outperforms the state-of-the-art systems on most of evaluation metrics, which demonstrates potential to apply for real-life remote sensing image systems.

作者：Khang Truong、Lam Pham、Hieu Tang、Jasmin Lampert、Martin Boyer、Son Phan、Truong Nguyen

作者单位：

学科分类：遥感技术

推荐引用：Khang Truong,Lam Pham,Hieu Tang,Jasmin Lampert,Martin Boyer,Son Phan,Truong Nguyen.SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning[EB/OL].(2025-07-17)[2025-08-10].https://arxiv.org/abs/2507.12845.点此复制

SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

SEMT: Static-Expansion-Mesh Transformer Network Architecture for Remote Sensing Image Captioning

评论