|国家预印本平台
首页|Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding

来源:Arxiv_logoArxiv
英文摘要

Automated waterway environment perception is crucial for enabling unmanned surface vessels (USVs) to understand their surroundings and make informed decisions. Most existing waterway perception models primarily focus on instance-level object perception paradigms (e.g., detection, segmentation). However, due to the complexity of waterway environments, current perception datasets and models fail to achieve global semantic understanding of waterways, limiting large-scale monitoring and structured log generation. With the advancement of vision-language models (VLMs), we leverage image captioning to introduce WaterCaption, the first captioning dataset specifically designed for waterway environments. WaterCaption focuses on fine-grained, multi-region long-text descriptions, providing a new research direction for visual geo-understanding and spatial scene cognition. Exactly, it includes 20.2k image-text pair data with 1.8 million vocabulary size. Additionally, we propose Da Yu, an edge-deployable multi-modal large language model for USVs, where we propose a novel vision-to-language projector called Nano Transformer Adaptor (NTA). NTA effectively balances computational efficiency with the capacity for both global and fine-grained local modeling of visual features, thereby significantly enhancing the model's ability to generate long-form textual outputs. Da Yu achieves an optimal balance between performance and efficiency, surpassing state-of-the-art models on WaterCaption and several other captioning benchmarks.

Runwei Guan、Ningwei Ouyang、Tianhao Xu、Shaofeng Liang、Wei Dai、Yafeng Sun、Shang Gao、Songning Lai、Shanliang Yao、Xuming Hu、Ryan Wen Liu、Yutao Yue、Hui Xiong

自动化技术、自动化技术设备计算技术、计算机技术水路运输工程

Runwei Guan,Ningwei Ouyang,Tianhao Xu,Shaofeng Liang,Wei Dai,Yafeng Sun,Shang Gao,Songning Lai,Shanliang Yao,Xuming Hu,Ryan Wen Liu,Yutao Yue,Hui Xiong.Da Yu: Towards USV-Based Image Captioning for Waterway Surveillance and Scene Understanding[EB/OL].(2025-07-01)[2025-07-16].https://arxiv.org/abs/2506.19288.点此复制

评论