首页|VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

来源：

英文摘要

Visual storytelling is an interdisciplinary field combining computer vision and natural language processing to generate cohesive narratives from sequences of images. This paper presents a novel approach that leverages recent advancements in multimodal models, specifically adapting transformer-based architectures and large multimodal models, for the visual storytelling task. Leveraging the large-scale Visual Storytelling (VIST) dataset, our VIST-GPT model produces visually grounded, contextually appropriate narratives. We address the limitations of traditional evaluation metrics, such as BLEU, METEOR, ROUGE, and CIDEr, which are not suitable for this task. Instead, we utilize RoViST and GROOVIST, novel reference-free metrics designed to assess visual storytelling, focusing on visual grounding, coherence, and non-redundancy. These metrics provide a more nuanced evaluation of narrative quality, aligning closely with human judgment.

作者：Mohamed Gado、Towhid Taliee、Muhammad Memon、Dmitry Ignatov、Radu Timofte

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Mohamed Gado,Towhid Taliee,Muhammad Memon,Dmitry Ignatov,Radu Timofte.VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?[EB/OL].(2025-04-27)[2025-05-28].https://arxiv.org/abs/2504.19267.点此复制

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

VIST-GPT: Ushering in the Era of Visual Storytelling with LLMs?

评论