|国家预印本平台
首页|LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data

来源:Arxiv_logoArxiv
英文摘要

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

Jinming Wu、Wei Li、Yuanhan Zhang、Ziwei Liu、Chunyuan Li、Bo Li、Zejun Ma

计算技术、计算机技术

Jinming Wu,Wei Li,Yuanhan Zhang,Ziwei Liu,Chunyuan Li,Bo Li,Zejun Ma.LLaVA-Video: Video Instruction Tuning With Synthetic Data[EB/OL].(2025-08-01)[2025-08-18].https://arxiv.org/abs/2410.02713.点此复制

评论