首页|LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data

来源：

英文摘要

The development of video large multimodal models (LMMs) has been hindered by the difficulty of curating large amounts of high-quality raw data from the web. To address this, we propose an alternative approach by creating a high-quality synthetic dataset specifically for video instruction-following, namely LLaVA-Video-178K. This dataset includes key tasks such as detailed captioning, open-ended question-answering (QA), and multiple-choice QA. By training on this dataset, in combination with existing visual instruction tuning data, we introduce LLaVA-Video, a new video LMM. Our experiments demonstrate that LLaVA-Video achieves strong performance across various video benchmarks, highlighting the effectiveness of our dataset. We plan to release the dataset, its generation pipeline, and the model checkpoints.

作者：Jinming Wu、Wei Li、Yuanhan Zhang、Ziwei Liu、Chunyuan Li、Bo Li、Zejun Ma

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Jinming Wu,Wei Li,Yuanhan Zhang,Ziwei Liu,Chunyuan Li,Bo Li,Zejun Ma.LLaVA-Video: Video Instruction Tuning With Synthetic Data[EB/OL].(2025-08-01)[2025-08-18].https://arxiv.org/abs/2410.02713.点此复制

LLaVA-Video: Video Instruction Tuning With Synthetic Data

LLaVA-Video: Video Instruction Tuning With Synthetic Data

评论