|国家预印本平台
首页|RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation

来源:Arxiv_logoArxiv
英文摘要

Creating recipe images is a key challenge in food computing, with applications in culinary education and multimodal recipe assistants. However, existing datasets lack fine-grained alignment between recipe goals, step-wise instructions, and visual content. We present RecipeGen, the first large-scale, real-world benchmark for recipe-based Text-to-Image (T2I), Image-to-Video (I2V), and Text-to-Video (T2V) generation. RecipeGen contains 26,453 recipes, 196,724 images, and 4,491 videos, covering diverse ingredients, cooking procedures, styles, and dish types. We further propose domain-specific evaluation metrics to assess ingredient fidelity and interaction modeling, benchmark representative T2I, I2V, and T2V models, and provide insights for future recipe generation models. Project page is available now.

Ruoxuan Zhang、Jidong Gao、Bin Wen、Hongxia Xie、Chenming Zhang、Hong-Han Shuai、Wen-Huang Cheng

食品工业

Ruoxuan Zhang,Jidong Gao,Bin Wen,Hongxia Xie,Chenming Zhang,Hong-Han Shuai,Wen-Huang Cheng.RecipeGen: A Step-Aligned Multimodal Benchmark for Real-World Recipe Generation[EB/OL].(2025-06-07)[2025-07-23].https://arxiv.org/abs/2506.06733.点此复制

评论