首页|A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

来源：

英文摘要

This paper describes the solutions of the Dianping-Trust-Safety team for the META CRAG-MM challenge. The challenge requires building a comprehensive retrieval-augmented generation system capable for multi-modal multi-turn question answering. The competition consists of three tasks: (1) answering questions using structured data retrieved from an image-based mock knowledge graph, (2) synthesizing information from both knowledge graphs and web search results, and (3) handling multi-turn conversations that require context understanding and information aggregation from multiple sources. For Task 1, our solution is based on the vision large language model, enhanced by supervised fine-tuning with knowledge distilled from GPT-4.1. We further applied curriculum learning strategies to guide reinforcement learning, resulting in improved answer accuracy and reduced hallucination. For Task 2 and Task 3, we additionally leveraged web search APIs to incorporate external knowledge, enabling the system to better handle complex queries and multi-turn conversations. Our approach achieved 1st place in Task 1 with a significant lead of 52.38\%, and 3rd place in Task 3, demonstrating the effectiveness of the integration of curriculum learning with reinforcement learning in our training pipeline.

作者：Chenliang Zhang、Lin Wang、Yuanyuan Lu、Yusheng Qi、Kexin Wang、Peixu Hou、Wenshi Chen

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Chenliang Zhang,Lin Wang,Yuanyuan Lu,Yusheng Qi,Kexin Wang,Peixu Hou,Wenshi Chen.A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering[EB/OL].(2025-08-14)[2025-08-24].https://arxiv.org/abs/2508.10337.点此复制

A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

A Curriculum Learning Approach to Reinforcement Learning: Leveraging RAG for Multimodal Question Answering

评论