|国家预印本平台
| 注册
首页|基于规则奖励与分组策略优化的多模态隐式不良内容检测方法

基于规则奖励与分组策略优化的多模态隐式不良内容检测方法

董新祺 张淼

基于规则奖励与分组策略优化的多模态隐式不良内容检测方法

Rule-Guided GRPO for Multimodal Implicit Harmful Content Detection

董新祺 1张淼1

作者信息

  • 1. 北京邮电大学网络空间安全学院,北京 100876
  • 折叠

摘要

随着生成式人工智能技术的快速发展,网络空间中出现了大量具有隐蔽性、语义模糊性和强上下文依赖性的隐式不良内容,给传统内容审核技术带来了新的挑战。现有监督学习方法难以有效学习复杂审核规则,而通用多模态大模型在专业审核场景中普遍存在规则对齐不足、边界案例识别能力弱等问题。针对上述问题,提出一种基于规则奖励与动态分组策略优化(D-GRPO)的多模态隐式不良内容检测方法。该方法将内容审核规则转化为准确性奖励、格式奖励和解释性奖励等多维度奖励信号,通过D-GRPO算法实现模型与审核规范的高效对齐,并利用组内相对优势优化机制提升训练稳定性。基于构建的细粒度隐式不良内容数据集开展实验验证。结果表明,与监督微调模型及通用多模态大模型相比,所提方法在准确率、召回率和F1值等指标上均取得更优性能,尤其在边界案例识别与审核解释生成方面表现出更好的鲁棒性和可解释性,为多模态内容审核任务提供了一种有效的解决方案。

Abstract

With the rapid development of generative artificial intelligence technology, a large amount of implicit harmful content characterized by concealment, semantic ambiguity, and strong contextual dependence has emerged in cyberspace, posing new challenges to traditional content moderation technologies. Existing supervised learning methods are difficult to effectively learn complex moderation rules, while general-purpose multimodal large models generally suffer from insufficient rule alignment and weak capability in recognizing borderline cases in professional moderation scenarios. To address the above problems, a multimodal implicit harmful content detection method based on rule rewards and Dynamic Group Relative Policy Optimization (D-GRPO) is proposed. The proposed method transforms content moderation rules into multidimensional reward signals, including accuracy rewards, format rewards, and explanation rewards, and achieves efficient alignment between the model and moderation specifications through the D-GRPO algorithm. In addition, the group-relative advantage optimization mechanism is adopted to improve training stability. Experiments are conducted on a constructed fine-grained implicit harmful content dataset. The results show that, compared with supervised fine-tuning models and general-purpose multimodal large models, the proposed method achieves better performance in terms of Precision, Recall, and F1-score, and demonstrates stronger robustness and interpretability in borderline case recognition and moderation explanation generation, providing an effective solution for multimodal content moderation tasks.

关键词

网络空间安全/大语言模型/强化学习微调

Key words

Cyberspace Security/Large Language Models/Reinforcement Learning Fine-tuning

引用本文复制引用

董新祺,张淼.基于规则奖励与分组策略优化的多模态隐式不良内容检测方法[EB/OL].(2026-06-23)[2026-06-25].http://www.paper.edu.cn/releasepaper/content/202606-65.

学科分类

计算技术、计算机技术
首发时间 2026-06-23
下载量:0
|
点击量:10
段落导航相关论文