首页|Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

来源：

英文摘要

Social platforms have revolutionized information sharing, but also accelerated the dissemination of harmful and policy-violating content. To ensure safety and compliance at scale, moderation systems must go beyond efficiency and offer accuracy and interpretability. However, current approaches largely rely on noisy, label-driven learning, lacking alignment with moderation rules and producing opaque decisions that hinder human review. Therefore, we propose Hierarchical Guard (Hi-Guard), a multimodal moderation framework that introduces a new policy-aligned decision paradigm. The term "Hierarchical" reflects two key aspects of our system design: (1) a hierarchical moderation pipeline, where a lightweight binary model first filters safe content and a stronger model handles fine-grained risk classification; and (2) a hierarchical taxonomy in the second stage, where the model performs path-based classification over a hierarchical taxonomy ranging from coarse to fine-grained levels. To ensure alignment with evolving moderation policies, Hi-Guard directly incorporates rule definitions into the model prompt. To further enhance structured prediction and reasoning, we introduce a multi-level soft-margin reward and optimize with Group Relative Policy Optimization (GRPO), penalizing semantically adjacent misclassifications and improving explanation quality. Extensive experiments and real-world deployment demonstrate that Hi-Guard achieves superior classification accuracy, generalization, and interpretability, paving the way toward scalable, transparent, and trustworthy content safety systems. Code is available at: https://github.com/lianqi1008/Hi-Guard.

作者：Anqi Li、Wenwei Jin、Jintao Tong、Pengda Qin、Weijia Li、Guo Lu

作者单位：

学科分类：计算技术、计算机技术

推荐引用：Anqi Li,Wenwei Jin,Jintao Tong,Pengda Qin,Weijia Li,Guo Lu.Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling[EB/OL].(2025-08-05)[2025-08-16].https://arxiv.org/abs/2508.03296.点此复制

Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

Towards Trustworthy Multimodal Moderation via Policy-Aligned Reasoning and Hierarchical Labeling

评论