CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation
CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation
Instance segmentation demands costly per-pixel annotations and large models. We introduce CAST, a semi-supervised knowledge distillation (SSKD) framework that compresses pretrained vision foundation models (VFM) into compact experts using limited labeled and abundant unlabeled data. CAST unfolds in three stages: (1) domain adaptation of the VFM teacher(s) via self-training with contrastive pixel calibration, (2) distillation into a compact student via a unified multi-objective loss that couples standard supervision and pseudo-labels with our instance-aware pixel-wise contrastive term, and (3) fine-tuning on labeled data to remove residual pseudo-label bias. Central to CAST is an \emph{instance-aware pixel-wise contrastive loss} that fuses mask and class scores to mine informative negatives and enforce clear inter-instance margins. By maintaining this contrastive signal across both adaptation and distillation, we align teacher and student embeddings and fully leverage unlabeled images. On Cityscapes and ADE20K, our ~11X smaller student surpasses its adapted VFM teacher(s) by +3.4 AP (33.9 vs. 30.5) and +1.5 AP (16.7 vs. 15.2) and outperforms state-of-the-art semi-supervised approaches.
Pardis Taghavi、Tian Liu、Renjie Li、Reza Langari、Zhengzhong Tu
计算技术、计算机技术
Pardis Taghavi,Tian Liu,Renjie Li,Reza Langari,Zhengzhong Tu.CAST: Contrastive Adaptation and Distillation for Semi-Supervised Instance Segmentation[EB/OL].(2025-05-27)[2025-06-06].https://arxiv.org/abs/2505.21904.点此复制
评论