Masked Vision-Language Transformer in Fashion
Masked Vision-Language Transformer in Fashion
We present a masked vision-language transformer (MVLT) for fashion-specific multi-modal representation. Technically, we simply utilize vision transformer architecture for replacing the BERT in the pre-training model, making MVLT the first end-to-end framework for the fashion domain. Besides, we designed masked image reconstruction (MIR) for a fine-grained understanding of fashion. MVLT is an extensible and convenient architecture that admits raw multi-modal inputs without extra pre-processing models (e.g., ResNet), implicitly modeling the vision-language alignments. More importantly, MVLT can easily generalize to various matching and generative tasks. Experimental results show obvious improvements in retrieval (rank@5: 17%) and recognition (accuracy: 3%) tasks over the Fashion-Gen 2018 winner Kaleido-BERT. Code is made available at https://github.com/GewelsJI/MVLT.
Ge-Peng Ji、Luc Van Gool、Deng-Ping Fan、Dehong Gao、Mingcheng Zhuge、Christos Sakaridis
服装工业、制鞋工业计算技术、计算机技术
Ge-Peng Ji,Luc Van Gool,Deng-Ping Fan,Dehong Gao,Mingcheng Zhuge,Christos Sakaridis.Masked Vision-Language Transformer in Fashion[EB/OL].(2022-10-26)[2025-08-18].https://arxiv.org/abs/2210.15110.点此复制
评论