Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space
Motion retrieval is crucial for motion acquisition, offering superior precision, realism, controllability, and editability compared to motion generation. Existing approaches leverage contrastive learning to construct a unified embedding space for motion retrieval from text or visual modality. However, these methods lack a more intuitive and user-friendly interaction mode and often overlook the sequential representation of most modalities for improved retrieval performance. To address these limitations, we propose a framework that aligns four modalities -- text, audio, video, and motion -- within a fine-grained joint embedding space, incorporating audio for the first time in motion retrieval to enhance user immersion and convenience. This fine-grained space is achieved through a sequence-level contrastive learning approach, which captures critical details across modalities for better alignment. To evaluate our framework, we augment existing text-motion datasets with synthetic but diverse audio recordings, creating two multi-modal motion retrieval datasets. Experimental results demonstrate superior performance over state-of-the-art methods across multiple sub-tasks, including an 10.16% improvement in R@10 for text-to-motion retrieval and a 25.43% improvement in R@1 for video-to-motion retrieval on the HumanML3D dataset. Furthermore, our results show that our 4-modal framework significantly outperforms its 3-modal counterpart, underscoring the potential of multi-modal motion retrieval for advancing motion acquisition.
Shiyao Yu、Zi-An Wang、Kangning Yin、Zheng Tian、Mingyuan Zhang、Weixin Si、Shihao Zou
计算技术、计算机技术
Shiyao Yu,Zi-An Wang,Kangning Yin,Zheng Tian,Mingyuan Zhang,Weixin Si,Shihao Zou.Multi-Modal Motion Retrieval by Learning a Fine-Grained Joint Embedding Space[EB/OL].(2025-07-31)[2025-08-07].https://arxiv.org/abs/2507.23188.点此复制
评论