首页|多模态深度学习综述

多模态深度学习综述

来源：

中文摘要

英文摘要

模态是指事物发生或存在的方式，如文字、语言、声音、图形等。多模态学习是指学习多个模态中各个模态的信息，并且实现各个模态的信息的交流和转换。多模态深度学习是指建立可以完成多模态学习任务的神经网络模型。多模态学习的普遍性和深度学习的热度赋予了多模态深度学习鲜活的生命力和发展潜力。旨在多模态深度学习的发展前期，总结当前的多模态深度学习，发现在不同的多模态组合和学习目标下，多模态深度学习实现过程中的共有问题，并对共有问题进行分类，叙述解决各类问题的方法。具体来说，从涉及自然语言、视觉、听觉的多模态学习中考虑了语言翻译、事件探测、信息描述、情绪识别、声音识别和合成，以及多媒体检索等方面研究，将多模态深度学习实现过程中的共有问题分为模态表示、模态传译、模态融合和模态对齐四类，并对各问题进行子分类和论述，同时列举了为解决各问题产生的神经网络模型。最后论述了实际多模态系统，多模态深度学习研究中常用的数据集和评判标准，并展望了多模态深度学习的发展趋势。

modality refers to the way in which something happens or is experienced, such as word, language, sound, picture and so on. Multimodality is a combination of two or more modalities. Multimodal learning refers to learning the information of each modality in the multimodality, and realizing the exchange and conversion of information of each modality. Thus, Multimodal deep learning is the establishment of a neural network model that can accomplish multimodal learning tasks. The universality of multimodal learning and the intensification of deep learning lead to the vitality of multimodal deep learning. This paper aims to summarize the current multimodal deep learning, find common problems in the implementation of multimodal deep learning under different multimodal and learning objectives, as well as making common problems classify and describing methods for solving various problems at the early development of multimodal deep learning. Specifically, this paper summarizing the current multimodal deep learning that study on natural language, visual, auditory, and considering the research direction such as language translation, event detection, information description, emotion recognition, voice recognition and synthesis, and multimedia retrieval and so on, which further concludes that there are four types of common problems: multimodal representation, multimodal interpretation, multimodal fusion, and multimodal alignment. Meanwhile, each common multimodal learning problem is sub-categorized and discussed, and the neural network models generated for solving the problems are listed. Finally, it introduce some actual multimodal system, list baseline datasets and evaluation criteria used in multimodal deep learning, and conclude with perspectives and directions for future research.

作者：丁熙浩、刘建伟、罗雄麟

作者单位：

DOI：10.12074/201905.00048V1

学科分类：计算技术、计算机技术

中文关键词：多模态深度学习神经网络模态表示模态传译模态融合模态对齐

推荐引用：丁熙浩,刘建伟,罗雄麟.多模态深度学习综述[EB/OL].(2019-05-10)[2025-08-16].https://chinaxiv.org/abs/201905.00048.点此复制

多模态深度学习综述

评论