Elsevier

Neurocomputing

Volume 426, 22 February 2021, Pages 195-215
Neurocomputing

New Ideas and Trends in Deep Multimodal Content Understanding: A Review

https://doi.org/10.1016/j.neucom.2020.10.042Get rights and content
Under a Creative Commons license
open access

Abstract

The focus of this survey is on the analysis of two modalities of multimodal deep learning: image and text. Unlike classic reviews of deep learning where monomodal image classifiers such as VGG, ResNet and Inception module are central topics, this paper will examine recent multimodal deep models and structures, including auto-encoders, generative adversarial nets and their variants. These models go beyond the simple image classifiers in which they can do uni-directional (e.g. image captioning, image generation) and bi-directional (e.g. cross-modal retrieval, visual question answering) multimodal tasks. Besides, we analyze two aspects of the challenge in terms of better content understanding in deep multimodal applications. We then introduce current ideas and trends in deep multimodal feature learning, such as feature embedding approaches and objective function design, which are crucial in overcoming the aforementioned challenges. Finally, we include several promising directions for future research.

Keywords

Multimodal deep learning
Ideas and trends
Content understanding
Literature review

Cited by (0)

Wei Chen is a doctoral candidate in Leiden Institute of Advanced Computer Science at Leiden University. His research interest focuses on cross-modal retrieval with deep learning methods. Before starting with PhD study in Leiden University, he received his Master degree from the National University of Defense Technology (NUDT), China, in 2016. He has published papers in international conferences and journal including ICPR, ICME and TMM.

Weiping Wang received the Ph.D. degree in systems engineering from the National University of Defense Technology (NUDT), Changsha, China, where he is currently a Professor. His research interests include systems engineering and simulation. He has more than 200 papers published on journals and conferences including IEEE Transactions on Vehicular Technology, Simulation, Simulation Modeling Practice and Theory, Software and Systems Modeling.

Li Liu received the BSc degree in communication engineering, the MSc degree in photogrammetry and remote sensing and the Ph.D. degree in information and communication engineering from the National University of Defense Technology (NUDT), China, in 2003, 2005 and 2012, respectively. She joined the faculty at NUDT in 2012, where she is currently an Associate Professor with the College of System Engineering. During her PhD study, she spent more than two years as a Visiting Student at the University of Waterloo, Canada, from 2008 to 2010. From 2015 to 2016, she spent ten months visiting the Multimedia Laboratory at the Chinese University of Hong Kong. From 2016.12 to 2018.11, she worked as a senior researcher at the Machine Vision Group at the University of Oulu, Finland. She was a co-chair of nine International Workshops at CVPR, ICCV, and ECCV. She was a guest editor of special issues for IEEE TPAMI and IJCV. She serves as Area Chair for ACCV 2020 and ICME 2020. She currently serves as Associate Editor of the Visual Computer Journal and Pattern Recognition Letter. Her current research interests include computer vision, pattern recognition and machine learning. Her papers have currently over 2500+ citations in Google Scholar.

Michael S. Lew is head of Deep Learning and Computer Vision Research Group Director, LIACS Media Lab. He received his doctorate from University of Illinois at Urbana-Champaign and then became a postdoctoral researcher at Leiden University. One year later, he became the first Leiden University Fellow which was a pilot program for tenure track professors. In 2003, he became a tenured associate professor at Leiden University and was invited to serve as a chair full professor in computer science at Tsinghua University (the MIT of China). He has published over 100 peer reviewed papers with three best paper citations in the areas of computer vision, content-based retrieval, and machine learning. Currently (September 2014), he has the most cited paper in the history of the ACM Transactions on Multimedia. In addition, he has the most cited paper from the ACM International Conference on Multimedia Information Retrieval (MIR) 2008 and also from ACM MIR 2010. He has served on the organizing committees for over a dozen ACM and IEEE conferences. He served as the founding the chair of the ACM ICMR steering committee and had served as chair for both the ACM MIR and ACM CIVR steering committees. In addition he is the Editor-in-Chief of the International Journal of Multimedia Information Retrieval (Springer) and a member of the ACM SIGMM Executive Board which is the highest and most influential committee of the SIGMM.