Multi-Modal Learning for Multimedia Data Analysis and Applications

A special issue of Electronics (ISSN 2079-9292). This special issue belongs to the section "Computer Science & Engineering".

Deadline for manuscript submissions: closed (15 April 2024) | Viewed by 2322

Special Issue Editors


E-Mail Website
Guest Editor
School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen 518055, China
Interests: machine learning; pattern recognition; multi-modal learning

E-Mail Website
Guest Editor
Research Center of Precision Sensing and Control, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China
Interests: machine learning; classification; pattern recognition

Special Issue Information

Dear Colleagues,

The advance of data collection, transmission and storage has generated the explosive growth of multi-source, multi-view, or multi-modal multimedia data, such as texts, images, audios and videos. For example, magnetic resonance imaging (MRI), computed tomography (CT) and positron emission tomography (PET) have been explored for disease diagnosis in clinical practice. Face image, fingerprint, palm print, and iris have been used for face recognition. Object tracking exploits RGB images, pseudo depth images and thermal infrared images to improve reliability and accuracy of autonomous driving systems. Various multi-modal learning methods have been proposed, including graph-based, kernel-based, subspace learning-based, manifold learning-based, ensemble learning-based methods, as well as their deep extensions.

Despite having many benefits, there still exists several open and unexplored challenges. For example, most current multi-modal learning methods may suffer from quadratic or even cubic complexity when encountering large-scale datasets. This may restrict their practical applicability. In disease diagnosis, patients may not do all medical tests, leading to the incomplete multi-modal learning problem. Deep multi-modal learning has achieved promising performance, yet lack of good interpretability. These are because it is always challenging to fuse multi-modal multimedia data across different modalities for subsequent applications such as recommendation, emotion recognition, matching, classification.

This special issue seeks the latest advances towards novel theory, architecture and algorithm design in multi-modal data analysis for pattern recognition, computer vision, and their novel applications such as multi-modal outlier detection, multi- modal object tracking, multi-model medical analysis. We hope these advances can improve the accuracy, robustness, and efficiency of multi-modal data analysis. Therefore, papers of the above topics as well as on other related topics are welcome. The following lists contain topics of interest (but not limited to):

  • Novel multi-modal learning for multimedia applications such as multi-modal outlier detection, multi-modal object tracking, multi-model medical analysis
  • Novel multi-modal learning theories
  • Optimization for multi-modal multimedia data analysis techniques
  • Fast solvers for large-scale multi-modal data
  • Incomplete multi-modal learning methods
  • Unsupervised/semi-supervised multi-modal learning methods
  • Deep multi-modal learning methods
  • Incorporating new mathematical techniques in multi-modal data analysis

Dr. Yongyong Chen
Dr. Yongqiang Tang
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Electronics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • unsupervised/semi-supervised multi-modal learning
  • deep learning
  • multimedia data analysis

Published Papers (2 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

17 pages, 2957 KiB  
Article
Prompt-Enhanced Generation for Multimodal Open Question Answering
by Chenhao Cui and Zhoujun Li
Electronics 2024, 13(8), 1434; https://doi.org/10.3390/electronics13081434 - 10 Apr 2024
Viewed by 410
Abstract
Multimodal open question answering involves retrieving relevant information from both images and their corresponding texts given a question and then generating the answer. The quality of the generated answer heavily depends on the quality of the retrieved image–text pairs. Existing methods encode and [...] Read more.
Multimodal open question answering involves retrieving relevant information from both images and their corresponding texts given a question and then generating the answer. The quality of the generated answer heavily depends on the quality of the retrieved image–text pairs. Existing methods encode and retrieve images and texts, inputting the retrieved results into a language model to generate answers. These methods overlook the semantic alignment of image–text pairs within the information source, which affects the encoding and retrieval performance. Furthermore, these methods are highly dependent on retrieval performance, and poor retrieval quality can lead to poor generation performance. To address these issues, we propose a prompt-enhanced generation model, PEG, which includes generating supplementary descriptions for images to provide ample material for image–text alignment while also utilizing vision–language joint encoding to improve encoding effects and thereby enhance retrieval performance. Contrastive learning is used to enhance the model’s ability to discriminate between relevant and irrelevant information sources. Moreover, we further explore the knowledge within pre-trained model parameters through prefix-tuning to generate background knowledge relevant to the questions, offering additional input for answer generation and reducing the model’s dependency on retrieval performance. Experiments conducted on the WebQA and MultimodalQA datasets demonstrate that our model outperforms other baseline models in retrieval and generation performance. Full article
(This article belongs to the Special Issue Multi-Modal Learning for Multimedia Data Analysis and Applications)
Show Figures

Figure 1

14 pages, 10170 KiB  
Article
MSGSA: Multi-Scale Guided Self-Attention Network for Crowd Counting
by Yange Sun, Meng Li, Huaping Guo and Li Zhang
Electronics 2023, 12(12), 2631; https://doi.org/10.3390/electronics12122631 - 11 Jun 2023
Cited by 2 | Viewed by 1071
Abstract
The use of convolutional neural networks (CNN) for crowd counting has made significant progress in recent years; however, effectively addressing the scale variation and complex backgrounds remain challenging tasks. To address these challenges, we propose a novel Multi-Scale Guided Self-Attention (MSGSA) network that [...] Read more.
The use of convolutional neural networks (CNN) for crowd counting has made significant progress in recent years; however, effectively addressing the scale variation and complex backgrounds remain challenging tasks. To address these challenges, we propose a novel Multi-Scale Guided Self-Attention (MSGSA) network that utilizes self-attention mechanisms to capture multi-scale contextual information for crowd counting. The MSGSA network consists of three key modules: a Feature Pyramid Module (FPM), a Scale Self-Attention Module (SSAM), and a Scale-aware Feature Fusion (SFA). By integrating self-attention mechanisms at multiple scales, our proposed method captures both global and local contextual information, leading to an improvement in the accuracy of crowd counting. We conducted extensive experiments on multiple benchmark datasets, and the results demonstrate that our method outperforms most existing methods in terms of counting accuracy and the quality of the generated density map. Our proposed MSGSA network provides a promising direction for efficient and accurate crowd counting in complex backgrounds. Full article
(This article belongs to the Special Issue Multi-Modal Learning for Multimedia Data Analysis and Applications)
Show Figures

Figure 1

Back to TopTop