sensors-logo

Journal Browser

Journal Browser

Multimodal Data Fusion Technologies and Applications in Intelligent System

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 31 May 2024 | Viewed by 7410

Special Issue Editors


E-Mail Website
Guest Editor
Electrical Engineering Department, University of South Florida, Tampa, FL 33620, USA
Interests: machine learning; computer vision; cybersecurity; statistical signal processing; internet of things; multimodal data fusion

E-Mail Website
Guest Editor
Senior AI Software Engineer at Intel Corporation, San Jose, CA 95134, USA
Interests: probabilistic machine learning; recommender systems; Bayesian models; multimodal data fusion

E-Mail Website
Guest Editor
Machine Learning Scientist at Amazon Prime Video, Seattle, WA, USA
Interests: machine learning; computer vision; video analytics; multimodal data fusion

E-Mail Website
Guest Editor
Department of Machine Learning, H. Lee. Moffitt Cancer Center, Tampa, FL 33612, USA
Interests: machine learning; medical image processing; multimodal data fusion; prostate cancer; lung cancer; lylmphoma
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

In many real-world systems, fusing data from different sensing modalities can facilitate inference and decision making, such as millimeter-wave radar, lidar, and camera data for autonomous driving; synthetic aperture radar (SAR), lidar, and satellite data for remote sensing; magnetic resonance (MR), X-ray, and ultrasound imaging data for medical applications; moisture, temperature, and chemical sensor data for environmental applications. While late-stage fusion, also known as decision fusion, techniques for multimodal data, such as majority voting, have been used for a long time and are attractive for their simplicity, they do not use the full potential of rich multimodal data. Hence, there is a growing need and interest in early-stage fusion techniques which can leverage the correlation between multimodal data to improve the quality of inference and decision making. Advances in software and hardware for machine learning, in particular deep neural networks, have been increasing the capacity of processing complex data, e.g., high-dimensional and heterogenous data from multimodal sources. Focusing on this timely topic of AI-enabled multimodal data fusion, in this special issue, we solicit articles presenting novel methods (architectures, algorithms) and applications.

Dr. Yasin Yılmaz
Dr. Mehmet Aktukmak
Dr. Keval Doshi
Dr. Yoganand Balagurunathan
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multimodal data
  • data fusion
  • data integration
  • machine learning
  • deep neural networks
  • AI

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

18 pages, 3151 KiB  
Article
Semantic Guidance Fusion Network for Cross-Modal Semantic Segmentation
by Pan Zhang, Ming Chen and Meng Gao
Sensors 2024, 24(8), 2473; https://doi.org/10.3390/s24082473 - 12 Apr 2024
Viewed by 260
Abstract
Leveraging data from various modalities to enhance multimodal segmentation tasks is a well-regarded approach. Recently, efforts have been made to incorporate an array of modalities, including depth and thermal imaging. Nevertheless, the effective amalgamation of cross-modal interactions remains a challenge, given the unique [...] Read more.
Leveraging data from various modalities to enhance multimodal segmentation tasks is a well-regarded approach. Recently, efforts have been made to incorporate an array of modalities, including depth and thermal imaging. Nevertheless, the effective amalgamation of cross-modal interactions remains a challenge, given the unique traits each modality presents. In our current research, we introduce the semantic guidance fusion network (SGFN), which is an innovative cross-modal fusion network adept at integrating a diverse set of modalities. Particularly, the SGFN features a semantic guidance module (SGM) engineered to boost bi-modal feature extraction. It encompasses a learnable semantic guidance convolution (SGC) designed to merge intensity and gradient data from disparate modalities. Comprehensive experiments carried out on the NYU Depth V2, SUN-RGBD, Cityscapes, MFNet, and ZJU datasets underscore both the superior performance and generalization ability of the SGFN compared to the current leading models. Moreover, when tested on the DELIVER dataset, the efficiency of our bi-modal SGFN displayed a mIoU that is comparable to the hitherto leading model, CMNEXT. Full article
Show Figures

Figure 1

29 pages, 3319 KiB  
Article
Building Flexible, Scalable, and Machine Learning-Ready Multimodal Oncology Datasets
by Aakash Tripathi, Asim Waqas, Kavya Venkatesan, Yasin Yilmaz and Ghulam Rasool
Sensors 2024, 24(5), 1634; https://doi.org/10.3390/s24051634 - 02 Mar 2024
Cited by 2 | Viewed by 994
Abstract
The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. [...] Read more.
The advancements in data acquisition, storage, and processing techniques have resulted in the rapid growth of heterogeneous medical data. Integrating radiological scans, histopathology images, and molecular information with clinical data is essential for developing a holistic understanding of the disease and optimizing treatment. The need for integrating data from multiple sources is further pronounced in complex diseases such as cancer for enabling precision medicine and personalized treatments. This work proposes Multimodal Integration of Oncology Data System (MINDS)—a flexible, scalable, and cost-effective metadata framework for efficiently fusing disparate data from public sources such as the Cancer Research Data Commons (CRDC) into an interconnected, patient-centric framework. MINDS consolidates over 41,000 cases from across repositories while achieving a high compression ratio relative to the 3.78 PB source data size. It offers sub-5-s query response times for interactive exploration. MINDS offers an interface for exploring relationships across data types and building cohorts for developing large-scale multimodal machine learning models. By harmonizing multimodal data, MINDS aims to potentially empower researchers with greater analytical ability to uncover diagnostic and prognostic insights and enable evidence-based personalized care. MINDS tracks granular end-to-end data provenance, ensuring reproducibility and transparency. The cloud-native architecture of MINDS can handle exponential data growth in a secure, cost-optimized manner while ensuring substantial storage optimization, replication avoidance, and dynamic access capabilities. Auto-scaling, access controls, and other mechanisms guarantee pipelines’ scalability and security. MINDS overcomes the limitations of existing biomedical data silos via an interoperable metadata-driven approach that represents a pivotal step toward the future of oncology data integration. Full article
Show Figures

Figure 1

14 pages, 4058 KiB  
Article
A Music-Driven Dance Generation Method Based on a Spatial-Temporal Refinement Model to Optimize Abnormal Frames
by Huaxin Wang, Yang Song, Wei Jiang and Tianhao Wang
Sensors 2024, 24(2), 588; https://doi.org/10.3390/s24020588 - 17 Jan 2024
Viewed by 623
Abstract
Since existing music-driven dance generation methods have abnormal motion when generating dance sequences which leads to unnatural overall dance movements, a music-driven dance generation method based on a spatial-temporal refinement model is proposed to optimize the abnormal frames. Firstly, the cross-modal alignment model [...] Read more.
Since existing music-driven dance generation methods have abnormal motion when generating dance sequences which leads to unnatural overall dance movements, a music-driven dance generation method based on a spatial-temporal refinement model is proposed to optimize the abnormal frames. Firstly, the cross-modal alignment model is used to learn the correspondence between the two modalities of audio and dance video and based on the learned correspondence, the corresponding dance segments are matched with the input music segments. Secondly, an abnormal frame optimization algorithm is proposed to carry out the optimization of the abnormal frames in the dance sequence. Finally, a temporal refinement model is used to constrain the music beats and dance rhythms in the temporal perspective to further strengthen the consistency between the music and the dance movements. The experimental results show that the proposed method can generate realistic and natural dance video sequences, with the FID index reduced by 1.2 and the diversity index improved by 1.7. Full article
Show Figures

Figure 1

16 pages, 4631 KiB  
Article
Hierarchical Fusion Network with Enhanced Knowledge and Contrastive Learning for Multimodal Aspect-Based Sentiment Analysis on Social Media
by Xiaoran Hu and Masayuki Yamamura
Sensors 2023, 23(17), 7330; https://doi.org/10.3390/s23177330 - 22 Aug 2023
Viewed by 1126
Abstract
Aspect-based sentiment analysis (ABSA) is a task of fine-grained sentiment analysis that aims to determine the sentiment of a given target. With the increased prevalence of smart devices and social media, diverse data modalities have become more abundant. This fuels interest in multimodal [...] Read more.
Aspect-based sentiment analysis (ABSA) is a task of fine-grained sentiment analysis that aims to determine the sentiment of a given target. With the increased prevalence of smart devices and social media, diverse data modalities have become more abundant. This fuels interest in multimodal ABSA (MABSA). However, most existing methods for MABSA prioritize analyzing the relationship between aspect–text and aspect–image, overlooking the semantic gap between text and image representations. Moreover, they neglect the rich information in external knowledge, e.g., image captions. To address these limitations, in this paper, we propose a novel hierarchical framework for MABSA, known as HF-EKCL, which also offers perspectives on sensor development within the context of sentiment analysis. Specifically, we generate captions for images to supplement the textual and visual features. The multi-head cross-attention mechanism and graph attention neural network are utilized to capture the interactions between modalities. This enables the construction of multi-level aspect fusion features that incorporate element-level and structure-level information. Furthermore, for this paper, we integrated modality-based and label-based contrastive learning methods into our framework, making the model learn shared features that are relevant to the sentiment of corresponding words in multimodal data. The results, based on two Twitter datasets, demonstrate the effectiveness of our proposed model. Full article
Show Figures

Figure 1

28 pages, 9610 KiB  
Article
Triangle-Mesh-Rasterization-Projection (TMRP): An Algorithm to Project a Point Cloud onto a Consistent, Dense and Accurate 2D Raster Image
by Christina Junger, Benjamin Buch and Gunther Notni
Sensors 2023, 23(16), 7030; https://doi.org/10.3390/s23167030 - 08 Aug 2023
Cited by 2 | Viewed by 1549
Abstract
The projection of a point cloud onto a 2D camera image is relevant in the case of various image analysis and enhancement tasks, e.g., (i) in multimodal image processing for data fusion, (ii) in robotic applications and in scene analysis, and (iii) for [...] Read more.
The projection of a point cloud onto a 2D camera image is relevant in the case of various image analysis and enhancement tasks, e.g., (i) in multimodal image processing for data fusion, (ii) in robotic applications and in scene analysis, and (iii) for deep neural networks to generate real datasets with ground truth. The challenges of the current single-shot projection methods, such as simple state-of-the-art projection, conventional, polygon, and deep learning-based upsampling methods or closed source SDK functions of low-cost depth cameras, have been identified. We developed a new way to project point clouds onto a dense, accurate 2D raster image, called Triangle-Mesh-Rasterization-Projection (TMRP). The only gaps that the 2D image still contains with our method are valid gaps that result from the physical limits of the capturing cameras. Dense accuracy is achieved by simultaneously using the 2D neighborhood information (rx,ry) of the 3D coordinates in addition to the points P(X,Y,V). In this way, a fast triangulation interpolation can be performed. The interpolation weights are determined using sub-triangles. Compared to single-shot methods, our algorithm is able to solve the following challenges. This means that: (1) no false gaps or false neighborhoods are generated, (2) the density is XYZ independent, and (3) ambiguities are eliminated. Our TMRP method is also open source, freely available on GitHub, and can be applied to almost any sensor or modality. We also demonstrate the usefulness of our method with four use cases by using the KITTI-2012 dataset or sensors with different modalities. Our goal is to improve recognition tasks and processing optimization in the perception of transparent objects for robotic manufacturing processes. Full article
Show Figures

Figure 1

13 pages, 1249 KiB  
Article
A Dual-Path Cross-Modal Network for Video-Music Retrieval
by Xin Gu, Yinghua Shen and Chaohui Lv
Sensors 2023, 23(2), 805; https://doi.org/10.3390/s23020805 - 10 Jan 2023
Cited by 2 | Viewed by 1772
Abstract
In recent years, with the development of the internet, video has become more and more widely used in life. Adding harmonious music to a video is gradually becoming an artistic task. However, artificially adding music takes a lot of time and effort, so [...] Read more.
In recent years, with the development of the internet, video has become more and more widely used in life. Adding harmonious music to a video is gradually becoming an artistic task. However, artificially adding music takes a lot of time and effort, so we propose a method to recommend background music for videos. The emotional message of music is rarely taken into account in current work, but it is crucial for video music retrieval. To achieve this, we design two paths to process content information and emotional information between modals. Based on the characteristics of video and music, we design various feature extraction schemes and common representation spaces. In the content path, the pre-trained network is used as the feature extraction network. As these features contain some redundant information, we use an encoder–decoder structure for dimensionality reduction. Where encoder weights are shared to obtain content sharing features for video and music. In the emotion path, an emotion key frames scheme was used for video and a channel attention mechanism was used for music in order to obtain the emotion information effectively. We also added emotion distinguish loss to guarantee that the network acquires the emotion information effectively. More importantly, we propose a way to combine content information with emotional information. That is, content features are first stitched together with sentiment features and then passed through a fused shared space structured as an MLP to obtain more effective fused shared features. In addition, a polarity penalty factor has been added to the classical metric loss function to make it more suitable for this task. Experiments show that this dual path video music retrieval network can effectively merge information. Compared with existing methods, the retrieval task evaluation index increases Recall@1 by 3.94. Full article
Show Figures

Figure 1

Back to TopTop