sensors-logo

Journal Browser

Journal Browser

Multi-Modal Image Processing Methods, Systems, and Applications

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Sensing and Imaging".

Deadline for manuscript submissions: 1 December 2024 | Viewed by 10588

Special Issue Editors


E-Mail Website
Guest Editor
School of Mechanical Engineering & Automation, Northeastern University, Shenyang 110819, China
Interests: applied surface science; vision detection for surface defects; multi-modal image analysis and application
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
School of Mechanical Engineering & Automation, Northeastern University, Shenyang 110819, China
Interests: feature extraction; image segmentation; object detection; image fusion; computer vision

Special Issue Information

Dear Colleagues,

Multi-sensor systems are widely deployed for various real-world applications, thereby further enhancing the visual perception of machines by obtaining multi-modal images. RGB-Depth (RGBD) images and RGB-Thermal infrared images are currently the most commonly used multi-modal images, among which the D image provide stereoscopic information on the object, and the T image strengthens the imaging capability of the machine vision system under insufficient illumination. Multi-modal images enable a more accurate and robust detection in vision tasks, such as object detection and segmentation, image fusion and enhancement, object tracking and positioning, etc. At the same time, various real-world applications are easier to accomplish with the blessing of multi-modal images, such as autonomous driving, industrial defect detection, robot control, and remote sensing inspection. However, the visual perception gain brought by multi-modal images also increases the difficulty of data processing. The acquisition and balance of different modal information, combination and enhancement of multiple features, and the generalization and robustness of multi-sensor systems in complex scenes are all challenges encountered in multi-modal image processing.

To address these challenges, this Special Issue calls for original and innovative methodological contributions to tackle the issues of multi-modal image processing. These papers may cover all areas of multi-modal vision tasks, from fundamental theoretical methods to the latest innovative multi-sensor system designs. Topics of interest include the detection and recognition of multi-modal images, multi-modal industrial applications, autonomous driving, robot vision and control, remote sensing data processing, etc. Critical reviews and surveys of multi-modal images and multi-sensor systems are also encouraged.

Dr. Kechen Song
Prof. Dr. Yunhui Yan
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • multi-sensor systems
  • multi-modal images (RGB-D/T)
  • multi-modal object detection
  • multi-modal segmentation
  • multi-modal image fusion
  • multi-modal object tracking
  • automatic drive
  • robotic visual perception
  • unmanned aerial vehicles (UAVs)

Published Papers (7 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 6947 KiB  
Article
Fusion of Multimodal Imaging and 3D Digitization Using Photogrammetry
by Roland Ramm, Pedro de Dios Cruz, Stefan Heist, Peter Kühmstedt and Gunther Notni
Sensors 2024, 24(7), 2290; https://doi.org/10.3390/s24072290 - 03 Apr 2024
Viewed by 477
Abstract
Multimodal sensors capture and integrate diverse characteristics of a scene to maximize information gain. In optics, this may involve capturing intensity in specific spectra or polarization states to determine factors such as material properties or an individual’s health conditions. Combining multimodal camera data [...] Read more.
Multimodal sensors capture and integrate diverse characteristics of a scene to maximize information gain. In optics, this may involve capturing intensity in specific spectra or polarization states to determine factors such as material properties or an individual’s health conditions. Combining multimodal camera data with shape data from 3D sensors is a challenging issue. Multimodal cameras, e.g., hyperspectral cameras, or cameras outside the visible light spectrum, e.g., thermal cameras, lack strongly in terms of resolution and image quality compared with state-of-the-art photo cameras. In this article, a new method is demonstrated to superimpose multimodal image data onto a 3D model created by multi-view photogrammetry. While a high-resolution photo camera captures a set of images from varying view angles to reconstruct a detailed 3D model of the scene, low-resolution multimodal camera(s) simultaneously record the scene. All cameras are pre-calibrated and rigidly mounted on a rig, i.e., their imaging properties and relative positions are known. The method was realized in a laboratory setup consisting of a professional photo camera, a thermal camera, and a 12-channel multispectral camera. In our experiments, an accuracy better than one pixel was achieved for the data fusion using multimodal superimposition. Finally, application examples of multimodal 3D digitization are demonstrated, and further steps to system realization are discussed. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

18 pages, 2852 KiB  
Article
Self-Enhanced Mixed Attention Network for Three-Modal Images Few-Shot Semantic Segmentation
by Kechen Song, Yiming Zhang, Yanqi Bao, Ying Zhao and Yunhui Yan
Sensors 2023, 23(14), 6612; https://doi.org/10.3390/s23146612 - 22 Jul 2023
Cited by 1 | Viewed by 987
Abstract
As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use [...] Read more.
As an important computer vision technique, image segmentation has been widely used in various tasks. However, in some extreme cases, the insufficient illumination would result in a great impact on the performance of the model. So more and more fully supervised methods use multi-modal images as their input. The dense annotated large datasets are difficult to obtain, but the few-shot methods still can have satisfactory results with few pixel-annotated samples. Therefore, we propose the Visible-Depth-Thermal (three-modal) images few-shot semantic segmentation method. It utilizes the homogeneous information of three-modal images and the complementary information of different modal images, which can improve the performance of few-shot segmentation tasks. We constructed a novel indoor dataset VDT-2048-5i for the three-modal images few-shot semantic segmentation task. We also proposed a Self-Enhanced Mixed Attention Network (SEMANet), which consists of a Self-Enhanced module (SE) and a Mixed Attention module (MA). The SE module amplifies the difference between the different kinds of features and strengthens the weak connection for the foreground features. The MA module fuses the three-modal feature to obtain a better feature. Compared with the most advanced methods before, our model improves mIoU by 3.8% and 3.3% in 1-shot and 5-shot settings, respectively, which achieves state-of-the-art performance. In the future, we will solve failure cases by obtaining more discriminative and robust feature representations, and explore achieving high performance with fewer parameters and computational costs. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

23 pages, 6290 KiB  
Article
Semantic Retrieval of Remote Sensing Images Based on the Bag-of-Words Association Mapping Method
by Jingwen Li, Yanting Cai, Xu Gong, Jianwu Jiang, Yanling Lu, Xiaode Meng and Li Zhang
Sensors 2023, 23(13), 5807; https://doi.org/10.3390/s23135807 - 21 Jun 2023
Viewed by 1143
Abstract
With the increasing demand for remote sensing image applications, extracting the required images from a huge set of remote sensing images has become a hot topic. The previous retrieval methods cannot guarantee the efficiency, accuracy, and interpretability in the retrieval process. Therefore, we [...] Read more.
With the increasing demand for remote sensing image applications, extracting the required images from a huge set of remote sensing images has become a hot topic. The previous retrieval methods cannot guarantee the efficiency, accuracy, and interpretability in the retrieval process. Therefore, we propose a bag-of-words association mapping method that can explain the semantic derivation process of remote sensing images. The method constructs associations between low-level features and high-level semantics through visual feature word packets. An improved FP-Growth method is proposed to achieve the construction of strong association rules to semantics. A feedback mechanism is established to improve the accuracy of subsequent retrievals by reducing the semantic probability of incorrect retrieval results. The public datasets AID and NWPU-RESISC45 were used to validate these experiments. The experimental results show that the average accuracies of the two datasets reach 87.5% and 90.8%, which are 22.5% and 20.3% higher than VGG16, and 17.6% and 15.6% higher than ResNet18, respectively. The experimental results were able to validate the effectiveness of our proposed method. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

19 pages, 5956 KiB  
Article
Bilateral Cross-Modal Fusion Network for Robot Grasp Detection
by Qiang Zhang and Xueying Sun
Sensors 2023, 23(6), 3340; https://doi.org/10.3390/s23063340 - 22 Mar 2023
Cited by 1 | Viewed by 1462
Abstract
In the field of vision-based robot grasping, effectively leveraging RGB and depth information to accurately determine the position and pose of a target is a critical issue. To address this challenge, we proposed a tri-stream cross-modal fusion architecture for 2-DoF visual grasp detection. [...] Read more.
In the field of vision-based robot grasping, effectively leveraging RGB and depth information to accurately determine the position and pose of a target is a critical issue. To address this challenge, we proposed a tri-stream cross-modal fusion architecture for 2-DoF visual grasp detection. This architecture facilitates the interaction of RGB and depth bilateral information and was designed to efficiently aggregate multiscale information. Our novel modal interaction module (MIM) with a spatial-wise cross-attention algorithm adaptively captures cross-modal feature information. Meanwhile, the channel interaction modules (CIM) further enhance the aggregation of different modal streams. In addition, we efficiently aggregated global multiscale information through a hierarchical structure with skipping connections. To evaluate the performance of our proposed method, we conducted validation experiments on standard public datasets and real robot grasping experiments. We achieved image-wise detection accuracy of 99.4% and 96.7% on Cornell and Jacquard datasets, respectively. The object-wise detection accuracy reached 97.8% and 94.6% on the same datasets. Furthermore, physical experiments using the 6-DoF Elite robot demonstrated a success rate of 94.5%. These experiments highlight the superior accuracy of our proposed method. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

21 pages, 34240 KiB  
Article
Introduction to Door Opening Type Classification Based on Human Demonstration
by Valentin Šimundić, Matej Džijan, Petra Pejić and Robert Cupec
Sensors 2023, 23(6), 3093; https://doi.org/10.3390/s23063093 - 14 Mar 2023
Cited by 2 | Viewed by 1549
Abstract
Opening doors and drawers will be an important ability for future service robots used in domestic and industrial environments. However, in recent years, the methods for opening doors and drawers have become more diverse and difficult for robots to determine and manipulate. We [...] Read more.
Opening doors and drawers will be an important ability for future service robots used in domestic and industrial environments. However, in recent years, the methods for opening doors and drawers have become more diverse and difficult for robots to determine and manipulate. We can divide doors into three distinct handling types: regular handles, hidden handles, and push mechanisms. While extensive research has been done on the detection and handling of regular handles, the other types of handling have not been explored as much. In this paper, we set out to classify the types of cabinet door handling types. To this end, we collect and label a dataset consisting of RGB-D images of cabinets in their natural environment. As part of the dataset, we provide images of humans demonstrating the handling of these doors. We detect the poses of human hands and then train a classifier to determine the type of cabinet door handling. With this research, we hope to provide a starting point for exploring the different types of cabinet door openings in real-world environments. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

21 pages, 8910 KiB  
Article
HAAN: Learning a Hierarchical Adaptive Alignment Network for Image-Text Retrieval
by Shuhuai Wang, Zheng Liu, Xinlei Pei and Junhao Xu
Sensors 2023, 23(5), 2559; https://doi.org/10.3390/s23052559 - 25 Feb 2023
Viewed by 1442
Abstract
Image-text retrieval aims to search related results of one modality by querying another modality. As a fundamental and key problem in cross-modal retrieval, image-text retrieval is still a challenging problem owing to the complementary and imbalanced relationship between different modalities (i.e., Image and [...] Read more.
Image-text retrieval aims to search related results of one modality by querying another modality. As a fundamental and key problem in cross-modal retrieval, image-text retrieval is still a challenging problem owing to the complementary and imbalanced relationship between different modalities (i.e., Image and Text) and different granularities (i.e., Global-level and Local-level). However, existing works have not fully considered how to effectively mine and fuse the complementarities between images and texts at different granularities. Therefore, in this paper, we propose a hierarchical adaptive alignment network, whose contributions are as follows: (1) We propose a multi-level alignment network, which simultaneously mines global-level and local-level data, thereby enhancing the semantic association between images and texts. (2) We propose an adaptive weighted loss to flexibly optimize the image-text similarity with two stages in a unified framework. (3) We conduct extensive experiments on three public benchmark datasets (Corel 5K, Pascal Sentence, and Wiki) and compare them with eleven state-of-the-art methods. The experimental results thoroughly verify the effectiveness of our proposed method. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

20 pages, 9150 KiB  
Article
RGB-D-Based Stair Detection and Estimation Using Deep Learning
by Chen Wang, Zhongcai Pei, Shuang Qiu and Zhiyong Tang
Sensors 2023, 23(4), 2175; https://doi.org/10.3390/s23042175 - 15 Feb 2023
Cited by 2 | Viewed by 2384
Abstract
Stairs are common vertical traffic structures in buildings, and stair detection tasks are important in environmental perception for autonomous mobile robots. Most existing algorithms have difficulty combining the visual information from binocular sensors effectively and ensuring reliable detection at night and in the [...] Read more.
Stairs are common vertical traffic structures in buildings, and stair detection tasks are important in environmental perception for autonomous mobile robots. Most existing algorithms have difficulty combining the visual information from binocular sensors effectively and ensuring reliable detection at night and in the case of extremely fuzzy visual clues. To solve these problems, we propose a stair detection network with red-green-blue (RGB) and depth inputs. Specifically, we design a selective module, which can make the network learn the complementary relationship between the RGB feature maps and the depth feature maps and fuse the features effectively in different scenes. In addition, we propose several postprocessing algorithms, including a stair line clustering algorithm and a coordinate transformation algorithm, to obtain the stair geometric parameters. Experiments show that our method has better performance than existing the state-of-the-art deep learning method, and the accuracy, recall, and runtime are improved by 5.64%, 7.97%, and 3.81 ms, respectively. The improved indexes show the effectiveness of the multimodal inputs and the selective module. The estimation values of stair geometric parameters have root mean square errors within 15 mm when ascending stairs and 25 mm when descending stairs. Our method also has extremely fast detection speed, which can meet the requirements of most real-time applications. Full article
(This article belongs to the Special Issue Multi-Modal Image Processing Methods, Systems, and Applications)
Show Figures

Figure 1

Back to TopTop