sensors-logo

Journal Browser

Journal Browser

Audio, Image, and Multimodal Sensing Techniques

A special issue of Sensors (ISSN 1424-8220). This special issue belongs to the section "Intelligent Sensors".

Deadline for manuscript submissions: 10 May 2024 | Viewed by 9945

Special Issue Editor


E-Mail Website
Guest Editor
Polish-Japanese Academy of Information Technology, Warsaw, Poland
Interests: audio signal analysis; music information retrieval; knowledge discovery in databases; multimedia; human–computer interaction; data mining; computer science; artificial intelligence
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

This Special Issue of Sensors is focused on original research involving the use of various audio and image sensing devices, both simultaneously and separately. The goal is to collect a diverse set of papers that span a wide range of analyses and possible applications.

Specifically, the interest is in papers that address the use of environment-based sensors, i.e., placed on the ground, cased in the air or water, etc., and the development of software utilizing the output of these sensors. Papers focused on the construction of optimized sensors are also welcomed.

This Special Issue will cover, but is not limited to, the following areas:

  • digital signal processing
  • audio signal analysis
  • image analysis
  • pattern recognition
  • audio sensors
  • image sensors
  • multimodal and single mode sensing
  • artificial intelligence
  • sensor applications

Dr. Alicja Wieczorkowska
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Sensors is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • digital signal processing
  • audio signal analysis
  • image analysis
  • pattern recognition
  • audio sensors
  • image sensors
  • multimodal and single mode sensing
  • artificial intelligence
  • sensor applications

Published Papers (5 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

18 pages, 2178 KiB  
Article
Performance Optimization in Frequency Estimation of Noisy Signals: Ds-IpDTFT Estimator
by Miaomiao Wei, Yongsheng Zhu, Jun Sun, Xiangyang Lu, Xiaomin Mu and Juncai Xu
Sensors 2023, 23(17), 7461; https://doi.org/10.3390/s23177461 - 28 Aug 2023
Viewed by 679
Abstract
This research presents a comprehensive study of the dichotomous search iterative parabolic discrete time Fourier transform (Ds-IpDTFT) estimator, a novel approach for fine frequency estimation in noisy exponential signals. The proposed estimator leverages a dichotomous search process before iterative interpolation estimation, which significantly [...] Read more.
This research presents a comprehensive study of the dichotomous search iterative parabolic discrete time Fourier transform (Ds-IpDTFT) estimator, a novel approach for fine frequency estimation in noisy exponential signals. The proposed estimator leverages a dichotomous search process before iterative interpolation estimation, which significantly reduces computational complexity while maintaining high estimation accuracy. An in-depth exploration of the relationship between the optimal parameter p and the unknown parameter δ forms the backbone of the methodology. Through extensive simulations and real-world experiments, the Ds-IpDTFT estimator exhibits superior performance relative to other established estimators, demonstrating robustness in noisy conditions and stability across varying frequencies. This efficient and accurate estimation method is a significant contribution to the field of signal processing and offers promising potential for practical applications. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

35 pages, 10828 KiB  
Article
Audiovisual Tracking of Multiple Speakers in Smart Spaces
by Frank Sanabria-Macias, Marta Marron-Romera and Javier Macias-Guarasa
Sensors 2023, 23(15), 6969; https://doi.org/10.3390/s23156969 - 05 Aug 2023
Viewed by 737
Abstract
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It [...] Read more.
This paper presents GAVT, a highly accurate audiovisual 3D tracking system based on particle filters and a probabilistic framework, employing a single camera and a microphone array. Our first contribution is a complex visual appearance model that accurately locates the speaker’s mouth. It transforms a Viola & Jones face detector classifier kernel into a likelihood estimator, leveraging knowledge from multiple classifiers trained for different face poses. Additionally, we propose a mechanism to handle occlusions based on the new likelihood’s dispersion. The audio localization proposal utilizes a probabilistic steered response power, representing cross-correlation functions as Gaussian mixture models. Moreover, to prevent tracker interference, we introduce a novel mechanism for associating Gaussians with speakers. The evaluation is carried out using the AV16.3 and CAV3D databases for Single- and Multiple-Object Tracking tasks (SOT and MOT, respectively). GAVT significantly improves the localization performance over audio-only and video-only modalities, with up to 50.3% average relative improvement in 3D when compared with the video-only modality. When compared to the state of the art, our audiovisual system achieves up to 69.7% average relative improvement for the SOT and MOT tasks in the AV16.3 dataset (2D comparison), and up to 18.1% average relative improvement in the MOT task for the CAV3D dataset (3D comparison). Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

13 pages, 3555 KiB  
Article
High-Level CNN and Machine Learning Methods for Speaker Recognition
by Giovanni Costantini, Valerio Cesarini and Emanuele Brenna
Sensors 2023, 23(7), 3461; https://doi.org/10.3390/s23073461 - 25 Mar 2023
Cited by 4 | Viewed by 2294
Abstract
Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files [...] Read more.
Speaker Recognition (SR) is a common task in AI-based sound analysis, involving structurally different methodologies such as Deep Learning or “traditional” Machine Learning (ML). In this paper, we compared and explored the two methodologies on the DEMoS dataset consisting of 8869 audio files of 58 speakers in different emotional states. A custom CNN is compared to several pre-trained nets using image inputs of spectrograms and Cepstral-temporal (MFCC) graphs. AML approach based on acoustic feature extraction, selection and multi-class classification by means of a Naïve Bayes model is also considered. Results show how a custom, less deep CNN trained on grayscale spectrogram images obtain the most accurate results, 90.15% on grayscale spectrograms and 83.17% on colored MFCC. AlexNet provides comparable results, reaching 89.28% on spectrograms and 83.43% on MFCC.The Naïve Bayes classifier provides a 87.09% accuracy and a 0.985 average AUC while being faster to train and more interpretable. Feature selection shows how F0, MFCC and voicing-related features are the most characterizing for this SR task. The high amount of training samples and the emotional content of the DEMoS dataset better reflect a real case scenario for speaker recognition, and account for the generalization power of the models. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

17 pages, 5040 KiB  
Article
An Efficient Pest Detection Framework with a Medium-Scale Benchmark to Increase the Agricultural Productivity
by Suliman Aladhadh, Shabana Habib, Muhammad Islam, Mohammed Aloraini, Mohammed Aladhadh and Hazim Saleh Al-Rawashdeh
Sensors 2022, 22(24), 9749; https://doi.org/10.3390/s22249749 - 12 Dec 2022
Cited by 8 | Viewed by 2403
Abstract
Insect pests and crop diseases are considered the major problems for agricultural production, due to the severity and extent of their occurrence causing significant crop losses. To increase agricultural production, it is significant to protect the crop from harmful pests which is possible [...] Read more.
Insect pests and crop diseases are considered the major problems for agricultural production, due to the severity and extent of their occurrence causing significant crop losses. To increase agricultural production, it is significant to protect the crop from harmful pests which is possible via soft computing techniques. The soft computing techniques are based on traditional machine and deep learning-based approaches. However, in the traditional methods, the selection of manual feature extraction mechanisms is ineffective, inefficient, and time-consuming, while deep learning techniques are computationally expensive and require a large amount of training data. In this paper, we propose an efficient pest detection method that accurately localized the pests and classify them according to their desired class label. In the proposed work, we modify the YOLOv5s model in several ways such as extending the cross stage partial network (CSP) module, improving the select kernel (SK) in the attention module, and modifying the multiscale feature extraction mechanism, which plays a significant role in the detection and classification of small and large sizes of pest in an image. To validate the model performance, we develop a medium-scale pest detection dataset that includes the five most harmful pests for agriculture products that are ants, grasshopper, palm weevils, shield bugs, and wasps. To check the model’s effectiveness, we compare the results of the proposed model with several variations of the YOLOv5 model, where the proposed model achieved the best results in the experiments. Thus, the proposed model has the potential to be applied in real-world applications and further motivate research on pest detection to increase agriculture production. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

Review

Jump to: Research

26 pages, 493 KiB  
Review
A Survey on Low-Latency DNN-Based Speech Enhancement
by Szymon Drgas
Sensors 2023, 23(3), 1380; https://doi.org/10.3390/s23031380 - 26 Jan 2023
Cited by 3 | Viewed by 2936
Abstract
This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the [...] Read more.
This paper presents recent advances in low-latency, single-channel, deep neural network-based speech enhancement systems. The sources of latency and their acceptable values in different applications are described. This is followed by an analysis of the constraints imposed on neural network architectures. Specifically, the causal units used in deep neural networks are presented and discussed in the context of their properties, such as the number of parameters, the receptive field, and computational complexity. This is followed by a discussion of techniques used to reduce the computational complexity and memory requirements of the neural networks used in this task. Finally, the techniques used by the winners of the latest speech enhancement challenges (DNS, Clarity) are shown and compared. Full article
(This article belongs to the Special Issue Audio, Image, and Multimodal Sensing Techniques)
Show Figures

Figure 1

Back to TopTop