Research

Jump to: Review

11 pages, 2779 KiB

Open AccessArticle

Enhanced Multiple Speakers’ Separation and Identification for VOIP Applications Using Deep Learning

by Amira A. Mohamed, Amira Eltokhy and Abdelhalim A. Zekry

Appl. Sci. 2023, 13(7), 4261; https://doi.org/10.3390/app13074261 - 28 Mar 2023

Cited by 1 | Viewed by 1413

Institutions have been adopting work/study-from-home programs since the pandemic began. They primarily utilise Voice over Internet Protocol (VoIP) software to perform online meetings. This research introduces a new method to enhance VoIP calls experience using deep learning. In this paper, integration between two [...] Read more.

Institutions have been adopting work/study-from-home programs since the pandemic began. They primarily utilise Voice over Internet Protocol (VoIP) software to perform online meetings. This research introduces a new method to enhance VoIP calls experience using deep learning. In this paper, integration between two existing techniques, Speaker Separation and Speaker Identification (SSI), is performed using deep learning methods with effective results as introduced by state-of-the-art research. This integration is applied to VoIP system application. The voice signal is introduced to the speaker separation and identification system to be separated; then, the “main speaker voice” is identified and verified rather than any other human or non-human voices around the main speaker. Then, only this main speaker voice is sent over IP to continue the call process. Currently, the online call system depends on noise cancellation and call quality enhancement. However, this does not address multiple human voices over the call. Filters used in the call process only remove the noise and the interference (de-noising speech) from the speech signal. The presented system is tested with up to four mixed human voices. This system separates only the main speaker voice and processes it prior to the transmission over VoIP call. This paper illustrates the algorithm technologies integration using DNN, and voice signal processing advantages and challenges, in addition to the importance of computing power for real-time applications. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

18 pages, 2427 KiB

Open AccessArticle

A Feature Fusion Model with Data Augmentation for Speech Emotion Recognition

by Zhongwen Tu, Bin Liu, Wei Zhao, Raoxin Yan and Yang Zou

Appl. Sci. 2023, 13(7), 4124; https://doi.org/10.3390/app13074124 - 24 Mar 2023

Cited by 5 | Viewed by 1821

Abstract

The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small [...] Read more.

The Speech Emotion Recognition (SER) algorithm, which aims to analyze the expressed emotion from a speech, has always been an important topic in speech acoustic tasks. In recent years, the application of deep-learning methods has made great progress in SER. However, the small scale of the emotional speech dataset and the lack of effective emotional feature representation still limit the development of research. In this paper, a novel SER method, combining data augmentation, feature selection and feature fusion, is proposed. First, aiming at the problem that there are inadequate samples in the speech emotion dataset and the number of samples in each category is unbalanced, a speech data augmentation method, Mix-wav, is proposed which is applied to the audio of the same emotion category. Then, on the one hand, a Multi-Head Attention mechanism-based Convolutional Recurrent Neural Network (MHA-CRNN) model is proposed to further extract the spectrum vector from the Log-Mel spectrum. On the other hand, Light Gradient Boosting Machine (LightGBM) is used for feature set selection and feature dimensionality reduction in four emotion global feature sets, and more effective emotion statistical features are extracted for feature fusion with the previously extracted spectrum vector. Experiments are carried out on the public dataset Interactive Emotional Dyadic Motion Capture (IEMOCAP) and Chinese Hierarchical Speech Emotion Dataset of Broadcasting (CHSE-DB). The experiments show that the proposed method achieves 66.44% and 93.47% of the unweighted average test accuracy, respectively. Our research shows that the global feature set after feature selection can supplement the features extracted by a single deep-learning model through feature fusion to achieve better classification accuracy. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

15 pages, 11962 KiB

Open AccessArticle

Speech Enhancement Based on Two-Stage Processing with Deep Neural Network for Laser Doppler Vibrometer

by Chengkai Cai, Kenta Iwai and Takanobu Nishiura

Appl. Sci. 2023, 13(3), 1958; https://doi.org/10.3390/app13031958 - 02 Feb 2023

Viewed by 1347

Abstract

The development of distant-talk measurement systems has been attracting attention since they can be applied to many situations such as security and disaster relief. One such system that uses a device called a laser Doppler vibrometer (LDV) to acquire sound by measuring an [...] Read more.

The development of distant-talk measurement systems has been attracting attention since they can be applied to many situations such as security and disaster relief. One such system that uses a device called a laser Doppler vibrometer (LDV) to acquire sound by measuring an object’s vibration caused by the sound source has been proposed. Different from traditional microphones, an LDV can pick up the target sound from a distance even in a noisy environment. However, the acquired sounds are greatly distorted due to the object’s shape and frequency response. Due to the particularity of the degradation of observed speech, conventional methods cannot be effectively applied to LDVs. We propose two speech enhancement methods that are based on two-stage processing with deep neural networks for LDVs. With the first proposed method, the amplitude spectrum of the observed speech is first restored. The phase difference between the observed and clean speech is then estimated using the restored amplitude spectrum. With the other proposed method, the low-frequency components of the observed speech are first restored. The high-frequency components are then estimated by the restored low-frequency components. The evaluation results indicate that they improved the observed speech in sound quality, deterioration degree, and intelligibility. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

12 pages, 524 KiB

Open AccessArticle

Efficient Realization for Third-Order Volterra Filter Based on Singular Value Decomposition

by Yuya Nakahira, Kenta Iwai and Yoshinobu Kajikawa

Appl. Sci. 2022, 12(21), 10710; https://doi.org/10.3390/app122110710 - 22 Oct 2022

Viewed by 1176

Abstract

Nonlinear distortion in loudspeaker systems degrades sound quality and must be properly compensated for by linearization techniques. One technique to reduce nonlinear distortion is to use a Volterra Filter, which approximates the nonlinearity of the target loudspeaker using the Volterra series expansion. In [...] Read more.

Nonlinear distortion in loudspeaker systems degrades sound quality and must be properly compensated for by linearization techniques. One technique to reduce nonlinear distortion is to use a Volterra Filter, which approximates the nonlinearity of the target loudspeaker using the Volterra series expansion. In general, the Volterra Filter is computationally very expensive, and the amount of computation needs to be reduced for real-time processing. In this paper, we propose an efficient implementation of the third-order Volterra filter based on singular value decomposition. The proposed method determines the necessary coefficients based on the symmetry of the third-order Volterra filter and applies singular value decomposition to them. In the filter structure consisting of singular values and their corresponding singular vector, the computational complexity of the third-order Volterra filter can be reduced by eliminating the part of the filter with small singular values. By focusing on the magnitude of the singular values, the proposed method can improve the computational efficiency of the third-order Volterra filter without decreasing its approximation accuracy. Simulation results show that the proposed method can improve the computational efficiency by 60% while maintaining the nonlinear distortion compensation performance of the micro-speaker for smartphones by about 8 dB. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

14 pages, 1564 KiB

Open AccessArticle

Sound Source Localization Indoors Based on Two-Level Reference Points Matching

by Shuopeng Wang, Peng Yang and Hao Sun

Appl. Sci. 2022, 12(19), 9956; https://doi.org/10.3390/app12199956 - 03 Oct 2022

Cited by 1 | Viewed by 1060

Abstract

A dense sample point layout is the conventional approach to ensure the positioning accuracy for fingerprint-based sound source localization (SSL) indoors. However, mass reference point (RPs) matching of online phases may greatly reduce positioning efficiency. In response to this compelling problem, a two-level [...] Read more.

A dense sample point layout is the conventional approach to ensure the positioning accuracy for fingerprint-based sound source localization (SSL) indoors. However, mass reference point (RPs) matching of online phases may greatly reduce positioning efficiency. In response to this compelling problem, a two-level matching strategy is adopted to shrink the adjacent RPs searching scope. In the first-level matching process, two different methods are adopted to shrink the search scope of the online phase in a simple scene and a complex scene. According to the global range of high similarity between adjacent samples in a simple scene, a greedy search method is adopted for fast searching of the sub-database that contains the adjacent RPs. Simultaneously, in accordance with the specific local areas’ range of high similarity between adjacent samples in a complex scene, the clustering method is used for database partitioning, and the RPs search scope can be compressed by sub-database matching. Experimental results show that the two-level RPs matching strategy can effectively improve the RPs matching efficiency for the two different typical indoor scenes on the premise of ensuring the positioning accuracy. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

16 pages, 7041 KiB

Open AccessArticle

Signal Enhancement of Helicopter Rotor Aerodynamic Noise Based on Cyclic Wiener Filtering

by Chengfeng Wu, Chunhua Wei, Yong Wang and Yang Gao

Appl. Sci. 2022, 12(13), 6632; https://doi.org/10.3390/app12136632 - 30 Jun 2022

Cited by 1 | Viewed by 1204

Abstract

The research on helicopter rotor aerodynamic noise becomes imperative with the wide use of helicopters in civilian fields. In this study, a signal enhancement method based on cyclic Wiener filtering was proposed given the cyclostationarity of rotor aerodynamic noise. The noise was adaptively [...] Read more.

The research on helicopter rotor aerodynamic noise becomes imperative with the wide use of helicopters in civilian fields. In this study, a signal enhancement method based on cyclic Wiener filtering was proposed given the cyclostationarity of rotor aerodynamic noise. The noise was adaptively filtered out by performing a group of frequency shifts on the input signal. According to the characteristics of rotor aerodynamic noise, a detection function was constructed to realize the long-distance detection of helicopters. The flight data of the Robinson R44 helicopter was obtained through the field flight experiment and employed as the research object for analysis. The detection range of the Robinson R44 helicopter after cyclic Wiener filtering was increased from 4.114 km to 17.75 km, verifying the feasibility and effectiveness of the proposed method. The efficacy of the proposed detection method was demonstrated and compared in the far-field flight test measurements of the Robinson R44 helicopter. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

17 pages, 3826 KiB

Open AccessArticle

A Deep Learning Method for DOA Estimation with Covariance Matrices in Reverberant Environments

by Qinghua Huang and Weilun Fang

Appl. Sci. 2022, 12(9), 4278; https://doi.org/10.3390/app12094278 - 23 Apr 2022

Cited by 3 | Viewed by 1398

Abstract

Acoustic source localization in the spherical harmonic domain with reverberation has hitherto not been extensively investigated. Moreover, deep learning frameworks have been utilized to estimate the direction-of-arrival (DOA) with spherical microphone arrays under environments with reverberation and noise for low computational complexity and [...] Read more.

Acoustic source localization in the spherical harmonic domain with reverberation has hitherto not been extensively investigated. Moreover, deep learning frameworks have been utilized to estimate the direction-of-arrival (DOA) with spherical microphone arrays under environments with reverberation and noise for low computational complexity and high accuracy. This paper proposes three different covariance matrices as the input features and two different learning strategies for the DOA task. There is a progressive relationship among the three covariance matrices. The second matrix can be obtained by processing the first matrix and it effectively filters out the effects of the microphone array and mode strength to some extent. The third matrix can be obtained by processing the second matrix and it further efficiently removes information irrelevant to location information. In terms of the strategies, the first strategy is a regular learning strategy, while the second strategy is to split the task into three parts to be performed in parallel. Experiments were conducted both on the simulated and real datasets to show that the proposed method has higher accuracy than the conventional methods and lower computational complexity. Thus, the proposed method can effectively resist reverberation and noise. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

18 pages, 2571 KiB

Open AccessArticle

3-D Sound Image Reproduction Method Based on Spherical Harmonic Expansion for 22.2 Multichannel Audio

by Kenta Iwai, Hiromu Suzuki and Takanobu Nishiura

Appl. Sci. 2022, 12(4), 1994; https://doi.org/10.3390/app12041994 - 14 Feb 2022

Cited by 1 | Viewed by 1631

Abstract

In this paper, we propose a three-dimensional (3-D) sound image reproduction method based on spherical harmonic (SH) expansion for 22.2 multichannel audio. 22.2 multichannel audio is a 3-D sound field reproduction system that has been developed for ultra-high definition television (UHDTV). This system [...] Read more.

In this paper, we propose a three-dimensional (3-D) sound image reproduction method based on spherical harmonic (SH) expansion for 22.2 multichannel audio. 22.2 multichannel audio is a 3-D sound field reproduction system that has been developed for ultra-high definition television (UHDTV). This system can reproduce 3-D sound images by simultaneously driving 22 loudspeakers and two sub-woofers. To control the 3-D sound image, vector base amplitude panning (VBAP) is conventionally used. VBAP can control the direction of 3-D sound image by weighting the input signal and emitting it from three loudspeakers. However, VBAP cannot control the distance of the 3-D sound image because it calculates the weight by only considering the image’s direction. To solve this problem, we propose a novel 3-D sound image reconstruction method based on SH expansion. The proposed method can control both the direction and distance of the 3-D sound image by controlling the sound directivity on the basis of spherical harmonics (SHs) and mode matching. The directivity of the 3-D sound image is obtained in the SH domain. In addition, the distance of the 3-D sound image is represented by the mode strength. The signal obtained by the proposed method is then emitted from loudspeakers and the 3-D sound image can be reproduced accurately with consideration of not only the direction but also the distance. A number of experimental results show that the proposed method can control both the direction and distance of 3-D sound images. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

Review

Jump to: Research

25 pages, 2693 KiB

Open AccessReview

Mouth Sounds: A Review of Acoustic Applications and Methodologies

by Norberto E. Naal-Ruiz, Erick A. Gonzalez-Rodriguez, Gustavo Navas-Reascos, Rebeca Romo-De Leon, Alejandro Solorio, Luz M. Alonso-Valerdi and David I. Ibarra-Zarate

Appl. Sci. 2023, 13(7), 4331; https://doi.org/10.3390/app13074331 - 29 Mar 2023

Viewed by 5668

Abstract

Mouth sounds serve several purposes, from the clinical diagnosis of diseases to emotional recognition. The following review aims to synthesize and discuss the different methods to apply, extract, analyze, and classify the acoustic features of mouth sounds. The most analyzed features were the [...] Read more.

Mouth sounds serve several purposes, from the clinical diagnosis of diseases to emotional recognition. The following review aims to synthesize and discuss the different methods to apply, extract, analyze, and classify the acoustic features of mouth sounds. The most analyzed features were the zero-crossing rate, power/energy-based, and amplitude-based features in the time domain; and tonal-based, spectral-based, and cepstral features in the frequency domain. Regarding acoustic feature analysis, t-tests, variations of analysis of variance, and Pearson’s correlation tests were the most-used statistical tests used for feature evaluation, while the support vector machine and gaussian mixture models were the most used machine learning methods for pattern recognition. Neural networks were employed according to data availability. The main applications of mouth sound research were physical and mental condition monitoring. Nonetheless, other applications, such as communication, were included in the review. Finally, the limitations of the studies are discussed, indicating the need for standard procedures for mouth sound acquisition and analysis. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

13 pages, 591 KiB

Open AccessReview

Overview of Voice Conversion Methods Based on Deep Learning

by Tomasz Walczyna and Zbigniew Piotrowski

Appl. Sci. 2023, 13(5), 3100; https://doi.org/10.3390/app13053100 - 28 Feb 2023

Cited by 6 | Viewed by 12818

Abstract

Voice conversion is a process where the essence of a speaker’s identity is seamlessly transferred to another speaker, all while preserving the content of their speech. This usage is accomplished using algorithms that blend speech processing techniques, such as speech analysis, speaker classification, [...] Read more.

Voice conversion is a process where the essence of a speaker’s identity is seamlessly transferred to another speaker, all while preserving the content of their speech. This usage is accomplished using algorithms that blend speech processing techniques, such as speech analysis, speaker classification, and vocoding. The cutting-edge voice conversion technology is characterized by deep neural networks that effectively separate a speaker’s voice from their linguistic content. This article offers a comprehensive overview of the development status of this area of science based on the current state-of-the-art voice conversion methods. Full article

(This article belongs to the Special Issue Audio and Acoustic Signal Processing)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Audio and Acoustic Signal Processing

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (10 papers)

Research

Review

Further Information

Guidelines

MDPI Initiatives

Follow MDPI