MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality

Kumalija, Elhard James; Nakamoto, Yukikazu

doi:10.3390/app13042455

Open AccessArticle

MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality

by

Elhard James Kumalija

^*

and

Yukikazu Nakamoto

Graduate School of Applied Informatics, University of Hyogo, Kobe 651-2197, Hyogo, Japan

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(4), 2455; https://doi.org/10.3390/app13042455

Submission received: 3 January 2023 / Revised: 6 February 2023 / Accepted: 10 February 2023 / Published: 14 February 2023

(This article belongs to the Special Issue Deep Learning for Speech Processing)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In IP audio systems, audio quality is degraded by environmental noise, poor network quality, and encoding–decoding algorithms. Therefore, there is a need for a continuous automatic quality evaluation of the transmitted audio. Speech quality monitoring in VoIP systems enables autonomous system adaptation. Furthermore, there are diverse IP audio transmitters and receivers, from high-performance computers and mobile phones to low-memory and low-computing-capacity embedded systems. This paper proposes MiniatureVQNet, a single-ended speech quality evaluation method for VoIP audio applications based on a lightweight deep neural network (DNN) model. The proposed model can predict the audio quality independent of the source of degradation, whether noise or network, and is light enough to run in embedded systems. Two variations of the proposed MiniatureVQNet model were evaluated: a MiniatureVQNet model trained on a dataset that contains environmental noise only, referred to as MiniatureVQNet–Noise, and a second model trained on both noise and network distortions, referred to as MiniatureVQNet–Noise–Network. The proposed MiniatureVQNet model outperforms the traditional P.563 method in terms of accuracy on all tested network conditions and environmental noise parameters. The mean squared error (MSE) of the models compared to the PESQ score for ITU-T P.563, MiniatureVQNet-Noise, and MiniatureVQNet–Noise–Network was 2.19, 0.34, and 0.21, respectively. The performance of both the MiniatureVQNet–Noise–Network and MiniatureVQNet-Noise model depends on the noise type for an SNR greater than 0 dB and less than 10 dB. In addition, training on a noise–network-distorted speech dataset improves the model prediction accuracy in all VoIP environment distortions compared to training the model on a noise-only dataset.

Keywords:

single-ended speech quality evaluation; noise–network-distorted speech data; embedded DNN; DNN network optimization

1. Introduction

Since the invention of the telephone, voice-based telecommunication has been widely adopted. Audio communication involves audio signal capturing, processing, and transmission. The receiving end processes the received signal to output the reconstructed captured/original audio signal. This process introduces audio degradation. The introduced degradation affects the quality of service offered to users. Therefore, there is high interest in monitoring audio quality, mainly when the audio is transmitted over an Internet protocol (IP) network, the most commonly used global network. The IP network is a best-effort delivery network whereby the network does not guarantee that data are delivered or meet any quality of service, as the network performance depends entirely on the traffic load.

Audio quality degradation in IP audio systems is introduced by environmental noise, poor network quality, and audio encoding–decoding algorithms. The quality of audio signals in IP networks depends much on the network’s performance at a given instant. In VoIP applications such as telephone and video conferences involving two-way communications between humans, when the audio signal deteriorates, the involved parties can easily notice the situation and take mitigation measures. However, in some IP audio applications, such as PA systems, Internet radio, and music streaming apps, the communication is always one way, from the transmitter to the receivers. There is no way for the transmitter to know the quality at the receiver end. Therefore, there is a need for an automatic quality evaluation of the transmitted audio. Speech quality monitoring in VoIP systems enables autonomous system adaptation, e.g., changing encoding parameters or re-announcing. Furthermore, the IP network is shared with other applications. Hence, monitoring can detect when the network is overloaded and automatically switch to low-bitrate encoding.

For a long time, voice over IP (VoIP) has been the main form of audio communication in IP networks. An increase in IP network bandwidth, decrease in cost, and increase in the accessibility of the IP network has enabled the introduction of more IP audio services such as Internet radios and music streaming applications. Currently, voice/speech and music are the major categories of audio transmitted over IP networks. Music and speech share common audio signal characteristics but have different objectives for listeners. High-quality speech and music audio signals should be of the same fidelity as the source after processing and transmission. Moreover, for speech communication, how easy it is to understand words utterances in the received speech audio signal after processing is very important. This measure is called intelligibility.

Speech quality measures assess a speaker’s utterance voice quality, including attributes such as natural, raspy, hoarse, scratchy, and so on. Speech quality is highly subjective and difficult to evaluate reliably because individual listeners have different standards of what constitutes good or poor quality, resulting in large variations in rating scores among listeners listening to the same audio. Intelligibility measures assess “what words the speaker said” compared to the “words the listener heard”; that is, the meaning or the content of the spoken words understood by the listener. Unlike speech quality, speech intelligibility is not subjective. It can be easily measured by presenting speech material (sentences, words, etc.) to a group of listeners and asking them to identify the words spoken. Intelligibility is quantified by counting the number of words or phonemes identified correctly. The relationship between speech intelligibility and speech quality is not fully understood. This is partly because we have not yet identified the acoustic correlations between quality and intelligibility [1]. Speech can be highly intelligible yet be of poor quality. Speech audio quality assessment can be performed using subjective listening tests or objective measures.

Mean opinion scores (MOSs) are the most widely used direct subjective speech quality evaluation method. The MOS is a categorical judgment method in which listeners rate the quality of the speech test signal using a five-point numerical scale, with five indicating “excellent” quality and one indicating “unsatisfactory” or “bad” quality. The measured quality of the speech test signal is obtained by averaging categorical scores from all listeners, and the average score is commonly referred to as the MOS. This method is recommended by the IEEE Subcommittee on Subjective Methods [2] as well as by ITU [3]. In addition, a subjective test methodology for evaluating speech in communication systems that include a noise suppression algorithm [4], which measures the perceived speech quality, is one of the most adopted methods. However, subjective speech quality measurement is expensive as it requires the recruitment of human subjects.

Furthermore, subjective quality measurement is not practically applicable in the production phase for security reasons (communication privacy) and, for instance, in automatic speech quality monitoring in VoIP. Therefore, objective quality measures are developed to model subjective listening tests and mathematically estimate the audio quality perceived by human beings. There are two main categories of objective quality measurement methods: end-to-end (intrusive) and single-ended (non-intrusive) methods.

End-to-end audio quality objective evaluation involves the usage of mathematical signal processing techniques to compare the VoIP transmitted/processed speech signals to the original speech signal. Speech objective measures calculate the speech quality by measuring the numerical “distance” between the VoIP transmitted/processed speech signal and the original speech signal. Currently, the widely used end-to-end automatic speech evaluation methods are perceptual evaluation of speech quality (PESQ) [5], perceptual objective listening quality analysis (POLQA) [6], PEMO-Q [7], and perceptual objective listening quality prediction proposed in 2018, which supersede POLQA. Furthermore, the perceptual evaluation of audio quality (PEAQ) [8] is used to evaluate speech and other audio signals such as music. These methods have been widely applied for designing and evaluating the audio encoding and decoding algorithms.

End-to-end methods for objective audio quality evaluation require a clean (origin) reference signal and received (VoIP transmitted)/processed speech signal to evaluate the speech quality. This limits end-to-end method usability in general applications, as, in most cases in VoIP applications, the original speech (reference) signal is not readily available. The performance of IP networks changes based on the network load, affecting the quality of speech transmitted through the network. In the speech quality monitoring of VoIP applications, the interest is in continuously monitoring the quality of the transmitted speech signals. We only have access to the received signal at the monitoring point, whereas the original speech signal is unavailable. Single-ended objective speech quality measures estimate the speech quality using only received/processed signals without needing the original signal.

A single-ended (non-intrusive) objective measure of speech quality is a practically viable method for continuously monitoring the quality of speech delivered to a VoIP endpoint or a particular point in the network. Based on continuous quality assessment results, network traffic can be re-routed through a less congested route, or codec parameters can be changed, improving the service quality. Different single-ended methods have been proposed based on signal processing techniques or machine learning approaches.

Single-ended (non-intrusive) quality measures are suitable for continuously monitoring the received speech quality in Internet protocol networks, as there is no need for an original/reference signal. The IP network is the widely used global network implemented in Internet networks, local area networks (LANs), and enterprise networks. The wide availability of IP networks has driven the integration of voice/audio communication into the IP network. Voice communication services such as voice calls, conference systems, music streaming, and IP public addressing speakers are offered through IP networks. Therefore, it is desirable for single-ended speech quality objective measures to correlate well with subjective listening test MOS results. However, signal-processing-based single-ended speech quality evaluation results do not correlate well with subjective MOS test results.

Signal-processing-based methods such as single-ended methods for objective speech quality assessments in narrow-band telephony applications (P.563) [9] have been outperformed by the recently proposed deep learning methods [10,11,12,13]. However, deep learning models depend much on the training dataset used. Previous proposed deep-learning-based single-ended methods [11,12,13] were limited to acoustic characteristics. In [10], the effect of telecommunication networks were considered using speech transmitted through cellular networks and telephone networks. However, cellular and telephone networks exhibit different properties from IP networks.

Furthermore, the diversity of IP audio transmitters and receivers is high, from high-performance personal computers and mobile phones to low-memory and low-computing-capacity embedded systems. Therefore, deep learning methods have an excessive resource demand, impeding its deployment on low-end devices. Ideally, the objective speech quality measure in VoIP applications should be able to estimate the quality independent of the type of speech distortions introduced by the VoIP system, whether network distortions, speech encoding–decoding, or environmental noise. Moreover, the objective quality measure should support as diverse devices as possible.

In this paper, we propose MiniatureVQNet, a single-ended speech quality evaluation method for VoIP audio applications based on a lightweight deep neural network (DNN) model. The proposed model can predict the audio quality independent of the source of degradation, whether noise or network, and is light enough to run in embedded systems. We examined the proposed model performance on different VoIP speech degradation factors, including network distortion and acoustic/environmental noise. Furthermore, different post-training optimization methods were considered to improve the performance in low-resource computing environments such as embedded systems.

2. Deep-Learning-Based Single-Ended Speech Quality Measures

Several DNN models for non-intrusive speech quality evaluation have been proposed. DNN models were designed for a specific task or general tasks. This section looks into different DNN model architectures and datasets used in training these models.

2.1. Model Network Architecture

Early work on machine-learning-based non-intrusive speech quality evaluation was based on Gaussian mixture probability models (GMMs) [14]. In [15], a model based on the standard ASR system that combines a feed-forward DNN (which serves as an acoustic model) with a hidden Markov model (HMM) was presented, and the model reached an average correlation of r = 0.87. A model based on a bidirectional LSTM network was proposed in [16]. The model is focused towards speech enhancement. The DNN model based on a combination of a CNN and RNN (LSTM) [17] showed promising results in predicting the quality of super-wideband speech transmission and the impact of packet loss concealment. In [18], the convolutional networks (TCNs) used learned representations while maintaining the temporal structure of the signal for quality evaluation.

2.2. Training Datasets

Parametric models evaluate the quality from a set of parameters without processing the actual audio signal samples. Parametric models are commonly used in VoIP applications and are derived from models such as the E-model [19]. Although the E-Model is the most extensively applied parametric objective assessment method, it was originally designed for conventional network planning. Parametric models evaluate the quality of experience (speech quality experienced by VoIP users) in VoIP networks based on network quality of service (QoS) mapping and machine learning algorithms [20,21,22,23].

Parametric model studies are primarily concerned with the effect of network conditions on the speech quality experienced by users. The network packet loss effect is the most studied. The method proposed in [24] evaluated the influence of packet loss on the Skype quality and proposed a simplified E-Model. In [25], the authors addressed the effect of burst packet losses on VoIP. In [26], the effect of different codec and network conditions on the speech quality was examined. In [22], work was carried out to improve the E-model for wireless communication systems based on wireless parameter values and consider current technologies, such as MIMO. Moreover, an approach that considered the interactivity of voice communications was developed by Sun and Ifeachor [21], where the E-Model and PESQ were combined to evaluate the voice quality, and a nonlinear regression model of voice quality was proposed. Parametric models can evaluate distortions due to network conditions and devices used. However, as they do not analyze the actual audio samples, if the samples are distorted due to environmental noise, parametric models cannot be applicable for evaluating speech quality.

Network distortions and environmental noise distortions affect speech quality. Speech signals carry the effect of both noise and network distortions. Models trained on noise and network-distorted speech signals can learn to predict the speech quality regardless of whether the speech is affected by noise or environmental noise or both. In [23], a model trained and evaluated on real PSTN call recordings from 80 providers in more than 50 countries was proposed. After going through a gateway, the signals were routed across one of five PSTN carrier networks that Skype and Microsoft Teams use. Background noise was artificially added in the call to model environmental noise. In [27], an open-source PSTN speech quality test model based on a dataset with over 1000 crowdsourced real phone calls was presented. The influence of file cropping on the perceived speech quality and the influence of the number of ratings and training size on the model accuracy were analyzed. The proposed models in [23,27] were trained on a public network and crowdsourced data, and the distortion parameters such as packet loss and environmental noise were not controlled. Therefore, the speech quality distortions could not be independently analyzed, and the model accuracy in different conditions could not be verified. In this work, in addition to the public available dataset, we trained a model in a controlled dataset, where all factors are known and analyzed.

3. Dataset

A wide variety of datasets are used to train DNN-based non-intrusive speech quality prediction models. For speech enhancement algorithms, artificially added environmental noise is commonly used. In addition, noise and speech transmitted through cellular networks, or a combination of the two, are used to assess the distortions of telecommunication networks.

The training speech dataset was created by combining publicly available datasets and our specific generated dataset to cover a wider range of speech quality degradation. The noisy speech was taken from a noisy speech database for training speech enhancement algorithms and TTS models [28]. The noisy speech dataset does not contain any network-induced speech degradations. Speech quality score labels were calculated using PESQ [5] based on the clean reference speech. Moreover, NISQA Corpus [29], an aggregation of approximately 14,000 speech samples from different datasets, was used. It contains simulated and live phone (mobile phone, Skype, Zoom, Whatsapp) recordings under different noise, networks, and applications such as mobile phone, SKype, Zoom, Whatsapp, and so on. These samples are human-rated and labeled with MOS. NISQA corpus includes VoIP degradations, but the speech files do not contain information on noise and network condition parameters. To understand the prediction performance on different noise types and network conditions, a simulated noise–network-distorted speech dataset containing information on environmental noise conditions and network condition parameters was used. Environmental noise parameters were different noise types at different signal-to-noise ratios, whereas network parameters were packet loss, delay, and jitter.

3.1. Noise–Network-Distorted Dataset

In VoIP applications, speech distortion combines captured environmental noise, encoding–decoding, and transmission network distortions. The dataset for the combined effect of environmental noise, encoding–decoding, and transmission network distortions used to study the performance of automatic speech recognition systems on integrated noise–network-distorted speech [30] was used in this study. The dataset was generated by artificially corrupting the clean speech dataset with different environment noise at different signal-to-noise ratios and then transmitting the noise-mixed speech signal through the emulated network. As a result, the distorted noise–network dataset contains information on environmental noise conditions such as signal-to-noise ratios and noise types, and network condition parameters such as packet loss, delay, and jitter, as the data were generated in a controlled environment.

3.1.1. Effect of VoIP Network QoS on Speech Quality

The network condition in the VoIP system affects the quality of the transmitted speech. IP network conditions change with the network load. Commonly measured IP network conditions for network QoS include delay, jitter, bit rate, loss rate, forward error correction, and loss distribution. There is a substantial body of literature on the impact of network conditions on the perceived speech quality in VoIP communications [20,21,23,31]. In this study, we examined the effect loss rate, burst packet loss, delay, and jitter, which are the wide studies’ network condition parameters in VoIP applications.

3.1.2. Environmental Noise on Speech Quality

In VoIP applications, the microphone that captures speech also captures environmental noise. The environmental noise captured with an intended speech in VoIP affects the quality of the far-end received speech. Several single-ended algorithms for estimating the speech quality of speech signals corrupted by noise have been proposed [14,32]. In this study, in combination with the network condition parameters, the following noise types were examined: babble (multiple people talking), car, exhibition hall, restaurant, street, airport, train station, and train. The noise signals were artificially added to studio-recorded speech to generate signals at 0, 5, 10, and 15 dBs.

Finally, the dataset parameters generated from environmental noise and network distortions are summarized in Table 1.

The generated dataset comprised three female and three male speakers that read sentences selected from the IEEE database. The sentences include all phonemes in the American English language. The generated noise–network-distorted speech database was divided into training and test datasets. The training subset was used for training the single-ended speech quality prediction model in combination with the publicly available datasets, the noisy speech database for training speech enhancement algorithms and TTS models [28], and NISQA Corpus [29]. The noise–network-distorted speech testing subset was used to analyze the prediction model in different acoustic and network conditions, as this dataset contains acoustic and network condition parameters.

4. Proposed MiniatureVQNet Model

This study’s proposed models have two parts: speech signal feature extraction and quality prediction. The feature extraction part is a modified standard DeepSpeech ASR system, and the prediction part is a stacked layer of two bidirectional GRUs followed by a fully connected layer. DeepSpeech is an open-source speech-to-text engine that uses a model trained by machine learning techniques based on Baidu’s study [33]. The DeepSpeech-based technique does not require hand-designed features to model the background noise, reverberation, or phoneme dictionary. Instead, it depends on large amounts of data for training. This was the motivation behind modifying the DeepSpeech model for feature extraction to avoid hand-designed features, as there are no well-known speech features for capturing transmission network distortions. The major obstacles to implementing deep neural networks (DNNs) on embedded systems are the large model size and many operations needed for inference. However, several model optimization techniques can be applied to DNN models to deploy the models on resource-constrained systems. The commonly used techniques are weight sharing, network pruning, knowledge distillation, quantization, and designing compact network architectures. This study applied two optimization techniques: designing a compact network architecture and reducing the model precision (model parameters quantization). As a result, we achieved a very light model weight in the proposed network configuration, with only 5689 trainable parameters.

4.1. Network Architecture and Training

The proposed MiniatureVQNet architecture comprises fully connected and bidirectional GRU neural network layers with rectified linear unit (ReLu) activation. The network has four fully connected layers with ReLu activation, followed by two bidirectional gated recurrent unit (GRU) layers. The last bidirectional GRU layer feeds to a fully connected layer connected to the output layer. The number of neurons in the first four fully connected layers is given by 32-32-32-8. The number of neurons in the GRU is 8, and the last two fully connected layers is 8-1. The model was trained using stochastic gradient descent with an exponential decay learning rate schedule at an initial learning rate of 0.01, and the decay steps were set to 1000 at the decay rate of 0.9. The training dataset batch size was 64. The Figure 1 below shows the proposed network architecture diagram.

Model Optimization

Post-training DNN model optimization improves the model performance in a constrained resource environment. Quantization is a post-training model optimization technique that reduces the precision of weights and activation functions. As a result, model quantization results in smaller model sizes and faster computation in low-powered devices. The effect of post-training model quantization is a reduced model accuracy. However, the benefit of model quantization outweighs the accuracy loss, as Gysel [34] demonstrated that networks could be condensed to use 8-bit dynamic fixed points for network weights and activations with a less than 1% degradation of classification accuracy. In this study, we examined two post-training quantization methods: full-integer and float16 quantization. Full-integer quantization converts all weights and activation outputs into the nearest 8-bit fixed-point numbers. Float16 quantization converts the weights and activation to float16.

5. Experimental Evaluation

Ideally, the objective speech quality measure in VoIP applications should be able to estimate the quality independent of the type of speech distortions introduced by the VoIP system, regardless of whether they are network distortions, speech encoding–decoding, environmental noise, or the speech enhancement algorithm. This is highly challenging in IP audio applications as there are many factors, and the DNN training dataset must include at least all of these factors. In [35], the impact of network degradation features was studied independently on transmitted voice quality by using their MOS_LQO values and fixing the values of the remaining features. The results showed that each feature independently has an effect and can be used as a feature in the training set. In evaluating the proposed MiniatureVQNet single-ended method, the accuracy of the MiniatureVQNet was examined on different environmental noise features and transmission network quality features. The acoustic features considered during evaluation were the noise type and signal-to-noise ratios, whereas the network factors examined were the packet loss, delay, and jitter using a G722 encoder. Two variations of the proposed MiniatureVQNet model were evaluated. First, the MiniatureVQNet model was trained on a dataset with only environmental noise, referred to as MiniatureVQNet-Noise. Then, the second model was trained on noise and network distortions, referred to as MiniatureVQNet–Noise–Network.

5.1. General Performance

Both models, MiniatureVQNet-Noise and MiniatureVQNet–Noise–Network, outperformed the ITU-T P.563’s accuracy. The mean squared error (MSE) of the models compared to the PESQ score for ITU-T P.563, MiniatureVQNet-Noise, and MiniatureVQNet–Noise–Network was 2.19, 0.34, and 0.21, respectively. A further examination of the difference in the MSE of P.563 and MiniatureVQNet variation shows that the difference is attributed to the failure of the P.563 model to accurately predict the quality when speech distortions are low, or the SNR is high, as shown in the Figure 2 below.

While both the MiniatureVQNet–Noise and MiniatureVQNet–Noise–Network model show a higher performance than that of P.563, the model trained on the noise–network speech dataset MiniatureVQNet–Noise-Network has a higher accuracy than the model trained on noise-only MiniatureVQNet–Noise. Therefore, single-ended speech quality monitoring in the VoIP applications model trained on noisy datasets is outperformed by models trained on noise–network speech datasets.

5.2. The Effect of Noise and Network Distortion on Prediction Accuracy

The noise-distorted speech data without any network distortion were used to examine the individual effect of noise distortions and combined noise and network distortion on speech data. Then, the general network distortion effect for each SNR was observed. The performances of MiniatureVQNet–Noise and MiniatureVQNet–Noise-Network were compared in both cases, as shown in Figure 3.

The MSE of the MiniatureVQNet–Noise–Network model is low compared to Miniature VQNet–Noise. At a 0 dB signal-to-noise ratio, the MSE for MiniatureVQNet–Noise–Network and MiniatureVQNet–Noise models was 0.237 and 0.267, respectively. However, with the increase in SNR, both models’ performances decreased. For example, at a SNR of 15 dB, the MSE of MiniatureVQNet–Noise was 0.452, whereas that of MiniatureVQNet–Noise–Network was 0.438. The performance decreases as the effect of network distortions is less at a high SNR than at a low SNR. Still, the noise–network-trained model outperforms the noisy dataset trained model at all SNRs.

In the noise-only dataset, a dataset without network distortions (no packet loss, the delay is less than 3 ms, no jitter), both models MiniatureVQNet–Noise and Miniature VQNet–Noise–Network exhibited the same performance. There is no significant difference between the model trained on the noise–network dataset and the model trained on the noisy-only dataset, as shown in Figure 4. This is interesting as the model trained on the noise–network dataset improves the general performance, without a decrease in performance on the noise-only dataset. The model trained on the noise–network-distorted dataset can learn network distortion features without a loss of knowledge on noise features. Thus, the proposed model is suitable for deployment in noise and noise–network conditions.

5.3. The Effect of Noise Type and Network Distortion on Prediction Accuracy

Different noise types have different characteristics. Therefore, we intend to understand the influence of various noise types at different SNR values on the performance of MiniatureVQNet–Noise and MiniatureVQNet–Noise–Network models. Moreover, we intend to understand how network distortion impacts different noise types.

Noise type affects the model performance differently as shown in Figure 5, where a comparison of street and train station noise is plotted. Train station noise constitutes noise from different sources, such as approaching trains, broadcasting loudspeakers, and conversation from nearby people. Street noise constitutes noise from passing cars, singing birds, and other sources. The performance of both the MiniatureVQNet–Noise–Network and MiniatureVQNet–Noise model depends on the noise type for an SNR greater than 0 dB and less than 10 dB. Moreover, for all models, the performance on the station noise was lower than that of the street noise for all SNR values. Furthermore, the model trained on the noise–network dataset exhibits a higher performance than the model trained on a noisy dataset.

When network transmission errors further distorted the noise-distorted speech, there was no difference in the performance of the noise–network-trained model and noise-only-trained model on different noise types, as shown in Figure 6. The network distortion on the noisy speech effect does not change the effect of noise type. The DNN model prediction accuracy depends on the noise type. Nevertheless, training the model on the noise–network dataset improves the model accuracy on all SNRs and noise types. MiniatureVQNet–Noise–Network is more robust than MiniatureVQNet–Noise on all SNRs and noise types.

5.4. Effect of Jitter on Prediction Accuracy

Figure 7 shows the effect of jitter on the MiniatureVQNet–Noise–Network and Miniature VQNet–Noise model performance. We examined the effect of jitter at a 200 ms delay. The models showed significant performance differences at a jitter of 10% delay, which is 20 ms. At this jitter, the model trained on the noise-only dataset showed a 0.268 MSE, whereas that of the noise–network-trained model showed a 0.211 MSE. In all other cases, the noise–network-trained model outperforms the noise-only-trained model, and the difference is large when the jitter is 10% of the delay. Training the model on the noise–network speech dataset improves the model’s accuracy on different jitter distortions.

5.5. Effect of Packet Loss on Prediction Accuracy

With the jitter, delay, and burst-loss kept constant, the effect of packet loss on the performance of the noise-trained model and noise–network-trained model was compared, as shown in Figure 8. The performance of the MiniatureVQNet–Noise–Network and MiniatureVQNet–Noise models decreases with the increase in the packet loss; the MSE increases with an increase in the packet loss. Meanwhile, at all loss measurements, the MiniatureVQNet–Noise–Network model outperforms the MiniatureVQNet–Noise model. However, the performance difference is not very high, except for a 10% and 20% packet loss. Training the model on the noise–network dataset offers a better MOS prediction accuracy compared to models trained on the noisy dataset under all packet loss conditions.

5.6. Model Performance after Post-Training Optimization

The MiniatureVQNet–Noise–Network was optimized after training to reduce the model size and execution time in low-powered devices. Two optimized models, Full–Integer MiniatureVQNet–Noise–Network and Float16 MiniatureVQNet–Noise–Network, were compared with the original MiniatureVQNet–Noise–Network. Table 2 shows the Pearson correlation coefficient of MiniatureVQNet–Noise–Network, Full–Integer Miniature VQNet–Noise–Network, and Float16 MiniatureVQNet–Noise–Network with respect to PESQ. For optimized models, there is a slight decrease in the Pearson correlation for Full–Integer MiniatureVQNet–Noise–Network compared to the original model, but this difference is not significant.

The MSE of Full-Integer MiniatureVQNet–Noise–Network was 0.254, which is high compared to MiniatureVQNet–Noise–Network and Float16 MiniatureVQNet–Noise–Network, which were each 0.194. The results show that model compression can be achieved without a loss of model accuracy. Therefore, the proposed model’s weights representation can be reduced from float64/float32 to a low number of bits (float16) without a loss of accuracy. Hence, the quantized model can be used in a resource-constrained environment without a loss of accuracy.

6. Conclusions

This paper proposes the MiniatureVQNet, a single-ended speech quality evaluation method for VoIP audio applications, based on a lightweight deep neural network (DNN) model trained on an environmental noise and network-distorted speech dataset. The proposed model can predict audio quality independent of the source of degradation, whether noise or network, and is light enough to run in embedded systems. The proposed MiniatureVQNet model outperforms the traditional P.563 method in accuracy on all tested network conditions and environmental noise parameters. Furthermore, the proposed model is compact and can easily run on various low-resource computing platforms.

Training on a noise-network distorted speech dataset improves the model prediction accuracy in all VoIP environment distortions compared to training the model on a noise-only dataset. The noise–network dataset captures speech degradation from environmental noise and also transmission-induced degradations. In all considered environmental noise and transmission error factors, the MiniatureVQNet trained on the noise–network-degraded dataset outperformed the model trained on the noise-only dataset. Hence, the MiniatureVQNet model can learn new features without a degradation in performance. Examining the model performance on different noise types, SNR, jitter, packet loss, and delay provides the model’s weak points, where the model performance is low. Care should be taken when deploying single-ended speech quality prediction models as the stated performance may not be applicable in all conditions.

Recently, music streaming, BGM streaming, Internet radios, and IP audio applications are very popular. Thus, it is important to consider not only single-ended speech quality evaluation but also music quality evaluation. Music and speech are always mixed, such as in Internet radios, where there is usually no clear point when music or speech only will be broadcast. Hence, a general audio quality evaluation is important. Although only speech was considered in this study, a future study plan is to extend this to evaluate the general audio quality of IP audio applications, including both speech and music signals. Furthermore, we plan to use highly versatile codecs such as Opus [36], which scales from low-bitrate narrowband speech to high-fidelity full-band speech and supports packet loss concealment and the jitter buffer algorithm.

Author Contributions

Conceptualization, E.J.K. and Y.N.; methodology, E.J.K. and Y.N.; software, E.J.K.; validation, E.J.K.; formal analysis, E.J.K.; investigation, E.J.K.; resources, Y.N. and E.J.K.; data curation, E.J.K.; writing—original draft preparation, E.J.K.; writing—review and editing, Y.N. and E.J.K.; visualization, E.J.K.; supervision, Y.N.; project administration, E.J.K. and Y.N.; funding acquisition, Y.N. All authors have read and agreed to the published version of the manuscript.

Funding

This work is partially supported by Osaka NDS.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Publicly available datasets were in this study. Noisy speech database for training speech enhancement algorithms and TTS models is available at https://datashare.ed.ac.uk/handle/10283/2791 (accessed on 16 June 2022) under Creative Commons License. NISQA Corpus is available at https://github.com/gabrielmittag/NISQA/wiki/NISQA-Corpus#nisqa_train_sim-and-nisqa_val_sim (accessed on 14 October 2022). Internal generated dataset is available on request from the corresponding author.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Voiers, W. Interdependencies among measures of speech intelligility and speech “Quality”. In Proceedings of the ICASSP ’80. IEEE International Conference on Acoustics, Speech, and Signal Processing, Denver, CO, USA, 9–11 April 1980; Volume 5, pp. 703–705. [Google Scholar] [CrossRef]
IEEE. IEEE Recommended Practice for Speech Quality Measurements. IEEE Trans. Audio Electroacoust. 1969, 17, 225–246. [Google Scholar] [CrossRef]
International Telecommunication Union. Methods for Subjective Determination of Transmission Quality; ITU-T Recommendation P.800; International Telecommunication Union: Geneva, Switzerland, 1996. [Google Scholar]
International Telecommunication Union. Subjective Test Methodology for Evaluating Speech Communication Systems That Include Noise Suppression Algorithm; ITU-T Recommendation P.835; International Telecommunication Union: Geneva, Switzerland, 2003. [Google Scholar]
International Telecommunication Union. Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs; ITU-T Recommendation P.862; International Telecommunication Union: Geneva, Switzerland, 2001. [Google Scholar]
International Telecommunication Union. Perceptual Objective Listening Quality Assessment: An Advanced Objective Perceptual Method for End-to-End Listening Speech Quality Evaluation of Fixed, Mobile, and IP-Based Networks and Speech Codecs Covering Narrowband, Wideband, and Super-Wideband; ITU-T Recommendation P.863; International Telecommunication Union: Geneva, Switzerland, 2011. [Google Scholar]
Huber, R.; Kollmeier, B. PEMO-Q-A new method for objective audio quality assessment using a model of auditory perception. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1902–1911. [Google Scholar] [CrossRef]
Thiede, T.; Treurniet, W.C.; Bitto, R.; Schmidmer, C.; Sporer, T.; Beerends, J.G.; Colomes, C. PEAQ-The ITU standard for objective measurement of perceived audio quality. J. Audio Eng. Soc. 2000, 48, 3–29. [Google Scholar]
International Telecommunication Union. Single-Ended Method for Objective Speech Quality Assessment in Narrow-Band Telephony Applications; ITU-T Recommendation P.563; International Telecommunication Union: Geneva, Switzerland, 2004. [Google Scholar]
Sharma, D.; Wang, Y.; Naylor, P.A.; Brookes, M. A data-driven non-intrusive measure of speech quality and intelligibility. Speech Commun. 2016, 80, 84–94. [Google Scholar] [CrossRef]
Gamper, H.; Reddy, C.K.A.; Cutler, R.; Tashev, I.J.; Gehrke, J. Intrusive and Non-Intrusive Perceptual Speech Quality Assessment Using a Convolutional Neural Network. In Proceedings of the 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), IEEE, New Paltz, NY, USA, 20–23 October 2019. [Google Scholar]
Cauchi, B.; Siedenburg, K.; Santos, J.F.; Falk, T.H.; Doclo, S.; Goetze, S. Non-Intrusive Speech Quality Prediction Using Modulation Energies and LSTM-Network. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1151–1163. [Google Scholar] [CrossRef]
Catellier, A.A.; Voran, S.D. Wawenets: A No-Reference Convolutional Waveform-Based Approach to Estimating Narrowband and Wideband Speech Quality. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, Spain, 4–8 May 2020; Institute of Electrical and Electronics Engineers Inc.: Piscataway Township, NJ, USA, 2020; Volume 2020, pp. 331–335. [Google Scholar] [CrossRef]
Falk, T.H.; Chan, W.Y. Single-ended speech quality measurement using machine learning methods. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1935–1947. [Google Scholar] [CrossRef]
Ooster, J.; Huber, R.; Meyer, B.T. Prediction of perceived speech quality using deep machine listening. In Proceedings of the Annual Conference of the International Speech Communication Association, INTERSPEECH, Hyderabad, India, 2–6 September 2018; Volume 2018, pp. 976–980. [Google Scholar] [CrossRef]
Fu, S.W.; Tsao, Y.; Hwang, H.T.; Wang, H.M. Quality-Net: An end-to-end non-intrusive speech quality assessment model based on BLSTM. arXiv 2018, arXiv:1808.05344. [Google Scholar]
Mittag, G.; Möller, S. Non-intrusive Speech Quality Assessment for Super-wideband Speech Communication Networks. In Proceedings of the ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing, Brighton, UK, 12–17 May 2019; Institute of Electrical and Electronics Engineers Inc.: Piscataway Township, NJ, USA, 2019; Volume 2019, pp. 7125–7129. [Google Scholar] [CrossRef]
Manocha, P.; Xu, B.; Kumar, A. NORESQA: A Framework for Speech Quality Assessment using Non-Matching References. Adv. Neural Inf. Process. Syst. 2021, 27, 22363–22378. [Google Scholar]
ITU-T. The E-Model: A Computational Model for Use in Transmission Planning; Recommendation ITU-T G.107; ITU-T: Geneva, Switzerland, 2015. [Google Scholar]
Sun, L.; Ifeachor, E. Perceived speech quality prediction for voice over IP-based networks. In Proceedings of the 2002 IEEE International Conference on Communications, New York, NY, USA, 28 April–2 May 2002; Volume 4, pp. 2573–2577. [Google Scholar] [CrossRef]
Sun, L.; Ifeachor, E. Voice quality prediction models and their application in VoIP networks. IEEE Trans. Multimed. 2006, 8, 809–820. [Google Scholar] [CrossRef]
Rodriguez, D.Z.; Rosa, R.L.; Almeida, F.L.; Mittag, G.; Moller, S. Speech Quality Assessment in Wireless Communications with MIMO Systems Using a Parametric Model. IEEE Access 2019, 7, 35719–35730. [Google Scholar] [CrossRef]
Hu, Z.; Yan, H.; Yan, T.; Geng, H.; Liu, G. Evaluating QoE in VoIP networks with QoS mapping and machine learning algorithms. Neurocomputing 2020, 386, 63–83. [Google Scholar] [CrossRef]
Wuttidittachotti, P.; Daengsi, T. Subjective MOS model and simplified E-model enhancement for Skype associated with packet loss effects: A case using conversation-like tests with Thai users. Multimed. Tools Appl. 2017, 76, 16163–16187. [Google Scholar] [CrossRef]
Jelassi, S.; Rubino, G. A perception-oriented Markov model of loss incidents observed over VoIP networks. Comput. Commun. 2018, 128, 80–94. [Google Scholar] [CrossRef]
Uhl, T. QoS by VoIP under Use Different Audio Codecs. In Proceedings of the 2018 Joint Conference—Acoustics, Acoustics 2018, Ustka, Poland, 11–14 September 2018; pp. 311–314. [Google Scholar] [CrossRef]
Mittag, G.; Cutler, R.; Hosseinkashi, Y.; Revow, M.; Srinivasan, S.; Chande, N.; Aichner, R. DNN No-Reference PSTN Speech Quality Prediction. arXiv 2020, arXiv:2007.14598. [Google Scholar] [CrossRef]
Valentini-Botinhao, C. Noisy Speech Database for Training Speech Enhancement Algorithms and TTS Models; Centre for Speech Technology Research (CSTR), School of Informatics, University of Edinburgh: Edinburgh, UK, 2017. [Google Scholar] [CrossRef]
Mittag, G.; Naderi, B.; Chehadi, A.; Möller, S. NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets. arXiv 2021, arXiv:2104.09494. [Google Scholar]
Kumalija, E.J.; Nakamoto, Y. Performance Evaluation of Automatic Speech Recognition Systems on Integrated Noise-Network Distorted Speech. Front. Signal Process. 2022, 2, 999457. [Google Scholar] [CrossRef]
da Silva, A.P.C.; Varela, M.; de Souza e Silva, E.; Leão, R.M.; Rubino, G. Quality assessment of interactive voice applications. Comput. Netw. 2008, 52, 1179–1192. [Google Scholar] [CrossRef]
Soni, M.H.; Patil, H.A. Novel deep autoencoder features for non-intrusive speech quality assessment. In Proceedings of the 2016 24th European Signal Processing Conference (EUSIPCO), Budapest, Hungary, 29 August–2 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 2315–2319. [Google Scholar]
Hannun, A.; Case, C.; Casper, J.; Catanzaro, B.; Diamos, G.; Elsen, E.; Prenger, R.; Satheesh, S.; Sengupta, S.; Coates, A.; et al. Deep Speech: Scaling up end-to-end speech recognition. arXiv 2014, arXiv:1412.5567. [Google Scholar]
Gysel, P.; Pimentel, J.; Motamedi, M.; Ghiasi, S. Ristretto: A Framework for Empirical Study of Resource-Efficient Inference in Convolutional Neural Networks. IEEE Trans. Neural Netw. Learn. Syst. 2018, 29, 5784–5789. [Google Scholar] [CrossRef] [PubMed]
Alkhawaldeh, R.S.; Khawaldeh, S.; Pervaiz, U.; Alawida, M.; Alkhawaldeh, H. NIML: Non-intrusive machine learning-based speech quality prediction on VoIP networks. IET Commun. 2019, 13, 2609–2616. [Google Scholar] [CrossRef]
Valin, J.M.; Vos, K.; Terriberry, T. Definition of the Opus Audio Codec; IETF RFC 6716: Milford, MA, USA, 2012. [Google Scholar]

Figure 1. Proposed model network architecture.

Figure 2. MSE for different models.

Figure 3. Model accuracy at different signal-to-noise ratios on noise–network-distorted speech data.

Figure 4. Model accuracy at different signal-to-noise ratios on noisy distorted speech data.

Figure 5. Effect of noise type on prediction accuracy.

Figure 6. Effect of noise type on prediction accuracy.

Figure 7. The effect of jitter on prediction accuracy.

Figure 8. Effect of packet loss on prediction accuracy.

Table 1. Combined noise and network distortion parameters.

Distortion	Parameter	Values
Network	Packet loss (%)	0, 10, 15, 20, 25, 30, 35
	Delay (ms)	0, 100, 200, 300, 500
	Jitter (% delay)	0, 10, 20, 30, 40
	Codec	G722
Noise	Noise type	Babble, car, exhibition hall, restaurant, street, airport, train station, train
Noise	SNR (dB)	0, 5, 10, 15

Table 2. Correlation and Mean Squared Error.

Model	Correlation	MSE
MiniatureVQNet–Noise–Network	0.691	0.194
Float16 MiniatureVQNet–Noise–Network	0.690	0.194
Full-Integer MiniatureVQNet–Noise–Network	0.670	0.254

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kumalija, E.J.; Nakamoto, Y. MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality. Appl. Sci. 2023, 13, 2455. https://doi.org/10.3390/app13042455

AMA Style

Kumalija EJ, Nakamoto Y. MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality. Applied Sciences. 2023; 13(4):2455. https://doi.org/10.3390/app13042455

Chicago/Turabian Style

Kumalija, Elhard James, and Yukikazu Nakamoto. 2023. "MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality" Applied Sciences 13, no. 4: 2455. https://doi.org/10.3390/app13042455

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

MiniatureVQNet: A Light-Weight Deep Neural Network for Non-Intrusive Evaluation of VoIP Speech Quality

Abstract

1. Introduction

2. Deep-Learning-Based Single-Ended Speech Quality Measures

2.1. Model Network Architecture

2.2. Training Datasets

3. Dataset

3.1. Noise–Network-Distorted Dataset

3.1.1. Effect of VoIP Network QoS on Speech Quality

3.1.2. Environmental Noise on Speech Quality

4. Proposed MiniatureVQNet Model

4.1. Network Architecture and Training

Model Optimization

5. Experimental Evaluation

5.1. General Performance

5.2. The Effect of Noise and Network Distortion on Prediction Accuracy

5.3. The Effect of Noise Type and Network Distortion on Prediction Accuracy

5.4. Effect of Jitter on Prediction Accuracy

5.5. Effect of Packet Loss on Prediction Accuracy

5.6. Model Performance after Post-Training Optimization

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI