Next Article in Journal
A Multi-Model Proposal for Classification and Detection of DDoS Attacks on SCADA Systems
Previous Article in Journal
Customer Complaint Analysis via Review-Based Control Charts and Dynamic Importance–Performance Analysis
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform †

Department of Electrical Engineering, National Chi Nan University, Nantou 54561, Taiwan
*
Author to whom correspondence should be addressed.
This paper is an extended version of our paper published in the Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING), Taipei, Taiwan, 21–22 November 2022.
Appl. Sci. 2023, 13(10), 5992; https://doi.org/10.3390/app13105992
Submission received: 20 March 2023 / Revised: 4 May 2023 / Accepted: 10 May 2023 / Published: 12 May 2023

Abstract

:
Nowadays, time-domain features see wide use in speech enhancement (SE) networks such as frequency-domain features to achieve excellent performance in eliminating noise from input utterances. This study primarily investigates how to extract information from time-domain utterances to create more effective features in SE. We extend our recent work by employing sub-signals which dwell in multiple acoustic frequency bands in the time domain and integrating them into a unified time-domain feature set. The discrete wavelet transform (DWT) is applied to decompose each input frame signal to obtain sub-band signals, and a projection fusion process is performed on these signals to create the ultimate features. The corresponding fusion strategy is either bi-projection fusion (BPF) or multiple projection fusion (MPF). In short, MPF exploits the softmax function to replace the sigmoid function in order to create ratio masks for multiple feature sources. The concatenation of fused DWT features and time features serves as the encoder output of two celebrated SE frameworks, the fully convolutional time-domain audio separation network (Conv-TasNet) and the dual-path transformer network (DPTNet), to estimate the mask and then produce the enhanced time-domain utterances. The evaluation experiments are conducted on the VoiceBank-DEMAND and VoiceBank-QUT tasks, and the results reveal that the proposed method achieves higher speech quality and intelligibility than the original Conv-TasNet that uses time features only, indicating that the fusion of DWT features created from the input utterances can benefit time features to learn a superior Conv-TasNet/DPTNet network in SE.

1. Introduction

Speech processing technology has been successfully integrated into many networking and multimedia audio–visual services, such as voice input of smart devices, interactive voice chat, voice command control of various intelligent robots, and voice control of automobile and locomotive functions. However, there are still many critical challenges in expanding speech applications. From the perspective of signal processing, the primary problem facing speech signal transmission and processing is noise interference. As described in the best-selling book “A Flaw in Human Judgment” [1] in 2021, bias and noise are two major sources of error. However, compared to bias, the randomness of noise makes it much less likely to predict accurately, thus deepening the difficulty of dealing with noise, while this is primarily a narrative of human judgment, it seems equally applicable to the environment that speech signal processing inhabits.
Thanks to the rapid development of science and technology, tremendous countermeasures have been presented in recent decades to deal with the distortion existing in communicated speech. One of the prevailing schools is speech enhancement (SE), which aims to improve the quality and intelligibility of received speech so that the enhanced utterances can be more readily accepted and understood by human beings or machines. Conventional SE algorithms mainly rely on statistical modeling of speech or noise, and they often perform unexpectedly well in non-stationary noise scenarios. They include spectral subtraction [2], the Weiner filtering method [3], maximum a posteriori adaptation (MAP) [4], maximum likelihood linear regression (MLLR) [5], maximum likelihood linear transform (MLLT) [6], parallel model combination (PMC) [7], minimum mean square error log-spectral amplitude estimator (MMSE-LSA) [8], stochastic vector mapping (SVM) [9], and multi-environment model-based linear normalization [10], just to name a few.
In recent years, thanks to the development of the theory and application of deep neural networks (DNN), most state-of-the-art SE technologies developed by mainstream researchers contain deep-learning algorithms. These DNN-wise techniques allow the extensive use of speech data to train SE models, exhibiting more excellent SE performance with greater generalizability than conventional methods. In short, in the supervised DNN-based SE methods, a large amount of paired training data (viz., noisy speech and the corresponding clean speech component) is used to train a DNN. The DNN is to learn a mapping between noisy and clean utterances to estimate the clean speech component of any noisy speech accurately. Furthermore, the performance of DNN-based SE methods can be determined by three factors: the model architecture, the data used for learning, and the employed hyper-parameters.
As for the model architecture, various types of deep neural networks, such as multi-layer perception (MLP), convolutional neural network (CNN) and recurrent neural network (RNN), denoising auto-encoder (DAE), and the corresponding variants and evolutions have been adopted to develop SE methods. For example, the works in [11,12,13,14,15] use MLP architectures, the works in [16,17,18,19,20] employ a CNN framework, and RNN is exploited in [21,22]. Furthermore, the training target of network learning is also significantly related to the performance of the SE methods. According to [23], the SE methods can be divided into two categories due to the training target: mapping-based and masking-based.
Mapping-based methods directly pursue a mapping function with the ideal output being the representation (feature) of the clean speech component for the noisy input. The representation might be time-domain signal waveform and time-frequency (T-F) diagrams including spectrograms and cochleagrams. Some well-known mapping-based SE methods include, but are not limited to, the target magnitude spectrum (TMS) [24], gammatone frequency target power spectrum (GF-TPS) [24], and signal approximation (SA) [24]. On the other hand, masking-based methods seek a point-to-point multiplicative mask for the input signal or feature representation, hoping that the mask-multiplied output can approach its clean state. Some well-known masking-based methods include the ideal binary mask (IBM) [25], ideal ratio mask (IRM) [26], spectral magnitude mask (SMM) [23], complex ideal ratio mask (cIRM) [27], phase-sensitive mask (PSM) [28], and many others.

Introduction of Cross-Domain Features in DNN-Based SE Network

Many conventional statistical model-based or novel DNN-based SE methods employ the time-frequency (T-F) spectrum (spectrogram) obtained by short-time Fourier transform (STFT) as the fundamental representation of speech signals, which is equivalent to using a fixed and known sinusoid basis set. Although sinusoids have theoretical advantages in serving as a basis for analyzing signals, they may not be the optimal choice for practical use. For example, Fourier transform (FT) is unsuitable for analyzing non-stationary signals. Thus, STFT is a compromise that applies FT to segmented and windowed signals to capture the spectral properties within a short period of a signal. Alternatively, other transformation methods might be more suitable for directly analyzing non-stationary signals, such as wavelet transformation and Hilbert Huang transformation [29] or data-driven time-domain transformation with a trainable convolution network.
In particular, the operation of a convolutional layer for segmented speech is similar to the STFT mentioned above, but it uses a set of learnable basis functions. For example, Conv-TasNet [18] directly uses CNN in its encoder part to extract frame-wise features, and SincNet [30] employs the sinc functions, the parameters of which can be learned with CNN to simulate traditional band-pass filters.
According to information theory, a signal in the time, frequency, or distortion-free transform domains contains the same information without any loss. However, we cannot confirm that the downstream network processing will not result in information loss or always highlight salient information for SE from any single-domain features. In this way, increasing the redundancy with multiple-domain features at the feature extraction stage might simplify the design of the subsequent network, which is similar to using redundant codes in transmission to detect and correct possible errors. Inevitably, introducing redundancy with multiple-domain features has certain downsides, such as increasing the input size, introducing a potential bias, and complicating, making the fusion of these features less interpretable.
In addition to directly using multiple-domain features with their original form (by concatenating them), it is likely that the mutual information among these features, if extracted in advance, can further benefit the entire SE network in terms of both efficiency and efficacy. For example, in [31], both time-domain and frequency-domain features are integrated with a mechanism called bi-projection fusion (BPF), and the resulting feature set produces superior SE performance than Conv-TasNet [18]. BPF is initially developed in image processing, which integrates equirectangular and cubemap panoramic photography to achieve a more accurate depth estimation of panoramic images.
In a nutshell, this study is motivated by the following observations:
  • Features from multiple domains may learn a more effective SE model than single-domain features, according to [31].
  • The well-known discrete wavelet transform (DWT) may split a signal into many sub-bands without losing information or introducing distortion, and it could be a good choice for producing features or as a pre-processing stage for an SE model. If the used wavelet function has a small support, DWT can be implemented quickly. Furthermore, because of the downsampling process, the total length of the DWT-wise sub-band signals is approximately equal to that of the original signal, and thus DWT does not increase data size when the number of sub-bands is increased. Additionally, our recent research [32,33,34] has shown that DWT can be applied in various aspects in SE to promote its performance.
  • The work in [31] employs BPF to fuse two feature sources. We wondered if the BPF mechanism can be extended to fuse more feature sources than just two.
Accordingly, this study extends our recent work in [35] and proposes to adopt and extend the idea of BPF to produce effective speech representations for DNN-based SE networks, primarily employing DWT to create features as an alternative to the commonly used STFT features. DWT is a distortionless transformation such as Fourier transform (FT), and it decomposes an input signal by passing it through a series of (analysis) filters, where at each level of decomposition there are two filters—low-pass and high-pass filters. Since only the low-frequency (approximation) part of the signal is further decomposed, the bandwidth of the resulting sub-bands decreases as the frequency decreases. This property is similar to the behavior of the human ear in sound perception.
Furthermore, we propose a new fusion mechanism called multiple projection fusion (MPF), which can merge more than two sources. The integration of multiple features is employed to learn the DNN-based SE network, and there is two-fold integration. In the first, various acoustic frequency-band signals taken from any frame waveform with DWT are fused, and in the second, the time-domain feature is concatenated with the fusion of DWT-domain features from the first integration. We evaluate the effectiveness of the provided feature structure on two SE networks, Conv-TasNet and DPTNet SE, to see if SE performance can be enhanced further.
The remainder of this paper is as follows: Section 2 gives background information on the presented novel framework, and Section 3 details it. The experimental setup, results and corresponding discussions are provided in Section 4. Finally, Section 5 gives the concluding remarks of this work.

2. Background Knowledge

This section provides the background knowledge of our presented SE framework to be detailed in the next section, and it contains two parts. The first part introduces a celebrated deep-learning-based SE architecture, Conv-TasNet. Conv-TasNet will serve as the archetype of our SE framework. The second part describes the discrete wavelet transform, which will be employed to create the speech features at the encoder end of our framework.

2.1. Conv-TasNet

An effective SE method, the fully convolutional time-domain audio separation network (Conv-TasNet), adopts and extends the ideas of two other methods: the time-domain audio separation network (TasNet) [36] and temporal convolutional network (TCN) [37]. Briefly speaking, TasNet models the input waveform with a convolutional encoder–decoder architecture. The encoder possesses a nonnegativity constraint on its output, and a linear decoder inverts the (enhanced) encoder output back to the sound waveform. In particular, a separation (enhancement) layer consisting of a deep LSTM network is located between the encoder and decoder to implement the denoising. In contrast, TCN consists of a series of 1-D convolution blocks with different dilation factors, and it is used in Conv-TasNet as the layer structure of separation (enhancement), which is shown in Figure 1.
The architecture of Conv-TasNet is depicted in Figure 2, which contains three parts: the encoder, mask-estimation network, and decoder. In the following, we describe the process of each of the three parts.

2.1.1. Encoder and Decoder

The encoder part converts the input utterance to a series of frame-wise features, which are supposed to benefit the downstream mask-estimation network better than the original raw utterance. First, one splits the input utterance into K overlapping frames of length L in the time domain, for which each frame represented by a vector x m R L × 1 , where k is the frame index, and K is the total number of frames. Then, we concatenate all x k to form a matrix X R L × K .
The framed raw data matrix X is then transformed into a feature matrix W T R N × K by a 1-D convolution operation and is formulated as a matrix multiplication:
W T = H ( U X ) ,
where U R N × L consists of N vectors (termed encoder basis functions) each with dimension 1 × L , and H is an optional nonlinear function such as the rectified linear unit (ReLU) to make sure the output W T is non-negative in an element-wise manner.
In contrast, the decoder employs a 1-D transposed convolution operation to convert the enhanced feature matrix W ˜ T (the output of the mask-estimation network) to the time-domain data matrix X ˜ , and it can be reformulated as another matrix multiplication:
X ˜ = V W ˜ T ,
where X ˜ R L × K is the reconstruction of X, and the columns in V R L × N are N vectors (termed decoder basis functions) each with dimension L × 1 . The overlapping reconstructed frames in X ˜ are summed together to generate the final time-domain waveform.

2.1.2. Mask Estimation Network

As shown in the middle part of Figure 2, the mask-estimation layer between the encoder and decoder layers consists of stacked 1-D dilated convolution blocks, which follows the temporal convolutional network (TCN) architecture. In particular, the employed TCN contains R groups of blocks connected in tandem, each group consisting of M 1-D convolutional blocks that have dilation factors increasing exponentially from 1 to 2 M 1 . The design of increasing dilation factors provides different context window sizes to catch short- and long-range dependencies of speech utterances. Finally, the output of the TCN goes through a 1-by-1 convolutional layer (viz., pointwise convolution) and a nonlinear activation (sigmoid) to produce the estimated masks for both clean speech and noise.
Referring to Figure 1, the 1-D convolutional block in the mask-estimation layer has two output paths: a residual path and a skip-connection path. The residual path acts as the input to the next block, while the skip-connection paths of all convolutional blocks are summed up as the output of the TCN. Moreover, a depth-wise separable convolution operation is employed in each convolution block instead of the standard convolution to reduce the computation complexity. A depth-wise separable convolution consists of two consecutive operations and a depth-wise convolution followed by a point-wise convolution. Additionally, the 1-D convolution block uses parametric ReLu (PReLU) [38] as the activation. PReLU modifies ReLU by adding a coefficient that controls the slope of the negative component; this coefficient, along with the other model parameters, is learned from the training set. PReLU prevents zero gradients from being brought on by negative channel outputs, allowing the model to be updated with these outputs. For the ImageNet 2012 benchmark utilizing 10-view testing, substituting PReLU with PReLU reduces the top-1 error from 33.82% to 32.64%—a 1.2% improvement. PReLU has been found to be more beneficial than ReLU, boosting the performance of the learnt network.
Here, we give some detail about comparing standard convolution and depthwise separable convolution. Let the input contain G channels with M-point data and form a matrix Y R G × M , and the output is expected to contain H channels with M-point data and form a matrix with size R H × M . The standard convolution uses a kernel tensor K R G × H × P , where P is the length of the impulse response for each one-dimension input–output convolution pair. In contrast, the depthwise separable convolution first uses G impulse responses with length P, individually convolved with each of the G channels of input data. Next, the overall convolved G output feature maps pass through a pointwise (1-by-1) convolution layer that has a kernel tensor K ^ R G × H × 1 to produce the output matrix with size R H × M . Therefore, the standard convolution uses G × H × P parameters, while the depthwise separable convolution uses G × P + G × H parameters. Obviously G × P + G × H < G × P × H if ( P 1 ) ( H 1 ) > 1 , and the model size ratio G × P × H G × P + G × H increases as H P or P H .

2.2. An Advanced Variant of Conv-TasNet: DPTNet

The Conv-TasNet framework employs a temporal convolution network, which consists of stacked 1-D dilated convolution blocks, to predict the mask for the encoder output. In recent years, some more advanced encoder SE methods have followed the encoder–decoder structure of Conv-TasNet while exploiting a more exquisite module to perform the mask estimation. Among them, the dual-path transformer network (DPTNet) [39], as the name suggests, uses an attention-based dual-path modeling network to estimate the mask and achieves superior SE performance. Here, we give a brief introduction to DPTNet, focusing on its difference from Conv-TasNet.
As depicted in Figure 3, the masking estimation layer of DPTNet consists of three stages: segmentation, dual-path modeling, and overlap-add, whose functions are described as follows:
  • Segmentation: In the segmentation stage, the input feature matrix W from the upstream encoder is split into fix-length chunks with hopping. These chunks are then concatenated together to comprise a 3-D tensor.
  • Dual-path modeling: The 3-D tensor passes through the stack of dual-path modeling blocks containing intra-chunk and inter-chunk processing. The intra-chunk processing models the local information of the tensor with an improved transformer defined by [40]. The respective output then passes through another improved transformer to perform inter-chunk processing, which catches the information about global dependency among different chunks.
  • Overlap-add: After completing the dual-path modeling, we employ the overlap-add method to the chunks of the ultimate tensor to generate a preliminary mask. After that, a hyperbolic tangent function is subsequently applied as a non-linear function to the preliminary mask to ensure the final mask values in the range ( 1 , 1 ) .

2.3. Discrete Wavelet Transform

Most SE methods employ short-time Fourier transform (STFT) to create a spectrogram for input signals and conduct de-noising on the spectrogram or its variants. However, as mentioned in previous sections, STFT is not necessarily an optimal choice for processing non-stationary signals such as speech. Comparatively, discrete wavelet transform (DWT) is one of the emerging technologies in signal processing in recent years. Relative to discrete Fourier transform (DFT), DWT can highlight some crucial information about a signal within a short-time period. We briefly review the discrete wavelet transform (DWT) in the following.
Mathematically, the DWT of a signal f [ n ] is the outcome of passing f [ n ] through a series of filters. The resulting signals represent different frequency components of f [ n ] . First, the signal f [ n ] is passed through two filters (called analysis filters) with impulse responses g [ n ] and h [ n ] in parallel to obtain two signals. The two filters are low-pass and high-pass, respectively, and are related to each other by
h [ n ] = ( 1 ) n g [ N 1 n ] ,
where N is the filter length, and Equation (3) shows that g [ n ] and h [ n ] are quadrature mirror filters to each other. Since the bandwidth of the signal f [ n ] is halved in the output signals of two filters, we can discard half the points according to Nyquist’s rule. The outputs of filters h [ n ] and g [ n ] are then down-sampled by 2:
f A 1 [ n ] = k f [ k ] g [ 2 n k ] ,
and
f D 1 [ n ] = k f [ k ] h [ 2 n k ] .
Compared with the original signal f [ n ] , the two sub-band signals f A 1 [ n ] and f D 1 [ n ] have a half time resolution but twice frequency resolution. In addition, f A 1 [ n ] and f D 1 [ n ] are usually termed the one-level approximation and detail parts of f [ n ] , respectively, in wavelet analysis. The operation mentioned above is called one-level DWT, and we depict it in Figure 4.
Moreover, we can expand a one-level DWT to multiple-level DWT by repeatedly employing sub-band filtering and factor-2 downsampling. In practice, an L-level DWT of f [ n ] is created by performing a one-level DWT as shown in Figure 5 on the approximate (low-frequency) part obtained from the preceding level (the ( L 1 ) -level) of DWT of f [ n ] , where L > 1 . In the multiple-level DWT, the decomposition, which contains low-pass and high-pass filtering together with down-sampling, is repeated to increase the frequency resolution further. Therefore, the ultimate sub-band signals exhibit a different time-frequency localization. Figure 6 depicts the flowchart of a three-level DWT.

3. Proposed SE Framework with DWT Features

This section introduces a novel SE framework that exploits the short-time DWT data as one of the sources to create the encoding features for an encoder–decoder SE architecture such as Conv-TasNet and DPTNet, introduced in the previous section. This new framework is partially inspired by the cross-domain TCN (CD-TCN) [31], in which the time-domain and frequency-domain features are employed together for the following mask-estimation process, as depicted in Figure 7. We propose the use of wavelet-domain features as a substitute for frequency-domain features, hoping to bring about further SE improvement.
In the presented method, we would use the one-level and two-level DWT individually to create wavelet-domain features to see the corresponding behavior in an SE framework. The related details and variants are provided in the following two sub-sections. It is noteworthy that we can use DWT with a level greater than 2, and a higher-level DWT can generate more sub-bands with finer frequency-resolution features at lower frequencies. However, because of the factor-2 sampling procedure in DWT, the amount of data points at the lowest two DWT sub-bands reduces exponentially as the DWT level is increased.
In general, vowels have their energy primarily concentrated in the frequency range of 250–2000 Hz, while voiced consonants have their energy in the range of 250–4000 Hz. Comparatively, unvoiced consonants have varying strength with their energy concentrated in the range of 2000–8000 Hz. Since the majority of speech corpora have a sampling rate of 16 kHz, which covers the frequency range of 0–8 kHz, the sub-bands obtained through one-level DWT are located approximately at (0 kHz, 4 kHz) and (4 kHz, 8 kHz), while the sub-bands obtained through two-level DWT are located roughly at (0 kHz, 2 kHz), (2 kHz, 4 kHz), and (4 kHz, 8 kHz). The DWT features analyzed in this study, therefore, contain significant acoustic speech information. However, if a higher sampling rate is used, such as 44.1 kHz for music sampling, the high-pass DWT sub-band may not contain crucial speech components and should be disregarded during integration.

3.1. Fusion of One-Level DWT Features and Time-Domain Features

Here, we employ one-level DWT features and integrate them with the time-domain features to serve as the encoder features for the SE network. Referring to the procedures of Conv-TasNet introduced in Section 2.1, the proposed novel SE framework consists of the following steps:
  • Create time-domain features and one-level DWT features For the encoder end of the framework, one splits the input time-domain utterance into K overlapping frames of length L, each frame represented by a vector x k R L × 1 . These x k are concatenated to comprise a data matrix X R L × K . Then, the time-domain feature matrix W T R N × K is created as in Equation (1) mentioned in Section 2.1.1.
    To create another branch of features, we present to apply a one-level DWT to the data matrix X concerning each of its columns x k . That is, each frame signal x k passes through a one-level DWT (as introduced in Section 2.3) to produce its approximation (low-pass) and detail (high-pass) sub-band signals, denoted by c k A and c k D , respectively. Compared with x k , c k A and c k D have half the length and bandwidth. As such, we organize these sub-band signals to produce two feature matrices, C A and C D with size L 2 × K , consisting of c k A and c k D as columns, respectively.
    Furthermore, we process C A and C D individually with a 1-D trainable convolution (together with the nonlinear function H) to produce the other two matrices W A and W D with size N × K , the same size as the time-domain feature matrix W T . The operations are formulated as
    W A = H ( U A C A ) , W D = H ( U D C D ) ,
    where U A and U D denote the 1-D convolution matrices for C A and C D , respectively.
  • Integrate time-domain features and DWT features
    To date, we have three feature matrices: W T (time-domain features), W A (DWT-wise approximation features), and W D (DWT-wise detail features). Notably, they have the same size, and we have come up with three ways to integrate them to constitute the ultimate encoder features W E :
    • Addition:
      Here, W E is simply the weighted sum of three matrices:
      W E = 0.50 W T + 0.25 W A + 0.25 W D ,
      which is illustrated by Figure 8a. Accordingly, the final feature matrix has the same size as each component matrix.
    • Concatenation:
      The other intuitive way for the integration is to concatenate the three matrices:
      W E = W T ; W A ; W D ,
      which is illustrated by Figure 8b. Here, the final feature matrix is three times the size of each component matrix.
    • Fusion and concatenation:
      To extract the information across the two DWT features W A and A D more effectively, we exploit the bi-projection fusion (BPF) method [31]. BPF has been used to integrate time-domain and frequency-domain features and exhibits outperforming behavior in DPTNet [39]. In addition, the two DWT features, which reflect the low-pass and high-pass half-band short-time spectra in speech, are supposed to be of unequal importance. For example, the low-pass part, W A , contains more about vowels, and the high-pass part, W D , might correspond to consonants better. Furthermore, W A and W D usually have different signal-to-noise ratios (SNRs) due to the embedded speech components and the background noise. Therefore, the two complementary masks (being non-negative with a unity sum) of the BPF module are suitable to leverage these two features to benefit SE. After obtaining the BPF features from the two DWT features, W A and W D , as in Figure 9a, we concatenate them with the time-domain features W T as in Figure 9b. The details are as follows:
      First, we employ the concatenation of W A and W D to estimate a ratio mask matrix M:
      M = σ ( ψ M ( [ W A ; W D ] , θ M ) ) ,
      where σ is the sigmoid function, and ψ M is a convolutional projection layer operation with parameters θ M . Then, we multiply M and 1 M with W A and W D in an element-wise manner and add them up to generate the BPF-wise DWT features:
      W D W T = M W A + ( 1 M ) W D ,
      where ⊙ is the element-wise multiplication. Finally, we concatenate the DWT features and the time-domain features:
      W E = [ W T ; W D W T ] ,
      Consequently, the final feature matrix has double the size of each original feature matrix.

3.2. Fusion of Two-Level DWT Features and Time-Domain Features

As introduced in the previous section, the L-level DWT splits the lower sub-band from the ( L 1 ) -level DWT into two sub-bands. A two-level DWT can extract low-frequency information from short-time speech signals for further analysis with a higher resolution than a one-level DWT. In addition, a higher-level DWT somehow agrees more with the human perception of hearing. Therefore, we propose using a two-level DWT to build features for the frame signals and then integrate them with the time-domain features for the encoder of an SE network such as Conv-TasNet.
The flowchart of the Conv-TasNet with the two-level DWT outputs as one source of encoder features is depicted in Figure 10. Referring to the previous section, each frame signal x k passes through a two-level DWT to produce its three sub-band signals, denoted by c k D 1 , c k D 2 and c k A 2 , respectively. Then, we organize these sub-band signals to produce three feature matrices, C D 1 , C D 2 and C A 2 , consisting of c k D 1 , c k D 2 and c k A 2 as columns, respectively. Next, for the one-level DWT outputs in the previous section, we process the three DWT outputs, C D 1 , C D 2 and C A 2 , individually with a 1-D trainable convolution (together with the nonlinear function H) to produce the other three matrices W D 1 , W D 2 and W A 2 with the same size as the time-domain feature matrix W T , which comes from Equation (1). The operations are formulated as
W D 1 = H ( U D 1 C D 1 ) , W D 2 = H ( U D 2 C D 2 ) , W A 2 = H ( U A 2 C A 2 ) ,
where U D 1 , U D 2 and U A 2 denote the 1-D convolution matrices for C D 1 , C D 2 , and C A 2 , respectively.
As an extension of the BPF-wise used for the one-level DWT introduced in the previous section, we choose to directly concatenate the time-domain features W T and a fusion of two-level DWT features as the ultimate encoding features. However, we currently have three DWT feature matrices, and we have put forward three methods to create their fusion with three multiplicative masks. These three methods are as follows:
  • Use two BPF modules for two pairs of DWT feature matrices:
    This method is illustrated in Figure 11. We choose the DWT feature matrices with closer frequencies as a pair and apply a BPF module to each pair. Then, the resulting two BPF outputs are added to form the final DWT matrix. The whole process is formulated by:
    M 1 = σ ( ψ M 1 ( [ W D 1 ; W D 2 ] , θ M 1 ) ) ,
    M 2 = σ ( ψ M 2 ( [ W D 2 ; W A 2 ] , θ M 2 ) ) ,
    and
    W D W T = M 1 W D 1 + ( 1 M 1 ) W D 2 + M 2 W D 2 + ( 1 M 2 ) W A 2 ,
    where M 1 and M 2 are the applied two BPF matrices.
  • Use an MPF module with intra-channel softmax for three DWT feature matrices: This method is illustrated in Figure 12. It extends the idea of BPF in order to linearly combine more than two sources, which is termed multiple projection fusion (MPF). Rather than the sigmoid function used in BPF, the softmax function is used to obtain three multiplicative masks for three DWT feature matrices, respectively. More precisely, we apply the softmax function to three DWT features individually for each channel (here, a channel indicates an entry of the frame-wise feature vector), as we carried out in the BPF process for two DWT feature matrices. The whole process is formulated by:
    M 1 ( i , j ) , M 2 ( i , j ) , M 3 ( i , j ) = softmax ( ψ M s ( [ W D 1 ; W D 2 ; W A 2 ] , θ M s ) ) , i = 1 , 2 , . . . , N , j = 1 , 2 , . . . , K ,
    where i is the channel index, M k ( i , j ) denotes the i j -th element of the matrix M k , N and K are, respectively, the feature size and the frame number, softmax is the softmax function, and ψ M s is a convolutional projection layer operation with parameters θ M s . Please note here that the softmax function is applied to the concatenation of three frame-wise DWT features at the same channeli, and thus we call it an intra-channel softmax operation.
    As such, the obtained three masks satisfy the equation:
    M 1 + M 2 + M 3 = 1 ,
    After that, we generate the DWT-wise MPF features by multiplying these masks with three DWT feature matrices:
    W D W T = M 1 W D 1 + M 2 W D 2 + M 3 W A 2 .
  • Use an MPF module with inter-channel softmax for three DWT feature matrices:
    This method is illustrated in Figure 13. The BPF and MPF modules used in the previous methods pursue the weights for different DWT-wise sub-band feature vectors at the same channel (position in a vector) to show the relative importance of each sub-band features. Therefore, we use the sigmoid or softmax functions N times in the BPF or MPF module, where N is the is the number of channels (dimensionality) of each sub-band feature vector. By contrast, this method exploits the MPF module with a single softmax function for all of the channels of three sub-band feature vectors simultaneously. Accordingly, the obtained mask values are supposed to reflect the relative importance of different channels in different sub-band feature vectors.
    The whole process is formulated by:
    [ M 1 ( : , j ) , M 2 ( : , j ) , M 3 ( : , j ) ] = softmax ( ψ M s ( [ W D 1 ; W D 2 ; W A 2 ] , θ M s ) ) , j = 1 , 2 , , K ,
    where M k ( : , j ) denotes the j-th column of the matrix M k , softmax is the softmax function, and ψ M s is a convolutional projection layer operation with parameters θ M s . Here, the softmax function determines the weights of all channels of three frame-wise DWT feature vectors simultaneously, and thus we call it an inter-channel softmax operation.
    Therefore the obtained three masks satisfy the equation:
    i = 1 N ( M 1 ( i , j ) + M 2 ( i , j ) + M 3 ( i , j ) ) = 1 , j = 1 , 2 , . . . , K .
    After that, the DWT-wise MPF features are generated by multiplying these masks with three DWT feature matrices:
    W D W T = M 1 W D 1 + M 2 W D 2 + M 3 W A 2 ,
After obtaining the DWT feature matrix from any of the three methods mentioned earlier, we concatenate it with the time-domain feature matrix W T to build the final encoding feature matrix as in Equation (11).

4. Experimental Setup, Results, and Discussion

This section primarily evaluates the presented novel framework in SE performance. We start with introducing experimental environments, including the speech and noise datasets and the evaluation metrics. Then, we provide comprehensive SE evaluation results of the presented methods and the corresponding analyses and discussions. Finally, we use the speech spectrogram demonstration to evaluate our approaches.

4.1. Experimental Setup

We first use the VoiceBank-DEMAND [41] task for evaluation, which contains the VoiceBank speech dataset [42] and the noise sources from the DEMAND database [43]. As for the Voicebank dataset, the training set includes 11,572 utterances from 28 speakers, and the testing set includes 824 utterances from 2 speakers. The utterances in the training set are corrupted by ten types of noise from the DEMAND database at four different signal-to-noise ratios (SNRs): 0, 5, 10, and 15 dB. The test set is contaminated by five types of DEMAND noise at four SNRs: 2.5, 7.5, 12.5, and 17.5 dB. In particular, we set aside around 200 utterances from the training set for validation, which are fewer than the common alternative, 742 utterances, for validation. The initial idea is that we expect such a choice to make the learned model less specific to the DEMAND noise scenario, allowing it to be generalized and well employed in an unseen noise scenario (such as QUT as introduced in the following).
To further investigate the generalization capability of the presented methods, we have built another test set. This test set uses the same (noise-free) speech utterances as the original test set, while different types of non-stationary noise from the QUT-NOISE dataset [44] corrupt them to simulate a more severe noisy scenario. The QUT-NOISE dataset includes 5 noise types: cafe, home, street, car, and reverb. The resulting noisy utterances are at any of 5 , 0, 5, and 15 dB of SNRs. Notably, all of the waveforms used for the experiments are resampled at 16 kHz.
The db2 wavelet function is used to implement the DWT in our methods. Wavelet db2 consists of 4 points only, making the convolution process (filtering) quite simple and uncomplicated. Wavelet db2 is also one of the most commonly used wavelet functions as its support is small enough to separate the features of interest [45].
As for the Conv-TasNet, we set the configuration of the mask-estimation network with the following hyperparameters: the number of convolutional blocks in each repeats X = 8 times, the number of repeats R = 3 times, the number of channels in the bottleneck and the residual paths’ 1 × 1 -conv blocks is B = 128 , the number of channels in convolutional blocks is H = 256 , the number of channels in skip-connection paths’ 1 × 1 -conv blocks is S c = 128 , and the kernel size in convolutional blocks is P = 3 , which is half the setting of H in [18]. As for the hyperparameter settings for the model training, we employ an Adam optimizer, a learning rate of 10 3 , a weight decay of 10 5 , a batch size of 3, and 100 epochs. The used loss function for training Conv-TasNet and DPTNet is the scale-invariant signal-to-noise ratio (SI-SNR) [46].
To evaluate the denoising capability of the SE methods, we employ three objective metrics:
  • Perceptual estimation of speech quality (PESQ) [47]: This metric ranks the level of enhancement for the processed utterances relative to the original noise-free ones. PESQ indicates the quality difference between the enhanced and clean speech signals, and it ranges from 0.5 to 4.5. Briefly speaking, the calculation of PESQ is achieved in several stages: time alignment, level alignment to a calibrated listening level, time-frequency mapping, frequency warping, and compressive loudness scaling.
  • Short-time objective intelligibility (STOI) [48]: This metric measures the objective intelligibility for short-time time-frequency (TF) regions of an utterance with discrete-time Fourier transform (DFT). STOI ranges from 0 to 1, and a higher STOI score corresponds to better intelligibility. Briefly speaking, the STOI of the processed utterance s ^ with respect to its clean counterpart s is calculated with the following procedures: First, we perform STFT on s ^ and s to obtain their spectrograms, X ^ and X. Then, a one-third octave band analysis is performed by grouping DFT-bins of X ^ and X, resulting Y ^ and Y. respectively. The octave band energy of Y ^ is further normalized to be equal to that of Y, and then clipped to lower the bound of the signal-to-distortion ratio (SDR), producing Y ^ . Finally, the linear correlation coefficient d j ( m ) between Y ^ and Y with respect to each octave band j and each frame m is computed, and the STOI of s is the average of the correlation coefficients over all octave bands and all frames.
  • Scale-invariant signal-to-noise ratio (SI-SNR) [46]: This metric usually reflects the degree of artifact distortion between the processed utterance s ^ and the clean counterpart s , which is formulated by
    SI-SNR = 10 log 10 s t a r g e t 2 e n o i s e 2 ,
    where
    s t a r g e t = s ^ , s s s 2 ,
    e n o i s e = s ^ s t a r g e t ,
    with s ^ , s being the inner product of s ^ and s .

4.2. Experiments on Conv-TasNet with the VoiceBank-DEMAND Task

We start with the VoiceBank-DEMAND task to evaluate the Conv-TasNet method using different features in the encoder. These features are created by integrating time-domain features with either STFT-domain, one-level DWT-domain or two-level DWT-domain features, as described in Section 2.1 and Section 2.2. The corresponding PESQ, STOI, and SI-SNR results for the test set are listed in Table 1. From this table, we note the following:
  • Compared with the unprocessed baseline, all Conv-TasNet variants used here give significantly superior PESQ and SI-SNR scores, reflecting their excellent capability of SE. By contrast, the brought STOI improvement is moderate, probably because the baseline STOI score has been as high as 0.921 .
  • The cross-domain Conv-TasNet [31] with time and STFT features outperforms the original Conv-TasNet with time-domain features only in PESQ and SI-SNR, and it provides the best possible SI-SNR among all of the variants. These results reveal that the addition of STFT features benefits Conv-TasNet significantly.
  • When the STFT is replaced by one-level DWT, the corresponding three Conv-TasNet variants exhibit similar or better PESQ results. It indicates that DWT features can complement time-domain features to provide Conv-TasNet with better speech quality. Among the three one-level DWT-based Conv-TasNets, addition-wise integration behaves the best in PESQ, outperforming the more complicated method of implementation with one BPF and concatenation. However, when it comes to SI-SNR, the one-level DWT-wise methods behave worse than the method with STFT.
  • As for the cases with which the two-level DWT features are involved, the achieved PESQ scores are close to or moderately better than that obtained by the method with STFT, but not necessarily superior to those with one-level DWT. However, the optimal PESQ score is obtained by concatenating time features and two-level DWT features that use one MPF and inter-channel softmax. This result might indicate that further weighting the different channel values in the used encoding features is a promising direction to improve the SE performance.
  • To further examine if the presented fusion feature sets provide statistically significant improvement in PESQ relative to the pure time-domain feature, we perform a one-tailed two-sample t-test, the details of which are shown in Appendix A. Referring to the results shown in Table A1, we see that the four types of fusion “addition”, “concatenation”, “one BPF and concatenation”, and “one MPF with inter-channel softmax” provide Conv-TasNet with significant improvement in PESQ when compared with Conv-TasNet using the time-domain feature only. The other two fusion types (“two BPFs” and “one MPF with intra-channel softmax”) do not improve PESQ significantly.

4.3. Experiments on Conv-TasNet with the VoiceBank-QUT Task

This subsection is for the experimental results of the Conv-TasNet with a more challenging task: VoiceBank-QUT, in which the noise is non-stationary and unseen by the learned SE models, which are listed in Table 2. Compared with the previous table for the VoiceBank-DEMAND task, here, the three evaluation metric scores are significantly lower concerning the unprocessed baseline and the various SE methods. These degraded results seem to be expectable for the unseen and non-stationary noise scenario. Furthermore, even though these metric scores are varied when using different fusion features (STFT, one-level DWT, and two-level DWT), their difference is quite small and statistically insignificant, as shown in Table A2 in Appendix A.

4.4. Experiments on DPTNet with Two Tasks

As introduced in Section 2.2, the DPTNet method exploits a more delicate mask-estimation module than Conv-TasNet, and it is supposed to provide superior SE performance. Therefore, we further evaluate some encoding features used in the previous two sections in the DPTNet framework with the two benchmark tasks: VoiceBank-DEMAND and VoiceBank-QUT. The corresponding evaluation results are listed in Table 3 and Table 4. Observing these two tables and comparing them with the previous two tables for Conv-TasNet, we present the following findings:
  • Almost all encoding features provide better metric scores in DPTNet than in Conv-TasNet as expected, except for the time-domain features. In our opinion, the improper model configuration might cause this performance disagreement in the case of time features.
  • When integrating with time features, STFT and various forms of DWT features provide DPTNet with similar STOI and SI-SNR results.
  • The PESQ results for DPTNet are somehow converse to those for Conv-TasNet. With the DPTNet framework, STFT features behave better in PESQ than DWT features for the VoiceBank-DEMAND task, while they are inferior for the VoiceBank-QUT task. However, as for the VoiceBank-DEMAND task, the DWT features with the integration “one MPF with inter-channel softmax” provides PESQ close to the STFT features. In contrast, the DWT features with the integration “one BPF and concatenation” behave significantly better than the STFT features in PESQ for the more challenging VoiceBank-QUT task.
  • The results with different feature sets are not entirely consistent for DPTNet and Conv-TasNet. However, we can conclude that the presented DWT-domain features offer beneficial SE information. They are additive to the time-domain features to improve the performance of both Conv-TasNet and DPTNet. In addition, referring to Table A3 and Table A4, we find that all of the three fusion types (“one BPF and concatenation”, “two BPFs”, “one MPF with inter-channel softmax”) listed here can benefit DPTNet with statistically significant improvement in PESQ relative to DPTNet using the time-domain feature only. These results indicate that DWT-domain features improve DPTNet more than Conv-TasNet in both DEMAND and QUT noise scenarios.

4.5. Spectrogram Demonstration for Various SE Methods

In addition to the objective evaluation metrics, here we use the spectrograms of a single utterance at various situations to examine the presented DWT-wise Conv-TasNet and DPTNet. First of all, Figure 14a–c depict the magnitude spectrograms of an utterance (a) without noise corruption, (b) with noise corruption (with bus noise at SNR 17.5 dB), and (c) corrupted as (b) and then enhanced by time feature-wise Conv-TasNet. Comparing Figure 14a,b, we see that noise causes a significant interference to the clean speech, while the original Conv-TasNet (with time-domain features) can bring noise alleviation, as shown in Figure 14c. The noise alleviation is especially significant in the region outlined by the red box. Next, the enhanced spectrograms with the Conv-TasNet variants with three versions of one-level DWT and three versions of two-level DWT are depicted in Figure 14d–i, and the spectrograms enhanced with various versions of DPTNet are depicted in Figure 15a–d. From these figures, we can find that the presented DWT-based Conv-TasNet and DPTNet frameworks are also helpful to reduce the noise distortion in the utterance sample, while it is somehow difficult to compare these methods in SE performance simply with these spectrogram demonstrations, and thus we just use the sample spectrograms for illustration purpose.

5. Concluding Remarks

In this study, we primarily pay attention to the noise issue in speech signals, discuss some celebrated deep-learning-based SE methods, and propose the use of the DWT to create features for an encoder–decoder SE framework. One-level and two-level DWT features are extracted and then complemented by time-domain features for SE model learning. We extend the concept of bi-projection fusion (BPF) and provide a novel fusion structure, multiple-projection fusion (MPF), that can integrate three (or more) feature sources adaptively. We provide two types of MPF for two-level DWT features, which serve as intra-channel softmax and inter-channel softmax, respectively.
With the well-known VoiceBank-DEMAND and VoiceBank-QUT tasks, we evaluate two SE architectures, Conv-TasNet and DPTNet, with various fusions of time-domain and DWT-domain features. Preliminary results show that DWT-domain features compensate for time-domain features to produce higher SE performance, particularly the PESQ metric score. Although the fusion of DWT-domain and time-domain features does not always result in a statistically significant improvement in PESQ for Conv-TasNet when compared to pure time-domain features, it consistently performs significantly better for DPTNet, an SE framework more advanced than Conv-TasNet. It is worth noting that the presented feature sets offer DPTNet with excellent SE performance for the VoiceBank-QUT task, which includes unseen and non-stationary noises.
In particular, DWT features work well as a substitute for STFT features as the encoder’s fusion component to deliver comparable or superior SE results. Since we conduct the convolution using a small-support wavelet function (db2), DWT features may be produced more quickly than STFT features, making them a competitive feature type for learning the DNN-based SE framework.
As future work, we will further investigate whether the type of wavelet function used in the creation of DWT features influences the performance of the involved SE framework. Next, we will test other types of fusions for time-domain and DWT-domain features to see their SE behavior. Finally, to examine their portability, we plan to employ DWT features in other recently proposed SE frameworks, such as DEMUCS [49] and MANNER [50].

Author Contributions

Methodology, Y.-T.C. and J.-W.H.; Validation, Y.-T.C., Z.-T.W. and J.-W.H.; Investigation, Z.-T.W.; Resources, J.-W.H.; Data curation, Z.-T.W.; Writing—original draft, Y.-T.C.; Writing—review & editing, J.-W.H.; Supervision, J.-W.H.; Project administration, J.-W.H.; Funding acquisition, J.-W.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Science and Technology Council (grant number 111-2221-E-260-005-MY2).

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

We perform a one-tailed two-sample t-test to evaluate if the presented DWT-wise cross-domain feature method gives a significant improvement in averaged metric score over the time-domain-wise method. We first calculate the t-statistic as follows:
t = m m r σ 2 N + σ r 2 N r
where m, σ , and N are the mean, standard deviation, and number of samples of the metric scores of the DWT-wise method, respectively, and m r , σ r and N r are the mean, standard deviation and number of samples of the metric scores of the time domain-wise method, respectively. Since the test set has 826 utterances, N = N r = 826 . The degrees of freedom ( d f ) for the t-distribution are N + N r 2 = 826 + 826 2 = 1650 . Referring to a t-table, for 1650 degrees of freedom, a t-statistic value higher than 1.645 corresponds to a p-value smaller than 0.05, indicating a 95 % higher level of confidence in the significance of the improvement m m r .
According to the aforementioned evaluation rule, we examine whether the proposed DWT-wise cross-domain features may give Conv-TasNet and DPTNet statistically improved PESQ scores compared to the time-domain features. The results are shown in Table A1, Table A2, Table A3 and Table A4.
Specifically for Conv-TasNet, Table A1 and Table A2 show that while some DWT-wise cross-domain feature arrangements significantly improve PESQ scores for the VoiceBank-DEMAND task, none of them do so for the VoiceBank-QUT task, perhaps because the VoiceBank-QUT task involves a more difficult noise scenario.
Comparatively, all three of the DWT-wise cross-domain feature arrangements can produce significantly higher PESQ scores for both the VoiceBank-DEMAND and VoiceBank-QUT tasks as shown in Table A3 and Table A4, which are specifically for DPTNet. Therefore, DPTNet outperforms Conv-TasNet in speech enhancement, and the proposed DWT-wise cross-domain features outperform the time-domain features in an advanced SE framework such as DPTNet.
Table A1. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for Conv-TasNet in noise environment of “DEMAND”.
Table A1. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for Conv-TasNet in noise environment of “DEMAND”.
Feature
Type
Integration
Manner
PESQ
Mean
PESQ
Standard
Deviation
t-StatisticSignificant
Improvement?
( t > 1.645 ? )
Time2.6180.5991
Time and
one-level DWT
Addition2.6810.63112.081True
Concatenation2.6690.62161.698True
One BPF and
concatenation
2.6680.60841.683True
Time and
two-level DWT
Two BPFs2.6540.60391.216False
One MPF with intra-
channel softmax
2.6670.62401.628False
One MPF with inter-
channel softmax
2.6900.63372.373True
Table A2. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for Conv-TasNet in noise environment of “QUT”.
Table A2. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for Conv-TasNet in noise environment of “QUT”.
Feature
Type
Integration
Manner
PESQ
Mean
PESQ
Standard
Deviation
t-StatisticSignificant
Improvement?
( t > 1.645 ? )
Time1.9080.5875
Time and
one-level DWT
Addition1.9220.58970.483False
Concatenation1.9320.59320.826False
One BPF and
concatenation
1.9360.59290.964False
Time and
two-level DWT
Two BPFs1.9220.59360.482False
One MPF with intra-
channel softmax
1.9170.58680.312False
One MPF with inter-
channel softmax
1.9260.60090.616False
Table A3. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for DPTNet in noise environment of “DEMAND”.
Table A3. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for DPTNet in noise environment of “DEMAND”.
Feature
Type
Integration
Manner
PESQ
Mean
PESQ
Standard
Deviation
t-StatisticSignificant
Improvement?
( t > 1.645 ? )
Time2.5490.5885
Time and
one-level DWT
One BPF and
concatenation
2.7240.59626.004True
Time and
two-level DWT
Two BPFs2.7450.60616.668True
One MPF with inter-
channel softmax
2.7790.62507.700True
Table A4. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for DPTNet in noise environment of “QUT”.
Table A4. The significance test for the PESQ score of DWT-wise fusion methods relative to the time-domain method for DPTNet in noise environment of “QUT”.
Feature
Type
Integration
Manner
PESQ
Mean
PESQ
Standard
Deviation
t-StatisticSignificant
Improvement?
( t > 1.645 ? )
Time1.8040.5079
Time and
one-level DWT
One BPF and
concatenation
2.0440.61588.641True
Time and
two-level DWT
Two BPFs2.0330.63728.077True
One MPF with inter-
channel softmax
2.0340.61168.315True

References

  1. Kahneman, D.; Sibony, O.; Sunstein, C.R. Noise: A Flaw in Human Judgment; Little, Brown Spark: New York, NY, USA, 2021. [Google Scholar]
  2. Boll, S. Suppression of acoustic noise in speech using spectral subtraction. IEEE Trans. Acoust. Speech Signal Process. 1979, 27, 113–120. [Google Scholar] [CrossRef]
  3. Scalart, P.; Filho, J.V. Speech enhancement based on a priori signal to noise estimation. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Atlanta, GA, USA, 9 May 1996. [Google Scholar]
  4. Gauvain, J.; Lee, C.-H. Maximum a posteriori estimation for multivariate Gaussian mixture observations of Markov chains. IEEE Trans. Acoust. Speech Signal Process. 1994, 2, 291–298. [Google Scholar] [CrossRef]
  5. Leggetter, C.J.; Woodland, P.C. Maximum likelihood linear regression for speaker adaptation of continuous density hidden Markov models. Comput. Speech Lang. 1995, 9, 171–185. [Google Scholar] [CrossRef]
  6. Gales, M.J.F. Maximum likelihood linear transformations for HMM-based speech recognition. Comput. Speech Lang. 1998, 12, 75–98. [Google Scholar] [CrossRef]
  7. Gales, M.J. Model-Based Techniques for Noise Robust Speech Recognition. Ph.D. Thesis, Cambridge University, Cambridge, UK, 1995. [Google Scholar]
  8. Ephraim, Y.; Malah, D. Speech enhancement using a minimum mean square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 1985, 33, 443–445. [Google Scholar] [CrossRef]
  9. Wu, J.; Huo, Q. An environment-compensated minimum classification error training approach based on stochastic vector mapping. IEEE Trans. Audio Speech Lang Process. 2006, 14, 2147–2155. [Google Scholar] [CrossRef]
  10. Buera, L.; Lleida, E.; Miguel, A.; Ortega, A.; Saz, O. Cepstral vector normalization based on stereo data for robust speech recognition. IEEE Trans. Audio Speech Lang Process. 2007, 15, 1098–1113. [Google Scholar] [CrossRef]
  11. Xu, Y.; Du, J.; Dai, L.; Lee, C. A regression approach to speech enhancement based on deep neural networks. IEEE/ACM Trans. Audio Speech Lang. Process. 2015, 23, 7–19. [Google Scholar] [CrossRef]
  12. Zhao, Y.; Wang, D.; Merks, I.; Zhang, T. DNN-based enhancement of noisy and reverberant speech. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 20–25 March 2016; pp. 6525–6529. [Google Scholar]
  13. Wang, D. Deep learning reinvents the hearing aid. IEEE Spectr. 2017, 54, 32–37. [Google Scholar] [CrossRef]
  14. Chen, J.; Wang, Y.; Yoho, S.E.; Wang, D.; Healy, E.W. Large-scale training to increase speech intelligibility for hearing impaired listeners in novel noises. J. Acoust. Soc. Am. 2016, 139, 2604–2612. [Google Scholar] [CrossRef]
  15. Karjol, P.; Kumar, M.A.; Ghosh, P.K. Speech enhancement using multiple deep neural networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018. [Google Scholar]
  16. Kounovsky, T.; Malek, J. Single channel speech enhancement using convolutional neural network. In Proceedings of the 2017 IEEE International Workshop of Electronics, Control, Measurement, Signals and their Application to Mechatronics (ECMSM), Donostia, Spain, 24–26 May 2017. [Google Scholar]
  17. Chakrabarty, S.; Wang, D.; Habets, E.A.P. Time-frequency masking based online speech enhancement with multi-channel data Using convolutional neural Networks. In Proceedings of the 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018. [Google Scholar]
  18. Luo, Y.; Mesgarani, N. Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech Separation. IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 1256–1266. [Google Scholar] [CrossRef]
  19. Fu, S.; Tsao, Y.; Lu, X.; Kawai, H. Raw waveform-based speech enhancement by fully convolutional networks. In Proceedings of the Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), Kuala Lumpur, Malaysia, 12–15 December 2017. [Google Scholar]
  20. Kiranyaz, S.; Ince, T.; Abdeljaber, O.; Avci, O.; Gabbouj, M. 1-D convolutional neural networks for signal processing applications. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019. [Google Scholar]
  21. Chung, J.; Gulcehre, C.; Cho, K.; Bengio, Y. Empirical evaluation of gated recurrent neural networks on sequence modeling. In Proceedings of the NIPS Workshop on Deep Learning; Neural Information Processing Systems (NeurIPS), Montreal, QC, Canada, 8–13 December 2014. [Google Scholar]
  22. Sun, L.; Du, J.; Dai, L.; Lee, C. Multiple-target deep learning for LSTMRNN based speech enhancement. In Proceedings of the Hands-Free Speech Communication and Microphone Arrays (HSCMA), San Francisco, CA, USA, 1–3 March 2017. [Google Scholar]
  23. Wang, Y.; Narayanan, A.; Wang, D. On training targets for supervised speech separation. IEEE/ACM Trans. Audio Speech Lang. 2014, 22, 1849–1858. [Google Scholar] [CrossRef] [PubMed]
  24. Wang, D.; Chen, J. Supervised speech separation based on deep learning: An overview. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1702–1726. [Google Scholar] [CrossRef] [PubMed]
  25. Roman, N.; Woodruff, J. Ideal binary masking in reverberation. In Proceedings of the 20th European Signal Processing Conference (EUSIPCO), Bucharest, Romania, 27–31 August 2012; pp. 629–633. [Google Scholar]
  26. Narayanan, A.; Wang, D. Ideal ratio mask estimation using deep neural networks for robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, Vancouver, BC, Canada, 26–31 May 2013; pp. 7092–7096. [Google Scholar] [CrossRef]
  27. Williamson, D.S.; Wang, Y.; Wang, D. Complex ratio masking for monaural speech separation. IEEE/ACM Trans. Audio Speech Lang. 2015, 24, 483–492. [Google Scholar] [CrossRef] [PubMed]
  28. Erdogan, H.; Hershey, J.R.; Watanabe, S.; Roux, J.L. Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks. In Proceedings of the International Conference on Acoustics, Speech and Signal Processing (ICASSP), South Brisbane, QLD, Australia, 19–24 April 2015. [Google Scholar]
  29. Vani, H.Y.; Anusuya, M.A. Hilbert Huang transform based speech recognition. In Proceedings of the 2016 Second International Conference on Cognitive Computing and Information Processing (CCIP), Mysuru, India, 12–13 August 2016; pp. 1–6. [Google Scholar] [CrossRef]
  30. Ravanelli, M.; Bengio, Y. Speech and speaker recognition from raw waveform with sincnet. In Proceedings of the 2018 IEEE Spoken Language Technology Workshop (SLT), Athens, Greece, 18–21 December 2018. [Google Scholar]
  31. Chao, F.-A.; Hung, J.-W.; Chen, B. Cross-Domain Single-Channel Speech Enhancement Model with BI-Projection Fusion Module for Noise-Robust ASR. In Proceedings of the 2021 IEEE International Conference on Multimedia and Expo (ICME), Shenzhen, China, 5–9 July 2021. [Google Scholar]
  32. Lin, J.-Y.; Chen, Y.-T.; Liu, K.-Y.; Hung, J.-W. An evaluation study of. modulation-domain wavelet denoising method by alleviating different sub-band portions for speech enhancement. In Proceedings of the 2019 IEEE International Conference on Consumer Electronics-Taiwan (ICCE-TW), Yilan, Taiwan, 20–22 May 2019. [Google Scholar]
  33. Chen, Y.-T.; Lin, Z.-Q.; Hung, J.-W. Employing low-pass filtered temporal speech features for the training of ideal ratio mask in speech enhancement. In Proceedings of the Conference on Computational Linguistics and Speech Processing (ROCLING), Taoyuan, Taiwan, 15–16 October 2021. [Google Scholar]
  34. Liao, C.-W.; Wu, P.-C.; Hung, J.-W. A Preliminary Study of Employing Lowpass-Filtered and Time-Reversed Feature Sequences as Data Augmentation for Speech Enhancement Deep Networks. In Proceedings of the 2022 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Penang, Malaysia, 22–25 November 2022. [Google Scholar]
  35. Chen, Y.T.; Wu, Z.T.; Hung, J.W. A Preliminary Study of the Application of Discrete Wavelet Transform Features in Conv-TasNet Speech Enhancement Model. In Proceedings of the 34th Conference on Computational Linguistics and Speech Processing (ROCLING 2022), Taipei, Taiwan, 21–22 November 2022; pp. 92–99. [Google Scholar]
  36. Ochiai, T.; Delcroix, M.; Ikeshita, R.; Kinoshita, K.; Nakatani, T.; Araki, S. Beam-TasNet: Time-domain Audio Separation Network Meets Frequency-domain Beamformer. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6384–6388. [Google Scholar] [CrossRef]
  37. Lea, C.; Flynn, M.D.; Vidal, R.; Reiter, A.; Hager, G.D. Temporal Convolutional Networks for Action Segmentation and Detection. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 1003–1012. [Google Scholar] [CrossRef]
  38. He, K.; Zhang, X.; Ren, S.; Sun, J. Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. arXiv 2015, arXiv:1502.01852v1. [Google Scholar]
  39. Chen, J.; Mao, Q.; Liu, D. Dual-Path Transformer Network: Direct Context-Aware Modeling for End-to-End Monaural Speech Separation. arXiv 2020, arXiv:2007.13975. [Google Scholar]
  40. Luo, Y.; Chen, Z.; Yoshioka, T. Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation. arXiv 2019, arXiv:1910.06379. [Google Scholar]
  41. Valentini-Botinhao, C.; Wang, X.; Takaki, S.; Yamagishi, J. Investigating RNNbased speech enhancement methods fornoise-robust Text-to-Speech. In Proceedings of the 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13–15 September 2016; pp. 146–152. [Google Scholar]
  42. Veaux, C.; Yamagishi, J.; King, S. The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of the 2013 International Conference Oriental COCOSDA Held Jointly with 2013 Conference on Asian Spoken Language Research and Evaluation (O-COCOSDA/CASLRE 2013), Gurgaon, India, 25–27 November 2013. [Google Scholar]
  43. Thiemann, J.; Ito, N.; Vincent, E. Demand: A collection of multi-channel recordings of acoustic noise in diverse environments. In Proceedings of the 21st International Congress on Acoustics (ICA 2013), Montreal, QC, Canada, 2–7 June 2013. [Google Scholar]
  44. Dean, D.B.; Sridharan, S.; Vogt, R.J.; Mason, M.W. The QUT-NOISE-TIMIT corpus for the evaluation of voice activity detection algorithms. In Proceedings of the 11th Annual Conference of the International Speech Communication Association (INTERSPEECH 2010), Chiba, Japan, 26–30 September 2010; pp. 3110–3113. [Google Scholar]
  45. Choose a Wavelet. Available online: https://www.mathworks.com/help/wavelet/gs/choose-a-wavelet.html (accessed on 27 April 2023).
  46. Isik, Y.; Roux, J.L.; Chen, Z.; Watanabe, S.; Hershey, J.R. Single-Channel Multi-Speaker Separation Using Deep Clustering. arXiv 2016, arXiv:1607.02173. [Google Scholar]
  47. ITU-T Recommendation P. 862; Perceptual Evaluation of Speech Quality (PESQ): An Objective Method for End-to-End Speech Quality Assessment of Narrow-Band Telephone Networks and Speech Codecs. ITU Telecommunication Standardization Sector: Geneva, Switzerland, 2001.
  48. Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An algorithm for intelligibility prediction of time–frequency weighted noisy speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
  49. Défossez, A.; Usunier, N.; Bottou, L.; Bach, F. Demucs: Deep Extractor for Music Sources with extra unlabeled data remixed. arXiv 2019, arXiv:1909.01174. [Google Scholar]
  50. Park, H.J.; Kang, B.H.; Shin, W.; Kim, J.S.; Han, S.W. MANNER: Multi-View Attention Network For Noise Erasure. In Proceedings of the ICASSP 2022—2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Singapore, 23–27 May 2022; pp. 7842–7846. [Google Scholar] [CrossRef]
Figure 1. 1-D convolutional block.
Figure 1. 1-D convolutional block.
Applsci 13 05992 g001
Figure 2. A schematic diagram of the Conv-TasNet system.
Figure 2. A schematic diagram of the Conv-TasNet system.
Applsci 13 05992 g002
Figure 3. A schematic diagram of the DPTNet system.
Figure 3. A schematic diagram of the DPTNet system.
Applsci 13 05992 g003
Figure 4. A schematic diagram of one-level DWT.
Figure 4. A schematic diagram of one-level DWT.
Applsci 13 05992 g004
Figure 5. A schematic diagram of L-level DWT.
Figure 5. A schematic diagram of L-level DWT.
Applsci 13 05992 g005
Figure 6. A schematic diagram of three-level DWT.
Figure 6. A schematic diagram of three-level DWT.
Applsci 13 05992 g006
Figure 7. A schematic diagram of the cross-domain TCN (CD-TCN), which uses time-domain and frequency-domain features together with their fusion.
Figure 7. A schematic diagram of the cross-domain TCN (CD-TCN), which uses time-domain and frequency-domain features together with their fusion.
Applsci 13 05992 g007
Figure 8. A schematic diagram of the encoder part of CD-TCN with time-domain and one-level DWT features fused by (a) addition (b) concatenation.
Figure 8. A schematic diagram of the encoder part of CD-TCN with time-domain and one-level DWT features fused by (a) addition (b) concatenation.
Applsci 13 05992 g008
Figure 9. (a) The BPF module of DWT features (b) the encoder part of CD-TCN with time-domain and one-level DWT features by fusion and concatenation.
Figure 9. (a) The BPF module of DWT features (b) the encoder part of CD-TCN with time-domain and one-level DWT features by fusion and concatenation.
Applsci 13 05992 g009
Figure 10. A schematic diagram of the CD-TCN, which uses the concatenation of time-domain and fused two-level DWT features.
Figure 10. A schematic diagram of the CD-TCN, which uses the concatenation of time-domain and fused two-level DWT features.
Applsci 13 05992 g010
Figure 11. The diagram of integrating three DWT features with two BPF modules.
Figure 11. The diagram of integrating three DWT features with two BPF modules.
Applsci 13 05992 g011
Figure 12. The diagram of integrating three DWT features with an MPF module with intra-channel softmax.
Figure 12. The diagram of integrating three DWT features with an MPF module with intra-channel softmax.
Applsci 13 05992 g012
Figure 13. The diagram of integrating three DWT features with an MPF module with inter-channel softmax.
Figure 13. The diagram of integrating three DWT features with an MPF module with inter-channel softmax.
Applsci 13 05992 g013
Figure 14. Spectrogram of a sentence in various situations: (a) clean and noise-free, (b) noise-corrupted (with bus noise at 17.5 dB SNR), noise-corrupted and enhanced by (c) time-feature Conv-TasNet, noise-corrupted and enhanced by three one-level DWT-wise Conv-TasNet; (d) addition, (e) concatenation, (f) one BPF and concatenation, and noise-corrupted and enhanced by three two-level DWT-wise Conv-TasNet; (g) two BPFs, (h) MPF with intra-channel softmax, and (i) MPF with inter-channel softmax. The regions outlined by the red box help us compare different methods in denoising.
Figure 14. Spectrogram of a sentence in various situations: (a) clean and noise-free, (b) noise-corrupted (with bus noise at 17.5 dB SNR), noise-corrupted and enhanced by (c) time-feature Conv-TasNet, noise-corrupted and enhanced by three one-level DWT-wise Conv-TasNet; (d) addition, (e) concatenation, (f) one BPF and concatenation, and noise-corrupted and enhanced by three two-level DWT-wise Conv-TasNet; (g) two BPFs, (h) MPF with intra-channel softmax, and (i) MPF with inter-channel softmax. The regions outlined by the red box help us compare different methods in denoising.
Applsci 13 05992 g014
Figure 15. Spectrogram of the noisy sentence in Figure 14b enhanced by (a) time-feature DPTNet, (b) one-level DWT-wise DPTNet with one BPF and concatenation, two-level DWT-wise DPTNet with (c) two BPFs, and (d) one MPF with inter-channel softmax. The regions outlined by the red box help us compare different methods in denoising.
Figure 15. Spectrogram of the noisy sentence in Figure 14b enhanced by (a) time-feature DPTNet, (b) one-level DWT-wise DPTNet with one BPF and concatenation, two-level DWT-wise DPTNet with (c) two BPFs, and (d) one MPF with inter-channel softmax. The regions outlined by the red box help us compare different methods in denoising.
Applsci 13 05992 g015
Table 1. The evaluation results of Conv-TasNet using different domains of features for the test set in noise environment of “DEMAND”.
Table 1. The evaluation results of Conv-TasNet using different domains of features for the test set in noise environment of “DEMAND”.
Feature DomainsIntegration MannerVoiceBank-DEMAND
Metric Scores
PESQSTOISI-SNR
Unprocessed baseline1.9700.9218.445
Time-2.6180.94319.500
Time and STFT [31]-2.6480.94219.712
Time and
one-level DWT
Addition2.6810.94219.352
Concatenation2.6690.94219.609
One BPF and
concatenation
2.6680.94319.496
Time and
two-level DWT
(concatenation)
Two BPFs2.6540.94319.703
One MPF with
intra-channel softmax
2.6670.94219.540
One MPF with
inter-channel softmax
2.6900.94219.378
Table 2. The evaluation results of Conv-TasNet using different domains of features for the test set in noise environment of “QUT-NOISE”.
Table 2. The evaluation results of Conv-TasNet using different domains of features for the test set in noise environment of “QUT-NOISE”.
Feature DomainsIntegration MannerVoiceBank-QUT
Metric Scores
PESQSTOISI-SNR
Unprocessed baseline1.2470.7843.876
Time-1.9080.86013.694
Time and STFT [31]-1.9360.86313.779
Time and
one-level DWT
Addition1.9220.85813.645
Concatenation1.9320.86113.775
One BPF and
concatenation
1.9360.86213.824
Time and
two-level DWT
(concatenation)
Two BPFs1.9220.86113.837
One MPF with
intra-channel softmax
1.9170.86113.748
One MPF with
inter-channel softmax
1.9260.85913.729
Table 3. The evaluation results of DPTNet using different domains of features for the test set in noise environment of “DEMAND”.
Table 3. The evaluation results of DPTNet using different domains of features for the test set in noise environment of “DEMAND”.
Feature DomainsIntegration MannerVoiceBank-DEMAND
Metric Scores
PESQSTOISI-SNR
Unprocessed baseline1.9700.9218.445
Time-2.5490.93519.080
Time and STFT [31]-2.7820.94619.963
Time and
one-level DWT
One BPF and
concatenation
2.7240.94519.960
Time and
two-level DWT
(concatenation)
Two BPFs2.7450.94419.624
One MPF with
inter-channel softmax
2.7790.94419.470
Table 4. The evaluation results of DPTNet using different domains of features for the test set in noise environment of “QUT-NOISE”.
Table 4. The evaluation results of DPTNet using different domains of features for the test set in noise environment of “QUT-NOISE”.
Feature DomainsIntegration MannerVoiceBank-QUT
Metric Scores
PESQSTOISI-SNR
Unprocessed Baseline1.2470.7843.876
Time-1.8040.84512.802
Time and STFT [31]-2.0190.87014.500
Time and
one-level DWT
One BPF and
concatenation
2.0440.87314.543
Time and
two-level DWT
(concatenation)
Two BPFs2.0330.87214.549
One MPF with
inter-channel softmax
2.0340.87114.611
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, Y.-T.; Wu, Z.-T.; Hung, J.-W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Appl. Sci. 2023, 13, 5992. https://doi.org/10.3390/app13105992

AMA Style

Chen Y-T, Wu Z-T, Hung J-W. Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform. Applied Sciences. 2023; 13(10):5992. https://doi.org/10.3390/app13105992

Chicago/Turabian Style

Chen, Yan-Tong, Zong-Tai Wu, and Jeih-Weih Hung. 2023. "Cross-Domain Conv-TasNet Speech Enhancement Model with Two-Level Bi-Projection Fusion of Discrete Wavelet Transform" Applied Sciences 13, no. 10: 5992. https://doi.org/10.3390/app13105992

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop