Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator

Zhang, Zhaoqi; Feng, Xuelei; Shen, Yong

doi:10.3390/app11178054

Open AccessCommunication

Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator

by

Zhaoqi Zhang

¹,

Xuelei Feng

¹

and

Yong Shen

^1,2,*

¹

Institute of Acoustics, Nanjing University, Nanjing 210093, China

²

Shenzhen Research Institute of Nanjing University, Shenzhen 518000, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2021, 11(17), 8054; https://doi.org/10.3390/app11178054

Submission received: 6 July 2021 / Revised: 21 August 2021 / Accepted: 28 August 2021 / Published: 30 August 2021

(This article belongs to the Special Issue Sound Field Control)

Download

Browse Figures

Versions Notes

Abstract

:

The estimation of the late reverberant spectral variance (LRSV) is of paramount importance in most reverberation suppression algorithms. This letter proposes an improved single-channel LRSV estimator based on Habets LRSV estimator by using an adaptive parameter estimator. Instead of estimating the direct-to-reverberation ratio (DRR), the proposed LRSV estimator directly estimates the parameter

κ

in a generalized statistical model since the experimental results show that even the

κ

calculated using measured ground truth DRR may not be the optimal parameter for the LRSV estimator. Experimental results using synthetic reverberant signals demonstrate the superiority of the proposed estimator to conventional approaches.

Keywords:

dereverberation; single-channel; probability-based

1. Introduction

Speech signals received within a room usually contain reverberation which impairs the intelligibility of speech in communication scenarios such as mobile phones and hearing aids. Reverberation will also degrade the recognition performance of automatic speech recognition systems. Hence, speech dereverberation is still an important issue nowadays.

Dereverberation techniques can be divided into reverberation cancellation [1] and reverberation suppression [2,3] depending on whether or not the acoustic impulse response (AIR) needs to be estimated during the dereverberation [4]. The major part of most reverberation suppression methods is the estimation of late reverberant spectral variance (LRSV), which remains a challenging task due to its high time variability [5]. Habets proposed a single-channel LRSV estimator [3] based on a generalized statistical model [6] to suppress late reverberation, and it still performs outstanding nowadays [5]. However, in Habets LRSV estimator, two parameters (i.e., the reverberation time

T_{60}

and the parameter

κ

which is related to direct-to-reverberation ratio (DRR)) should be given in advance or estimated online.

To the authors’ knowledge, there are numerous reverberation time estimation methods, whereas there are few single-channel DRR online estimation methods [7]. Besides, according to practical experience, it may also be inappropriate to obtain

κ

indirectly by estimating DRR, because even the

κ

calculated via measured ground truth DRR may not be the optimal

κ

for the Habets estimator. A detailed discussion can be found in Section 5. Therefore, unlike other traditional methods using estimated DRR to calculate

κ

, the present work aims to propose a blind adaptive

κ

estimator which can improve the performance of the Habets LRSV estimator and makes it more practical. Inspired by the optimally-modified log-spectral amplitude (OM-LSA) algorithm [8], this letter differentiates between the direct sound presence/absence hypotheses and derives the conditional direct sound presence probability to give a time-varying recursive average on the estimated

κ

. The proposed

κ

estimator is evaluated and compared with existed

κ

estimator [9] and

κ

calculated using measured ground truth DRR. The evaluation results show that the proposed

κ

estimator performs better than the conventional

κ

estimator or measured

κ

under all evaluation conditions. Besides, the quality of the dereverberated speech is also evaluated and compared to a method using recursive maximum-sparseness-power-prediction-model (MSPP) [10].

2. Problem Formulation

The reverberant signal results from the convolution of the anechoic speech signal and a causal AIR. The anechoic speech signal can be expressed in the Short-time Fourier Transform (STFT) domain by

S (k, l)

, where k and l are the frequency and frame indices, respectively. According to the convolutive transfer function (CTF) model [2], the reverberant speech signal

Z (k, l)

can be expressed as Equation (1)

Z (k, l) = \sum_{l^{'} = 0}^{+ \infty} H (k, l^{'}) S (k, l - l^{'}),

(1)

where

H (k, l)

represents the AIR and it can be split into three components as Equation (2)

H (k, l) = \{\begin{matrix} H_{d} (k), & l = 0 \\ H_{e} (k, l), & 1 \leq l \leq N_{e} \\ H_{l} (k, l), & l > N_{e} \end{matrix}

(2)

where

H_{d} (k)

is the direct sound,

H_{e} (k, l)

consists of early reflections,

H_{l} (k, l)

represents later reflections, and

N_{e}

usually corresponds to approximately 20–50 ms. The late reverberant speech component

Z_{l} (k, l) = \sum_{l^{'} = N_{e} + 1}^{+ \infty} H_{l} (k, l^{'}) S (k, l - l^{'})

mainly decreases the speech fidelity and intelligibility [4] and needs to be suppressed. Hence, the main challenge is to derive an estimator for the spectral variance of the late reverberant speech component (i.e., LRSV)

λ_{l} (k, l) = E [{|Z_{l} (k, l)|}^{2}]

, where

E [\cdot]

denotes the expectation operator. Once

λ_{l} (k, l)

is given, a spectral enhancement method [11] can be used to suppress the late reverberation.

3. Brief Review of Habets Late Reverberant Spectral Variance Estimator

The underlying theory for the present work is based on the LRSV estimator derived by Habets [3]. The Habets method is based on a generalized statistical model which is an improvement on Polack’s statistical model [4]. Using

H_{r} (k, l)

represents early and late reflections. Then, the corresponding spectral variance can be written as Equation (3)

λ_{h} (k, l) = E [{|H (k, l)|}^{2}] = \{\begin{matrix} λ_{h_{d}} (k), & l = 0 \\ λ_{h_{r}} (k, l), & l \geq 1 \end{matrix}

(3)

λ_{h_{d}} (k) = E [{|H_{d} (k)|}^{2}], λ_{h_{r}} (k, l) = κ (k) λ_{h_{d}} (k) e^{\frac{- 13.8 l R}{T_{60} (k) f_{s}}},

where

T_{60} (k)

is the frequency-dependent reverberation time,

f_{s}

denotes the sampling frequency, R is the discrete time shift, and

κ (k)

is a prior parameter that is related to DRR.

Assuming that the direct component

Z_{d} (k, l) = H_{d} (k) S (k, l)

and the reverberant component

Z_{r} (k, l) = \sum_{l^{'} = 1}^{+ \infty} H_{r} (k, l^{'}) S (k, l - l^{'})

are uncorrelated, the corresponding spectral variance

λ_{z} (k, l) = E [{|Z (k, l)|}^{2}]

can be expressed as the sum of the direct component spectral variance

λ_{d} (k, l) = E [{|Z_{d} (k, l)|}^{2}]

and the reverberant component spectral variance

λ_{r} (k, l) = E [{|Z_{r} (k, l)|}^{2}]

, such that Equation (4)

\begin{matrix} λ_{z} (k, l) = \underset{λ_{d} (k, l)}{\underset{⏟}{λ_{h_{d}} (k) λ_{s} (k, l)}} + \underset{λ_{r} (k, l)}{\underset{⏟}{\sum_{l^{'} = 1}^{+ \infty} λ_{h_{r}} (k, l^{'}) λ_{s} (k, l - l^{'})}}, \end{matrix}

(4)

where

λ_{s} (k, l)

is the spectral variance of

S (k, l)

. The reverberant component

λ_{r} (k, l)

can be further split into early reverberation

λ_{e} (k, l)

and late reverberation

λ_{l} (k, l)

, as Equation (5)

λ_{r} (k, l) = \underset{λ_{e} (k, l)}{\underset{⏟}{\sum_{l^{'} = 1}^{N_{e}} λ_{h_{r}} (k, l^{'}) λ_{s} (k, l - l^{'})}} + \underset{λ_{l} (k, l)}{\underset{⏟}{\sum_{l^{'} = N_{e}}^{+ \infty} λ_{h_{r}} (k, l^{'}) λ_{s} (k, l - l^{'})}},

(5)

and the main purpose is to derive an estimator for the LRSV

λ_{l} (k, l)

. Combining Equations (3) and (4),

λ_{r} (k, l)

can be obtained by Equation (6)

\begin{matrix} λ_{r} (k, l) = exp \{\frac{- 13.8 R}{T_{60} (k) f_{s}}\} [(1 - κ (k)) λ_{r} (k, l - 1) + κ (k) λ_{z} (k, l - 1)] . \end{matrix}

(6)

Finally, according to Equations (3) and (5),

λ_{l} (k, l)

can be obtained using

λ_{r} (k, l)

as Equation (7)

λ_{l} (k, l) = exp \{\frac{- 13.8 R (N_{e} - 1)}{T_{60} (k) f_{s}}\} λ_{r} (k, l - N_{e} + 1) .

(7)

4. Parameter Estimation

In Habets LRSV estimator, two parameters (i.e.,

T_{60}

and

κ

) should be given in advance. The reverberation time

T_{60}

can be determined by applying Schroeder’s method to the AIR. The parameter

κ

is related to DRR and can be calculated [3] by solving Equation (8)

κ = \frac{1}{D R R} \frac{1 - exp \{\frac{- 13.8 R}{T_{60} f_{s}}\}}{exp \{\frac{- 13.8 R}{T_{60} f_{s}}\}},

(8)

where

D R R = \sum_{n = 0}^{R - 1} h^{2} (n) / \sum_{n = R}^{+ \infty} h^{2} (n)

and

h (n)

represents AIR. The Habets LRSV estimator is often used without knowing those two parameters. The

T_{60}

estimation has been well investigated and numerous blind approaches can be found. However, the DRR estimation is less mature and there are few online single-channel estimation algorithms [7]. Therefore, the reverberation time

T_{60}

is assumed to be known in the following, and the present work focuses on the

κ

estimation. Most existed

κ

estimators [4,9] treat

κ

as a frequency-independent parameter. Hence, this letter also derives a fullband

κ

estimator which can make the LRSV estimator more practical and accurate.

4.1. Proposed $κ$ Estimator

Inspired by the OM-LSA algorithm [8], this letter proposed an adaptive

κ

estimator using a probability-based framework. Given two hypotheses,

H_{0} (l)

and

H_{1} (l)

, which indicate, respectively, direct sound absence and presence in the lth frame, as in Equation (9)

\begin{matrix} H_{0} (l) : Z (k, l) = Z_{r} (k, l), \\ H_{1} (l) : Z (k, l) = Z_{d} (k, l) + Z_{r} (k, l) . \end{matrix}

(9)

When the direct sound is absent, the desired

κ

can be directly estimated according to Equation (6). Accordingly, the proposed

κ

estimation strategy is to recursively average past estimated

κ

during periods of direct sound absence, and hold the estimate during direct sound presence. Specifically, the proposed

κ

estimator is as follows in Equation (10)

\begin{matrix} H_{0} (l) : κ (l + 1) = α_{κ} κ (l) + (1 - α_{κ}) \hat{κ} (l), \\ H_{1} (l) : κ (l + 1) = κ (l), \end{matrix}

(10)

where

α_{κ}

denotes a smoothing parameter, and

\hat{κ} (l)

denotes the estimated

κ

in the lth frame. Under direct sound uncertainty, the frame conditional direct sound presence probability

p (l)

can be employed by

p (l) \overset{Δ}{=} P (H_{1} (l) |Z (k, l), k = 0, 1, \dots, K)

, and the recursive averaging can be carried out in Equation (11)

\begin{matrix} κ (l + 1) = p (l) κ (l) + (1 - p (l)) [α_{κ} κ (l) + (1 - α_{κ}) \hat{κ} (l)] \\ = {\tilde{α}}_{κ} (l) κ (l) + (1 - {\tilde{α}}_{κ} (l)) \hat{κ} (l), \end{matrix}

(11)

where

{\tilde{α}}_{κ} (l) = p (l) + α_{κ} (1 - p (l))

is a time-varying smoothing parameter which is adjusted by the frame conditional direct sound presence probability

p (l)

.

Now, there are two remaining parts in the proposed

κ

estimator that need to be determined: (1) the frame conditional direct sound presence probability,

p (l)

; (2) the estimated

κ

in the lth frame,

\hat{κ} (l)

.

4.1.1. Frame Conditional Direct Sound Presence Probability

Let us assume that the STFT coefficients,

Z_{d} (k, l)

and

Z_{r} (k, l)

, are complex Gaussian variables. Then, applying Bayes rule [8], the conditional direct sound presence probability

p (k, l) \overset{Δ}{=} P (H_{1} (l) |Z (k, l))

can be written as Equation (12)

p (k, l) = {\{1 + \frac{q (l)}{1 - q (l)} [1 + ξ (k, l)] e^{- υ (k, l)}\}}^{- 1},

(12)

where

q (l) \overset{Δ}{=} P (H_{0} (l))

is the a priori probability for direct sound absence,

ξ (k, l) \overset{Δ}{=} \frac{λ_{d} (k, l)}{λ_{r} (k, l)}

is the a priori signal-to-reverberation ratio (SRR),

γ (k, l) \overset{Δ}{=} \frac{{|Z (k, l)|}^{2}}{λ_{r} (k, l)}

is the a posteriori SRR, and

υ (k, l) \overset{Δ}{=} \frac{γ (k, l) ξ (k, l)}{1 + ξ (k, l)}

. Note that

γ (k, l)

can be calculated directly whereas

q (l)

and

ξ (k, l)

need to be determined.

Considering that

λ_{z} (k, l)

decays frame by frame during periods of direct sound absence

H_{0} (l)

, the a priori probability for direct sound absence

q (l)

can be defined as Equation (13)

q (l) = \frac{1}{K} \sum_{k = 0}^{K - 1} u (λ_{z} (k, l - 1) - λ_{z} (k, l)),

(13)

where

u (\cdot)

is the unit step function. Then, the a priori SRR

ξ (k, l)

can be obtained via recursive average as in Equation (14)

\begin{matrix} ξ (k, l) = α_{ξ} ξ (k, l - 1) + (1 - α_{ξ}) max \{\frac{λ_{z} (k, l)}{λ_{r} (k, l)} - 1, 0\}, \end{matrix}

(14)

where

α_{ξ}

is a smoothing parameter.

After

p (k, l)

is determined, the frame conditional direct sound presence probability

p (l)

can be regarded as an average of

p (k, l)

over all frequency bins

p (l) = \frac{1}{K} \sum_{k = 0}^{K - 1} p (k, l)

.

4.1.2. Estimated $κ$ in Each Frame

Under direct sound absence hypothesis

H_{0} (l)

, Equation (4) becomes

λ_{z} (k, l) = λ_{r} (k, l)

, and substituting it into Equation (6) yields Equation (15)

\begin{matrix} λ_{z} (k, l) = exp \{\frac{- 13.8 R}{T_{60} (k) f_{s}}\} [(1 - κ) λ_{r} (k, l - 1) + κ λ_{z} (k, l - 1)] . \end{matrix}

(15)

After some algebra, Equation (15) can be rewritten as Equation (16)

κ = \frac{exp \{\frac{13.8 R}{T_{60} (k) f_{s}}\} λ_{z} (k, l) - λ_{r} (k, l - 1)}{λ_{z} (k, l - 1) - λ_{r} (k, l - 1)} .

(16)

Then, the estimated

κ

in the lth frame is determined in Equation (17) by averaging Equation (16) in the frequency domain

\hat{κ} (l) = \frac{\sum_{k = 0}^{K - 1} [exp \{\frac{13.8 R}{T_{60} (k) f_{s}}\} λ_{z} (k, l) - λ_{r} (k, l - 1)]}{\sum_{k = 0}^{K - 1} [λ_{z} (k, l - 1) - λ_{r} (k, l - 1)]} .

(17)

Note that the numerator and the denominator of Equation (16) are separately averaged in order to avoid division by zero.

Equation (17) is similar to the conventional estimator Equation (18) [9]. However, Equation (17) is derived under direct sound absence hypothesis using Equation (6). Hence, the proposed estimator using a probability-based framework to update

κ

, rather than a simple heuristic used in conventional estimator. Further comparison can be found in Section 5.

\hat{κ} (l) = \frac{exp \{\frac{13.8 R N_{e}}{T_{60} f_{s}}\} \sum_{k = 0}^{K - 1} λ_{z} (k, l) - \sum_{k = 0}^{K - 1} λ_{l} (k, l - N_{e})}{\sum_{k = 0}^{K - 1} λ_{z} (k, l - N_{e}) - \sum_{k = 0}^{K - 1} λ_{l} (k, l - N_{e})} .

(18)

5. Performance Evaluation

In this section, the performance of the LRSV estimator using the proposed

κ

estimator is evaluated. The performance using

κ

obtained by other four different methods are also evaluated, including conventional

κ

estimator [9], the measured ground truth

κ

calculated with measured DRR and

T_{60}

(fullband and subband) according to Equation (8), and the scanning-optimal

κ

obtained by scanning method which scans

κ

successively from 0.05 to 1.5 at intervals of 0.01. Besides, the quality of the dereverberated speech using proposed method is also evaluated and compared to a recent method using recursive MSPP [10].

5.1. Setup

The Signals to be processed in this letter are synthetic reverberant signals created by convolving original AIRs measured in a real hall with reverberation time of 2 s (from an open database [12]) with a male speaker signal of 15 s length. Six AIRs (referred to as

A I R_{1} \sim A I R_{6}

) with different

κ

ranging from 0.12 to 1.54 are adopted. Figure 1 demonstrates the signal there was used in experiment with and without reverberation.

As mentioned in Section 4, the ground truth

κ

(fullband and 1/3-octave subband) is calculated using the measured DRR and

T_{60}

via Equation (8). Besides, the reverberation time

T_{60}

is assumed to be known. Hence, the

T_{60}

used in this work is directly determined in 1/3-octave subbands by applying Schroeder’s method to AIRs. For evaluation purposes, the ground truth late reverberant speech component

z_{l} (n)

is defined as the anechoic male speaker signal convolved with the tail of AIR starting 50 ms after the direct sound. Other parameters used in this paper are chosen empirically as

κ (0) = 1

,

α_{κ} = 0.75

, and

α_{ξ} = 0.95

, similar to the reference [8]. All experiments are carried out in computer using MATLAB software.

The Log Spectral Distortion (LSD) [4] is adopted to evaluate the LRSV estimator by computing the root mean square(RMS) value of the difference between the estimated LRSV

\hat{λ_{l}} (k, l)

and the ground truth LRSV

λ_{l} (k, l)

, which is defined as Equation (19)

\begin{matrix} L S D_{l a t e} (l) = \sqrt{\frac{1}{K} \sum_{k = 0}^{K - 1} {|e (k, l)|}^{2}}, \\ e (k, l) = L \{\hat{λ_{l}} (k, l)\} - L \{λ_{l} (k, l)\}, \end{matrix}

(19)

where

L \{\cdot\} = max \{10 \lg |\cdot|, δ\}

is the log spectrum confined to 50 dB dynamic range and

δ = {max}_{k, l} \{10 \lg |\cdot|\} - 50

. The mean LSD (refered to as

\bar{L S D}

) is obtained by averaging Equation (19) over all frames. In addition, the lower and upper semi-variance of error

e (k, l)

were also calculated to evaluate the LRSV estimator [5] as Equation (20)

\begin{matrix} σ_{l} = \sqrt{\frac{1}{|τ_{l}|} \sum_{k, l \in τ_{l}} {(e (k, l) - \bar{e})}^{2}}, τ_{l} : e (k, l) \leq \bar{e} \\ σ_{u} = \sqrt{\frac{1}{|τ_{u}|} \sum_{k, l \in τ_{u}} {(e (k, l) - \bar{e})}^{2}}, τ_{u} : e (k, l) > \bar{e} \end{matrix}

(20)

where

\bar{e} = mea n_{k, l} \{e (k, l)\}

is the mean value of

e (k, l)

.

In order to evaluate the robustness of the proposed estimator to noise, the white noise was added to synthetic reverberant signals with variable RSNR [5]

RSNR = \frac{\sum_{k, l} λ_{d} (k, l) + λ_{r} (k, l)}{\sum_{k, l} λ_{v} (k, l)}

(21)

where

λ_{v} (k, l)

is the additive noise spectral variance.

5.2. Results and Analysis

Figure 2 depicts the mean LSD for Habets LRSV estimator using

κ

obtained by different methods, including the measured ground truth

κ

(fullband and subband), proposed

κ

estimator, conventional

κ

estimator and the scanning method.

As shown in Figure 2, an scanning-optimal

κ

can be obtained for each AIR as the corresponding

\bar{L S D_{l a t e}}

reaches a minimum during the scanning process, and it can be observed that such scanning-optimal

κ

is far from the measured ground truth

κ

, which alerts us that the measured

κ

may not be the optimal

κ

for Habets LRSV estimator. Although the measured fullband

κ

performs better for

A I R_{4}

and the measured subband

κ

performs better for

A I R_{2}

, they perform poorly for other AIRs. As for the proposed

κ

estimator, the

\bar{L S D_{l a t e}}

value exhibits a minimum for three AIRs, and is close to the minimum for other AIRs. It suggests that the proposed

κ

estimator performs not only much better than the conventional

κ

estimator and measured ground truth

κ

(both fullband and subband), but even as well as the scanning-optimal

κ

obtained by scan method. It is worth mentioning that the scanning-optimal

κ

may not be the real optimal

κ

, but it still can be seen as an appropriate

κ

considering the experimental results.

Figure 3 shows the averaged log error obtained using all RIRs for varying RSNR. As the RSNR decreases, all estimators show a more and more positive bias, which means the LRSV estimator performs worse with background noise and should be used after a denoising algorithm. However, the `length’ of the whisker bars of the proposed

κ

estimator is always shorter than other methods. In other words, the proposed

κ

estimator yields lower variance, which suggests that the proposed

κ

estimator is more robust with background noise.

Figure 4 compares the measured ground truth

κ

with the scanning-optimal

κ

, and as depicted in it, the scanning-optimal

κ

is not obviously related to the measured ground truth

κ

. It precludes us from simply applying a bias correction to the measured

κ

, which is sometimes used in practical.

The reason for the mismatch between the measured ground truth

κ

and the scanning-optimal

κ

may be that the generalized statistical model is a simplified approximation of AIR, which causes the error of estimation in Equation (6) and the error will vary with the anechoic speech signal

λ_{s} (k, l)

. Hence, in order to compensate that error, the value of

κ

needs to be modified, which makes the measured ground truth

κ

not the scanning-optimal

κ

for Habets LRSV estimator. To prove the above viewpoint, 13 different anechoic speech signals of 15 s length are used to obtain corresponding scanning-optimal

κ

for each AIR. The results are shown in Figure 5.

It can be seen that for different speech signals, the scanning-optimal

κ

changes randomly and can differ by up to 0.54, which reveals that the optimal

κ

for Habets LRSV estimator may be not only related to DRR but also related to the speech signal. In other words, it may be less effective to obtain

κ

indirectly via the blind DRR estimation algorithm. On the contrary, estimating

κ

directly as the proposed method did may achieve better performance. However, this letter only uses 13 different anechoic speech signals of 15 s length each, along with six AIRs, which is not enough to prove this hypothes, further research is needed using more speech signals and more AIRs.

5.3. Speech Dereverberation

Furthermore, the quality of the dereverberated speech using the estimated LRSV is evaluated, and the log-spectral amplitude gain function [11] is adopted to suppress the late reverberant speech component. Besides, a method using recursive MSPP [10] is also evaluated as a reference. The measures are the segmental SRR and LSD (averaged over all frames) between the estimated and true early speech component [3,4], the short-time objective intelligibility(STOI) [13] and perceptual evaluation of speech quality(PESQ) [14]. The results are averaged over all AIRs and presented in Table 1.

It can be observed that the proposed method achieves best performance in three measures and only performs slightly worse in LSD, which validates the superiority of the proposed estimator to conventional approaches. It also indicates that the LRSV estimator using proposed method performs even better than that using the measured ground truth

κ

(fullband and subband). It is worth mentioning that a single measure is not convincing, so this letter used four measures to jointly judge the performance of the proposed method. Hence, although MSPP has lower score than proposed method in LSD, considering all four measures, we still believe that the proposed algorithm is superior.

6. Conclusions

This work improves Habets LRSV estimator by proposing an adaptive

κ

estimator. We differentiate between the direct sound presence/absence hypotheses, and derive the frame conditional direct sound presence probability

p (l)

using Bayes rule. Under the direct sound absence hypothesis, the estimated

κ

in the lth frame

\hat{κ} (l)

is given under the assumption of

\{λ_{z} (k, l) = λ_{r} (k, l)\} |H_{0} (l)

. Finally,

κ (l)

is recursive averaged with a time-varying smoothing parameter

{\tilde{α}}_{κ} (l)

which is adjusted by the frame conditional direct sound presence probability

p (l)

.

The proposed

κ

estimator has been evaluated and compared to conventional

κ

estimator and a recursive MSPP method proposed in recent years. Experimental results show that the LRSV estimator using the proposed

κ

estimator outperforms other methods. It is also found that the ground truth

κ

calculated using measured DRR is not the optimal

κ

for the LRSV estimator since the optimal

κ

may be affected by speech signals. It suggests us estimate

κ

directly and adaptively rather than using the blind DRR estimation algorithm to obtain

κ

, which may be a less effective approach. However, further research is needed to prove this hypothesis.

Author Contributions

Conceptualization, Z.Z. and Y.S.; methodology, Z.Z.; software, Z.Z.; validation, Z.Z. and X.F.; formal analysis, Z.Z.; investigation, Z.Z.; resources, Y.S.; data curation, Z.Z.; writing—original draft preparation, Z.Z.; writing—review and editing, X.F. and Y.S.; visualization, Z.Z.; supervision, Y.S.; project administration, X.F.; funding acquisition, Y.S. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Yang, W.; Huang, G.; Chen, J.; Benesty, J.; Cohen, I.; Kellermann, W. Robust Dereverberation With Kronecker Product Based Multichannel Linear Prediction. IEEE Signal Process. Lett. 2021, 28, 101–105. [Google Scholar] [CrossRef]
Braun, S.; Schwartz, B.; Gannot, S.; Habets, E.A.P. Late reverberation PSD estimation for single-channel dereverberation using relative convolutive transfer functions. In Proceedings of the 2016 IEEE International Workshop on Acoustic Signal Enhancement (IWAENC), Xi’an, China, 13–16 September 2016; pp. 1–5. [Google Scholar]
Habets, E.A.P.; Gannot, S.; Cohen, I. Late Reverberant Spectral Variance Estimation Based on a Statistical Model. IEEE Signal Process. Lett. 2009, 16, 770–773. [Google Scholar] [CrossRef]
Naylor, P.A.; Gaubitch, N.D. Speech Dereverberation, 1st ed.; Springer Publishing Company Incorporated: Manhattan, NY, USA, 2010. [Google Scholar]
Braun, S.; Kuklasiński, A.; Schwartz, O.; Thiergart, O.; Habets, E.A.P.; Gannot, S.; Doclo, S.; Jensen, J. Evaluation and Comparison of Late Reverberation Power Spectral Density Estimators. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1056–1071. [Google Scholar] [CrossRef]
Habets, E.A.P.; Gannot, S.; Cohen, I. Speech dereverberation using backward estimation of the late reverberant spectral variance. In Proceedings of the 2008 IEEE 25th Convention of Electrical and Electronics Engineers in Israel, Eilat, Israel, 3–5 December 2008; pp. 384–388. [Google Scholar]
Eaton, J.; Gaubitch, N.D.; Moore, A.H.; Naylor, P.A. Estimation of Room Acoustic Parameters: The ACE Challenge. IEEE/ACM Trans. Audio Speech Lang. Process. 2016, 24, 1681–1693. [Google Scholar] [CrossRef] [Green Version]
Cohen, I. Optimal speech enhancement under signal presence uncertainty using log-spectral amplitude estimator. IEEE Signal Process. Lett. 2002, 9, 113–116. [Google Scholar] [CrossRef]
Erkelens, J.S.; Heusdens, R. Noise and late-reverberation suppression in time-varying acoustical environments. In Proceedings of the 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, Dallas, TX, USA, 14–19 March 2010; pp. 4706–4709. [Google Scholar]
Herzog, A.; Habets, E.A.P. Blind Single-Channel Dereverberation Using a Recursive Maximum-Sparseness-Power-Prediction-Model. In Proceedings of the 2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC), Tokyo, Japan, 17–20 September 2018; pp. 356–360. [Google Scholar]
Wolfe, P.; Godsill, S. Efficient Alternatives to the Ephraim and Malah Suppression Rule for Audio Signal Enhancement. EURASIP J. Adv. Signal Process. 2003, 2003, 910167. [Google Scholar] [CrossRef] [Green Version]
Merimaa, J.; Peltonen, T.; Lokki, T. Concert Hall Impulse Responses Pori, Finland. 2005. Available online: http://www.acoustics.hut.fi/projects/poririrs/ (accessed on 27 August 2021).
Taal, C.H.; Hendriks, R.C.; Heusdens, R.; Jensen, J. An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy Speech. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 2125–2136. [Google Scholar] [CrossRef]
Rix, A.W.; Beerends, J.G.; Hollier, M.P.; Hekstra, A.P. Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs. In Proceedings of the 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing, Proceedings (Cat. No.01CH37221), Salt Lake City, UT, USA, 7–11 May 2001; Volume 2, pp. 749–752. [Google Scholar]

Figure 1. Plot of signal used in experiment with and without reverberation. (a) with reverberation; (b) without reverberation.

Figure 2. Plot of

\bar{L S D}

as a function of

κ

for (a–f):

A I R_{1} \sim A I R_{6}

.

\bar{L S D}

for the measured ground truth

κ

(fullband and subband), proposed

κ

estimator and conventional

κ

estimator are presented as reference lines for comparison. (a)

A I R_{1}

; (b)

A I R_{2}

; (c)

A I R_{3}

; (d)

A I R_{4}

; (e)

A I R_{5}

; (f)

A I R_{6}

.

Figure 2. Plot of

\bar{L S D}

as a function of

κ

for (a–f):

A I R_{1} \sim A I R_{6}

.

\bar{L S D}

for the measured ground truth

κ

(fullband and subband), proposed

κ

estimator and conventional

κ

estimator are presented as reference lines for comparison. (a)

A I R_{1}

; (b)

A I R_{2}

; (c)

A I R_{3}

; (d)

A I R_{4}

; (e)

A I R_{5}

; (f)

A I R_{6}

.

Figure 3. Mean and standard deviation of log error

e (k, l)

for different RSNR. The means are indicated by symbols (circle, cross, etc.), and the semi-variances are indicated by whisker bars.

Figure 3. Mean and standard deviation of log error

e (k, l)

for different RSNR. The means are indicated by symbols (circle, cross, etc.), and the semi-variances are indicated by whisker bars.

Figure 4. Plots of the measured ground truth

κ

and the scanning-optimal

κ

for each AIR.

Figure 4. Plots of the measured ground truth

κ

and the scanning-optimal

κ

for each AIR.

Figure 5. Plot of scanning-optimal

κ

using different speech signals for (a–f):

A I R_{1} \sim A I R_{6}

. The measured

κ

is also presented as reference line for comparison. (a)

A I R_{1}

; (b)

A I R_{2}

; (c)

A I R_{3}

; (d)

A I R_{4}

; (e)

A I R_{5}

; (f)

A I R_{6}

.

Figure 5. Plot of scanning-optimal

κ

using different speech signals for (a–f):

A I R_{1} \sim A I R_{6}

. The measured

κ

is also presented as reference line for comparison. (a)

A I R_{1}

; (b)

A I R_{2}

; (c)

A I R_{3}

; (d)

A I R_{4}

; (e)

A I R_{5}

; (f)

A I R_{6}

.

Table 1. Improvement of objective speech quality measures.

	${Measured}_{full}$	${Measured}_{sub}$	${Optimal}_{scan}$	Proposed	Conventional	MSPP
$Δ$ SRR	7.81	7.73	7.80	8.61	6.80	7.60
$Δ$ LSD	−3.47	−3.46	−3.57	−3.77	−3.29	−3.82
$Δ$ STOI	0.0678	0.0693	0.0765	0.0785	0.0676	0.0362
$Δ$ PESQ	0.18	0.17	0.20	0.24	0.16	0.04

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhang, Z.; Feng, X.; Shen, Y. Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator. Appl. Sci. 2021, 11, 8054. https://doi.org/10.3390/app11178054

AMA Style

Zhang Z, Feng X, Shen Y. Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator. Applied Sciences. 2021; 11(17):8054. https://doi.org/10.3390/app11178054

Chicago/Turabian Style

Zhang, Zhaoqi, Xuelei Feng, and Yong Shen. 2021. "Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator" Applied Sciences 11, no. 17: 8054. https://doi.org/10.3390/app11178054

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator

Abstract

1. Introduction

2. Problem Formulation

3. Brief Review of Habets Late Reverberant Spectral Variance Estimator

4. Parameter Estimation

4.1. Proposed $κ$ Estimator

4.1.1. Frame Conditional Direct Sound Presence Probability

4.1.2. Estimated $κ$ in Each Frame

5. Performance Evaluation

5.1. Setup

5.2. Results and Analysis

5.3. Speech Dereverberation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Late Reverberant Spectral Variance Estimation for Single-Channel Dereverberation Using Adaptive Parameter Estimator

Abstract

1. Introduction

2. Problem Formulation

3. Brief Review of Habets Late Reverberant Spectral Variance Estimator

4. Parameter Estimation

4.1. Proposed κ Estimator

4.1.1. Frame Conditional Direct Sound Presence Probability

4.1.2. Estimated κ in Each Frame

5. Performance Evaluation

5.1. Setup

5.2. Results and Analysis

5.3. Speech Dereverberation

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

4.1. Proposed $κ$ Estimator

4.1.2. Estimated $κ$ in Each Frame