An Algorithm for Generating Virtual Sources in Dynamic Virtual Auditory Display Based on Tensor Decomposition of Head-Related Impulse Responses

Zhao, Tong; Xie, Bosun; Zhu, Jun

doi:10.3390/app12157715

Open AccessArticle

An Algorithm for Generating Virtual Sources in Dynamic Virtual Auditory Display Based on Tensor Decomposition of Head-Related Impulse Responses

by

Tong Zhao

,

Bosun Xie

^* and

Jun Zhu

Acoustic Lab, School of Physics and Optoeletronics, South China University of Technology, Guangzhou 510641, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(15), 7715; https://doi.org/10.3390/app12157715

Submission received: 16 June 2022 / Revised: 24 July 2022 / Accepted: 26 July 2022 / Published: 31 July 2022

(This article belongs to the Special Issue Techniques and Applications of Augmented Reality Audio)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Dynamic virtual auditory displays (VADs) are increasingly used for generating various auditory objects and scenes in virtual and augmented reality. Dynamic VADs are required to generate virtual sources in various directions and distances by using HRTF- or HRIR-based binaural synthesis. In the present work, an algorithm for improving the efficiency and performance of binaural synthesis in dynamic VAD is proposed. Based on tensor decomposition, a full set of near-field HRIRs is decomposed as a combination of distance-, direction-, and time-related modes. Then, binaural synthesis in VAD can be implemented by a common set of time mode-related convolvers or filters associated with direction- and distance-related weights. Dynamic binaural signals are created by updating the weights rather than updating the HRIR-based convolvers, which enables the independent control of virtual source distance and direction and avoids the audible artifact caused by updating the HRIR-based convolvers. An example of implementation indicates that a set of eight common convolvers or filters for each ear is enough to synthesize the binaural signals with sufficient accuracy. The computational efficiency of simultaneously generating multiple virtual sources is improved when the number of virtual sources is larger than eight. A virtual-source localization experiment validates the algorithm.

Keywords:

tensor decomposition; binaural reproduction; computation efficiency

1. Introduction

Virtual reality (VR) aims to provide users the experience of being presented in physical (natural) environments, and augmented reality (AR) aims to enhance users’ experience in physical environments, both through computer-generated environments. Recently, VR and AR have been increasingly applied to the fields of scientific research, engineering technology, and consumer products. A basic requirement of VR and AR is to generate various visual and auditory objects/scenes to make the users feel that they are immersed in the desired environments. Although various spatial audio techniques can be applicable for this purpose, dynamic virtual auditory display (VAD) is most commonly used due to its convenience and simplicity in practice [1].

A VAD generates the auditory objects/scenes by synthesizing the binaural signals (pressures) in a target auditory environment and rendering the signals to the user by headphones. A dynamic VAD utilizes a head tracker to detect the temporary orientation of the user’s head and then updates the signal processing to simulate the dynamic variation in binaural signals caused by head turning. Incorporating dynamic information enhances the virtual source localization and other auditory experiences in VR/AR applications.

Any auditory objects/scenes are composed of auditory sources (virtual sources) and auditory environments. However, even the auditory scenes of a complex reflected sound field can be regarded as the auditory scenes generated by a series of direct sound sources and reflective image sources. Therefore, it is essential for a dynamic VAD to be able to generate virtual sources at different spatial positions in terms of directions and distances. How to synthesize binaural signals with high efficiency is one of the core problems of dynamic VAD.

Various signal processing algorithms have been designed for synthesizing binaural signals for virtual sources at different spatial positions in dynamic VAD. A conventional algorithm filters the input signal with a pair of head-related transfer functions (HRTFs) in the frequency domain, or equivalently convolutes the input signal with a pair of head-related impulse responses (HRIRs) in the time domain. To simulate the dynamic variation in binaural signals caused by head turning (or by a moving virtual source), this algorithm requires constantly loading the HRTFs from a database and updating the HRTF-based filters constantly. Constantly loading the HRTFs from a large database reduces the computational efficiency, especially when both directional- and distance-dependent HRTFs are loaded from a large near-field HRTF database. Constantly updating the HRTF-based filters (or HRIR-based convolvers) easily causes audible artifacts in the resultant binaural signals. Moreover, the conventional algorithm requires a pair of HRTFs filters for a virtual source, and, thus, the required number of filters increases linearly with the number of virtual sources. Therefore, the efficiency of the conventional algorithm is low in generating multiple virtual sources simultaneously (for example, multiple direct sources and image sources for environmental reflections).

Alternatively, basis function decomposition-based algorithms have been suggested for synthesizing binaural signals of virtual source [2]. In fact, HRTFs vary as functions of frequency and source positions (in terms of direction and distance). They can be decomposed as a linear combination of a set of spatial (position) basis functions associated with frequency-dependent weights (spatial basis function decomposition), or a linear combination of spectral shape basis functions associated with source position-dependent weights (spectral shape basis function decomposition). Therefore, HRTF-based filtering can be implemented by a set of fixed filters (or equivalent convolvers) associated with position-dependent weights or gains. One advantage of basis function decomposition-based algorithms is that the filter set is fixed. It is only required to change the weights of filters to account for the variation in head orientation or source position, avoiding the problems caused by constantly loading HRTFs and updating the filters. Another advantage of basis function decomposition-based algorithms is that multiple virtual sources can share a common set of filters; therefore, the required number of filters is fixed and independent from the number of virtual sources. This feature improves the efficiency of generating multiple virtual sources simultaneously.

Spherical harmonic function (SHF) decomposition is a well-known example of spatial basis function decomposition, by which HRTFs are decomposed by a set of (source-direction-dependent) spherical harmonic functions associated with (frequency- and distance-dependent) weights [3,4,5]. The corresponding basis function decomposition-based algorithm is known as binaural Ambisonics [2,6,7,8]. However, binaural Ambisonics requires a large number of fixed filters to synthesize the binaural signals accurately up to a frequency of 20 kHz. Therefore, its computational efficiency is not high. Binaural Ambisonics with limited order is usually used in practice, which causes errors in the resultant binaural signals. Accordingly, appropriate psychoacoustic principles should be utilized to generate virtual sources in various directions and at various distances.

Principal component analysis (PCA), which is basically equivalent to singular value decomposition (SVD), is often used to derive the spectral shape basis function decomposition of HRTFs. By using PCA, the correlation among the HRTFs at different positions are removed so that HRTFs can be decomposed as a combination of spectral shape basis functions associated with source position-dependent weight. Previous studies have indicated that a small set (usually ranging from 5 to 20) of spectral shape basis functions is enough to represent HRTFs at different positions with appropriate accuracy [9,10,11]. Therefore, PCA yields a highly efficient spectral shape basis function decomposition of HRTFs. Moreover, PCA is also applicable to derive the spatial basis function decomposition of HRTFs. Accordingly, PCA-based algorithms for synthesizing binaural signals in dynamic VAD have been suggested [6,12]. However, HRTFs/HRIRs are multivariable functions, including the frequency (or time), source direction, and distances. PCA cannot separate the variations in HRTFs/HRIRs caused by each variation, and the corresponding algorithm cannot manipulate the directions and distance of virtual sources independently. Therefore, the efficiency of the PCA-based algorithm in synthesizing multiple virtual sources in different directions and at different distances should be improved.

Tensor decomposition is a multi-linear modelling technique and can be regarded as an extension of PCA and SVD [13]. Tensor decomposition removes the correlation among the HRTFs/HRIRs at different frequencies (or times), directions, distances, and even for different individuals, and represents the HRTFs with variations in multiple independent modes. Tensor methods have been used to decompose the HRTFs at a fixed (far field not less than 1.0 m) source distance and even the HRTFs for different individuals [14,15,16].

In the present work, we further applied tensor decomposition to near-field HRTFs in various sources directions and at different distances, and then propose an algorithm for generating multiple virtual sources at various directions and distances. The proposed algorithm enables the manipulation of the directions and distance of virtual sources independently with high efficiency. The paper is organized as follows. Section 2 outlines the auditory localization cues and the conventional algorithm of synthesizing binaural signals of virtual sources. Section 3 discusses the basic principle of tensor decomposition of near-field HRIRs. Section 4 presents the algorithm of generating multiple virtual sources at various directions and distances. Section 5 presents an example of analysis. Section 6 presents the psychoacoustic experiment and results. Section 7 includes some discussion and conclusions.

2. Auditory Localization Cues and Conventional Algorithm in Dynamic VAD

In the present work, we use a spherical coordinate system with respect to the head center. The virtual source position is specified by distance

0 \leq r < \infty

, elevation

- 90^{\circ} \leq ϕ \leq 90^{\circ}

, and azimuth

0^{\circ} \leq θ \leq 360^{\circ}

, where

ϕ = - 90^{\circ}, 0^{\circ}

, and

90^{\circ}

represents the bottom, horizontal, and top direction, respectively; on the horizontal plane,

θ = 0^{\circ}, 90^{\circ}

, and

180^{\circ}

correspond to the front, right, and back direction, respectively. Auditory localization includes directional localization and distance perception. Multiple cues contribute to directional localization for a real sound source in free-field [17]. The inter-aural time difference (ITD) and inter-aural level difference (ILD) are two cues for lateral localization; spectral cues caused by head and pinna scattering and dynamic cues caused by head turning contribute to vertical and front–back localization. Multiple cues also contribute to distance perceptions [18,19,20,21]. For a real source in a free-field, distance perception cues include distance-dependent pressure (

1 / r

law), distance-dependent ILD, and spectral cues at proximal source distances (

r \leq 1.0

m). High-frequency attenuation caused by air absorption is also a weak distance perception cue and is ignored in the presented work. Moreover, in a reflective environment, the direct-to-reverberation ratio is an effective distance perception cue [22], but it is also ignored here because the present work is focused on generating a free-field virtual source. Various auditory localization cues for the free-field source are encoded in HRTFs, which are defined as the normalized acoustic transfer functions from a point source to two ears [1]:

H_{α} (r, θ, ϕ, f) = \frac{P_{α} (r, θ, ϕ, f)}{P_{0} (f)},

(1)

where

α

= L or R denotes the left and right ear, respectively, and f is the frequency.

P_{α}

is the pressure at the ear and

P_{0}

is the pressure at the position of the center of the head with the head absent. For simplicity, we omit the subscript “

α

” in the following discussion. HRTFs generally vary with frequency, source distance, and direction. At far-field distances with r > 1.0 m, HRTFs are asymptotically distance independent. Otherwise, at near-field distances with r < 1.0 m, HRTFs depend on the source distance.

In the conventional algorithm in VAD, binaural signals of a virtual source are synthesized by filtering the input stimulus

E_{0}

with a pair of HRTFs. If the distance-dependent pressure and delay is taken into account, the binaural signals are given by [1]:

E (f) = \frac{1}{r} H (r, θ, ϕ, f) exp [(- j τ_{1} (r)] E_{0} (f) .

(2)

where

τ_{1}

is the distance-dependent pure delay. In the time domain, Equation (2) can be written as the convolution of input signal with a pair of HRIRs:

e (t) = \frac{1}{r} h (r, θ, ϕ, t) \otimes e_{0} [t - τ_{1} (r)] .

(3)

where notation “⊗” denotes convolution over time. The time-domain functions in Equation (3) are related to the corresponding frequency-domain functions in Equation (2) by inverse Fourier transformation. In dynamic VAD, the HRTFs in Equation (2) or HRIRs in Equation (3) should be constantly updated accordingly to the temporary position of virtual source with respect to the head.

3. Tensor Decomposition of Near-Field HRTF

Tensor decomposition can be applied to HRTF magnitudes, complex-valued HRTFs, minimum-phase HRTFs, HRIRs, and minimum-phase HRIRs. Here, we discuss the tensor decomposition of HRIRs or minimum-phase HRIRs, which is equivalent to the tensor decomposition of complex-valued HRTFs or minimum-phase HRTFs, respectively, because HRIRs and HRTFs are related by Fourier transformation.

Suppose that there is a full set of near-field HRIRs for a given set of ears, involving the data of D distances, M directions at each distance, and N discrete time samples for each distance and direction. The data of a discrete distance index d, direction m, and time n are denoted by

h (d, m, n)

with d = 1, 2…D, m = 1, 2…M, and n = 1, 2…N. To decompose the HRIRs with tensor effectively, the mean of HRIRs across distances, directions, and times should be theoretically subtracted from each data, e.g., the following mean-subtracted data should be used to substitute the original HRIR data in analysis:

h^{'} (d, m, n) = h (d, m, n) - \frac{1}{D M N} \sum_{d = 1}^{D} \sum_{m = 1}^{M} \sum_{n = 1}^{N} h (d, m, n) .

(4)

The dimensionality of the mean-subtracted dataset is

D \times M \times N

, and the mean-subtracted dataset constitutes a three-order tensor

h_{D \times M \times N}

with entries

h_{d m n} = h ’ (s, m, n)

. Because practical calculation in the example of the present work indicates that the mean of HRIRs is trivial, there is little difference between original and mean-subtracted HRIRs. Therefore, we omit the subtraction of the mean in the following discussion.

According to the Tucker decomposition of tensor [13], the three-order tensor

h_{D \times M \times N}

can be decomposed as the inner product of a core tensor

w_{D^{'} \times M^{'} \times N^{'}}

and three matrixes

u_{D \times D^{'}}^{d}

,

u_{M \times M^{'}}^{m}

and

u_{N \times N^{'}}^{n}

:

h_{D \times M \times N} = w_{D^{'} \times M^{'} \times N^{'}} \times_{D^{'}} u_{D \times D^{'}}^{d} \times_{M^{'}} u_{M \times M^{'}}^{m} \times_{N^{'}} u_{N \times N^{'}}^{n},

(5)

where subscript represents the dimensionality of the tensor or matrix. Notation

\times_{D^{'}}, \times_{M^{'}}

and

\times_{N^{'}}

denote the tensor/matrix product with respect to distance-related mode variable

D^{'}

, direction-related mode variable

M^{'}

, and time-related mode variable

N^{'}

, respectively.

In Equation (5), the full set of HRIR data are represented by three sets of independent variations of modes, e.g., distance-related modes, direction-related modes, and time-related modes.

u_{D \times D^{'}}

is a

D \times D^{'}

matrix of distance-related modes, with each row corresponding to a distance and each column corresponding to an eigen distance-related mode.

u_{M \times M^{'}}^{m}

is a

M \times M^{'}

matrix of the direction-related modes, with each row corresponding to a direction and each column corresponding to an eigen direction-related mode.

u_{N \times N^{'}}

is a

N \times N^{'}

matrix of time-related modes, with each row corresponding to a time and each column corresponding to an eigen time-related mode. Because the number of eigen modes associated with an independent variable should not exceed the dimensionality of that discrete variable, it generally has

D^{'} \leq D, M^{'} \leq M, N^{'} \leq N

.

The aforementioned three matrices satisfy the following orthogonality:

{(u^{d})}_{D^{'} \times D}^{T} u_{D \times D^{'}}^{d} = I_{D^{'} \times D^{'}} {(u^{m})}_{M^{'} \times M}^{T} u_{M \times M^{'}}^{m} = I_{M^{'} \times M^{'}} {(u^{n})}_{N^{'} \times N}^{T} u_{N \times N^{'}}^{n} = I_{N^{'} \times N^{'}},

(6)

where superscript “T” denotes the transpose of the matrix;

I

is an identity matrix. The orthogonality in Equation (6) indicates that each eigen mode within a set of variable-related modes is independent. For example, the orthogonality of matrix

u_{N \times N^{'}}^{n}

indicates that each eigen time-related mode is independent. The

D^{'} \times M^{'} \times N^{'}

core tensor

w_{D^{'} \times M^{'} \times N^{'}}

represents the possibly complex interaction among distance-, direction-, and time-related modes. Its entries depend on distance-, direction-, and time-related modes but are independent from distance, direction, and time.

Equation (5) is an exact representation of the full HRIR dataset. Given the tensor

h_{D \times M \times N}

of HRIR data, the core tensor

w_{D^{'} \times M^{'} \times N^{'}}

and three matrixes

u_{D \times D^{'}}^{d}

,

u_{M \times M^{'}}^{m}

, and

u_{N \times N^{'}}^{n}

can be derived according to the method in the Appendix A, and the accurate tensor decomposition of near-field HRTFs in Equation (5) is obtained.

For each variable-related mode, different eigen modes contribute differently to the related variation in HRIRs. A highly efficient representation of HRIRs can be obtained by retaining the eigen modes that contribute more to the variation in HRIRs and omitting the eigen modes that contribute less to the variation in HRIRs. To improve the efficiency of the algorithm of generating a virtual source, we are especially interested in the simplification of the time- and direction-related eigen modes in the tensor decomposition of HRIRs. Suppose that the columns of matrix

u_{N \times N^{'}}^{n}

are arranged in the descending order according to the contribution of the time-related eigen modes to the variation in HRIRs. Then, the matrix

u_{N \times N^{'}}^{n}

can be truncated up to the preceding

N^{″} < N^{'}

column, resulting in a new

N \times N^{″}

matrix of time-related modes. Similarly, the matrix

u_{M \times M^{'}}^{m}

can be truncated up to the preceding

M^{″} < M^{'}

column, resulting in a new

M \times M^{″}

matrix of direction-related modes. At the same time, the dimensionality of direction- and time-related modes in the core tensor

w_{D^{'} \times M^{'} \times N^{'}}

can also be truncated up to

M^{″}

and

N^{″}

, respectively. The distance-related eigen modes can also be truncated in a similar way. However, the original (measured or calculated) HRIRs usually include the data of a few distances, which can be represented by a few distance-related eigen modes. Thus, it is not necessary to further truncate the distance-related eigen modes. Then, Equation (5) can be approximated as:

{\hat{h}}_{D \times M \times N} \approx w_{D^{'} \times M^{″} \times N^{″}} \times_{D^{'}} u_{D \times D^{'}}^{d} \times_{M^{″}} u_{M \times M^{″}}^{m} \times_{N^{″}} u_{N \times N^{″}}^{n},

(7)

where

\hat{h}

represents the approximated tensor representation of HRIRs after truncation to distinguish the accurate tensor representation of HRIRs given by Equation (5). When the order of time-related modes is truncated to

N^{″}

, the cumulative percentage variation in the energy of HRIRs represented by Equation (7) is evaluated by:

η_{n} = \frac{\sum_{n = 1}^{N^{″}} λ_{n}}{\sum_{n = 1}^{N^{'}} λ_{n}} \times 100 %,

(8)

where

λ_{n}

are the eigenvalues corresponding to different eigen modes of time-related modes. For the truncation of direction-related modes, the cumulative percentage variation in the energy of HRIRs can be evaluated by an equation similar to Equation (8).

To evaluate the accuracy of tensor representation of HRTFs at various frequencies, the original HRIRs and reconstructed HRIRs in Equation (7) can be converted to frequency-domain HRTFs by Fourier transformation, and the relative reconstruction error

er (d, m, k)

is evaluated by the following formula:

er (d, m, k) = 10 {log}_{10} \frac{| H (d, m, k) - \hat{H} {(d, m, k) |}^{2}}{{| H (d, m, k) |}^{2}},

(9)

where

H (d, m, k)

and

\hat{H} (d, m, k)

are the exact HRTFs and truncated tensor representation of HRTFs, respectively, and

k = 1, 2 \dots N

denotes the discrete frequency.

Cumulative percentage variation in Equation (8) and relative errors in Equation (9) are used to evaluate the performance of approximation in tensor decomposition of HRIRs and the algorithm for generating a virtual source.

4. Tensor Decomposition-Based Algorithm of Generating Multiple Virtual Sources

The proposed algorithm is derived from Equation (7). A HRTF can be approximated as its minimum-phase functions

H_{m i n}

associated with a linear phase or linear delay

τ_{2}

which depends on the source position but is independent of frequency [23]. For discrete source distance d, direction m, and frequency k, we have:

H (d, m, k) = H_{min} (d, m, k) exp [- j τ_{2} (d, m)] .

(10)

Accordingly, the binaural signals in Equation (2) can be generated by filtering the input signal with minimum-phase HRTFs, scaling with an inverse distance gain

1 / r_{d}

and supplementing a linear delay:

\begin{matrix} E (d, m, k) & = \frac{1}{r_{d}} H_{min} (d, m, k) exp \{- j [τ_{1} (d) + τ_{2} (d, m)]\} E_{0} (k) \\ = \frac{1}{r_{d}} H_{min} (d, m, k) exp (- j τ) E_{0} (k) \end{matrix},

(11)

where

τ = τ_{1} (d) + τ_{2} (d, m)

. In the time domain, Equation (11) becomes following convolution form:

e (d, m, n) = \frac{1}{r_{d}} h_{min} (d, m, n) \otimes_{n} e_{0} (n - τ),

(12)

where

e (d, m, n)

,

h_{min} (d, m, n)

, and

e_{0} (n - τ)

denote the binaural signals, minimum-phase HRIR, and input signal in the time domain, respectively. Minimum-phase approximation of HRTFs/HRIRs simplifies the filters of generating binaural signals.

The minimum-phase HRIRs can be represented as the tensor decomposition form according to Equation (7). To derive the algorithm, we express Equation (7) in a different form. Let

h_{min, d, m} = {[h_{min} (d, m, 1), h_{min} (d, m, 2) \dots h_{min} (d, m, N)]}^{T}

be an

N \times 1

column vector or matrix, which represent the N-point minimum-phase HRIR of a given distance d and direction m, where superscript “T” denotes the transpose of the matrix. Additionally, let

u_{n^{'}} = {[u_{1 n^{'}}, u_{2 n^{'}} \dots u_{n n^{'}}]}^{T}

with

n^{'} = 1, 2 \dots N^{″}

as a set of

N \times 1

column vectors or matrices taken from matrix

u_{N \times N^{″}}^{n}

in Equation (7), which represents

N^{″}

impulse responses (coefficients of finite impulse response filters) corresponding to

N^{″}

time-related eigen modes of minimum-phase HRIRs. Then, Equation (7) yields:

\begin{matrix} h_{min, d, m} & = \sum_{n^{'} = 1}^{N^{″}} [\sum_{d^{'} = 1}^{D^{'}} \sum_{m^{'} = 1}^{M^{″}} w_{d^{'} m^{'} n^{'}} u_{d d^{'}}^{d} u_{m m^{'}}^{m}] u_{n^{'}} \\ = \sum_{n^{'} = 1}^{N^{″}} c_{n^{'}} (d, m) u_{n^{'}}, \end{matrix}

(13)

where

c_{n^{'}} = [\sum_{d^{'} = 1}^{D^{'}} \sum_{m^{'} = 1}^{M^{″}} w_{d^{'} m^{'} n^{'}} u_{d d^{'}}^{d} u_{m m^{'}}^{m}] .

(14)

Equation (13) indicates that HRIR of an arbitrary distance and direction can be decomposed into a weight combination of

N^{″}

time eigen-mode-related impulse responses, which are independent of distance and direction, and the weights depend on the direction and distance.

Let

e_{d, m} (n)

be the binaural signals of distance d and direction m, and

e_{0} (n)

be the input signal. Then, Equations (12) and (13) yield:

e_{d, m} (n) = [\sum_{n = 1}^{N "} c_{n^{'}} u_{n^{'}}] \otimes \frac{1}{r_{d}} e_{0} (n - τ) .

(15)

Equation (15) indicates that the algorithm of generating binaural signals can be implemented by the following step:

(1): Delay the input signal with a source distance- and direction-related $τ (d, m)$ , and then scale with a distance-related factor or gain $1 / r_{d}$ ;
(2): Multiply the signal with $N^{″}$ weights of $c_{n^{'}}$ and then convolute or filter with $u_{n^{'}}$ , respectively;
(3): The outputs of $N^{″}$ convolvers are summed to form the signal of a given ear;
(4): The signal of each ear is generated, respectively.

Figure 1 is the block diagram of the algorithm given by Equation (15). To generate multiple virtual sources simultaneously, the input signal of each virtual is manipulated according to the aforementioned steps (1) to (4), respectively, and the signals of all virtual sources are automatically mixed in the outputs of

N^{″}

filters. The algorithm has the following features:

(1): The algorithm requires $N^{″}$ convolvers, which are equivalent to $N^{″}$ finite impulse response (FIR) filters; therefore, its efficiency for generating a single virtual source is low. However, the filters are independent of source distances and directions. Virtual sources at different distances and directions can share a common set of filters. In other words, the number of filters is fixed and independent from the number of virtual sources. For generating multiple virtual sources simultaneously, the efficiency of the proposed algorithm is improved in comparison with the conventional algorithm when the number of virtual sources is larger than $N^{″}$ .
(2): For application to dynamic VAD, when the user turns their head or the position of the virtual source changes (such as the case of a moving virtual source), it is only required to update the weights $c_{n^{'}}$ rather than update the filters or convolvers, avoiding the possibly audible artifacts caused by updating the filters in the conventional algorithm.
(3): Weight $c_{n^{'}}$ is related to the source distance and directions by two sets of coefficients, $u_{d d^{'}}^{d}$ and $u_{m m^{'}}^{m}$ , respectively, as shown in Equation (14). The source distance and direction can be controlled independently by changing these two sets of coefficients, respectively.
(4): HRIRs/HRTFs are continuous functions of the source distance and direction. Practical measurement or calculation usually yields data at discrete distances and directions with certain resolutions. Dynamic VAD requires that the data match the auditory resolution, which can be obtained by spatially interpolating to the measured/calculated data. Spatial interpolation can be implemented online in a dynamic VAD, which increases the computational cost. Spatial interpolation can also be implemented offline, which requires more resources for data storage. Tensor decomposition results in a compact representation of HRIRs/HRTFs, which saves resources for data storage required for offline implementation of spatial interpolation.

5. The Results of Analysis

5.1. HRIR Dataset and Pre-Processing

The binaural HRIRs of a KEMAR artificial head with DB-060/061 pinnae, which were calculated by using the boundary element method, were used in the analysis and algorithm [24]. More specifically, a laser 3D scanner (UNIscan) associated with a software (VXelements) was used to acquire the geometrical surfaces of a KEMAR artificial head with DB-060/061 pinnae, and then boundary element method was used to calculate the near-field HRTFs. The theory of boundary element method and details of implementation could refer to [24]. The calculated dataset involves HRIR at seven source distances of r = 0.2, 0.25, 0.3, 0.4, 0.5, 0.75, and 1.0 m, and 2520 directions at each source distance. The directions range from elevation

ϕ = - 85^{\circ}

to

85^{\circ}

and azimuth

θ = 0^{\circ}

to

355^{\circ}

, with elevation and azimuth intervals of

5^{\circ}

. The length of each HRIR is 882 points at sampling frequency of 44.1 kHz.

A bilinear directional interpolation is applied to the calculated HRIRs [25], resulting in binaural HRIRs at 7 source distances and 64,440 directions (with both elevation and azimuthal resolutions of

1^{\circ}

). Minimum-phase approximation and truncation by a time windows yield a set of 128-point minimum-phase (binaural) HRIRs at 7 source distances and 64,440 directions (with tensor representation of

h_{128 \times 64440 \times 7}

), which are used for the analysis and algorithm.

5.2. Results of Analysis

Tensor decomposition, as discussed in Section 3, is applied to the aforementioned HRIRs, yielding a core tensor of

w_{128 \times 64440 \times 7}

. Table 1 and Table 2 list the relationship between the cumulative percentage variance of energy and the number of direction- and time-related modes for the left ear, respectively. The results indicate that the cumulative percentage variations in the energy of HRIRs in Equation (8) increase with the number of eigen modes in the decomposition. Moreover, the first mode accounts for nearly

69.9 %

and

70.7 %

of the direction- and time-related variance respectively, and the contributions of other modes descend with the order. This is a general characteristic of tensor decomposition of HRIRs/HRTFs and is consistent with some previous results [26]. For both left- and right-ear minimum-phase HRIRs,

M^{″} = 13

direction-related eigen modes and

N^{″} = 8

time-related eigen modes account for over

99.0 %

of energy variance. Thus, the order of direction- and time-related modes were truncated to

M^{″} = 13

and

N^{″} = 8

, yielding a truncated core tensor of

w_{8 \times 13 \times 7}

and two truncated unitary matrices of

u_{64440 \times 13}^{m}

and

u_{128 \times 8}^{n}

respectively. The minimum-phase HRIRs are approximately reconstructed with

M^{″} = 13

direction-related eigen modes and

N^{″} = 8

time-related eigen modes according to Equation (7), and the mean relative error across all frequencies and distances calculated from Equation (9) is −21.9 dB for the left ear and −22.0 dB for the right ear.

Figure 2 illustrates the mean relative error calculated from Equation (9) across all distances on the horizontal plane for the left ear. The error varies with the direction (azimuth) and frequency. Actually, reconstruction errors on the horizontal plane are larger than other elevations. As seen in Figure 2b, the mean relative error across distances is less than −10 dB for most cases, which is basically sufficient for the HRTF reconstruction. Calculation from the results indicate that the mean relative error across distances is −27.7 dB at the ipsilateral direction (

θ = 265^{\circ}

,

ϕ = 0^{\circ}

) and −17.5 dB at the contralateral direction (

θ = 85^{\circ}

,

ϕ = 0^{\circ}

). Generally, the errors are relatively large at high frequencies above 10 kHz and for source azimuth contralateral to the left ear. These errors are due to the complex nature of contralateral HRTFs at high frequency caused by head scattering and shadow effects. However, contralateral HRTFs at high frequencies contribute less to auditory perception because their magnitudes are attenuated by head shadows. Overall, the results of analysis indicate that

M^{″} = 13

direction-related eigen modes and

N^{″} = 8

time-related eigen modes are enough to reconstruct the minimum HRIRs with high accuracy.

The reconstructed results for the right-ear HRTF are quite similar to those for the left ear; thus, these results are omitted here.

6. Experimental Validation

6.1. Implementation of the Algorithm

The algorithm is implemented in a PC-based dynamic VAD platform with software written in the C++ language. An electromagnetic tracker (Polhemus FASTRAK) is used to detect the orientation of the subject’s (user’s) head in three degrees of freedom (turning around left–right, front–back, and up–down axes) in real time. According to the temporary orientation of the subject’s head, the target source position with respect to the subject’s head is calculated, and the synthesized binaural signals are updated. The update rate and system latency time of the dynamic VAD are 60 Hz and 25.4 ms, respectively. The details of the dynamic VAD are referred to [12].

The binaural signals are synthesized according to the algorithm in Section 4 and analysis in Section 5. The HRIRs used in binaural synthesis are described in Section 5.1. According to the analysis in Section 5.2, the tensor decomposition of minimum-phase HRIRs is truncated to

M^{″} = 13

direction-related eigen modes and

N^{″} = 8

time-related eigen modes. Therefore, a common set of

N^{″} = 8

convolvers or filters is used for synthesizing the signal of each ear.

The resultant binaural signals are reproduced by an in-ear headphone (Etymotic Research ER-2). Because the ER-2 headphone exhibits a flat magnitude response measured at the end of an occluded-ear simulator up to 16 kHz and the signal component above 16 kHz contributes little to localization, the equalization of the headphone to eardrum transmission is omitted.

Here, non-individualized HRIRs are used in binaural synthesis because it is difficult to use individualized HRIRs in most practical applications. Moreover, previous studies indicated that individualized HRTFs have a significant influence on the directional (front–back and vertical) localization and slight influence on distance perception [27]. However, when the dynamic localization cue is included in dynamic VAD, the influence of individualized HRIR (spectral cue) to directional localization decreases due to the redundancy among different directional localization cues.

6.2. Experimental Procedure

Based on the dynamic VAD platform, a virtual source localization experiment was conducted to validate the algorithm. The experiment was designed to compare the directional localization and distance perception performance of the proposed and conventional algorithms. Therefore, there were two algorithms to be evaluated:

(1): Binaural signals synthesized by the proposed method;
(2): Binaural signals synthesized by the conventional method.

For each algorithm, there were five target virtual source distances at r = 0.2, 0.3, 0.5, 0.75, and 1.0 m; and at each virtual source distance, there were 11 target source directions distributed in the right-hemispherical space, including

θ = 0^{\circ}, 90^{\circ}, 180^{\circ}

on the

ϕ = \pm 45^{\circ}

elevation planes and

θ = 0^{\circ}, 45^{\circ}, 90^{\circ}, 135^{\circ}, 180^{\circ}

on the

ϕ = 0^{\circ}

(horizontal) plane. The input signal was pink noise with a length of 15 s.

The experiment was conducted in a listening room with a background-noise level less than 30 dB. The subjects sat in the center of the room. Eight subjects (ages 23–27, six male and two female) with normal hearing participated in the experiment. Subjects judged the perceived virtual source direction and distance and reported the results by pointing to an electromagnetic receiver at one end of a wooden rod to the perceived position. If inside-the-head localization occurs, the subject reported the results orally. During the experiment, the subjects were asked to close their eyes and encouraged to turn their heads during the perception process.

If no inside-the-head localization occurs, the perceived virtual source position is further analyzed (inside-the-head localization means the algorithm or experiment is invalid). Directional localization and distance perception are analyzed separately. In the analysis of the directional localization, the reversal errors (front–back or up–down confusion) are first corrected in the raw localization results by reflecting the results against the symmetric plane, and the percentage of confusion is calculated. The accuracy of directional localization is evaluated by the following overall mean angular error between the perceived and target angles in each case of the algorithm, direction, and distance [28].

Δ_{Ω} = \frac{1}{L} \sum_{l = 1}^{L} ∣ arccos [\frac{r_{I} (l) • r}{|r_{I} (l) ∥ r|}],

(16)

where

r

is the vector from the origin (head center) to the target position.

r_{I} (l)

is the vector from the origin to the l th judged position, and L is the total number of judgments. The dot denotes the scalar multiplication of two vectors.

The performance of distance perception is simply evaluated by the mean perceived distance

{\bar{d}}_{I} = \frac{1}{L} \sum_{l = 1}^{L} |d_{I} (l)|,

(17)

where

d_{I} (l)

is the perceived distance of the l th judgment. Corresponding standard deviations are also calculated.

Moreover, multi-way ANOVA is also applied to the experimental results.

6.3. Results of Directional Localization

No inside-the-head localization was reported in all the raw localization results. For both the proposed and conventional algorithms, the percentages of front–back and up–down confusions in directional localization (calculated over all the target source distances, directions, and judgments) were less than 5% and 1%, respectively. It is well established that binaural cues including interaural time difference (ITD) and interaural level difference (ILD) account for lateral localization. However, binaural cues are unable to provide enough information for front-back and vertical localizations, especially in the cone of confusion in which the ITD and ILD are approximately constants. It is also well accepted that both individualized spectral cue (caused by the diffraction of pinnae and head) and dynamic cue caused by head turning contribute to front-back and vertical localization. However, the information provided by spectral cue and dynamic cue is somewhat redundant, one cue alone allows for front-back and vertical localization to some extent [29]. Therefore, the low percentage of front–back and up–down confusions is due to the incorporation of dynamic localization cues in dynamic VAD, which alleviate the dependence of individualized spectral cues in front–back and vertical localization.

Table 3 lists the mean angular errors across all direction and judgments at different target distances. It is observed that mean angular errors lies between

12 . 6^{\circ}

and

14 . 3^{\circ}

, which is basically consistent with previous results in dynamic VAD [12]. Generally, larger angular errors occurred in proximal regions. The mean angular errors tended to decrease as the target distance increased for all cases.

Because the perceived directions are distributed in a spherical coordinate system, we chose to use a statistical graph to represent the explicit performance of directional localization according to the method suggested by Philip Leong [30]. Figure 3 is the graphical representation of the mean perceived direction across target distances and judgments. The results are demonstrated on the surface of a sphere and viewed from the front, right, and rear, respectively. In the figures, notation ‘+’ denotes the target virtual source direction; the red points at the center of the ellipses are the average perceived directions; the ellipses centered around the centroid represent the confidence region at the significance level of

α = 0.05

. The ellipses drawn with a blue line indicate that the data are highly symmetric around the mean. Otherwise, it is drawn with a green line. It is also observed that the proposed and conventional algorithms yield similar directional localization performances. The larger localization errors usually occur at the directions below and behind the subject, and the smaller localization errors usually occur at the frontal and lateral directions in the horizontal plane. This feature of localization errors may be due to the fact that the directional localization accuracy of a human being is inherently lower at the below and back directions.

A multi-way ANOVA on the mean angular errors indicates that, at a significance level of 0.05, the difference between the mean angular error of proposed and conventional algorithms is insignificant. Therefore, the proposed algorithm yields a directional localization performance similar to that of the conventional algorithm.

6.4. Results of Distance Perception

Figure 4 plots the mean perceived distance and standard deviation of conventional and proposed methods, respectively, for virtual sources on the horizontal plane with different target distances and azimuths. Generally, two algorithms yield similar results. The mean perceived distance increases with the target distances; therefore, both algorithms can control the perceived distance effectively.

In most cases, when the target distance does not exceed 0.5 m, the mean perceived distance is larger than the target distance. When the target distance exceeds 0.5 m, the perceived distance is smaller than the target distance. Such biased perceived results are basic features in distance perception [31]. In addition, when the target source departs from the horizontal plane, the variation in ILD with source distance reduces; the standard deviation of perceived distance for a lateral source also increases, and thus, the accuracy of distance perception decreases, which is also consistent with previous research on distance perception for virtual sources in free field [19]. Actually, auditory distance perception in free-field is contributed by multiple cues and their interaction, including distance-dependent pressure or loudness, distance-dependent ILD, and distance-dependent spectral cues in near-field. On the median plane (such as

θ = 0^{\circ}

and

180^{\circ}

), the ILD cue vanishes due to the left-right symmetry, which accounts for the reduction in the accuracy of distance perception at these directions.

A multi-way ANOVA on the mean perceived distance indicates that, at a significance level of 0.05, the difference between the mean perceived distances of the proposed and conventional algorithms is insignificant. Therefore, the proposed algorithm yields a distance performance similar to that of the conventional algorithm. Moreover, a little bit larger standard deviation in the distance perception for rear sources may be probably caused by the error in reporting the perceived distances at this direction. We applied a F-test on the mean perceived distance for azimuth

θ = 180^{\circ}

. The result (

F (1, 7) = 2.156, p = 0.146

) indicates that, at a significance level of 0.05, the difference between the mean perceived distances of the conventional and proposed method is insignificant, which furtherly proves the validation of the proposed method.

The performance of perceived distances for target sources on the

ϕ = \pm 45^{\circ}

elevation plane is similar to that on the horizontal plane. The detail results and analysis are omitted here for conciseness.

7. Discussion and Conclusions

The experiment in Section 6 validates that the proposed algorithm can generate directional localization and distance perception consistent with that of the conventional algorithm in dynamic VAD.

To evaluate the efficiency of the proposed algorithm, some comparisons to the existing algorithms, including the conventional algorithm and binaural Ambisonics algorithm, are beneficial. The required memory and computational cost for synthesizing the signal of each ear in the conventional and proposed algorithms are estimated. The required resource for synthesizing binaural signals is twice that for each ear.

The conventional algorithm requires storing a full set of minimum-phase HRTFs at different distances and directions for binaural synthesis. In the example in Section 5.1, the dimensionality of minimum-phase HRIRs for each ear is 7 distances × 64,440 directions × 128 discrete frequencies = 57,738,240. Moreover, the conventional algorithm requires a 128-point HRIR convolver or finite-impulse-response HRTF filter for each virtual source. Dynamic binaural synthesis requires updating and interpolating on the 128 coefficients of the HRTF filter.

The proposed algorithm requires storing the data of tensor presentation of HRIRs. In the example in Section 5.1, the size of the tensor presentation of minimum-phase HRTFs is

7 \times 13 \times 8

(for core tensor)

+ 7 \times 7

(for

u_{D \times D^{'}}^{d}

)

+ 64440 \times 13

(for

u_{M \times M^{″}}^{m}

)

+ 128 \times 8

(for

u_{N \times N^{″}}^{n}

) = 839,521, which is about

1.5 %

of that of the conventional algorithm. Moreover, the proposed algorithm requires eight 128-point common HRIR convolvers or HRTF filters for synthesizing the signal of each ear. Dynamic binaural synthesis requires updating the 20 weights (weights of 7 distance-related eigen modes + weights of 13 direction-related eigen modes) for each virtual source, which is about

15.6 %

of that of the conventional algorithm. Therefore, the required memory and computational cost for the proposed algorithm is less than, or even much less than, those of the conventional algorithm when more than eight virtual sources are generated simultaneously. Moreover, the proposed algorithm can synthesize a binaural signal with high accuracy, with a mean relative error less than −20 dB (with respect to the conventional method), which is enough to accurately reconstruct various information for auditory perception.

Binaural Ambisonics is another algorithm for generating a virtual source at various distances and directions in dynamic VAD. According to the Shannon–Nyquist spatial sampling theorem [1], the upper frequency limit for accurately synthesizing binaural signals is related to the order L of binaural Ambisonics by

f_{max} = (L c / 2 π a)

, and the Lth-order binaural Ambisonics requires

{(L + 1)}^{2}

virtual loudspeakers or HRTF-based filters for synthesizing the signal of each ear, where

c = 343

m/s and

a = 0.0875

m are the speed of sound and mean head radius, respectively. Therefore, both the upper frequency limit and complexity of binaural Ambisonics increases with the order. A 33rd-order binaural Ambisonics with 1156 HRTF-based filters is required for accurately synthesizing the signal of each ear up to a frequency limit of 20 kHz. Binaural Ambisonics of appropriate order is usually used in practice, which causes error in binaural signals above the upper frequency limit. Although previous analyses and experiments indicated that, by using the redundancy among various auditory direction and distance perception cues, a fifth-order dynamic binaural Ambisonics is able to generate the auditory perception of virtual sources in various directions and at different distances [32]. However, the fifth-order binaural Ambisonics is only able to accurately synthesize binaural signals up to a frequency limit of 3.1 kHz. Above the upper frequency limit, the error in binaural signals results in timbre coloration in reproduction. Therefore, the efficiency of the proposed algorithm is higher than the binaural Ambisonics algorithm.

In conclusion, the proposed algorithm can generate virtual sources in various directions and at different distances in dynamic VAD and yields localization performance similar to conventional algorithm. Moreover, it has the following advantages:

(1): It can control the direction and distance independently.
(2): To simulate the variation of binaural pressures caused by head turning and create moving virtual source, it only requires changing the weights $c_{n^{'}}$ rather than updating the filters or convolvers, avoiding the possibly audible artifacts caused by updating the HRTF-based filters in conventional dynamic VAD.
(3): It only requires a set (eight) of common convolvers or filters, the number of convolvers is fixed (does not increases with the number of virtual source). Therefore, the efficiency of algorithm is improved when multiple(more than eight) virtual sources are synthesized simultaneously.

Future work may include generating various complex auditory scenes composed of multiple sources by using the proposed method.

Author Contributions

Conceptualization, T.Z., B.X. and J.Z.; methodology, T.Z., B.X. and J.Z.; formal analysis, T.Z. and J.Z.; validation, T.Z.; writing—original draft preparation, T.Z. and B.X.; writing—review and editing, T.Z. and B.X.; visualization, T.Z.; supervision, B.X.; project administration, B.X.; funding acquisition, B.X. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 12174118).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. The Calculation of the Tensor Decomposition of HRIRs

The three matrixes

u_{D \times D^{'}}^{d}

,

u_{M \times M^{'}}^{m}

and

u_{N \times N^{'}}^{n}

in Equation (5) can be derived using a procedure similar to that used in PCA or SVD calculation. For instance, to derive the matrix

u_{N \times N^{'}}^{n}

, an

N \times (D M)

matrix

h_{N \times (D M)}

is constructed. The rows of

h_{N \times (D M)}

represent the HRIRs at different times, and the columns of

h_{N \times (D M)}

represent the HRIRs at various directions and distances. An

N \times N

Hermitian matrix

R_{N \times N}

is constructed from

h_{N \times (D M)}

as:

R_{N \times N} = \frac{1}{D M} h_{N \times (D M)} h_{(D M) \times N}^{T},

(A1)

where superscript “T” denotes the transpose of the matrix.

The

N^{'}

columns of matrix

u_{N \times N^{'}}^{n}

are constructed from the orthonormal eigenvectors of matrix

R_{N \times N}

associated with the

N^{'} \leq N

positive eigenvalues in a descending order:

λ_{1}^{n} > λ_{2}^{n} > \dots λ_{N^{'}}^{n} > 0 .

(A2)

The other two matrixes

u_{D \times D^{'}}^{d}

and

u_{M \times M^{'}}^{m}

can be derived similarly.

Once the matrixes

u_{D \times D^{'}}^{d}

,

u_{M \times M^{'}}^{m}

and

u_{N \times N^{'}}^{n}

are found, the core tensor

w_{D^{'} \times M^{'} \times N^{'}}

in Equation (5) can be calculated using the orthonormal properties of Equation (6), as:

w_{D^{'} \times M^{'} \times N^{'}} = h_{D \times M \times N} \times_{D} {(u^{d})}_{D^{'} \times D}^{T} \times_{M} {(u^{m})}_{M^{'} \times M}^{T} \times_{N} {(u^{n})}_{N^{'} \times N}^{T} .

(A3)

References

Xie, B.S. Head-Related Transfer Function and Virtual Auditory Display, 2nd ed.; J Ross Publishing: New York, NY, USA, 2013. [Google Scholar]
Larcher, V.; Warusfel, O.; Jot, J.M.; Guyard, J. Study and comparison of efficient methods for 3D audio spatialization based on linear decomposition of HRTF data. In Proceedings of the 108th Audio Engineering Society Convention, Paris, France, 19–22 February 2000; p. 5097. [Google Scholar]
Evans, M.J.; Angus, J.A.S.; Tew, A.I. Analyzing head-related transfer function measurements using surface spherical harmonics. J. Acoust. Soc. Am. 1998, 104, 2400–2411. [Google Scholar] [CrossRef]
Duraiswami, R.; Zotkin, D.N.; Gumerov, N.A. Interpolation and range extrapolation of HRTFs. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Montreal, QC, Canada, 17–21 May 2004; pp. 45–48. [Google Scholar]
Pollow, M.; Nguyen, K.V.; Warusfel, O.; Carpentier, T.; Müller-Trapet, M.; Vorländer, M.; Noisternig, M. Calculation of head-related transfer functions for arbitrary field points using spherical harmonics decomposition. Acta Acust. Acust. 2012, 98, 72–82. [Google Scholar] [CrossRef]
Jot, J.M.; Wardle, S.; Larcher, V. Approaches to binaural synthesis. In Proceedings of the 105th Audio Engineering Society Convention, San Francisco, CA, USA, 26–29 September 1998; p. 4861. [Google Scholar]
Noisternig, M.; Sontacchi, A.; Musil, T.; Holdrich, R. A 3D ambisonic based binaural sound reproduction system. In Proceedings of the 24th International Conference: Multichannel Audio, The New Reality, Banff, AB, Canada, 26–28 June 2003; p. 1. [Google Scholar]
Menzies, D.; Marwan, A.A. Nearfield binaural synthesis and ambisonics. J. Acoust. Soc. Am. 2007, 121, 1559–1563. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Kistler, D.J.; Wightman, F.L. A model of head-related transfer functions based on principal components analysis and minimum-phase reconstruction. J. Acoust. Soc. Am. 1992, 191, 1637–1647. [Google Scholar] [CrossRef] [PubMed]
Chen, J.; Van Veen, B.D.; Hecox, K.E. A spatial feature extraction and regularization model for the head-related transfer function. J. Acoust. Soc. Am. 1995, 97, 439–452. [Google Scholar] [CrossRef]
Xie, B.S. Recovery of individual head-related transfer functions from a small set of measurements. J. Acoust. Soc. Am. 2012, 132, 282–294. [Google Scholar] [CrossRef]
Zhang, C.Y.; Xie, B.S. Platform for dynamic virtual auditory environment real-time rendering system. Chin. Sci. Bull. 2013, 58, 316–327. [Google Scholar] [CrossRef] [Green Version]
Cichocki, A.; Mandic, D.; De Lathauwer, L.; Zhou, G.; Zhao, Q.; Caiafa, C.; Phan, H.A. Tensor decompositions for signal processing applications: From two-way to multiway component analysis. IEEE Signal Process. Mag. 2015, 32, 145–163. [Google Scholar] [CrossRef] [Green Version]
Grindlay, G.; Vasilescu, M.A.O. A multilinear (tensor) framework for HRTF analysis and synthesis. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, Honolulu, HI, USA, 15–20 April 2007; pp. 161–164. [Google Scholar]
Huang, Q.; Li, L. Modeling individual HRTF tensor using high-order partial least squares. EURASIP J. Adv. Signal Process. 2014, 2014, 58. [Google Scholar] [CrossRef] [Green Version]
Wang, J.; Liu, M.; Wang, X.; Liu, T.; Xie, X. Prediction of head-related transfer function based on tensor completion. Appl. Acoust. 2020, 157, 106995. [Google Scholar] [CrossRef]
Blauert, J. Spatial Hearing: The Psychophysics of Human Sound Localization, Revised ed.; MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
Brungart, D.S.; Rabinowitz, W.M. Auditory localization of nearby sources I, head-related transfer functions. J. Acoust. Soc. Am. 1999, 106, 1465–1479. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Zahorik, P.; Brungart, D.S.; Bronkhorst, A.W. Auditory distance perception in humans: A summary of past and present research. Acta Acust. Acust. 2005, 91, 409–420. [Google Scholar]
Kolarik, A.J.; Moore, B.C.J.; Zahorik, P.; Cirstea, S.; Pardhan, S. Auditory distance perception in humans: A review of cues, development, neuronal bases, and effects of sensory loss. Atten. Percept. Psychophys. 2016, 78, 373–395. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Xie, B.S.; Yu, G.Z. Psychoacoustic Principle, Methods, and Problems with Perceived Distance Control in Spatial Audio. Appl. Sci. 2021, 11, 11242. [Google Scholar] [CrossRef]
Bronkhorst, A.W.; Houtgast, T. Auditory distance perception in rooms. Nature 1999, 397, 517–520. [Google Scholar] [CrossRef]
Kulkarni, A.; Isabelle, S.K.; Colburn, H.S. Sensitivity of human subjects to head-related transfer-function phase spectra. J. Acoust. Soc. Am. 1999, 105, 2821–2840. [Google Scholar] [CrossRef] [PubMed]
Rui, Y.; Yu, G.Z.; Xie, B.S.; Liu, Y. Calculation of individualized near-field head-related transfer function database using boundary element method. In Proceedings of the 134th Audio Engineering Society Convention, Rome, Italy, 4–7 May 2013; p. 8901. [Google Scholar]
Wightman, F.; Kistler, D.; Arruda, M. Perceptual consequences of engineering compromises in synthesis of virtual auditory objects. J. Acoust. Soc. Am. 1992, 92, 2332. [Google Scholar] [CrossRef]
Zhao, T.; Xie, B.S. Independent modes and dimensionality reduction of head-related transfer functions based on tensor decomposition. In Proceedings of the 23rd International Congress on Acoustics, Aachen, Germany, 9–13 September 2019. [Google Scholar]
Yu, G.Z.; Wang, L.L. Effect of individualized head-related transfer functions on distance perception in virtual reproduction for a nearby sound source. Arch. Acoust. 2019, 44, 251–258. [Google Scholar]
Wightman, F.L.; Kistler, D.J. Headphone simulation of free-field listening. II: Psychophysical validation. J. Acoust. Soc. Am. 1989, 85, 868–878. [Google Scholar] [CrossRef]
Jiang, J.L.; Xie, B.S.; Mai, H.M.; Liu, L.L.; Yi, K.L.; Zhang, C.Y. The role of dynamic cue in auditory vertical localisation. J. Appl. Acoust. 2019, 146, 398–408. [Google Scholar] [CrossRef]
Leong, P.; Carlile, S. Methods for spherical data analysis and visualization. J. Neurosci. Methods 1998, 80, 191–200. [Google Scholar] [CrossRef]
Zahorik, P. Auditory display of sound source distance. In Proceedings of the 2002 International Conference on Auditory Display, Kyoto, Japan, 2–5 July 2002; pp. 326–332. [Google Scholar]
Xie, B.S.; Liu, L.L.; Jiang, J.L. Dynamic binaural Ambisonics scheme for rendering distance information of free-field virtual sources. Acta Acust. 2021, 46, 1223–1233. (In Chinese) [Google Scholar]

Figure 1. The block diagram of algorithm.

Figure 2. Mean relative reconstruction error across all distances on the horizontal plane (left ear): (a) three-dimensional surface image; (b) color-scale image.

Figure 3. The statistical graph for spherical data of conventional and proposed method.

Figure 4. Results of distance perception for virtual sources on the horizontal plane.

Table 1. Cumulative percentage variance of energy for various numbers of direction-related modes (left ear).

$M^{″}$	1	2	4	7	11	13
$η_{m} (%)$	69.9	81.1	90.7	95.7	98.9	99.0

Table 2. Cumulative percentage variance of energy for various numbers of time-related modes (left ear).

$N^{″}$	1	2	4	6	8
$η_{n} (%)$	70.7	82.2	91.9	97.2	99.1

Table 3. Mean angular errors (

^{\circ}

) across all directions and judgments at different target distances.

Table 3. Mean angular errors (

^{\circ}

) across all directions and judgments at different target distances.

Algorithm	r = 0.2 m	r = 0.3 m	r = 0.5 m	r = 0.75 m	r = 1.0 m
Conventional	13.3	13.4	14.3	13.3	12.6
Proposed	14.0	13.7	12.9	13.5	13.3

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, T.; Xie, B.; Zhu, J. An Algorithm for Generating Virtual Sources in Dynamic Virtual Auditory Display Based on Tensor Decomposition of Head-Related Impulse Responses. Appl. Sci. 2022, 12, 7715. https://doi.org/10.3390/app12157715

AMA Style

Zhao T, Xie B, Zhu J. An Algorithm for Generating Virtual Sources in Dynamic Virtual Auditory Display Based on Tensor Decomposition of Head-Related Impulse Responses. Applied Sciences. 2022; 12(15):7715. https://doi.org/10.3390/app12157715

Chicago/Turabian Style

Zhao, Tong, Bosun Xie, and Jun Zhu. 2022. "An Algorithm for Generating Virtual Sources in Dynamic Virtual Auditory Display Based on Tensor Decomposition of Head-Related Impulse Responses" Applied Sciences 12, no. 15: 7715. https://doi.org/10.3390/app12157715

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Algorithm for Generating Virtual Sources in Dynamic Virtual Auditory Display Based on Tensor Decomposition of Head-Related Impulse Responses

Abstract

1. Introduction

2. Auditory Localization Cues and Conventional Algorithm in Dynamic VAD

3. Tensor Decomposition of Near-Field HRTF

4. Tensor Decomposition-Based Algorithm of Generating Multiple Virtual Sources

5. The Results of Analysis

5.1. HRIR Dataset and Pre-Processing

5.2. Results of Analysis

6. Experimental Validation

6.1. Implementation of the Algorithm

6.2. Experimental Procedure

6.3. Results of Directional Localization

6.4. Results of Distance Perception

7. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A. The Calculation of the Tensor Decomposition of HRIRs

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI