Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

Jiao, Xiaolin; Chen, Yaqi; Qu, Dan; Yang, Xukui

doi:10.3390/electronics12194118

Open AccessArticle

Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

¹

School of Cyber Science and Engineering, Zhengzhou University, Zhengzhou 450001, China

²

School of Information Systems Engineering, Information Engineering University, Zhengzhou 450001, China

^*

Authors to whom correspondence should be addressed.

Electronics 2023, 12(19), 4118; https://doi.org/10.3390/electronics12194118

Submission received: 21 August 2023 / Revised: 20 September 2023 / Accepted: 26 September 2023 / Published: 2 October 2023

Download

Browse Figures

Versions Notes

Abstract

:

At present, a prevalent approach to speaker diarization is clustering based on speaker embeddings. However, this method encounters two primary issues. Firstly, it cannot directly minimize the diarization error during the training process; secondly, the majority of clustering-based methods struggle to handle speaker overlap in audio. A viable approach for addressing these issues involves adopting end-to-end speaker diarization (EEND). Nevertheless, training this EEND system generally requires lengthy audio inputs, which must be downsampled to allow efficient model processing. In this study, we develop a novel downsampling layer using blueprint separable convolution (BSConv) instead of depthwise separable convolution (DSC) as the foundational convolutional unit, which effectively preserves information from the original audio. Furthermore, we incorporate multi-scale feature aggregation (MFA) into the encoder structure to combine the features extracted by each conformer block to the output layer, consequently enhancing the expressiveness of the model’s feature extraction. Lastly, we employ the conformer as the backbone network to incorporate the proposed enhancements, resulting in an EEND system named BSAC-EEND. We assess our suggested methodology on both simulated and real datasets. The experiment indicates that our proposed EEND system reduces diarization error rate (DER) by an average of 17.3% for two-speaker datasets and 12.8% for three-speaker datasets compared to the baseline.

Keywords:

end-to-end neural diarization (EEND); blueprint separable convolution (BSConv); multi-scale feature aggregation (MFA)

1. Introduction

Speaker diarization is the task of determining “who spoke when” from a given audio input. This technique has various applications, including information retrieval from broadcast news, generation of conference transcripts, and analysis of telephone conversations [1,2]. Moreover, it is a crucial component for implementing automatic speech recognition (ASR) in multi-speaker settings such as telephone conversations [3], conferences [4], and lectures [5]. It has been demonstrated that accurate speaker diarization results can enhance ASR performance by constraining speech masks during the construction of speech-separating beamformers [6,7].

Traditional speaker diarization systems mainly adopt a clustering-based approach [8,9], which sequentially applies the following modules to the input audio to obtain the final results: voice activity detection, voice segmentation, feature extraction, and clustering. However, despite the satisfactory performance of traditional clustering-based systems, they exhibit certain limitations. First, they depend on multiple modules, each requiring individual training. Consequently, clustering-based speaker diarization systems need to be jointly calibrated between different modules, which introduces additional complexity in training. Secondly, although some recent work has attempted to handle situations with multiple speakers at the same time, the clustering-based approach implicitly assumes that each short speech segment has only one speaker speaking, making it difficult to handle overlapping speech [10].

End-to-end speaker diarization presents an optimal solution. Self-attentitive end-to-end neural diarization (SA-EEND) [11] is an advanced method designed to model the joint speech activities of multiple speakers. Unlike clustering-based approaches, SA-EEND does not require clustering algorithms; instead, it directly outputs the speaking probabilities for all speakers within each time period. By utilizing a multi-label classification framework, SA-EEND naturally addresses the speaker overlap problem in both training and inference given a multi-speaker audio recording input. Furthermore, SA-EEND follows an end-to-end training strategy, minimizing the diarization error rate and exhibiting remarkable accuracy on a dataset of two-speaker telephone conversations [12].

In this paper, we propose an end-to-end speaker diarization model base on a blueprint separable subsampling layer and an aggregate feature conformer (BSAC-EEND), which enhances system performance by optimizing convolutional subsampling layers and feature encoders. Firstly, the EEND system must process long utterances. During the training phase, to increase efficiency, the system segments input audio, which typically varies in length from a few seconds to several tens of seconds. In contrast, during the testing phase, the entire audio is assessed, leading to longer input data. A critical consideration is the model’s ability to provide robust processing for various utterance durations, particularly for longer inputs. In the realm of speaker recognition, the multi-scale feature fusion conformer [13] has demonstrated proficiency in extracting robust global feature information from speech audios of varying lengths. This is attributed to the model’s ability to capture both global and local features while integrating multi-scale feature representations. Significantly, it exhibits better performance in the case of long-duration test utterances [13]. Consequently, we introduce the MFA structure into the EEND system’s feature extractor. The conformer [14] merges a transformer [15] with convolutional neural networks to effectively capture global and local features, whereas the MFA structure concatenates the output features of all conformer blocks, resulting in multi-scale representations. Secondly, to diminish redundant information in audio frame sequences and enable the model to efficiently process long-duration audio input, downsampling is necessary. Currently, frame stacking subsampling is frequently employed [11,16], although [12] has shown that convolutional subsampling can notably enhance EEND networks’ performance compared to frame-stacked subsampling. Motivated by this finding, we introduce BSConv [17], an advanced version of depthwise separable convolution [18] that more effectively leverages correlations within convolution kernels to achieve efficient convolution separation [17]. We have designed a novel convolution subsampling layer utilizing both unconstrained BSConv (BSConv-U) [17] and subspace BSConv (BSConv-S) [17] structures to optimize the subsampling architecture.

This paper’s primary contributions are threefold: (1) Incorporation of the MFA structure into the EEND model’s feature extractor, thereby enhancing its capability for extracting robust features. (2) Assessment of the performance of the convolutional subsampling layer based on BSConv-U and BSConv-S on both simulated and real data, with results demonstrating superiority over frame stacking subsampling and DSC-based convolutional subsampling. (3) Development of a novel EEND system which employs a conformer as the backbone network, integrates the MFA structure, and incorporates a convolutional subsampling layer based on BSConv. This new EEND system demonstrates significant performance improvements on both simulated and real datasets.

2. Related Work

2.1. Traditional Speaker Diarization Method

Traditional speaker diarization approaches typically comprise a cascaded system consisting of four modules: (1) speech activity detection (SAD), (2) speaker embedding extraction, (3) speaker embedding clustering, and an optional process, (4) overlap processing. Some methods also incorporate an ASR module [19,20]. The majority of research primarily focuses on the second and third modules, namely speaker embedding extraction and speaker embedding clustering. For speaker embeddings, i-vector [21,22], x-vector [23], and d-vector [24,25] have been investigated. Early works on speaker embedding clustering employed traditional algorithms such as k-means clustering [8,26], aggregated hierarchical clustering (AHC) [9,27,28], mean-shift clustering [21], and spectral clustering [24,29]. More recently, advanced clustering techniques have been proposed, such as Bayesian HMM clustering of x-vector sequences (VBx) [30,31], automatically adjusted spectral clustering [32], and fully supervised clustering [25,33]. These approaches are typically utilized for hard clustering, wherein a sample exclusively belongs to a single class; consequently, most cascaded systems cannot address speaker overlap, with a few exceptions [34]. To handle speaker overlap, some cascaded approaches incorporate the last optional module; however, recent studies occasionally omit this step from the methods and evaluations [22,24,25,32,33]. Additionally, the first module, SAD is often neglected when assessing cascaded methods using annotated speech activities [22,24,25,32,33].

2.2. End-to-End Speaker Diarization

For direct generation of results from audio, one advantage of end-to-end speaker diarization approaches is that they do not need additional modules to detect unvoice or overlapped voices. Various methods exist for obtaining speaker diarization results within speech separation [35,36]. However, these models are trained on clean speech or time–frequency masks derived from clean speech, making them unsuitable for training on complex audio datasets such as DIHARD [37,38]. In contrast, EEND-based methods, designed to output the posterior probability of multiple speakers speaking simultaneously, can be trained on real data and are better for solving practical problems. The EEND model accepts a sequence of acoustic features, such as MFCC or Fbank,

X = (x_{t} \in R^{F} | t = 1, \dots, T)

. The neural network produces a speaker label sequence

Y = (y_{t} | t = 1, . . ., T)

, wherein

y_{t} = [y_{t, k} \in \{0, 1\} | k = 1, \dots, K]

. It means that at frame speaker is speaking when

y_{t, k} = 1

, and denotes the maximum number of speakers the network can handle. For different speakers k and

k^{'}

, both

y_{t, k}

and

y_{t, k^{'}}

may equal one, indicating that speakers k and

k^{'}

are speaking simultaneously at frame t, i.e., overlapping speech. Assuming the output

y_{t, k}

is conditional independence, the model’s training objective is to maximize the diarization posterior probability

\log P (Y | X) \sim \sum_{t} \sum_{k} log P (y_{t, k} | X)

of the training data. Multiple candidate reference label sequences k can be generated by exchanging the speaker label Y; thus, the loss is computed for all possible reference sequences, and the reference label with the lowest loss is used for error backpropagation. This approach is influenced by the permutation-free objective used in speech separation [39]. Early EEND implementations employed bidirectional long short-term memory (BLSTM) [16], succeeded by self-attention-based EEND networks [11], demonstrating improved DER on two-speaker data, such as the CALLHOME dataset (LDC2001S97) [40] and dialogue audio in the CSJ [41].

Drawing upon the high-performing conformer model in end-to-end continuous speech recognition, literature [12] devised a conformer-based EEND system (CB-EEND), incorporating SpecAugment [42] and a convolutional subsampling layer based on depthwise separable convolution, resulting in significant improvements.

3. BSAC-EEND

This chapter presents the fundamental composition and structure of the proposed BSAC-EEND. As shown in Figure 1, the model’s comprehensive structure encompasses SpecAugment, a BSConv convolution subsampling layer, a conformer block, an MFA structure, a layer norm, a linear layer, and a sigmoid function. The subsequent sections provide detailed descriptions of the model’s critical components.

3.1. Convolution Subsampling Based on BSConv

After in-depth analysis of CNNs, the work [17] came to the conclusion that convolution kernels generally show high redundancy in their depth direction, which means they have a higher intra-kernel correlation. Consequently, the author proposes BSConv, which employs a two-dimensional blueprint to represent each convolution kernel utilizing a weight vector distributed along the depth axis.

Inspired by the correlation within the convolution kernel, DSC assumes that there is high inter-kernel redundancy and adopts a structure which is opposite to BSConv. This, however, has be proven to be less effective when separating convolutions [43].

As a result, we explore the use of BSConv as the fundamental unit of convolutional subsampling and design a novel convolutional subsampling layer to subsample input features. BSConv encompasses two distinct structures: BSConv-U and BSConv-S. Analogous to DSC, they are also composed of pointwise convolution [18] and depthwise convolution [18] with a different order. The depthwise convolution applies a convolution kernel to each channel of the input feature map, resulting in a modification of the output feature map’s shape without altering its channel count. In contrast, the pointwise convolution employs a 1 × 1 convolution for every feature map. It exclusively modifies the number of channels, leaving the feature map’s size unaffected. Figure 2 illustrates the transformations in the shape of the input and output. Their structural comparisons to DSC are also displayed in Figure 2.

Drawing inspiration from these structures, we develop two convolutional subsampling modules, with their respective architectures depicted in Figure 3. We refer to the convolutional subsampling based on BSConv-U as CS-BSCU and the one based on BSConv-S as CS-BSCS. The performance of these BSConv-based convolutional subsampling modules in the EEND system will be discussed in detail in Section 4.

Assuming the input acoustic feature possesses a length of T and a dimension of F, denoted as

X \in R^{T \times F}

, the feature undergoes SpecAugment before being transmitted to the convolutional subsampling layer. This novel convolutional subsampling process, embodied by

B S C o n v

, can be described as:

X^{'} = S p e c A u g m e n t (X)

(1)

X^{″} = B S C o n v (X^{'})

(2)

where

X \in R^{T \times F}

,

X^{'} \in R^{T \times F}

,

T^{'} = T / 10

.

3.2. Conformer Block

The self-attention mechanism derived from the transformer effectively captures long-distance global context dependencies; however, it lacks local detail recognition. To model global and local features more directly and efficiently, the conformer model [14] combines both CNN and transformer architectures, thereby better capturing the features necessary for speaker diarization in audio. The principal components of the conformer encoder block encompass the multi-head self-attention (MHSA) and convolution modules. Following the MHSA, the convolution module consists of pointwise convolution, one-dimensional depthwise convolution, and a batch normalization layer succeeding the convolution layer, which aids in training the depth model more readily.

The conformer encoder block’s structure [14] diverges from the transformer [15], incorporating two feedforward neural network (FNN) modules possessing semi-residual connections. These two FNNs are positioned between the MHSA and convolution modules, resembling a macaron structure. Mathematically, for the i-th conformer block and input feature

E_{i - 1}

, the output feature

E_{i}

is calculated as follows:

{\tilde{E}}_{i} = E_{i - 1} + \frac{1}{2} F N N (E_{i - 1})

(3)

{\tilde{E}}_{i} = E_{i - 1} + \frac{1}{2} F N N (E_{i - 1})

(4)

{E^{″}}_{i} = {E^{'}}_{i} + C o n v o l u t i o n ({E^{'}}_{i})

(5)

E_{i} = L a y e r N o r m ({E^{″}}_{i} + \frac{1}{2} F N N ({E^{″}}_{i}))

(6)

3.3. MFA Structure

The research in [44,45] demonstrated that low-level features can contribute to high-quality speaker feature extraction. Following this theory, the ECAPA-TDNN [46] system aggregates the output features of all SE-Res2 blocks before the final pooling layer, resulting in a significant performance improvement.

Similarly, in our approach, we concatenate the output features of each conformer block for extracting features from EEND speaker diarization and then pass them to the layer norm layer. Unlike [13], however, we eliminate the pooling layer after the layer norm along with the batch norm layer and other structures. There are two reasons for this: Firstly, the EEND system’s input audio typically contains multiple speakers, making it impossible to obtain representative speaker features using the pooling method. Secondly, the pooling method acquires segment-level representative speaker embeddings, whereas we require frame-level speaker activity analysis, rendering pooling inapplicable. Therefore, we directly use a linear layer and a sigmoid function to output the frame-level speaker activity’s posterior probability. The specific process can be described as:

E = C o n c a t (E_{1}, . . ., E_{i})

(7)

O = L a y e r N o r m (E)

(8)

O u t p u t = σ (L i n e a r (O))

(9)

where

σ

denotes the sigmoid function,

E_{1}, \dots, E_{i} \in R^{T^{'} \times d}

,

T^{'} = T / 10

, d symbolizes the output dimension of the transformer or conformer encoder,

E, O \in R^{T^{'} \times D}

, i represents the number of encoder blocks, and

D = d \times i

.

4. Data and Experiments

4.1. Data Preparation

To train the EEND system, we employed the simulated data generation algorithm suggested in [16], which has been used in several studies [11,12,16,47,48]. Firstly, we selected N speakers, and for each speaker, we chose random speech segments. We then inserted silent parts between these segments before splicing them together to create N audio files. For each generated audio, a room impulse response was convolved, mixing N long recordings with a noise signal set at a random signal-to-noise ratio. The silence duration between speech segments was determined by an exponential distribution with a parameter of

β

. Consequently,

β

can be utilized to control the speaker overlap rate of the generated audio, where a larger

β

value results in lower speaker overlap [16].

The corpus used for the simulation comprises both Switchboard-2 (Phases I, II, and III), Switchboard Cellular (Parts 1 and 2) and NIST Speaker Recognition Evaluation (2004, 2005, 2006, and 2008), with all audio being telephone speech sampled at 8 kHz. In total, these corpora contain 6381 speakers, divided into 5743 speakers for the training set and 638 speakers for the test set. As there are no temporal annotations in these corpora, similar to [16], we utilized SAD based on a time-delayed neural network (TDNN) and the Kaldi speech toolkit [49] for extraction. Noise data were obtained from 37 background noise segments from the MUSAN corpus [50], while the Room Impulse Response (RIRs) data came from the simulated RIR dataset used in [51], with 10,000 records selected. SNR values were chosen from 10, 15, and 20 dB. We generated mixed audio datasets with two and three speakers separately using different

β

values to control the speaker overlap rate of the created mixed audio [16]. Each speaker had 20–40 utterances, with detailed simulated data information presented in Table 1.

We also select audios containing two and three speakers from the real telephone conversation dataset CALLHOME to construct new datasets, referred as CALLHOME-2spk and CALLHOME-3spk, respectively, and employed Kaldi’s script to split them into two subsets. For CALLHOME-2spk, these subsets contained a fine-tuning set of 155 audios and a test set of 148 recordings, whereas for CALLHOME-3spk, the subsets had a fine-tuning set of 61 audios and a test set of 74 recordings. Further details are available in Table 2.

4.2. Experimental Setup

For audio features, we follow the configuration of the original paper, utilizing a 23-dimensional MFCC for SA-EEND [11], and 80-dimensional Fbank features for CB-EEND [12] and our BSAC-EEND, with a frame length of 25 ms and a frame shift of 10 ms. Regarding SpecAugment, we employed two frequency masks and two times masks, with each frequency mask capable of masking up to two consecutive frequency channels and each time mask having at most 120 consecutive time steps.

In terms of subsampling, we adopted three methods: frame stacking subsampling, as used in the original SA-EEND [11]; DSC-based convolutional subsampling, as used in the original CB-EEND [12]; and our proposed BSConv-based convolutional subsampling. For frame-stacked subsampling, we concatenated each feature with the seven preceding and following frames. For DSC-based convolutional subsampling, in line with [12], we implemented two layers of two-dimensional (2D) depthwise separable convolutions [18]. For BSConv-based convolutional subsampling, we utilized CS-BSCU and CS-BSCS convolutional subsampling methods. These techniques subsampled the temporal domain of input features by a factor of 10 (equivalent to 100 milliseconds). Following [12], we set the convolution kernel parameters for all convolution subsampling as follows: {(3, 3), (7, 7)}. For 23-dimensional acoustic features, the step sizes were {(2, 1), (5, 1)}, and for the 80-dimensional acoustic features, the step sizes were {(2, 2), (5, 2)}.

For SA-EEND, we employed four transformer encoders, with each encoder containing four attention heads. Each attention head generated a 256-dimensional frame-by-frame embedding vector, and the feedforward layer consisted of 1024 internal units. Positional encoding was not utilized. For CB-EEND and BSAC-EEND, we used four conformer encoders, with each encoder containing four attention heads. Each attention head generated a 256-dimensional frame-by-frame embedding vector, and the feedforward layer comprised 1024 internal units. Similarly, positional encoding was not used, and the convolution module’s kernel size was set to 31. Furthermore, we applied the designed BSConv-based convolutional subsampling and MFA to CB-EEND for comparison.

In the training process with simulated data, we utilized the Adam optimizer [52] to update the neural network and the Noam scheduler [15] to adjust the learning rate, employing 100,000 warm-up steps. The Adam optimizer with a fixed learning rate was also employed for adaptation on real datasets. For efficient batching during training, audio was divided into 50 s segments when working with simulated data, and 200 s segments for the adaptation set. The training batch size was set to 64. During the inference stage, the entire audio was processed by the network without segmentation. We assessed the performance using the DER, which is defined by

T_{S p e e c h}

,

T_{MI}

,

T_{F A}

, and

T_{C F}

as the total speech duration, missed speech duration, false positive speech duration, and speaker confusion duration, respectively.

D E R = \frac{T_{M I} + T_{F A} + T_{C F}}{T_{S p e e c h}}

(10)

we assessed performance on both the simulated dataset and the standard CALLHOME dataset, employing a collar tolerance of 0.25 s while comparing the hypothesized and reference speaker boundaries. Notably, speaker overlap was not excluded from the evaluation.

5. Result and Analysis

5.1. Improved Convolutional Subsampling

To investigate the performance of the proposed enhanced convolutional subsampling layer, we incorporated two convolutional subsampling structures depicted in Figure 3, substituting the respective elements in CB-EEND. These improvements were assessed on simulated datasets featuring two and three speakers, as well as real datasets. For the simulated data evaluation, the training sets of Sim2spk and Sim3spk were employed to train the EEND model, which was subsequently tested on their corresponding test sets (outlined in Table 1). Regarding the real dataset evaluation, the model was adapted using part 1 and was tested on part 2.

Table 3 presents the diarization error rate (DER) evaluation results. From the first three rows, it is evident that CB-EEND outperformed both the cluster-based cascade system and transformer-based EEND in terms of DER. Examining the last three rows, we observe that compared to the original CB-EEND, both CS-BSCU and CS-BSCS achieved significant improvements in DER when replacing the DSC-based convolutional subsampling layer. Specifically, the CS-BSCS layer resulted in 8.2% (

β = 2

), 15% (

β = 3

), and 14.9% (

β = 5

) relative improvements on the Sim2spk test set, a 3.6% improvement on the CALLHOME-2spk Part 2, and 17.1% (

β = 5

), 23.8% (

β = 7

), and 27.4% (

β = 11

) improvements on the Sim3spk test set. Additionally, a 1.2% relative improvement was observed on the CALLHOME-3spk Part 2.

The CS-BSCU layer yielded 13.4% (

β = 2

), 8.6% (

β = 3

), and 3.1% (

β = 5

) relative improvements on the Sim2spk test set, and a 3.1% improvement on CALLHOME-2spk Part 2. Furthermore, 20.7% (

β = 5

), 21.1% (

β = 7

), and 28.7% (

β = 11

) relative improvements were achieved on the Sim3spk test set, along with a 1.4% improvement on the CALLHOME-3spk Part 2.

Overall, within CB-EEND, the CS-BSCS exhibits slightly better performance for two-speaker datasets and slightly inferior results for three-speaker datasets in comparison to the CS-BSCU layer. As discussed in Section 3.1, the convolutional subsampling based on BSConv outperforms the DSC-based approach, primarily due to its ability to reduce correlation within the convolution kernel more effectively. This facilitates the retention of more relevant information during the downsampling process.

5.2. MFA

We evaluated the performance of the MFA structure, as outlined in Section 3.3, within both SA-EEND and CB-EEND frameworks. Employing the experimental settings discussed in Section 4.1, we implemented the MFA structure in CB-EEND and assessed its performance on simulated and real datasets containing two or three speakers. Similar to Section 5.1, for the simulated data evaluation, we used the Sim2spk and Sim3spk training sets to train the EEND model and tested it on the respective test sets provided in Table 1. For real data evaluation, we utilized Part 1 of the CALLHOME dataset for adaptation and conducted testing on Part 2.

Table 4 presents the diarization error rate (DER) evaluation results. Compared with CB-EEND, the incorporation of the MFA structure yielded noticeable improvements in both simulated and real datasets. In the Sim2spk test set, relative improvements of 8.7% (

β = 2

), 8.3% (

β = 3

), and 4.5% (

β = 5

) were achieved, whereas a 3.3% improvement was observed in the CALLHOME-2spk set. Furthermore, the model attained 12.3% (

β = 5

), 13.3% (

β = 7

), and 17.3% (

β = 11

) relative improvements in the Sim3spk test set. Additionally, a 2.3% relative improvement was observed on CALLHOME-3spk Part 2.

To better illustrate the impact of the MFA structure, we randomly selected a 50 s speech input from Sim2spk (

β = 2

) and reduced the high-dimensional features obtained prior to the linear layer to a two-dimensional space using principal component analysis (PCA). Figure 4 displays the visualization results for features extracted by SA-EEND, CB-EEND, and CB-EEND+MFA. Evidently, both CB-EEND and CB-EEND+MFA outperformed SA-EEND when discerning between silence frames and single-speaker speech frames, with CB-EEND+MFA proving superior in differentiating overlapping speaker frames. This demonstrates that the model benefits substantially from this multiscale representation convergence.

5.3. BSAC-EEND

In this section, we evaluate the proposed BSAC-EEND model, employing the two convolutional subsampling layer structures depicted in Figure 3. The evaluation dataset includes both two- and three-speaker datasets, incorporating simulated and real datasets for training and testing. The model’s training and testing processes are consistent with those described in the previous two sections.

The evaluation results can be observed in Table 5. When compared to CB-EEND, BSAC-EEND (CS-BSCS) achieved relative improvements of 20.8% (

β = 2

), 16.1% (

β = 3

), and 24.8% (

β = 5

) on the two-speaker dataset and a relative improvement of 7.8% on CALLHOME-2spk. For the three-speaker dataset, it showed relative improvements of 2.5% (

β = 5

) and 3.5% (

β = 7

). BSAC-EEND (CS-BSCU) demonstrated relative improvements of 16.2% (

β = 2

), 8.3% (

β = 3

), and 15.8% (

β = 5

) on the two-speaker dataset, with a 6.1% relative improvement on CALLHOME-2spk. On the three-speaker dataset, relative improvements of 15.2% (

β = 5

), 10.8% (

β = 7

), and 21.7% (

β = 11

) were recorded, with a 3.79% relative improvement on CALLHOME-3spk.

These findings indicate that BSAC-EEND achieves an overall DER improvement compared to SA-EEND and CB-EEND, with BSAC-EEND (CS-BSCS) performing better on the two-speaker dataset and BSAC-EEND (CS-BSCU) showing superior results on the three-speaker dataset. This outcome aligns with the analysis presented in Section 5.1. We conducted a comparative evaluation with the cutting-edge end-to-end speaker diarization system, EEND-EDA [48]. As reported in [48], EEND-EDA exhibits strong performance with fewer speakers, but there is significant room for improvement when working with more complex speaker datasets. Comparing the results in the table, our system demonstrates slightly inferior performance to EEND-EDA for the two-speaker dataset, but outperforms it in the more challenging three-speaker dataset. This substantial improvement suggests that our approach is better equipped to handle intricate audio information.

6. Conclusions

In this study, we introduce a conformer-based EEND system with an MFA structure and a novel convolutional subsampling layer. Our experimental results demonstrate that the EEND system benefits from the convergence and fusion of multi-scale representations facilitated by the MFA structure and the innovative convolutional subsampling layer. Although our system surpasses SA-EEND and CB-EEND in terms of overall performance, we observed that even after fine-tuning on real datasets, the performance of the new EEND system on the CALLHOME dataset appears relatively restrained, a dilemma potentially attributable to the domain mismatch problem existing between simulated training data and actual test data [12,53]. As a future direction, we plan to incorporate transfer learning techniques to enhance the EEND system’s training process, enabling better performance on real test data. Within the context of system architecture, existing literature [54] has suggested that quantum convolutional neural networks may enhance speech signal processing performance by extracting more representative speech features. Thus, employing quantum convolutional neural networks for future downsampling processes could be a potential approach. Moreover, given the importance of privacy protection, particularly in audio information modeling [55], it emerges as a potential area warranting further research focus in the future.

Author Contributions

Conceptualization, software, writing—original draft, X.J.; validation, visualization, Y.C.; supervision, writing—review and editing, D.Q. and X.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China (No. 62171470), the Central Plains Science and Technology Innovation Leading Talent Project of Henan Province (No. 234200510019), and the General Project of the Henan Provincial Natural Science Foundation of China (No. 232300421240).

Data Availability Statement

The simulated datasets used in the current study are not publicly available as these datasets can be generated by the readers themselves based on the simulated settings in Section 4, and the real datasets used can be purchased online.

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:

EEND	End-to-end neural diarization
DSC	Depthwise separable convolution
BSConv	Blueprint separable convolution
MFA	Multi-scale feature aggregation
BSAC-EEND	Blueprint separable subsampling and aggregate feature conformer-based EEND
ASR	Automatic speech recognition
SA-EEND	Self-Attentitive End-to-End Neural Diarization
BSConv-U	Unconstrained BSConv
BSConv-S	Subspace BSConv
MFA-Conformer	Multi-scale feature aggregation conformer
SAD	Speech activity detection
AHC	Aggregated hierarchical clustering
VBx	Bayesian HMM clustering of x-vector sequences
BLSTM	Bidirectional long short-term memory
CB-EEND	Conformer-based EEND
CS-BSCU	Convolutional subsampling based on BSConv-U
CS-BSCS	Convolutional subsampling based on BSConv-S
MHSA	Multi-head self-attention
FNN	Feedforward neural network
TDNN	Time-delayed neural network
RIRs	Room impulse response
DER	Diarization error rate
PCA	Principal component analysis

References

Tranter, S.E.; Reynolds, D.A. An Overview of Automatic Speaker Diarization Systems. IEEE Trans. Audio Speech Lang. Process. 2006, 14, 1557–1565. [Google Scholar] [CrossRef]
Anguera Miro, X.; Bozonnet, S.; Evans, N.; Fredouille, C.; Friedland, G.; Vinyals, O. Speaker Diarization: A Review of Recent Research. IEEE Trans. Audio Speech Lang. Process. 2012, 20, 356–370. [Google Scholar] [CrossRef]
Kenny, P.; Reynolds, D.; Castaldo, F. Diarization of Telephone Conversations Using Factor Analysis. IEEE J. Sel. Top. Signal Process. 2010, 4, 1059–1070. [Google Scholar] [CrossRef]
Anguera, X.; Wooters, C.; Hernando, J. Acoustic Beamforming for Speaker Diarization of Meetings. IEEE Trans. Audio Speech Lang. Process. 2007, 15, 2011–2022. [Google Scholar] [CrossRef]
Zhu, X.; Barras, C.; Lamel, L.; Gauvain, J.L. Multi-Stage Speaker Diarization for Conference and Lecture Meetings. In Multimodal Technologies for Perception of Humans, Proceedings of the International Evaluation Workshops CLEAR 2007 and RT 2007, Baltimore, MD, USA, 8–11 May 2007; Stiefelhagen, R., Bowers, R., Fiscus, J., Eds.; Springer: Berlin, Germany, 2008; Volume 4625, pp. 533–542. [Google Scholar]
Kanda, N.; Boeddeker, C.; Heitkaemper, J.; Fujita, Y.; Horiguchi, S.; Nagamatsu, K.; Haeb-Umbach, R. Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASR. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 1248–1252. [Google Scholar] [CrossRef]
Zorila, C.; Boeddeker, C.; Doddipatla, R.; Haeb-Umbach, R. An investigation into the effectiveness of enhancement in ASR training and test for CHIME-5 dinner party transcription. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), Singapore, 14–18 December 2019; pp. 47–53. [Google Scholar]
Shum, S.H.; Dehak, N.; Dehak, R.; Glass, J.R. Unsupervised Methods for Speaker Diarization: An Integrated and Iterative Approach. IEEE Trans. Audio Speech Lang. Process. 2013, 21, 2015–2028. [Google Scholar] [CrossRef]
Sell, G.; Garcia-Romero, D. Speaker diarization with PLDA I-vector scoring and unsupervised calibration. In Proceedings of the 2014 IEEE Workshop on Spoken Language Technology SLT 2014, South Lake Tahoe, NV, USA, 7–10 December 2014; pp. 413–417. [Google Scholar]
Bullock, L.; Bredin, H.; Garcia-Perera, L.P. Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, 4–8 May 2020; pp. 7114–7118. [Google Scholar]
Fujita, Y.; Kanda, N.; Horiguchi, S.; Xue, Y.; Nagamatsu, K.; Watanabe, S. End-to-end neural speaker diarization with self-attention. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU 2019), Singapore, 14–18 December 2019; pp. 296–303. [Google Scholar]
Liu, Y.C.; Han, E.; Lee, C.; Stolcke, A. End-to-End Neural Diarization: From Transformer to Conformer. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3081–3085. [Google Scholar] [CrossRef]
Zhang, Y.; Lv, Z.; Wu, H.; Zhang, S.; Hu, P.; Wu, Z.; Lee, H.Y.; Meng, H. MFA-Conformer: Multi-Scale Feature Aggregation Conformer for Automatic Speaker Verification. In Proceedings of the Interspeech 2022, ISCA, Incheon, Republic of Korea, 18–22 September 2022; pp. 306–310. [Google Scholar] [CrossRef]
Gulati, A.; Qin, J.; Chiu, C.C.; Parmar, N.; Zhang, Y.; Yu, J.; Han, W.; Wang, S.; Zhang, Z.; Wu, Y.; et al. Conformer: Convolution-Augmented Transformer for Speech Recognition. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 5036–5040. [Google Scholar] [CrossRef]
Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. In Proceedings of the Advances in Neural Information Processing Systems 30 (NIPS 2017), Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; NIPS: La Jolla, CA, USA, 2017; Volume 30. [Google Scholar]
Fujita, Y.; Kanda, N.; Horiguchi, S.; Nagamatsu, K.; Watanabe, S. End-to-End Neural Speaker Diarization with Permutation-Free Objectives. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 4300–4304. [Google Scholar] [CrossRef]
Haase, D.; Amthor, M. Rethinking Depthwise Separable Convolutions: How Intra-Kernel Correlations Lead to Improved MobileNets. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 14588–14597. [Google Scholar] [CrossRef]
Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications. arXiv 2017, arXiv:1704.04861. [Google Scholar]
India, M.; Fonollosa, J.A.R.; Hernando, J. LSTM Neural Network-Based Speaker Segmentation Using Acoustic and Language Modelling. In Proceedings of the 18th Annual Conference of the International Speech Communication Association (Interspeech 2017), Vols 1–6: Situated Interaction, Stockholm, Sweden, 20–24 August 2017; pp. 2834–2838. [Google Scholar] [CrossRef]
Park, T.J.; Han, K.J.; Huang, J.; He, X.; Zhou, B.; Georgiou, P.; Narayanan, S. Speaker Diarization with Lexical Information. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 391–395. [Google Scholar] [CrossRef]
Senoussaoui, M.; Kenny, P.; Stafylakis, T.; Dumouchel, P. A Study of the Cosine Distance-Based Mean Shift for Telephone Speech Diarization. IEEE-ACM Trans. Audio Speech Lang. Process. 2014, 22, 217–227. [Google Scholar] [CrossRef]
Sell, G.; Snyder, D.; McCree, A.; Garcia-Romero, D.; Villalba, J.; Maciejewski, M.; Manohar, V.; Dehak, N.; Povey, D.; Watanabe, S.; et al. Diarization Is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge. In Proceedings of the 19th Annual Conference of the International Speech Communication Association (Interspeech 2018), Vols 1–6: Speech Research for Emerging Markets in Multilingual Societies, Hyderabad, India, 2–6 September 2018; pp. 2808–2812. [Google Scholar] [CrossRef]
Snyder, D.; Garcia-Romero, D.; Sell, G.; Povey, D.; Khudanpur, S. X-Vectors: Robust dnn embeddings for speaker recognition. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5329–5333. [Google Scholar]
Wang, Q.; Downey, C.; Wan, L.; Mansfield, P.A.; Moreno, I.L. Speaker Diarization With Lstm. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5239–5243. [Google Scholar]
Zhang, A.; Wang, Q.; Zhu, Z.; Paisley, J.; Wang, C. Fully supervised speaker diarization. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6301–6305. [Google Scholar]
Dimitriadis, D.; Fousek, P. Developing On-Line Speaker Diarization System. In Proceedings of the 18th Annual Conference of the International Speech Communication Association (Interspeech 2017), Vols 1–6: Situated Interaction, Stockholm, Sweden, 20–24 August 2017; pp. 2739–2743. [Google Scholar] [CrossRef]
Garcia-Romero, D.; Snyder, D.; Sell, G.; Povey, D.; McCree, A. Speaker diarization using deep neural network embeddings. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 4930–4934. [Google Scholar]
Maciejewski, M.; Snyder, D.; Manohar, V.; Dehak, N.; Khudanpur, S. Characterizing performance of speaker diarization systems on far-field speech using standard methods. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5244–5248. [Google Scholar]
Raj, D.; Huang, Z.; Khudanpur, S. Multi-class spectral clustering with overlaps for speaker diarization. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 582–589. [Google Scholar] [CrossRef]
Diez, M.; Burget, L.; Landini, F.; Cernocky, J. Analysis of Speaker Diarization Based on Bayesian HMM With Eigenvoice Priors. IEEE-ACM Trans. Audio Speech Lang. Process. 2020, 28, 355–368. [Google Scholar] [CrossRef]
Landini, F.; Profant, J.; Diez, M.; Burget, L. Bayesian HMM Clustering of X-Vector Sequences (VBx) in Speaker Diarization: Theory, Implementation and Analysis on Standard Tasks. Comput. Speech Lang. 2022, 71, 101254. [Google Scholar] [CrossRef]
Park, T.J.; Han, K.J.; Kumar, M.; Narayanan, S. Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap. IEEE Signal Process. Lett. 2020, 27, 381–385. [Google Scholar] [CrossRef]
Li, Q.; Kreyssig, F.L.; Zhang, C.; Woodland, P.C. Discriminative neural clustering for speaker diarisation. In Proceedings of the 2021 IEEE Spoken Language Technology Workshop (SLT), Shenzhen, China, 19–22 January 2021; pp. 574–581. [Google Scholar] [CrossRef]
Huang, Z.; Watanabe, S.; Fujita, Y.; Garcia, P.; Shao, Y.; Povey, D.; Khudanpur, S. Speaker Diarization with Region Proposal Network. In Proceedings of the ICASSP 2020—2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Barcelona, Spain, 4–8 May 2020; pp. 6514–6518. [Google Scholar] [CrossRef]
Kinoshita, K.; Drude, L.; Delcroix, M.; Nakatani, T. Listening to each speaker one by one with recurrent selective hearing networks. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Calgary, AB, Canada, 15–20 April 2018; pp. 5064–5068. [Google Scholar]
von Neumann, T.; Kinoshita, K.; Delcroix, M.; Araki, S.; Nakatani, T.; Haeb-Umbach, R. All-neural online source separation, counting, and diarization for meeting analysis. In Proceedings of the 2019 Ieee International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 91–95. [Google Scholar]
Ryant, N.; Church, K.; Cieri, C.; Cristia, A.; Du, J.; Ganapathy, S.; Liberman, M. The Second DIHARD Diarization Challenge: Dataset, Task, and Baselines. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 978–982. [Google Scholar] [CrossRef]
Ryant, N.; Singh, P.; Krishnamohan, V.; Varma, R.; Church, K.; Cieri, C.; Du, J.; Ganapathy, S.; Liberman, M. The Third DIHARD Diarization Challenge. In Proceedings of the Interspeech 2021, Brno, Czech Republic, 30 August–3 September 2021; pp. 3570–3574. [Google Scholar] [CrossRef]
Kolbaek, M.; Yu, D.; Tan, Z.H.; Jensen, J. Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural Networks. IEEE-ACM Trans. Audio Speech Lang. Process. 2017, 25, 1901–1913. [Google Scholar] [CrossRef]
2000 NIST Speaker Recognition Evaluation. Available online: https://catalog.ldc.upenn.edu/LDC2001S97 (accessed on 20 August 2023).
Maekawa, K. Corpus of spontaneous japanese: Its design and evaluation. In Proceedings of the ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition, Tokyo, Japan, 13–16 April 2003. [Google Scholar]
Park, D.S.; Chan, W.; Zhang, Y.; Chiu, C.C.; Zoph, B.; Cubuk, E.D.; Le, Q. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 2613–2617. [Google Scholar] [CrossRef]
Guo, J.; Li, Y.; Lin, W.; Chen, Y.; Li, J. Network Decoupling: From Regular to Depthwise Separable Convolutions. arXiv 2018, arXiv:1808.05517. [Google Scholar]
Tang, Y.; Ding, G.; Huang, J.; He, X.; Zhou, B. Deep speaker embedding learning with multi-level pooling for text-independent speaker verification. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6116–6120. [Google Scholar]
Gao, Z.; Song, Y.; McLoughlin, I.; Li, P.; Jiang, Y.; Dai, L. Improving Aggregation and Loss Function for Better Embedding Learning in End-to-End Speaker Verification System. In Proceedings of the Interspeech 2019, Graz, Austria, 15–19 September 2019; pp. 361–365. [Google Scholar] [CrossRef]
Desplanques, B.; Thienpondt, J.; Demuynck, K. ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 3830–3834. [Google Scholar] [CrossRef]
Horiguchi, S.; Fujita, Y.; Watanabe, S.; Xue, Y.; Nagamatsu, K. End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors. In Proceedings of the Interspeech 2020, Shanghai, China, 25–29 October 2020; pp. 269–273. [Google Scholar] [CrossRef]
Horiguchi, S.; Fujita, Y.; Watanabe, S.; Xue, Y.; Garcia, P. Encoder-Decoder Based Attractors for End-to-End Neural Diarization. IEEE/ACM Trans. Audio Speech Lang. Process. 2022, 30, 1493–1507. [Google Scholar] [CrossRef]
Povey, D.; Ghoshal, A.; Boulianne, G.; Burget, L.; Glembek, O.; Goel, N.; Hannemann, M.; Motlíček, P.; Qian, Y.; Schwarz, P.; et al. The Kaldi Speech Recognition Toolkit. In Proceedings of the IEEE 2011 Workshop on Automatic Speech Recognition and Understanding, Waikoloa, HI, USA, 11–15 December 2011. [Google Scholar]
Snyder, D.; Chen, G.; Povey, D. MUSAN: A Music, Speech, and Noise Corpus. arXiv 2015, arXiv:1510.08484. [Google Scholar]
Ko, T.; Peddinti, V.; Povey, D.; Seltzer, M.L.; Khudanpur, S. A study on data augmentation of reverberant speech for robust speech recognition. In Proceedings of the 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, USA, 5–9 March 2017; pp. 5220–5224. [Google Scholar]
Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv 2017, arXiv:1412.6980. [Google Scholar]
Park, T.J.; Kanda, N.; Dimitriadis, D.; Han, K.J.; Watanabe, S.; Narayanan, S. A Review of Speaker Diarization: Recent Advances with Deep Learning. arXiv 2021, arXiv:2101.09624. [Google Scholar] [CrossRef]
Yang, C.H.H.; Qi, J.; Chen, S.Y.C.; Chen, P.Y.; Siniscalchi, S.M.; Ma, X.; Lee, C.H. Decentralizing Feature Extraction with Quantum Convolutional Neural Network for Automatic Speech Recognition. In Proceedings of the ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Toronto, ON, Canada, 6–11 June 2021; pp. 6523–6527. [Google Scholar] [CrossRef]
Yang, C.H.H.; Qi, J.; Siniscalchi, S.M.; Lee, C.H. An Ensemble Teacher-Student Learning Approach with Poisson Sub-sampling to Differential Privacy Preserving Speech Recognition. In Proceedings of the 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP), Singapore, 11–14 December 2022; pp. 1–5. [Google Scholar] [CrossRef]

Figure 1. BSAC-EEND model.

Figure 2. The structure of DSC and BSConv.

Figure 3. Subsampling layer based on BSConv.

Figure 4. Visualization of model extraction features.

Table 1. Simulated dataset.

Dataset	Split	Spk	Mixtures ¹	$β$	Overlap Ratio
Sim2spk	train	2	100,000	2	34.2%
	test	2	500	2	34.4%
	test	2	500	3	27.2%
	test	2	500	5	19.1%
Sim3spk	train	3	100,000	5	34.2%
	test	3	500	5	34.6%
	test	3	500	7	27.4%
	test	3	500	11	19.2%

¹ Mixtures: total number of audios.

Table 2. CALLHOME dataset.

Dataset	Split	Spk	Mixtures ¹	Overlap Ratio
CALLHOME-2spk	Part 1	2	155	14.0%
CALLHOME-2spk	Part 2	2	148	13.1%
CALLHOME-3spk	Part 1	3	61	19.2%
CALLHOME-3spk	Part 2	3	74	17.0%

¹ Mixtures: total number of audios.

Table 3. Improved convolutional subsampling evaluation results (DER).

Dataset	Sim2spk			CALLHOME-2spk	Sim3spk			CALLHOME-3spk
Overlap Ratio	34.4%	27.3%	19.1%	13.1%	34.6%	27.4%	19.2%	17.0%
x-vector (TDNN) + AHC [47]	28.77	24.46	19.78	11.53	31.78	26.06	19.55	19.01
SA-EEND [11]	4.56	4.50	3.85	9.54	8.69	7.64	6.92	14.00
CB-EEND	3.88	3.59	3.54	9.32	6.31	5.62	5.35	14.23
CB-EEND + CS-BSCS	3.56	3.05	3.01	8.98	5.23	4.28	3.88	14.05
CB-EEND + CS-BSCU	3.36	3.28	3.43	9.13	5.00	4.43	3.81	14.02

Table 4. The evaluation results of the MFA structure (DER).

Dataset	Sim2spk			CALLHOME-2spk	Sim3spk			CALLHOME-3spk
Overlap Ratio	34.4%	27.3%	19.1%	13.1%	34.6%	27.4%	19.2%	17.0%
SA-EEND [11]	4.56	4.50	3.85	9.54	8.69	7.64	6.92	14.00
CB-EEND	3.88	3.59	3.54	9.32	6.31	5.62	5.35	14.23
CB-EEND + MFA	3.54	3.29	3.38	9.01	5.53	4.87	4.42	13.90

Table 5. Evaluation results for BSAC-EEND (DER).

Dataset	Sim2spk			CALLHOME-2spk	Sim3spk			CALLHOME-3spk
Overlap Ratio	34.4%	27.3%	19.1%	13.1%	34.6%	27.4%	19.2%	17.0%
x-vector (TDNN) + AHC [47]	28.77	24.46	19.78	11.53	31.78	26.06	19.55	19.01
SA-EEND [11]	4.56	4.50	3.85	9.54	8.69	7.64	6.92	14.00
CB-EEND	3.88	3.59	3.54	9.32	6.31	5.62	5.35	14.23
BSAC-EEND (CS-BSCS)	3.07	3.01	2.66	8.59	6.15	5.42	5.39	15.01
BSAC-EEND (CS-BSCU)	3.25	3.29	2.98	8.75	5.21	4.38	4.22	13.69
EEND-EDA (Chronol) [47]	3.07	2.74	3.04	8.24	13.02	11.65	10.41	15.86
EEND-EDA (Shuffled) [47]	2.69	2.44	2.60	8.07	8.38	7.06	6.21	13.92

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Jiao, X.; Chen, Y.; Qu, D.; Yang, X. Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization. Electronics 2023, 12, 4118. https://doi.org/10.3390/electronics12194118

AMA Style

Jiao X, Chen Y, Qu D, Yang X. Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization. Electronics. 2023; 12(19):4118. https://doi.org/10.3390/electronics12194118

Chicago/Turabian Style

Jiao, Xiaolin, Yaqi Chen, Dan Qu, and Xukui Yang. 2023. "Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization" Electronics 12, no. 19: 4118. https://doi.org/10.3390/electronics12194118

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Blueprint Separable Subsampling and Aggregate Feature Conformer-Based End-to-End Neural Diarization

Abstract

1. Introduction

2. Related Work

2.1. Traditional Speaker Diarization Method

2.2. End-to-End Speaker Diarization

3. BSAC-EEND

3.1. Convolution Subsampling Based on BSConv

3.2. Conformer Block

3.3. MFA Structure

4. Data and Experiments

4.1. Data Preparation

4.2. Experimental Setup

5. Result and Analysis

5.1. Improved Convolutional Subsampling

5.2. MFA

5.3. BSAC-EEND

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

Abbreviations

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI