Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions
Abstract
:1. Introduction
2. Related Works
3. Materials and Method
3.1. EVA-Based I-Vector Framework
3.1.1. Subspace Estimation
- The eigenvoice matrix V is trained by assuming E and D are zero.
- The eigenemotion matrix E is trained using a given estimate of V, and by assuming D is zero.
- The residual matrix D is trained using given estimates of V and E.
3.1.2. Linear Score Computation
3.1.3. I-Vector Technique
3.2. Deep Discriminant Analysis
3.3. Scoring Method
3.4. Decision Evaluation Metrics
4. Experimental Setup
4.1. Dataset
4.2. Proposed System Setting
- Matrix V: 20,000 by 300 (300 eigenvoice components)
- Vector y: 300 by 1 (300 speaker factors)
- Matrix E: 20,000 by 100 (100 eigenemotion components)
- Vector x: 100 by 1 (100 emotion factors)
- Matrix D: 20,000 by 20,000 (20,000 residual components)
- Vector z: 20,000 by 1 (20,000 speaker-specific residual components)
4.3. Baseline System Setting
5. Result and Discussions
5.1. Evaluation Result
5.2. Ablation Experiment
5.2.1. The Effect of the Number of Mixture GMM-UBM System
5.2.2. The Effect of the Number of EVA-Based I-Vector Dimensions
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
Abbreviations
CSS | Cosine Similarity Scoring |
DDA | Deep Discriminant Analysis |
EDS | Euclidean Distance Scoring |
EER | Equal Error Rate |
EVA | Emotional Variability Analysis |
GMM | Gaussian Mixture Model |
JFA | Joint Factor Analysis |
KL | Kullback–Leibler |
LDA | Linear Discriminant Analysis |
LDC | Linguistic Data Consortium |
LPCC | Linear Predictive Cepstral Coefficients |
MAE | Mean Absolute Error |
MDS | Mahalanobis Distance Scoring |
MDE | Minimum Divergence Estimation |
MSE | Mean Squared Error |
MFCC | Mel-frequency Cepstral Coefficients |
PLDA | Probabilistic Linear Discriminant Analysis |
ReLU | Rectified Linear Activation Unit |
SAD | Speech Activity Analysis |
SGD | Stochastic Gradient Descent |
SUSAS | Speech Under Simulated and Actual Stress |
t-SNE | t-distributed Stochastic Neighbor Embedding |
TVM | Total-Variability Model |
UBM | Universal Background Model |
References
- Algabri, M.; Mathkour, H.; Bencherif, M.A.; Alsulaiman, A.; Mekhtiche, M.A. Automatic Speaker Recognition for Mobile Forensic Applications. Mob. Inf. Syst. 2017, 2017, 6986391. [Google Scholar] [CrossRef]
- Singh, N.; Khan, R.A.; Shree, R. Applications of Speaker Recognition. In Proceedings of the International Conference on Modelling, Optimisation and Computing (ICMOC), Procedia Engineering, Kumarakoil, India, 10–11 April 2012; Volume 38, pp. 3122–3126. [Google Scholar]
- Prasetio, B.H.; Tamura, H.; Tanno, K. Semi-Supervised Deep Time-Delay Embedded Clustering for Stress Speech Analysis. Electronics 2019, 8, 1263. [Google Scholar] [CrossRef] [Green Version]
- Prasetio, B.H.; Tamura, H.; Tanno, K. A Deep Time-delay Embedded Algorithm for Unsupervised Stress Speech Clustering. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC), Bari, Italy, 6–9 October 2019; pp. 1193–1198. [Google Scholar]
- Baruah, U.; Laskar, R.H.; Purkayashtha, B. Speaker Verification Systems: A Comprehensive Review. In Smart Computing Paradigms: New Progresses and Challenges; Elçi, A., Sa, P., Modi, C., Olague, G., Sahoo, M., Bakshi, S., Eds.; Advances in Intelligent Systems and Computing; Springer: Singapore, 2019; Volume 766, pp. 195–207. [Google Scholar]
- Buruck, G.; Wendsche, J.; Melzer, M.; Strobel, A.; Dorfel, D. Acute psychosocial stress and emotion regulation skills modulate empathic reactions to pain in others. Front. Psychol. 2014, 5, 1–16. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Smith, R.; Lane, R.D. Unconscious emotion: A cognitive neuroscientific perspective. Neurosci. Biobehav. Rev. 2016, 69, 216–238. [Google Scholar] [CrossRef] [PubMed]
- Joels, M.; Baram, T.Z. The neuro-symphony of stress. Nat. Rev. Neurosci. 2009, 10, 459–466. [Google Scholar] [CrossRef] [PubMed]
- Gordan, R.; Gwathmey, J.K.; Lai-Hua, X. Autonomic and endocrine control of cardiovascular function. World J. Cardiol. 2015, 7, 204–214. [Google Scholar] [CrossRef] [PubMed]
- Hansen, J.H.L.; Patil, S. Speech Under Stress: Analysis, Modeling and Recognition. In Speaker Classification I. Lecture Notes in Computer Science; Müller, C., Ed.; Springer: Berlin, Germany, 2007; Volume 4343, pp. 108–137. [Google Scholar]
- Zhang, Z. Mechanics of human voice production and control. J. Acoust. Soc. Am. 2016, 140, 2614–2635. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Wu, W.; Zheng, T.F.; Xu, M.; Bao, H. Study on Speaker Verification on Emotional Speech. In Proceedings of the INTERSPEECH, Pittsburgh, PA, USA, 17–21 September 2007; pp. 2102–2105. [Google Scholar]
- Shahin, I. Employing Emotion Cues to Verify Speakers in Emotional Talking Environments. J. Intell. Syst. 2015, 25, 3–17. [Google Scholar] [CrossRef] [Green Version]
- Shahin, I.; Nassif, A.B. Three-stage speaker verification architecture in emotional talking environments. Int. J. Speech Technol. 2018, 21, 915–930. [Google Scholar] [CrossRef] [Green Version]
- Bao, H.; Zheng, T.F.; Xu, M. Emotion Attribute Projection for Speaker Recognition on Emotional Speech. In Proceedings of the INTERSPEECH, Pittsburgh, PA, USA, 17–21 September 2007; pp. 758–761. [Google Scholar]
- Dehak, N.; Dehak, R.; Kenny, P.; Brummer, N.; Ouellet, P. Support vector machines versus fast scoring in the low-dimensional total variability space for speaker verification. In Proceedings of the INTERSPEECH, Brighton, UK, 6–10 September 2009; pp. 1559–1562. [Google Scholar]
- Dehak, N.; Kenny, P.; Dehak, R.; Dumouchel, P.; Ouellet, P. Front–end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 2011, 19, 788–798. [Google Scholar] [CrossRef]
- Umesh, S. Studies on inter-speaker variability in speech and its application in automatic speech recognition. Sādhanā 2011, 36, 853–883. [Google Scholar] [CrossRef] [Green Version]
- Godin, K.W.; Hansen, J.H.L. Physical task stress and speaker variability in voice quality. EURASIP J. Audio Speech Music Process. 2015, 29, 1–13. [Google Scholar] [CrossRef] [Green Version]
- Prasetio, B.H.; Tamura, H.; Tanno, K. A Study on Speaker Identification Approach by Feature Matching Algorithm using Pitch and Mel Frequency Cepstral Coefficients. In Proceedings of the the International Conference on Artificial Life and Robotics (ICAROB), Beppu, Japan, 10–13 January 2019. [Google Scholar]
- Mansour, A.; Lachiri, Z. Speaker Recognition in Emotional Context. Int. J. Comput. Sci. Commun. Inf. Technol. (CSCIT) 2015, 2, 1–4. [Google Scholar]
- Xu, S.; Liu, Y.; Liu, X. Speaker Recognition and Speech Emotion Recognition Based on GMM. In Proceedings of the International Conference on Electric and Electronics (EEIC), Hong Kong, China, 24–25 December 2013; pp. 434–436. [Google Scholar]
- Ghiurcau, M.V.; Rusu, C.; Astola, J. Speaker Recognition in an Emotional Environment. In Proceedings of the Signal Processing and Applied Mathematics for Electronics and Communications (SPAMEC), Hong Cluj-Napoca, Romania, 26–28 August 2011; pp. 81–84. [Google Scholar]
- Bie, F.; Wang, D.; Zheng, T.F.; Chen, R. Emotional speaker verification with linear adaptation. In Proceedings of the IEEE China Summit and International Conference on Signal and Information Processing, Beijing, China, 6–10 July 2013. [Google Scholar]
- Bie, F.; Wang, D.; Zheng, T.F.; Tejedor, J.; Chen, R. Emotional Adaptive Training for Speaker Verification. In Proceedings of the Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, Kaohsiung, Taiwan, 29 October–1 November 2013. [Google Scholar]
- Chen, L.; Yang, Y. Applying Emotional Factor Analysis and I-Vector to Emotional Speaker Recognition. In CCBR 2011: Biometric Recognition; Sun, Z., Lai, J., Chen, X., Tan, T., Eds.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2011; p. 27098. [Google Scholar]
- Al-Kaltakchi, M.T.S.; Woo, W.K.; Chambers, J.A. Comparison of I-vector and GMM-UBM Approaches to Speaker Identification with TIMIT and NIST 2008 Databases in Challenging Environments. In Proceedings of the 25th European Signal Processing Conference (EUSIPCO), Kos Island, Greece, 28 August–2 September 2017; pp. 274–281. [Google Scholar]
- Misra, A.; Hansen, J. Maximum Likelihood Linear Transformation for Unsupervised Domain Adaptation in Speaker Verification. IEEE/ACM Trans. Audio Speech Lang. Process. 2018, 26, 1549–1558. [Google Scholar] [CrossRef]
- Kenny, P.; Ouellet, P.; Dehak, N.; Gupta, V.; Dumouchel, P. A Study of Inter-Speaker Variability in Speaker Verification. IEEE Trans. Audio Speech Lang. Process. 2008, 16, 980–988. [Google Scholar] [CrossRef] [Green Version]
- Glembek, O.; Burget, L.; Dehak, N.; Brummer, N.; Kenny, P. Comparison of scoring methods used in speaker recognition with Joint Factor Analysis. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, Taiwan, 19–24 April 2009; pp. 4057–4060. [Google Scholar]
- Rao, K.S.; Yadav, J.; Sarkar, S.; Koolagudi, S.G.; Vuppala, A.K. Neural Network based Feature Transformation for Emotional Independent Speaker Identification. Int. J. Speech Technol. 2012, 15, 335–349. [Google Scholar]
- Wang, S.; Huang, Z.; Qian, Y.; Yu, K. Deep Discriminant Analysis for i-vector Based Robust Speaker Recognition. In Proceedings of the International Symposium on Chinese Spoken Language Processing (ISCSLP), Taipei, Taiwan, 26–29 November 2018; pp. 195–199. [Google Scholar]
- Prasetio, B.H.; Tamura, H.; Tanno, K. Embedded Discriminant Analysis based Speech Activity Detection for Unsupervised Stress Speech Clustering. In Proceedings of the International Conference on Imaging, Vision & Pattern Recognition (IVPR), Kitakyushu, Japan, 26–29 August 2020. [Google Scholar]
- Huang, Z.; Wang, S.; Yu, K. Angular Softmax for Short-Duration Text-independent Speaker Verification. In Proceedings of the INTERSPEECH, Hyderabad, India, 2–6 September 2018; pp. 3623–3627. [Google Scholar]
- Wang, S.; Yang, Y.; Wang, T.; Qian, Y.; Yu, K. Knowledge Distillation for Small Foot-print Deep Speaker Embedding. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 6021–6025. [Google Scholar]
- Salmun, I.; Opher, I.; Lapidot, I. On the Use of PLDA i-vector Scoring for Clustering Short Segments. In Proceedings of the Odyssey, Bilbao, Spain, 21–24 June 2016; pp. 407–414. [Google Scholar]
- Bousquet, P.; Matrouf, D.; Bonastre, J. Intersession compensation and scoring methods in the i-vectors space for speaker recognition. In Proceedings of the INTERSPEECH, Florence, Italy, 27–31 August 2011; pp. 485–488. [Google Scholar]
- Lei, Z.; Wan, Y.; Luo, J.; Yang, Y. Mahalanobis Metric Scoring Learned from Weighted Pairwise Constraints in I-vector Speaker Recognition System. In Proceedings of the INTERSPEECH, San Francisco, CA, USA, 8–12 September 2016; pp. 1815–1819. [Google Scholar]
- Hansen, J.H.L. Composer, SUSAS LDC99S78. In Sound Recording; Web Download; Linguistic Data Consortium: Philadelphia, PA, USA, 1999. [Google Scholar]
- Hansen, J.H.L. Composer, SUSAS Transcript LDC99T33. In Sound Recording; Linguistic Data Consortium: Philadelphia, PA, USA, 1999. [Google Scholar]
Layer | Number of Neurons | Non-Linear Function |
---|---|---|
Input | 600 | ReLU |
Hidden | 600 | ReLU+BatchNorm |
Embedding | 400 | - |
Loss | Softmax loss | *center loss |
(a) | |
---|---|
Class | Number of Utterances |
High stress Low stress Neutral Angry Soft | 620 620 620 436 420 |
midraule (b) | |
Data ID | Number of Utterances |
1 2 3 4 5 6 | 98 102 94 118 107 97 |
Method | Scoring Method | ||
---|---|---|---|
CSS | EDS | MDS | |
Baseline system | 6.78 | 6.19 | 7.63 |
Proposed system | 4.51 | 4.37 | 4.08 |
Scoring Method | Dimensions | ||
---|---|---|---|
600 | 500 | 400 | |
CSS | 4.36 | 4.87 | 5.21 |
EDS | 4.53 | 4.54 | 4.56 |
MDS | 4.11 | 4.12 | 4.12 |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Prasetio, B.H.; Tamura, H.; Tanno, K. Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions. Electronics 2020, 9, 1420. https://doi.org/10.3390/electronics9091420
Prasetio BH, Tamura H, Tanno K. Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions. Electronics. 2020; 9(9):1420. https://doi.org/10.3390/electronics9091420
Chicago/Turabian StylePrasetio, Barlian Henryranu, Hiroki Tamura, and Koichi Tanno. 2020. "Emotional Variability Analysis Based I-Vector for Speaker Verification in Under-Stress Conditions" Electronics 9, no. 9: 1420. https://doi.org/10.3390/electronics9091420