Next Article in Journal
Instance Segmentation and Number Counting of Grape Berry Images Based on Deep Learning
Next Article in Special Issue
Automatic Speech Disfluency Detection Using wav2vec2.0 for Different Languages with Variable Lengths
Previous Article in Journal
Predictive Vehicle Safety—Validation Strategy of a Perception-Based Crash Severity Prediction Function
Previous Article in Special Issue
Replay Speech Detection Based on Dual-Input Hierarchical Fusion Network
 
 
Article
Peer-Review Record

Multi-Level Attention-Based Categorical Emotion Recognition Using Modulation-Filtered Cochleagram

Appl. Sci. 2023, 13(11), 6749; https://doi.org/10.3390/app13116749
by Zhichao Peng 1,*, Wenhua He 1, Yongwei Li 2, Yegang Du 3 and Jianwu Dang 4,5,*
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(11), 6749; https://doi.org/10.3390/app13116749
Submission received: 17 April 2023 / Revised: 29 May 2023 / Accepted: 30 May 2023 / Published: 1 June 2023
(This article belongs to the Special Issue Advanced Technology in Speech and Acoustic Signal Processing)

Round 1

Reviewer 1 Report

The paper proposes a multi-level attention-based framework for categorical emotion recognition using modulation-filtered cochleagram features. The authors claim that their approach outperforms the baseline model on unweighted accuracy and addresses the variability in emotional characteristics across time. While the paper has some strengths, it also has several weaknesses that need to be addressed.

Strengths: - The paper proposes a novel approach for categorical emotion recognition using modulation-filtered cochleagram features and multi-level attention. - The authors provide a detailed description of their proposed framework, including the modulation-filtered cochleagram feature extraction process and the multi-level attention module. - The paper presents experimental results on the Interactive Emotional Dyadic Motion Capture Database (IEMOCAP) and shows that their approach outperforms the baseline model on unweighted accuracy.

Weaknesses: - The paper lacks a clear motivation for using modulation-filtered cochleagram features and multi-level attention. The authors briefly mention that modulation-filtered cochleagram features capture temporal modulation cues and that multi-level attention captures emotional saliency maps, but they do not explain why these features and attention mechanisms are necessary for categorical emotion recognition. - The paper does not provide a thorough comparison with existing approaches for categorical emotion recognition. The authors only compare their approach with a baseline model and do not compare it with state-of-the-art approaches using cochleagrams such as described in “Recognition of emotional vocalizations of canine”. - The paper lacks a discussion of the limitations of their approach. The authors do not discuss the potential drawbacks of using modulation-filtered cochleagram features and multi-level attention or the situations in which their approach may not be effective. - The paper does not provide a clear explanation of the significance of their results. While the authors show that their approach outperforms the baseline model on unweighted accuracy, they do not explain why this improvement is significant or how it compares to existing approaches.

Conclusion: the paper proposes a novel approach for categorical emotion recognition using modulation-filtered cochleagram features and multi-level attention, but it lacks a clear motivation for these features and attention mechanisms, a thorough comparison with existing approaches, a discussion of the limitations of their approach, and a clear explanation of the significance of their results. The authors should address these weaknesses in future revisions of their paper.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

Reviewer’s Report on the manuscript entitled:

Multi-level attention-based categorical emotion recognition using modulation-filtered cochleagram

 

The authors proposed an emotion recognition framework that applies a multi-level attention network for identifying emotions from the modulation-filtered cochleagram. Using experimental datasets, they showed that the modulation-filtered cochleagram  improves the prediction performance of categorical emotion compared to other evaluated features.

 

Line 21. Please provide the full name for IEMOCAP.

 

The related work in the Introduction can be improved:

Line 89. In addition to speech signals and image recognitions, electroencephalogram (EEG) signals are also widely used for emotion classification.

Kim and Choi [https://doi.org/10.3390/s20236727] utilized long short-term memory (LSTM) network for EEG-Based emotion classification.

Ghosh et al. [https://doi.org/10.1109/JSEN.2023.3237383] utilize k-nearest neighbor and LSTM for denoising EEG prior to applications like emotion classification.

And a review on electrical source imaging and EEG with applications on interpreting emotions from facial expressions is provided by Zorzos et al. [https://doi.org/10.3390/signals2030024]

Therefore, I suggest adding a paragraph for EEG-based emotion classification and briefly discuss the papers above.

 

Please add a flowchart for your methodology in the method section. This is different from Figures 2, 3, 4. It should be like a graphical abstract that shows input different methods and processes (using texts, graphics, etc.) and output.

 

Line 155 and Equation (3). You used two different symbols for Hilbert transform. Please be consistent with symbology and notations. Line 170 for f_n is another example for inconsistent symbology (italic and nonitalic).

 

Figure 1. Please increase the font size of the number on the axes and color bar. Also, what are their labels and units?

 

Line 273. The abbreviation LSTM is not defined. LSTM needs some recent references. For example, Ghosh et al. suggested above described LSTM in detail for signal processing that can be referred here. Please also add an acronym table at the end of the manuscript listing all the abbreviations used.

 

Line 300. Please mathematically describe what UA is and how you obtain it.

 

Figures 7 and 8. Please increase the font size. Please improve the figure quality.

 

Table 3. What about other metrics, such as precision, recall, overall accuracy, f1 score, RMSE, MAE, etc.? I suggest including at least a couple more evaluation metrics in this table. Also, please make sure to define whichever metrics you like to include.

Please include the limitations of this study in the conclusion section.

 

Please carefully check the references to ensure they are correct and have a consistent format and style according to the MDPI guidelines.

 

Thank you for your contribution,

Regards,

Please carefully check the references to ensure they are correct and have a consistent format and style according to the MDPI guidelines.

Abbreviations should be defined the first time they appear both in abstract and the body of the manuscript.

Please carefully check and correct mathematical symbols, and punctuations/typos, etc.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 3 Report

In this study, the authors proposed a multi-level attention-based categorical emotion recognition using modulation-filtered cochleagram. Although the performance seems promising, some major points should be addressed as follows:

1. There must have external validation data to evaluate the performance of models on unseen data.

2. Uncertainties of models should be reported.

3. When comparing the performance results among methods/models, the authors should perform some statistical tests to see significant differences.

4. How did the authors perform hyperparameter tuning of their models?

5. Model interpretation should be performed and analyzed.

6. More discussions should be added.

7. Deep learning is well-known and has been used in previous studies i.e., PMID: 36174933, PMID: 34730875. Therefore, the authors are suggested to refer to more works in this description to attract a broader readership.

8. Quality of figures should be improved.

9. Overall, English writing should be improved.

10. Source codes should be provided for replicating the study.

11. Why did this study only include four emotional categories in their dataset?

Overall, English writing should be improved.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 2

Reviewer 1 Report

The comparison with other works presenting the emotion recognition results on the IEMOCAP dataset is insufficient. The dataset has been used widely. However, the authors have presented only a few handpicked references for comparison. The authors should focus on the most recently published references for a more comprehensive comparison such as

·        Akalya Devi, C., Karthika Renuka, D., Pooventhiran, G., Harish, D., Yadav, S., & Thirunarayan, K. (2023). Towards enhancing emotion recognition via multimodal framework. Journal of Intelligent and Fuzzy Systems, 44(2), 2455-2470. doi:10.3233/JIFS-220280

·        Bhanusree, Y., Kumar, S. S., & Rao, A. K. (2023). Time-distributed attention-layered convolution neural network with ensemble learning using random forest classifier for speech emotion recognition. Journal of Information and Communication Technology, 22(1), 49-76. doi:10.32890/jict2023.22.1.3

·        Chattopadhyay, S., Dey, A., Singh, P. K., Ahmadian, A., & Sarkar, R. (2023). A feature selection model for speech emotion recognition using clustering-based population generation with hybrid of equilibrium optimizer and atom search optimization algorithm. Multimedia Tools and Applications, 82(7), 9693-9726. doi:10.1007/s11042-021-11839-3

·        Chauhan, K., Sharma, K. K., & Varma, T. (2023). Improved speech emotion recognition using channel-wise global head pooling (CwGHP). Circuits, Systems, and Signal Processing, doi:10.1007/s00034-023-02367-6

·        Chen, J., Sun, C., Zhang, S., & Zeng, J. (2023). Cross-modal dynamic sentiment annotation for speech sentiment analysis. Computers and Electrical Engineering, 106 doi:10.1016/j.compeleceng.2023.108598

·        Feng, L., Liu, L., Liu, S., Zhou, J., Yang, H., & Yang, J. (2023). Multimodal speech emotion recognition based on multi-scale MFCCs and multi-view attention mechanism. Multimedia Tools and Applications, doi:10.1007/s11042-023-14600-0

·        Kapoor, S., & Kumar, T. (2023). A novel approach to detect instant emotion change through spectral variation in single frequency filtering spectrogram of each pitch cycle. Multimedia Tools and Applications, 82(6), 9413-9429. doi:10.1007/s11042-022-13731-0

·        Le, H., Lee, G., Kim, S., Kim, S., & Yang, H. (2023). Multi-label multimodal emotion recognition with transformer-based fusion and emotion-level representation learning. IEEE Access, 11, 14742-14751. doi:10.1109/ACCESS.2023.3244390

·        Li, J., Wang, X., Lv, G., & Zeng, Z. (2023). GA2MIF: Graph and attention based two-stage multi-source information fusion for conversational emotion detection. IEEE Transactions on Affective Computing, , 1-14. doi:10.1109/TAFFC.2023.3261279

·        Liu, L., Liu, W., & Feng, L. (2023). SDTF-net: Static and dynamic time–frequency network for speech emotion recognition. Speech Communication, 148, 1-8. doi:10.1016/j.specom.2023.01.008

·        Prabhakar, G. A., Basel, B., Dutta, A., & Rama Rao, C. V. (2023). Multichannel CNN-BLSTM architecture for speech emotion recognition system by fusion of magnitude and phase spectral features using DCCA for consumer applications. IEEE Transactions on Consumer Electronics, 69(2), 226-235. doi:10.1109/TCE.2023.3236972

·        Tu, Z., Liu, B., Zhao, W., Yan, R., & Zou, Y. (2023). A feature fusion model with data augmentation for speech emotion recognition. Applied Sciences (Switzerland), 13(7) doi:10.3390/app13074124

·        Wagner, J., Triantafyllopoulos, A., Wierstorf, H., Schmitt, M., Burkhardt, F., Eyben, F., & Schuller, B. W. (2023). Dawn of the transformer era in speech emotion recognition: Closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, , 1-13. doi:10.1109/TPAMI.2023.3263585

·        Wang, B., Dong, G., Zhao, Y., Li, R., Cao, Q., Hu, K., & Jiang, D. (2023). Hierarchically stacked graph convolution for emotion recognition in conversation. Knowledge-Based Systems, 263 doi:10.1016/j.knosys.2023.110285

·        Wei, Q., Huang, X., & Zhang, Y. (2023). FV2ES: A fully End2End multimodal system for fast yet effective video emotion recognition inference. IEEE Transactions on Broadcasting, 69(1), 10-20. doi:10.1109/TBC.2022.3215245

·        Wen, J., Jiang, D., Tu, G., Liu, C., & Cambria, E. (2023). Dynamic interactive multiview memory network for emotion recognition in conversation. Information Fusion, 91, 123-133. doi:10.1016/j.inffus.2022.10.009

·        Yang, K., Zhang, T., Alhuzali, H., & Ananiadou, S. (2023). Cluster-level contrastive learning for emotion recognition in conversations. IEEE Transactions on Affective Computing, , 1-12. doi:10.1109/TAFFC.2023.3243463

·       Yi, Y., Tian, Y., He, C., Fan, Y., Hu, X., & Xu, Y. (2023). DBT: Multimodal emotion recognition based on dual-branch transformer. Journal of Supercomputing, 79(8), 8611-8633. doi:10.1007/s11227-022-05001-5

The authors should explain why their results is worse than the results achieved by other studies.

Only accuracy (UA) metric is used for performance evaluation. For an unbiased evaluation, the authors also should add the results of other metrics such as sensitivity, specificity, F-score.

Evaluate the confidence interval of your results such as was done by Chen et al. in ref. [40].

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Reviewer 2 Report

I thank the authors for addressing my comments and improving their manuscript.

Regards,

The English grammar is fine. Please carefully proofread the manuscript for any typos.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Round 3

Reviewer 1 Report

I still did not find the statistical evaluation of the results. The authors should discuss the reliability of their results from the statistics point of view.

Author Response

Please see the attachment.

Author Response File: Author Response.docx

Back to TopTop