Next Article in Journal
Short-Term Wind Power Forecasting Based on VMD and a Hybrid SSA-TCN-BiGRU Network
Next Article in Special Issue
Uni2Mul: A Conformer-Based Multimodal Emotion Classification Model by Considering Unimodal Expression Differences with Multi-Task Learning
Previous Article in Journal
Stability Analysis of Lane-Keeping Assistance System for Trucks under Crosswind Conditions
Previous Article in Special Issue
Lightweight Facial Expression Recognition Based on Class-Rebalancing Fusion Cumulative Learning
 
 
Article
Peer-Review Record

Facial Emotion Recognition for Photo and Video Surveillance Based on Machine Learning and Visual Analytics

Appl. Sci. 2023, 13(17), 9890; https://doi.org/10.3390/app13179890
by Oleg Kalyta 1, Olexander Barmak 1, Pavlo Radiuk 1,* and Iurii Krak 2,3
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3:
Appl. Sci. 2023, 13(17), 9890; https://doi.org/10.3390/app13179890
Submission received: 28 April 2023 / Revised: 7 July 2023 / Accepted: 30 August 2023 / Published: 31 August 2023
(This article belongs to the Special Issue Advanced Technologies for Emotion Recognition)

Round 1

Reviewer 1 Report

The paper at hand proposes a facial emotion recognition (FER) system for photo and video surveillance in crowded environments based on geometrical feature extraction, hyperplane classification and visual analytics for human-in-the-loop machine learning. The idea of transparent and interpretable classification constitutes a novel and emergent requirement of contemporary computer vision solutions. Hence, the article addresses a quite hot and interesting topic in the recent bibliography.

In general, the manuscript -although quite extensive- is well-written, organized and easy to follow.

Comments

In lines 433-520, the description regarding the manifestation of each emotional state with the Facial Action Coding System (FACS) is quite dense and difficult to follow. I would suggest that the authors displayed the above coding system in a compact table or diagram.

The authors correctly state the main limitations of their work. A major limitation concerns the exploitation of geometrical features for emotion identification that is well-known for their high susceptibility to subjects, illumination, viewpoint, etc. Hence, I am not sure whether the provided feature selection and thresholds apply to different subjects. Given that the authors have tested their work only on one dataset how do they anticipate their work to generalize? Probably, the authors could add some experiments on another dataset, or elaborate more on the above expectations.

Please, provide a better description regarding the evaluation setup of the work. Did the authors test their work using a leave-one-speaker-group-out cross-validation strategy to prove the work’s ability to generalize?

Given the existence of cutting-edge, real-time, high-precision facial landmarks extractors, I do not consider the exploitation of third-party tools for this purpose as a limitation. Please refer to the exploitation of such extractors in geometrical feature extraction for emotion recognition:

Kazemi, Vahid, and Josephine Sullivan. "One millisecond face alignment with an ensemble of regression trees." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

Kansizoglou, Ioannis, et al. "Continuous Emotion Recognition for Long-Term Behavior Modeling through Recurrent Neural Networks." Technologies 10.3 (2022): 59.

Vonikakis, Vassilios, and Stefan Winkler. "Identity-invariant facial landmark frontalization for facial expression analysis." 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020.

I believe that the above exploitation constitutes a strong benefit of the authors’ approach considering the transparent nature of the obtained features. Of course, the authors can elaborate more on that.

To my point of view, another main limitation constitutes the emotion classification in a frame-by-frame manner, ignoring the temporal quality of the video. To my understanding, each image of the video is classified regarding its manifested emotion. Yet, emotional states often need a specific duration to be accordingly expressed and thus the exploitation of online recognition capacities in video streams can be proved particularly effective. Please, refer to ["An active learning paradigm for online audio-visual emotion recognition." IEEE Transactions on Affective Computing 13.2 (2019): 756-768] and discuss the above limitation or idea for future work.

It is not clear to me why the authors chose to compare their work against a Deep Learning method for violence detection, instead of common FER works. Please, discuss this selection a little further.

The Conclusion section is quite short. Please, discuss your ideas for future work based on the stated limitations.

Finally, given that the proposed work does not exceed the performance of the compared methods, please clearly state both in the contributions and the conclusions what are the main benefits and the reasons to select such work against the existing ones. In such a way, the transparent and interpretable capacities of the proposed work could become more clear and enhance the readability of the manuscript.

In general, the manuscript -although quite extensive- is well-written, organized and easy to follow.

Minor comment

In line 88, please correct “FATCS” => “FACTS”.

Author Response

Response to Reviewer 1 Comments

 

Dear Reviewer,

 

We are grateful to the Editorial Board and to the experts for considering our submitted manuscript entitled “Facial Emotion Recognition for Photo and Video Surveillance based on Machine Learning and Visual Analytics.” We have taken into account all remarks and criticisms formulated and we thank the Editorial Board and the experts for their constructive comments. You will find below the answers to each of your remarks.

 

The paper at hand proposes a facial emotion recognition (FER) system for photo and video surveillance in crowded environments based on geometrical feature extraction, hyperplane classification and visual analytics for human-in-the-loop machine learning. The idea of transparent and interpretable classification constitutes a novel and emergent requirement of contemporary computer vision solutions. Hence, the article addresses a quite hot and interesting topic in the recent bibliography.

In general, the manuscript -although quite extensive- is well-written, organized and easy to follow.

 

Point 1: In lines 433-520, the description regarding the manifestation of each emotional state with the Facial Action Coding System (FACS) is quite dense and difficult to follow. I would suggest that the authors displayed the above coding system in a compact table or diagram.

 

Response 1: Thank you for your constructive feedback regarding the clarity of comparison between our technique and the Facial Action Coding System (FACS) and its relation to each emotional state. We recognize the complexity of this section and appreciate your suggestion for a more accessible representation.

In response, we have revised lines 433-520 to provide a more lucid and concise account. In particular, we incorporated a detailed table in lines 456-458 to convey this information in a more understandable format. This table delineates the manifestation of each emotional state with the FACS, serving as a more direct, reader-friendly comparison with the proposed technique.

We hope this change may improve the readability and accessibility of our research, allowing readers to grasp better and evaluate our findings.

 

Point 2: The authors correctly state the main limitations of their work. A major limitation concerns the exploitation of geometrical features for emotion identification that is well-known for their high susceptibility to subjects, illumination, viewpoint, etc. Hence, I am not sure whether the provided feature selection and thresholds apply to different subjects. Given that the authors have tested their work only on one dataset how do they anticipate their work to generalize? Probably, the authors could add some experiments on another dataset, or elaborate more on the above expectations.

 

Response 2: Thank you for your thoughtful remarks and for highlighting the potential limitations regarding our feature selection process and generalizability. We appreciate your recommendation and agree that the robustness of any model should be tested across diverse datasets to ensure its applicability to various conditions and subjects.

In light of your comments and our own evaluation of our study's limitations, during the revision phase of our work, we incorporated additional substantive experiments on two well-recognized datasets. Specifically, we tested our technique on the Facial Expression Recognition (FER+) dataset and the Amsterdam Dynamic Facial Expression Set (ADFES). These datasets provide a diverse set of subjects, illumination conditions, and viewpoints, thereby offering a more comprehensive evaluation of our model.

The additional testing offered valuable insights and reinforced the robustness of our technique, despite the well-noted variability in geometrical features for emotion identification. The results from these two datasets have been included in the revised version of our work, further addressing the concerns related to generalizability.

To conclude, we believe these enhancements make our work more comprehensive and solid).

 

Point 3: Please, provide a better description regarding the evaluation setup of the work. Did the authors test their work using a leave-one-speaker-group-out cross-validation strategy to prove the work's ability to generalize?

 

Response 3: Thank you for your inquiry about the evaluation setup employed in our study. We appreciate your interest and understand the value of rigorous validation methods in verifying the work's ability to generalize.

To clarify, in this particular study, we did not use a leave-one-speaker-group-out cross-validation strategy. While this approach indeed has its merits, especially in speaker recognition tasks, it was not incorporated in our methodology. The primary reason was that our work didn't exclusively focus on speaker-specific characteristics, which are generally the main interest in such a cross-validation strategy.

Instead, our evaluation setup employed other robust methods, such as visual analytics strategy (see subparagraph 3.2) and statistical comparison (subparagraph 3.4) allowing us to ensure the model's efficiency and generalizability across two datasets that include human faces with different resolution. We certainly acknowledge the value of your suggestion and will take it into consideration for future research, as we continue to explore diverse validation techniques to enhance our work's comprehensiveness and reliability.

We hope this answer clarifies your query, and we invite further discussion on any aspect of our work.

 

Point 4: Given the existence of cutting-edge, real-time, high-precision facial landmarks extractors, I do not consider the exploitation of third-party tools for this purpose as a limitation. Please refer to the exploitation of such extractors in geometrical feature extraction for emotion recognition:

Kazemi, Vahid, and Josephine Sullivan. “One millisecond face alignment with an ensemble of regression trees." Proceedings of the IEEE conference on computer vision and pattern recognition. 2014.

Kansizoglou, Ioannis, et al. "Continuous Emotion Recognition for Long-Term Behavior Modeling through Recurrent Neural Networks." Technologies 10.3 (2022): 59.

Vonikakis, Vassilios, and Stefan Winkler. "Identity-invariant facial landmark frontalization for facial expression analysis." 2020 IEEE International Conference on Image Processing (ICIP). IEEE, 2020.

 

Response 4: Thank you for your insightful feedback regarding our choice of feature extraction tools. Your suggestions highlight the dynamic landscape of emotion recognition and its continuing advancement, especially in facial landmarks extraction.

The works of Kazemi & Sullivan, Kansizoglou et al., and Vonikakis & Winkler, which you kindly referenced, indeed present efficient methodologies for face alignment, continuous emotion recognition, and identity-invariant facial landmark frontalization, respectively. They offer intriguing potential for our research and the broader field.

In this regard, our next steps include a thorough analysis of the suggested techniques and an evaluation of how their methodologies could be integrated into our current further work or how they might inform the design of future experiments.

 

Point 5: I believe that the above exploitation constitutes a strong benefit of the authors' approach considering the transparent nature of the obtained features. Of course, the authors can elaborate more on that.

 

Response 5: Thank you for your insightful feedback regarding the transparent nature of the obtained features in our study. We agree with your perspective that this transparency is a considerable strength of our approach, as it allows for a clear understanding and interpretation of the results.

In response to your suggestion, we have further elaborated on this aspect in the updated Section 3, "Results and Discussion." In this section, we delve deeper into the intricacies of our feature extraction process and clarify how the transparency of these features contributes to our model's overall robustness and interpretability.

Thank you for your constructive comments.

 

Point 6: To my point of view, another main limitation constitutes the emotion classification in a frame-by-frame manner, ignoring the temporal quality of the video. To my understanding, each image of the video is classified regarding its manifested emotion. Yet, emotional states often need a specific duration to be accordingly expressed and thus the exploitation of online recognition capacities in video streams can be proved particularly effective. Please, refer to ["An active learning paradigm for online audio-visual emotion recognition." IEEE Transactions on Affective Computing 13.2 (2019): 756-768] and discuss the above limitation or idea for future work.

 

Response 6: Your perspective on the limitations of our approach, particularly concerning the frame-by-frame classification, is insightful and appreciated. Indeed, emotional states are dynamic and often evolve, which is an aspect our current model does not fully account for.

In our approach, we analyze each frame of the video individually, assigning an emotional state based on the facial expression in that frame. This process can overlook the temporal continuity of emotional states in the video sequences, potentially leading to inaccuracies when rapid emotional shifts occur.

The work you have cited, "An active learning paradigm for online audio-visual emotion recognition," emphasizes the importance of temporal context in emotion recognition. It proposes an active learning paradigm that considers the temporal sequences in video streams, thereby improving emotion recognition accuracy.

In light of this, a potential direction for future work could be integrating an active learning paradigm into our framework. This could involve developing a model that accounts for the temporal context of emotional states, possibly through the use of techniques like recurrent neural networks or other sequence modeling approaches. Such a modification could enhance the model's ability to understand the fluid nature of emotions, improving its overall accuracy in emotion recognition from video streams. We appreciate your thoughtful feedback and look forward to exploring these possibilities further.

 

Point 7: It is not clear to me why the authors chose to compare their work against a Deep Learning method for violence detection, instead of common FER works. Please, discuss this selection a little further.

 

Response 7: Thank you for your insightful remark regarding the comparison criteria in our work. We acknowledge that the relevance and selection of comparison studies play a crucial role in establishing the validity and performance of our proposed method.

In our original work, the comparison with a Deep Learning method for violence detection may have seemed unusual, given that our focus is on Facial Emotion Recognition (FER). The initial choice was guided by the methodological similarities and the applicability of certain aspects of violence detection methods to FER, such as the emphasis on rapid, dynamic facial changes.

However, we understand your concern about needing a more direct comparison with standard FER works. In response, we conducted additional experiments during the revision period and compared our technique with four common and well-known machine learning and deep learning techniques for FER.

These include Toisoul et al.'s work on the estimation of continuous valence and arousal levels from faces, Baltrusaitis et al.'s OpenFace 2.0 for facial behavior analysis, Serengil and Ozpinar's HyperExtended LightFace for facial attribute analysis, and Pecoraro et al.'s work on local multi-head channel self-attention for facial expression recognition.

We believe these comparisons offer a more directly relevant evaluation of our work within the context of recent advancements in FER. The revised manuscript now includes detailed discussions and results from these additional comparisons, which will provide a more satisfying context for assessing our contribution.

 

Point 8: The Conclusion section is quite short. Please, discuss your ideas for future work based on the stated limitations.

 

Response 8: Thank you for your constructive feedback regarding the brevity of our Conclusion section. We acknowledge the need for a comprehensive conclusion that summarizes the work and outlines potential avenues for future research based on our study's limitations.

In response to your suggestion, we have significantly extended the Conclusion section. In this expanded section, we show the benefits of our contribution and offer a detailed discussion of our ideas for future work, which directly address the limitations identified in our study. These proposed directions may provide valuable insights and a clear pathway for subsequent investigations in this field.

 

Point 9: Finally, given that the proposed work does not exceed the performance of the compared methods, please clearly state both in the contributions and the conclusions what are the main benefits and the reasons to select such work against the existing ones. In such a way, the transparent and interpretable capacities of the proposed work could become more clear and enhance the readability of the manuscript.

 

Response 9: Thank you for your insightful feedback on our manuscript. We agree that it is essential to clearly articulate our work's unique contributions and benefits, particularly in the context of existing methods in the field.

While our method may not exceed the performance of some compared methods on certain metrics, it brings unique advantages worth highlighting. In response to your suggestion, we have significantly extended both the Contributions and Conclusions sections of our paper to more clearly delineate these benefits.

We highlight the benefits, such as the transparency and interpretability of our proposed model. This includes its ability to provide insights into which features it considers significant for emotion recognition and how it utilizes them to make predictions. These attributes are crucial in applications where understanding the decision-making process is as important as the outcome itself.

In terms of choosing our method over existing ones, we have emphasized how our approach balances performance with the aforementioned transparency and interpretability, which may not be the case with more complex but opaque models. We believe these updates will enhance the clarity and readability of our manuscript, and we look forward to any further suggestions you may have to improve our work.

 

Point 10: In general, the manuscript -although quite extensive- is well-written, organized and easy to follow.

 

Response 10: We sincerely appreciate your kind words. We tried our best to address all the reviewers' comments.

 

Point 11: Minor comment: In line 88, please correct “FATCS” => “FACTS”.

 

Response 11: Thank you for your comment. We corrected acronym “FATCS” to “FACTS” within the whole text.

Author Response File: Author Response.pdf

Reviewer 2 Report

The present article is devoted to present a new comprehensive technique based of the hyperplane classification method for recognizing facial expressions of emotions in video surveillance systems in crowded areas.
The new technique is designed to quickly identify emotional state changes in images of crowds captured by video surveillance cameras.
The manuscript is clear, relevant for the field and presented in a well-structured manner.
The cited references are mostly recent publications and relevant.
The manuscript is scientifically sound, and the experimental design is appropriate to test the hypothesis.
The manuscript’s results are reproducible based on the details given in the methods section.
The data is interpreted appropriately and consistently throughout the manuscript.
The conclusions are consistent with the evidence and arguments presented.
The ethics statements and data availability statements are adequate.
The limitation of the hyperplane classification method is the need to use a data set with faces of people that are characteristic of the location in which the proposed technique will be used, namely: appropriate categories of faces by age, gender, racial and cultural characteristics, and by the characteristics of clothing and climate.

Author Response

Response to Reviewer 2 Comments

 

Dear Reviewer,

 

We are grateful to the Editorial Board and to the experts for considering our submitted manuscript entitled “Facial Emotion Recognition for Photo and Video Surveillance based on Machine Learning and Visual Analytics.” We have taken into account all remarks and criticisms formulated and we thank the Editorial Board and the experts for their constructive comments. You will find below the answers to each of your remarks.

 

Point 1: The present article is devoted to present a new comprehensive technique based of the hyperplane classification method for recognizing facial expressions of emotions in video surveillance systems in crowded areas.

 

Response 1: Thank you for your succinct summary of our article. You have correctly captured the essence of our research, which indeed presents a novel, comprehensive technique leveraging hyperplane classification methods for facial expression recognition in video surveillance, particularly in crowded areas. We appreciate your understanding of our work's objectives and its potential application in practical, real-world settings.

 

Point 2: The new technique is designed to quickly identify emotional state changes in images of crowds captured by video surveillance cameras.

 

Response 2: Thank you for your accurate encapsulation of the primary purpose of our work. Indeed, our technique is specifically engineered to swiftly identify changes in emotional states within images of crowds as captured by video surveillance cameras. The fast-paced nature of crowd dynamics necessitates an approach capable of keeping pace while delivering reliable results.

We appreciate your recognition of our technique's efficiency and potential utility in photo and video surveillance, a domain that requires robust and timely emotion recognition.

 

Point 3: The manuscript is clear, relevant for the field and presented in a well-structured manner.

 

Response 3: We sincerely appreciate your positive feedback on our manuscript. It is gratifying to hear that our work's clarity, structure, and relevance have been recognized. These are aspects we put significant emphasis on during our preparation of the manuscript in order to ensure our research findings are effectively communicated and contribute to advancing the field.

Your constructive review and recognition of our efforts provide motivation and validation for our work. We remain committed to maintaining and enhancing these standards in our current and future research endeavors.

 

Point 4: The cited references are mostly recent publications and relevant.

 

Response 4: We appreciate your positive remark about the cited references in our manuscript. We have made a concerted effort to ensure our study is grounded in the most recent and relevant literature in the field. It is crucial to us that our research reflects an understanding and acknowledgment of the existing body of work, while also identifying areas where we can contribute new knowledge.

Your recognition of our efforts in this aspect is greatly encouraging. We understand the importance of proper citation in fostering a robust academic conversation and we're glad that our selection of references resonates with this principle.

 

Point 5: The manuscript is scientifically sound, and the experimental design is appropriate to test the hypothesis.

 

Response 5: We are sincerely grateful for your positive evaluation of our manuscript. It is reassuring to hear that our work has been recognized as scientifically sound and that our experimental design is considered appropriate for testing our hypothesis.

Our aim was to ensure that the research we conducted was rigorous, relevant, and adhered to the high standards of scientific investigation. Your feedback provides valuable affirmation that we have been successful in our pursuit.

 

Point 6: The manuscript's results are reproducible based on the details given in the methods section.

 

Response 6: We greatly appreciate your confirmation that the results of our manuscript are reproducible based on the provided details in the methods section. Ensuring the reproducibility of our research is a critical aspect of our work, and we are pleased to hear that we have successfully met this standard.

We firmly believe in the importance of transparency and clarity in scientific communication. Our manuscript provided a comprehensive and detailed methods section to allow other researchers to validate and build upon our findings.

Your positive remark encourages us to maintain these high standards in future research. Please let us know if there are any further aspects of our work that you would like us to elaborate on or any suggestions you may have. Thank you for your time and valuable feedback.

 

Point 7: The data is interpreted appropriately and consistently throughout the manuscript.

 

Response 7: Thank you for your positive feedback regarding the interpretation of data in our manuscript. We have worked diligently to ensure that our data interpretation is accurate, appropriate, and consistently maintained throughout the document.

We believe that data interpretation is pivotal to conveying our research findings effectively and to substantiate our conclusions. Therefore, your recognition of our efforts in this area is greatly appreciated.

 

Point 8: The conclusions are consistent with the evidence and arguments presented.

 

Response 8: Thank you for your feedback affirming the consistency between our conclusions and the evidence and arguments we have presented. This alignment is a fundamental aspect of our research methodology, and we endeavored to ensure that our conclusions logically emerged from the data and arguments set forth.

Your recognition of this consistency underscores the careful consideration and scrutiny we applied in synthesizing our results into conclusions, and we are heartened by your positive remarks.

 

Point 9: The ethics statements and data availability statements are adequate.

 

Response 9: Thank you for your feedback on the adequacy of our ethics and data availability statements. These aspects are of great importance in our research, ensuring that our work is conducted with the utmost integrity and transparency, and that it can be scrutinized, validated, and built upon by others in the field.

We have made a conscientious effort to thoroughly address ethical considerations and ensure our data is accessible to other researchers. Your positive feedback in this regard affirms that we have met these standards.

We greatly appreciate your expertise and time in reviewing our manuscript.

 

Point 10: The limitation of the hyperplane classification method is the need to use a data set with faces of people that are characteristic of the location in which the proposed technique will be used, namely: appropriate categories of faces by age, gender, racial and cultural characteristics, and by the characteristics of clothing and climate.

 

Response 10: Thank you for your insightful observation regarding the limitations of the hyperplane classification method. Indeed, the need for location-specific datasets, characterized by factors like age, gender, race, cultural attributes, as well as clothing and climatic conditions, is a limitation we acknowledge.

This necessity poses challenges in terms of data collection and may affect the generalizability of the method. However, it also motivates us to explore more diverse datasets and strive for robust techniques across various demographics and environments.

Your constructive feedback is essential for guiding future research and improving our work. We sincerely appreciate your expertise, time, and effort in reviewing our manuscript.

 

Author Response File: Author Response.pdf

Reviewer 3 Report

This paper proposes a method to recognize facial emotion through machine learning and visual analytics to monitor photos and videos. In general, this paper is of limited theoretical contributions. And my primary concerns are as follows.

 

(1)    The authors conduct their research on the sub-dataset of 27 faces which is reduced from ADFES training dataset of 110 faces. I wonder if the conclusion could hold for a larger dataset or people of different races. Please execute more experiments with a second dataset to address this concern.

(2)    According to Fig. 12, only Fear and not Fear states could be decided by the linear classifier. How can the authors categorize the samples into three groups using one dividing line? Please make this clear.

(3)    Although the authors use a dataset of five emotional states, the final results in Fig. 13 only tell the states of Fear and not Fear, leading to lower contribution in practical scenarios.

 

Additionally, my minor concerns are as follows.

 

(4)    In the title, this method is also designed for videos, but I do not sense the main difference in your method considering different monitored objects, photos and videos.

(5)    Most of the figures are of low resolution or in an unpleasant status. Besides, please type the numbers and draw straight lines in Fig. 4(b) to Fig. 7 other than casually drawing by hand.

(6)    In Fig. 10 and Fig.11, there are six groups. Should it be five, according to your dataset? Legends will be of help to the readers.

 

(7)    Why do you leave a hyphen after Photo in your title?

I can follow your idea. However, it still needs polishing. 

Author Response

Response to Reviewer 3 Comments

 

Dear Reviewer,

 

We are grateful to the Editorial Board and to the experts for considering our submitted manuscript entitled “Facial Emotion Recognition for Photo and Video Surveillance based on Machine Learning and Visual Analytics.” We have taken into account all remarks and criticisms formulated and we thank the Editorial Board and the experts for their constructive comments. You will find below the answers to each of your remarks.

 

This paper proposes a method to recognize facial emotion through machine learning and visual analytics to monitor photos and videos. In general, this paper is of limited theoretical contributions. And my primary concerns are as follows.

 

Point 1: The authors conduct their research on the sub-dataset of 27 faces which is reduced from ADFES training dataset of 110 faces. I wonder if the conclusion could hold for a larger dataset or people of different races. Please execute more experiments with a second dataset to address this concern.

 

Response 1: Thank you for your thoughtful feedback regarding our dataset size and its potential implications on the generalizability of our conclusions. We wholeheartedly agree that experimenting with a more extensive and diverse dataset would enhance the robustness of our findings.

In response to your suggestion, during the revision process, we conducted additional experiments using two well-recognized datasets. These are the Facial Expression Recognition (FER+) dataset and the Amsterdam Dynamic Facial Expression Set (ADFES), both of which are more extensive and diverse than our original sub-dataset of 27 faces.

By expanding our experimental scope in this way, we have aimed to demonstrate the effectiveness of our technique across a broader range of facial expressions, racial backgrounds, and cultural contexts. We hope that this comprehensive testing will help assuage your concerns about our research findings' scalability and general applicability.

We appreciate your valuable comments and hope that our revisions meet your expectations. Your insights have contributed significantly to the improvement of our work.

 

Point 2: According to Fig. 12, only Fear and not Fear states could be decided by the linear classifier. How can the authors categorize the samples into three groups using one dividing line? Please make this clear.

 

Response 2: Thank you for your astute observation regarding Figure 12 and the classification strategy we used in our research. We apologize if the depiction needed to be sufficiently clear and resulted in some confusion.

In this particular instance, the dividing line is primarily used to distinguish the 'Fear' emotional state from all other states, thus essentially creating two groups – "Fear" and "Not Fear." This is a common approach in binary classification problems where the goal is to accurately identify the presence or absence of a particular characteristic, in this case, the "Fear" emotion.

The mention of three groups might have led to confusion, and we appreciate you pointing this out. We have revised the text and figures to better explain our methodology and avoid any ambiguity. We hope this clarification will make our classification strategy more understandable.

Thank you again for your constructive feedback.

 

Point 3: Although the authors use a dataset of five emotional states, the final results in Fig. 13 only tell the states of Fear and not Fear, leading to lower contribution in practical scenarios.

 

Response 3: Thank you for your constructive comments regarding Figure 13 and the representation of our results. We appreciate your perspective on the practical contribution of our work.

Our primary objective in the reported results was to demonstrate our technique's efficacy in distinguishing one particular emotional state, "Fear," from others. This binary classification (Fear vs. Not Fear) appears limited when considering the full spectrum of emotional states in the dataset.

However, it's important to note that the methodology presented is not limited to the categorization of Fear alone. The same approach can be applied iteratively or simultaneously for each emotional state in the dataset. In this particular instance, we focused on Fear due to its relevance in the surveillance context.

In response to your feedback, we have considerably revised our manuscript to emphasize this point better and to provide a more straightforward explanation of how our method can be expanded to identify all five emotional states.

Thank you once again for your keen observation and valuable feedback. It greatly aids us in enhancing the clarity and relevance of our work.

 

Additionally, my minor concerns are as follows.

 

Point 4: In the title, this method is also designed for videos, but I do not sense the main difference in your method considering different monitored objects, photos and videos.

 

Response 4: Thank you for your remark concerning the distinction between photos and videos in our work, as outlined in the title. Your observation is valid, and we understand the need for more explicit articulation in the manuscript about how our method applies differently to photos and videos.

Our method is primarily designed to process video frames in real time, extracted from the low-resolution video stream of outdoor surveillance cameras. While each frame can technically be considered a "photo," video surveillance's continuous and dynamic nature introduces additional challenges and requirements. For example, real-time processing speed and the need to track emotional changes over time are unique considerations in video surveillance.

We appreciate your input on this matter and, in response to your feedback, we have clarified this distinction in our manuscript. We hope this provides a more comprehensive understanding of our method and its applications. Thank you again for your valuable feedback; it significantly contributes to improving our work.

 

Point 5: Most of the figures are of low resolution or in an unpleasant status. Besides, please type the numbers and draw straight lines in Fig. 4(b) to Fig. 7 other than casually drawing by hand.

Thank you for your valuable feedback regarding the quality and presentation of our figures. We agree that high-resolution, professionally-rendered graphics significantly improve our work's readability and overall presentation.

 

Response 5: In response to your comments, we have revised and updated all figures to enhance their resolution and clarity. For Figures 4(b) through 7, we have replaced the hand-drawn elements with typed numbers and straight lines to provide a more professional and cleaner visual representation.

We believe that these improvements will make the data more straightforward and the figures more aesthetically pleasing, thereby enhancing our manuscript's overall quality. We appreciate your keen eye for detail and your guidance in this aspect, as it significantly contributes to the refinement of our work.

 

Point 6: In Fig. 10 and Fig.11, there are six groups. Should it be five, according to your dataset? Legends will be of help to the readers.

 

Response 6: Thank you for your careful observation regarding Figures 10 and 11. We apologize for any confusion caused by the discrepancy between the number of groups represented in these figures and our dataset's actual number of groups.

You are correct; according to our dataset, there should indeed be five groups, not six. We appreciate your suggestion regarding the addition of legends for clarity. In response to your feedback, we redrafted Figures 10 and 11 into one complex Figure 10 to reflect the five emotion groups in our dataset accurately and added clear legends for straightforward interpretation. This should now align with the dataset details presented in the manuscript and enhance readability for our readers.

 

Point 7: Why do you leave a hyphen after Photo in your title?

 

Response 7: Thank you for your keen observation regarding the usage of the hyphen in our title. The hyphenation after "Photo-" in "Facial Emotion Recognition for Photo- and Video Surveillance based on Machine Learning and Visual Analytics" is actually a typographical convention known as a suspensive hyphen.

This is used to avoid repetition when two or more compound modifiers share a common base. In this case, "Photo-" and "Video Surveillance" share the common base "Surveillance." Instead of writing out "Photo Surveillance and Video Surveillance," we use the hyphen after "Photo-" to indicate that "Surveillance" applies to both "Photo" and "Video."

However, as a result of a deeper analysis of English grammar, we came to the conclusion that the hyphen can be omitted in this case. Thank you very much for your kind remark and assistance.

 

Point 8: I can follow your idea. However, it still needs polishing.

 

Response 8: Thank you for your succinct summary of our work. You've correctly captured the essence of our research, which indeed presents a novel, comprehensive technique leveraging hyperplane classification methods for facial expression recognition in video surveillance, particularly in crowded areas.

We sincerely appreciate your positive feedback on our manuscript. It is gratifying to hear that our work's clarity, structure, and relevance have been recognized. These are aspects we put significant emphasis on during our preparation of the manuscript in order to ensure our research findings are effectively communicated and contribute to advancing the field.

 

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

The authors have addressed all of my concerns. Therefore, I recommend the publication of the manuscript.

Author Response

Response to Reviewer 1 Comments

 

Point 1: The authors have addressed all of my concerns. Therefore, I recommend the publication of the manuscript.

 

Response 1:

 

Dear Reviewer,

We greatly appreciate your thorough review and are glad to hear that we have successfully addressed all your concerns. Your positive recommendation for the publication of our manuscript is encouraging and highly valued. Our utmost priority was to engage constructively with your insightful feedback; we are pleased it had the desired effect.

We would also like to thank you for the time and effort you have invested in the review process. Your insightful comments and suggestions undoubtedly helped improve our work's quality. We look forward to possible future opportunities to benefit from your expert perspective. Thank you once again for your positive endorsement and contribution to enhancing our manuscript's quality.

 

Author Response File: Author Response.docx

Reviewer 3 Report

After this round, most of my concerns have been addressed to a large extent. I do appreciate your responses to my concerns on a point-by-point basis. However, according to the modified version, I still have two more significant concerns.

 

According to the results in Fig. 10 and Fig. 11 of the first submission and the results in Fig. 10 of this round, the results are different. The distributions of the emotional states are different. Since you do not mention any significant modifications to the method proposed in this paper, at least in the responses to my review comments, I do not see an apparent reason why the results change differently. Please make a clear explanation. I am reluctant to question your work and professionality. Please accept my apology if this is offensive to you to some extent.

 

 

Although you claimed that Fig. 11 in the revised manuscript had been modified toward my two-group concern, I still do not understand why two categories of emotional states exist on one side of the dividing line, the Neutral and the Joy.  

Author Response

Response to Reviewer 2 Comments

 

Dear Reviewer,

 

Thank you for taking the time to review our manuscript again and for acknowledging our efforts to address your concerns. We greatly appreciate your detailed feedback, which has immensely contributed to improving our work.

We are pleased to hear that most of your concerns have been addressed. However, we understand that two significant points still need further clarification or modification. Below, you may find detailed explanations for both of your concerns.

 

Point 1: According to the results in Fig. 10 and Fig. 11 of the first submission and the results in Fig. 10 of this round, the results are different. The distributions of the emotional states are different. Since you do not mention any significant modifications to the method proposed in this paper, at least in the responses to my review comments, I do not see an apparent reason why the results change differently. Please make a clear explanation. I am reluctant to question your work and professionality. Please accept my apology if this is offensive to you to some extent.

 

Response 1: We sincerely appreciate your meticulous observation and thank you for providing us with the opportunity to clarify the disparities you identified between the results presented in our initial submission and the current round. Please rest assured that we respect and value your expert opinion, and we do not consider your query offensive in any way. On the contrary, we regard it as a constructive opportunity to elucidate the discrepancies.

In our initial submission, we trained and tested the classifier on video frames from the ADFES dataset. It was subsequently brought to our attention that the previous experiments unintentionally included a number of frames depicting the same individuals expressing identical emotions in both the training and test subsets. Thus, we might speculate about the presence of the so-called overfitting in our previous results and a shift toward detecting facial features peculiar to some people but not others. This issue inadvertently skewed our initial results toward specific appearances.

Upon identifying this error, we conducted comprehensive experiments with a correctly distributed selection of images from both the FER+ and ADFES datasets. Unfortunately, weights W of the classifier were unintentionally not updated during the first round of revision, for which we sincerely apologize as this happened to be an unfortunate mistake on our part. Updating our classifier necessitated alterations to the values of W, which were executed in our latest experiments.

The variances observed in our present results compared to the initial submission stem from these crucial modifications and more rigorous experimental practices. We hope these adjustments have enhanced the authenticity of our results by eliminating previously unidentified biases. We have incorporated these significant updates in our manuscript to clarify the disparities between our initial and present results.

We apologize for any confusion caused, and we appreciate your understanding and patience as we continually strive for the highest standard of scientific accuracy in our work. We thank you for your vigilance and assistance in enhancing the integrity of our study.

 

Point 2: Although you claimed that Fig. 11 in the revised manuscript had been modified toward my two-group concern, I still do not understand why two categories of emotional states exist on one side of the dividing line, the Neutral and the Joy.

 

Response 2: Thank you for your valuable feedback and the opportunity to provide further clarification. We sincerely apologize for any confusion caused and appreciate your patience as we delve into the details.

To address your concern, we have extensively revised Section 3, specifically paragraph 3.3, to offer a more comprehensive explanation of our technique and its graphical representation in Figure 11. In the revised manuscript, Figure 11 has been redrawn to demonstrate the grouping of five emotional states to reflect the validation results obtained from our proposed model and methods.

Figure 11, in its new form, clearly depicts the stable grouping of values derived from matrix X, which are obtained through the geometric interpretation of facial expressions. Each of these groups distinctly corresponds to a different emotional state. A dividing line, manually drawn on the coordinate plane, is also visible in the figure. This line is purposefully designed to clearly demarcate the target emotional state, in this case, “Fear,” from all other emotion groups. We constantly refer to the emotional state of “Fear” because it is the emotion that best suits the purpose of our work. It should be noted that we define the goal of our work as improving the accuracy of identifying changes in a person’s emotional state by facial expressions to detect abnormal behavior of a group of people in a crowd by their facial expressions in systems that meet security requirements.

The dividing line follows the “human-in-the-loop” principle and is manually constructed to simulate a hyperplane in two-dimensional space. Its principal objective is to distinguish the target class “Fear” from all other classes, thereby confirming the efficacy of our proposed model (1) for the classification of emotional states.

We hope this explanation clarifies the logic behind our methods and figure representation. Please, let us know if you have more questions or concerns, as we are always ready to improve our work based on your esteemed feedback.

Author Response File: Author Response.docx

Back to TopTop