Next Article in Journal
Special Issue on Advanced Fiber-Reinforced Cementitious Composites
Next Article in Special Issue
TruMuzic: A Deep Learning and Data Provenance-Based Approach to Evaluating the Authenticity of Music
Previous Article in Journal
Recognition of Student Engagement State in a Classroom Environment Using Deep and Efficient Transfer Learning Algorithm
Previous Article in Special Issue
Generative Music with Partitioned Quantum Cellular Automata
 
 
Article
Peer-Review Record

Low Complexity Deep Learning Framework for Greek Orthodox Church Hymns Classification

Appl. Sci. 2023, 13(15), 8638; https://doi.org/10.3390/app13158638
by Lazaros Alexios Iliadis 1,*, Sotirios P. Sotiroudis 1, Nikolaos Tsakatanis 1, Achilles D. Boursianis 1, Konstantinos-Iraklis D. Kokkinidis 2, George K. Karagiannidis 3,* and Sotirios K. Goudos 1,*
Reviewer 1:
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Appl. Sci. 2023, 13(15), 8638; https://doi.org/10.3390/app13158638
Submission received: 28 May 2023 / Revised: 24 July 2023 / Accepted: 25 July 2023 / Published: 27 July 2023
(This article belongs to the Special Issue Algorithmic Music and Sound Computing)

Round 1

Reviewer 1 Report

To tackel Greek Orthodox Church hymns identification with low complexity algorithms, it was verified mainly by 3 kinds of DL models, namely Shallow CNN, Deep CNN, and Macro VGG in the paper, and also compared the performances with 2 kinds of state-of-the-art (SOTA) deep learning models, namely VGGish, and ResNet18. It shows from the experiments and the aspect in overall consideration on complexity that the Macro VGG is the desired model. It seems that the Macro VGG method could be accepted and applied in mobile device as an initial version; however, some of the aspects could be improved as follows.

 1)      In the paper, the research target is the music of Greek Orthodox Church hymns, but it is hardly to read the content about its analysis on one hand, so it is required to add some of introduction about its characteristics, such as rhythm, pitch contour, accent etc; on the other hand, from the whole content, it couldnt be shown that the modeling is dependent on the target (i.e. object identification), so it seems that it is functional to any targets. Please give some explanations;

2)      From the experiments, it can be see that the Macro VGG is the desirable model for Greek Orthodox Church hymns identification with low complexity, so it is required to analyze the medel deeply to let the reader to know why and how its architeture more efficient is;

3)As mensioned in the paper, other architetures, such as RNNs, Transformers with attention mechanism, are also efficient DL models as classification methods on voice and inmage and havent  been verified in the paper, so it coundnt be guaranteed that the Macro VGG is optimal model. To varify them further in the paper is encouraged.

Author Response

Please see the attachment

 

Author Response File: Author Response.doc

Reviewer 2 Report

In this paper, the authors present the results of their study on the automatic classification of hymns of the Greek Orthodox Church using deep learning techniques. While the problem itself is interesting and relevant, I believe that the description and approach presented in the paper lack sufficient detail and results to be considered for publication in a journal.

Starting with the form, I suggest rewriting the Abstract to improve the flow of ideas. For instance, the Abstract should begin with the second sentence since signal processing is not the main focus of the work. Additionally, the Introduction seems to present disconnected ideas, particularly between the third and fourth paragraphs.

The authors mention that one of their contributions is the creation of a database for the Greek Orthodox Church. However, if this database is not publicly available, it cannot be considered a genuine contribution to the scientific community.

Furthermore, the review of the state of the art is not extensive enough. For example, recent works such as:

- Farajzadeh, N., Sadeghzadeh, N., & Hashemzadeh, M. (2023). PMG-Net: Persian music genre classification using deep neural networks. Entertainment Computing, 44, 100518.

- Sharma, Dhruv, Sachin Taran, and Anukul Pandey. "A fusion way of feature extraction for automatic categorization of music genres." Multimedia Tools and Applications (2023): 1-24.

These recent works introduce ideas that should be reviewed and discussed in the paper as well. It is important to note that converting music to a spectrogram representation is just one of the options available in the state of  the art.

Moreover, the improvement in classification achieved with one of the proposed models seems to be marginal compared to the state-of-the-art model. It is necessary to validate this improvement with a statistical test and provide a more adequate justification for selecting both the state-of-the-art methods and the methods presented in the paper.

In the conclusions, the authors mention that there are still many ways to improve the results. However, a journal publication should explore these avenues of improvement in greater depth.

Regarding the form, I believe that Figures 3, 4, and 5 add little value to the results and can be removed.

Given the significance of the problem addressed, the authors should reconsider the scope of their proposal to expand on the results, provide more thorough justifications, and consider publishing the database to establish a benchmark for future developments.

 

Minor suggestions:

Substitute "excessive success" for "great success" in the first paragraph of introduction.

Present the meaning of VGC acronym.

I don't consider necessary to present and  explain the common metrics for classification.

 

I recommend to perform a proof read and edit of the work, in particular with the word choice and disconnected ideas.

Author Response

Please see the attachment

 

Author Response File: Author Response.doc

Reviewer 3 Report

This paper presents a use case of audio signal processing and automatic acoustic classification for Greek Orthodox Church Hymns Classification.

In the papers, authors briefly explain how they have gathered the hymns of the churches and the design of three basic convolutional neural networks for the classification of such hymns. Finally, authors compare the results of their custom models to two state of the art deep learning architectures (VGGish and ResNet18).

While the application could be of interest to the religious community, I believe there are major flaws in this work that need to be addressed prior to its publication:

1-      The introduction mentions several scientific contributions, including a dataset of 23 hymns. However, this dataset is not made available, which raises concerns. For it to be considered a contribution, the dataset should be openly accessible. Otherwise, it should be removed from the list of contributions.

2-      The introduction lacks a proper justification for the need of this application or the societal interest in such a tool. It would be helpful to elaborate on why this classification of church hymns is necessary and how it fills a research gap. Additionally, it would be beneficial to discuss the differentiation between this tool and existing applications like Shazam, highlighting the unique aspects of this research.

3-      The related work section is too brief and could be expanded upon. While the references provided are relevant and up-to-date, their explanation is insufficient. It would be valuable to explore related applications or techniques, particularly those related to cultural heritage and hymns classification.

4-      In the materials and methods section, the data generation process needs further elaboration. Additional information about the devices used, such as the types of professional equipment and mobile phones, would be helpful. Details about the distribution of sensors in the churches, the use of tripods, and the proximity of sensors to the walls would enhance the reader's understanding. Including pictures of different set-ups would also be beneficial.

5-      The paper lacks information about the acoustic characteristics of the data, such as the average and standard deviation of file durations and the sampling rate used. Boxplots could be included to provide a visual representation. Furthermore, it would be valuable to provide the names of the hymns in the dataset for clarity.

6-      Figure 1 would benefit from including the temporal and frequential axes, as well as providing information about the maximum frequency and audio file durations. Identifying the hymns' classes and ensuring a uniform sample rate in the dataset should also be addressed.

7-      It would be convenient to specify the parameters of the spectrogram calculation.

8-      In section 2.2.2, while augmentation techniques for computer vision are mentioned, it would be interesting to include audio augmentation techniques, particularly those related to spectrograms. Methods like Audio mix-up and SpecAugment could be discussed. Additionally, it would be valuable to specify the number of samples in the dataset after augmentation.

9-      I feel that the first part of section 2.3 should be skipped as it is basic information that every DL researcher would know (what is an input layer, a convolutional layer, or an activation function). And, if this information is not known, the paper does not provide sufficient information for a new reader to understand it. I presume that if a reader does not know what a Convolutional layer is, then it will now know either what a Fully connected layer is. For this reason, I feel that this part of the section could be erased. Instead, I think that all this space could be used to better explain the three architectures presented in this paper (e.g., size, neurons, activation function on each layer…) and the design choices behind them.

10-   The paper states that cross-validation was used in Section 3.1, but it lacks information about the size of each fold, the distribution between training and validation sets, and the size of the test set. Clarifying these details and whether the same folds were used for all experiments or randomly created would be beneficial.

11-   As cross-validation has been carried out, it is presumed that the metrics presented in Tables 3 and 4 are averages (train/validation)? It would be helpful to specify this and provide the standard deviation as well.

12-   Figures 3, 4 and 5 take a lot of space without giving very relevant information. I guess that the authors want to prove that their system is not overfitting (this should be specified in the text). I would suggest joining the three graphs in a single one.

13-   A deeper discussion of the results obtained in the three custom models would be appreciated too.

14-   In Table 4, the authors show a Time column, probably the training time (it is not specified in the text). As the final application is aimed at classifying the hymns in a mobile device, the training time is not so important, as the training process is usually done in advance and in the cloud. To complement this information, it would be more interesting to analyze the inference time to classify a hymn, especially if this time is measured in a mobile device.

15-   When applying transfer learning to two SOTA architectures (section 3.2), VGGish and ResNet18 have been chosen. The paper lacks a justification on why were these two models chosen (specially VGGish, which is a large model usually used for embedding extraction and was not designed for mobile classification). Currently, there are many DL architectures that are specially designed to perform real-time classification in mobile devices and could be compared to the ones presented in this work.  An example of these architectures are MobileNet, SqueezeNet or EfficientNet. Using these architectures might be more beneficial to the problem of hymns classification if the network has to run in real-time over a mobile device. Please include them in the paper.

16-   It would be a great contribution if authors could show the confusion matrix of their system and deeply analyze the cases where their models are not performing well.

17-   Finally, I am worried about the functioning of the application in the real-world. Right now, and by reading what is written in the paper, I assume that every audio (which is one complete hymn) will be converted into a spectrogram (no matter its duration), and then passed through the CNN. However, what happens if only one fragment of the song is recorded? Will the system be able to analyze it? Perhaps it would be more convenient to always select one window size, fixed for all the inputs. If that is the case, it is not properly explained in the paper.

 

 

Overall, I think the problem being studied has potential, but authors need to address all the above concerns before their paper can be published. Many efforts have been put into recording all the hymns and gathering a dataset to develop the application, and it would be a great contribution if authors could deeply analyze it and make it available for the community to use.

Author Response

Please see the attachment

 

Author Response File: Author Response.doc

Round 2

Reviewer 2 Report

I believe that the authors have made significant improvements to the manuscript, and have complied with all the comments made in the first revision. Therefore, I consider that the manuscript can be published in its present form.

Author Response

Dear Editor-in-Chief,

Dear Academic Editor,



First, we would like to express our gratitude to you and the reviewers whose valuable comments on this manuscript have improved its quality. We have addressed all the comments in the new version of the paper, and we have listed all the changes item by item in response to the comments mentioned below. We have also carefully proofread the manuscript. Please note that the revised manuscript has highlighted all the modifications in red.

Author Response File: Author Response.docx

Reviewer 3 Report

First of all, I would like to congratulate the authors for the improvement of their manuscript. Most of the comments have been properly addressed. New experiments have been carried out, and the results are now better represented and explained in the paper. However, some comments were not properly addressed and I feel that there are still some issues that should be modified in order to make the paper acceptable for publication. 

  1. In introduction, lines 45 to 53, the third contribution should be erased. It is a general sentence that does not add value to the study. Making the experiments is the methodology to design the DL models, but not a contribution by itself. Also, couldn’t the first and second contributions be written in a same sentence? The micro VGG is one of the three DL models explained in the first contribution. In my opinion, authors should rewrite this part to emphasize the new micro VGG architecture (first point), and then as a second point explain the comparison to the several SOTA pre-trained models (VGGish, ResNet18, etc.). Moreover, why are these models not mentioned right before, after line 44, where the first experiment is explained? I think that the third point of the authors contributions should be there as well, and not listed as a contribution.
  2. For the Shazam part, I feel that this reference may help the authors: Wang, A. (2003, October). An industrial strength audio search algorithm. In Ismir (Vol. 2003, pp. 7-13).
  3. I still think that the data generation process needs further elaboration. Right now, authors state that:

To enhance the dataset’s diversity, some of the hymns 131

were recorded in a soundproof environment by a single chanter using professional record- 132

ing equipment, while others were captured using a mobile phone in various churches 133

during divine services, including background noise from crowds and church bells. During 134

the recordings inside the churches, the mobile phone was placed in different places each 135

time to achieve diversity in the audio quality.

It would be beneficial to know the proportion of audio files recorded in the soundproof environment and the audio files recorded in situ. 

  1. I still think that it is necessary to include the names of the different hymns, even if they are in Greek. This way, readers of the paper will be able to look for them on the internet and listen to them, checking their differences and similitudes. Moreover, the duration and standard deviation should be specified by class. A boxplot per class would give very valuable information. 
  2. Figure 1 should include the values of time and frequency in the axis, apart from the hymn name of each spectrogram. Moreover, has the spectrogram been filtered? Why are the top and bottom part appearing in black? A justification would be appreciated. 
  3. In Table 6, the number of operations of MobileNet and SqueezeNet are lacking their units.
  4. In Table 7, VGGish achieves a mean accuracy of 100% with a +-1.34% of standard deviation. How is that possible? Is it an error? The accuracy cannot be greater than 100%, therefore the only possible way of obtaining an accuracy of 100% is if the std deviation is 0. 
  5. The confusion matrix is hardly understandable. Authors should use numbers instead of the color map. 
  6. Lines 378 to 382 lack a reference.

Author Response

Please see the attachment

Author Response File: Author Response.doc

Back to TopTop