Comparative Study of Musical Timbral Variations: Crescendo and Vibrato Using FFT-Acoustic Descriptor
Round 1
Reviewer 1 Report
The authors present the study of different qualitative timbral descriptors applied to a large dataset of sounds. The results are quite interesting and in line with their previous research and the article is well written. I have one fundamental criticism though and it's how they compute the euclidean distance of elements where the different dimensions are in different scales. In particular, the problem is that the authors utilice the frequency in Hz as one dimension, whose order of magnitude is on the hundreds, whereas the rest of the descriptors vary between 1 and 10. If the authors re-make the computations for standarised dimension I think the comparison would be much more interesting that as they are now, for that reason I recommend mayor revisions to the article."Therefore, the timbre of the instrument will present variations with respect to the sounds of a particular instrument." I don't understand this sentence, what is the instrument and that is a particular instrument?
I found the description of the timbral coefficients quite poor. I needed to go back to the original paper and read the definition there. I think adding it here would make more sense and the paper would be self-contained. fig1 Can the authors present the first 2 dimension of the PCA decomposition of the tuples? It would be nice to see if the same instrument clusterises in that space or they are overlapped. It's rather difficult to interpret form the multidimensional raw data. The distance matrices are all meaningless. The authors are comparing a 7-tuple where one dimension almost 3 ordes of magnitude larger than the other ones. The authors need to standarise each dimension so the mean along each dimension of the tuple is 0 and the standard deviation is one. Only in this way one can get fair results when doing a comparison. If we look at fig 2,3,5,8 and 11 we mostly see the difference in the fundamental frequency of the notes, nothing else. page 8: "quantizes" not the correct word. Quantifies?Author Response
We thank you for your valuable comments.
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 2 Report
The authors propose a set of six dimensionless magnitudes derived from the Fast Fourier Transform (FFT) of audio recordings to describe and compare the timbral characteristics of musical sounds. The authors apply the FFT-acoustic descriptors to a sample of monophonic sounds from different instruments, dynamics, and techniques such as crescendo and vibrato. They show how these descriptors can capture the changes in the envelope, frequency, and harmonicity of the sounds. The authors use the Random Forest algorithm to evaluate the accuracy of the FFT-acoustic descriptors and compare them with other timbral features extracted using Librosa. They find that the FFT-acoustic descriptors perform well in classifying instruments, dynamics, and families of instruments, and better than Librosa features in classifying pitch.
However, some weaknesses of this paper are:
- The paper does not provide a clear definition of musical timbre and how it relates to the FFT-acoustic descriptors. It also does not explain how the dimensionless magnitudes are derived from the FFT spectrum and what are their physical or perceptual meanings.
- The paper does not compare the proposed method with other existing approaches for musical timbre analysis and classification, such as Mel-frequency cepstral coefficients (MFCCs), spectral centroid, spectral flux, etc. It would be useful to see how the FFT-acoustic descriptors perform against these features in terms of accuracy, robustness, and interpretability.
- The paper does not provide any examples or applications of how the FFT-acoustic descriptors can be used for musical analysis, synthesis, or composition. It would be interesting to see how the proposed method can help musicians, composers, or researchers to understand, manipulate, or create musical sounds with different timbral characteristics.
- The paper does not discuss the limitations or challenges of the proposed method, such as the sensitivity to noise, recording quality, instrument tuning, or performance style. It also does not address how the method can handle polyphonic sounds or complex musical textures that involve multiple instruments or sources.
Author Response
We thank you for your valuable comments.
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 3 Report
This paper compares musical timbral Variations using FFT-acoustic descriptor for crescendo and vibrato of the sound of some musical instruments. However, some of the points in the paper are difficult to understand, so readers may not be able to fully understand it.
In general, "timbre" means the difference in sound on different instruments, and "tone (or tone color)" means the difference in sound on the same type of instrument. In this paper, these are mixed up and described, so please correct them.
If this paper is a study of music acoustics, the author should conduct experiments using multiple sound sources and real instruments for each type of instrument. In addition, consideration should be given to the acoustic characteristics of the instrument. However, the authors describe that, for example, the 3.4 vibrato theory, the issue of resonant cavity is outside the scope of this paper.
If so, this paper is research on information processing, but the relationship between the Fourier transform formula and the matrix shown in this paper may not be fully understand for the readers.
In any case, the point of the paper is not clear, and the readers will not be able to fully understand the intent of the authors. In other words, this paper only presents examples of the results of FFT of several sound sources.
 
First, it should be illustrated how the ”A”, ”S”, ”MA”, ”MC”, ”H”, and ”M” defined in the paper are represented as features in the spectrum obtained by the FFT, and the timbre (or tone color). Please show a conceptual diagram of how these characteristics correspond to changes in, and explain in the text.
Next, please illustrate the characteristics of each of the instruments used in the experiment. Next, illustrate the characteristics of crescendo and vibrato for each instrument. In other words, classify the experimental data by the type of instrument. This will convey the author's intentions to the reader.
The reviewer hopes that these comments will be useful to the authors.
Author Response
We thank you for your valuable comments.
Please see the attachment
Author Response File: Author Response.pdf
Round 2
Reviewer 1 Report
The authors have improved the manuscript but there is a fundamental misunderstanding still in the article concerning the distance matrices.
The authors say in their response letter: "it should be noted that for purposes of comparing musical timbres it is assumed that the audios to be compared have the same tone, that is, they correspond to the same musical sounds." Yet they keep presenting matrices that compare DIFFERENT musical sounds. It's either one or the other, if they want to compare the SAME sounds they should not present the matrices in figs 2, 3, 5, 8 and 11 for those matrices are mostly about the DIFFERENT sounds (ie all the off-diagonal elements). If they want to present the matrices then the dimensions need to be standarised as I said in my firsts comments. Please chose one option and modify the paper as needed; as it is, it's not coherent.
I keep major revision as the comment affects half of the results presented in the paper.
Author Response
The observation is appreciated.
The matrices in Figure 2 were changed to their corresponding standardized distance matrices at the request of the Reviewer.
The distance matrices in Figures 3, 5, 8 and 11 were changed to distance graphs to facilitate the interpretation of the results.
Reviewer 3 Report
The revision made the content of this paper a little easier for the reader to understand.
However, although FFT analysis is the subject of this paper, the analysis conditions are not clearly stated in the text.
Please show the audio waveform and sound spectrogram of the sound source used for FFT analysis, and clearly indicate the part used for analysis in terms of time (how many seconds?). Please also specify whether the analysis interval includes the rise and fall of the sound.
Please specify the window function used in the analysis.
Also, please explain whether changing the window function changes the results stated in this paper.
Please clearly state the absolute sound pressure level (mf, ff, etc.) using numerical values.
This is the minimum information necessary to claim the results of this paper regarding timbre.
To compare the sound quality of the same instrument, "tone (color)" should be used instead of "timbre".
Author Response
The observation is appreciated
The databases of monophonic audio records from Goodsound (reference 9) and Tynisol (reference 10) are used. We use the FFT complete on monophonic audios, all of them of equal duration, for which no window function is required (ie, the window is rectangular and of unit width). The records of these databases are stored through their FFT, so the spectrograms would require the deconvolution of the already digitized signal, which, as we have explained, would not add more information about the timbre than what already exists in the FFT complete of which are extracted. The databases used do not give specific information about the power levels (absolute intensity) used for each register in the musical dynamics. Keep in mind that in music the so-called pianissimo and fortissimo intensity levels are not univocally defined and do not go beyond being a qualitative criterion of the "intensity" with which the musical sound is executed in relation to its standard performance (mezzoforte). In any case, the comparison made is of these values ​​registered in the databases as pp and ff, so the absolute intensity values ​​do not modify the timbre comparison.
An additional clarification was added on page 3 (second paragraph of section 2) : 'Note that all the audio records are monophonic and have the same duration (5 seconds), so the complete FFT is performed with a constant window function (unit step)'
Round 3
Reviewer 1 Report
The authors have followed the advice concerning the matrices but they don't discuss the results in any meaningful way. They could have taken a bit more time and do a proper analysis of the results. For example, the hight pitch notes in the clarinet (A4 and up) are all very similar, but not in the others. For the flute the distance seems to increase with pitch distance.
Also, mention explicitly how the dimensions are standarised. I know what it means cause I asked for it but it's not necessarily clear to the reader.
Author Response
We want you for your valuable comments
Please see the attachment
Author Response File: Author Response.pdf
Reviewer 3 Report
Please describe the "difference in sound pressure level" between mf, ff, and pp using specific values.
Author Response
We want you for your valuable comments.
Please see the attachment
Author Response File: Author Response.pdf