Next Article in Journal
Reinforcement Learning Based Relay Selection for Underwater Acoustic Cooperative Networks
Previous Article in Journal
Basalt Chronology of the Orientale Basin Based on CE-2 CCD Imaging and Implications for Lunar Basin Volcanism
 
 
Communication
Peer-Review Record

From Phase Transition to Interdecadal Changes of ENSO, Altered by the Lower Stratospheric Ozone

Remote Sens. 2022, 14(6), 1429; https://doi.org/10.3390/rs14061429
by Natalya Andreeva Kilifarska *, Tsvetelina Plamenova Velichkova and Ekaterina Anguelova Batchvarova
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Reviewer 3: Anonymous
Remote Sens. 2022, 14(6), 1429; https://doi.org/10.3390/rs14061429
Submission received: 31 January 2022 / Revised: 27 February 2022 / Accepted: 11 March 2022 / Published: 15 March 2022

Round 1

Reviewer 1 Report

You have made a lot of effort and improved the statistics. Nevertheless, it always has strong error bugs as follows

A) 2-sigma significance test (line 146):

As the authors correctly write with reference to Wikipedia, for the standard normal distribution (Z-distribution) a critical value at the point "2 standard deviations to the left and right of the mean" (+/- 2) yields almost the same as a critical value for p = 0.05 (+/- 1.96). Thus, for the Z-test, one can take either 2 σ or p=0.05. But this is only valid for the two-tailed Z-test, not for t-tests, F-tests, tests with other distributions, other p-values or one-sided tests.

For the test of the null hypothesis H0: ρ = 0, one finds two different tests in the relevant literature, namely.

(a) the t-test: the test statistic t = r / sqrt(1-r²) * sqrt(n-2) is, if the null hypothesis holds, t-distributed with n-2 degrees of freedom (df), and

(b) the Z test: the test statistic Z = 1/2 * ln ((1+r)/(1-r)) = artanh(r) is, if the null hypothesis holds, approximately standard normally distributed. ( Z = 1/2 * ln ((1+r)/(1-r)) = artanh(r) is called Fisher Z-transform).

Disadvantages of (b): First, stronger preconditions are necessary (in (b) X- and Y-series must be stochastically independent and bivariate normally distributed, in (a) only the Y-series must be stochastically independent and univariate normally distributed), and second, the Fisher-Z-transformation yields only an approximate normal distribution, where the approximation is worse the smaller n is, while the t-distribution is exact. The Z-test is therefore less accurate.

Therefore, to test the null hypothesis H0: ρ = 0, the t-test is used today and no longer the Z-test.

However, the Z-test, despite its disadvantages, still has its justification in the following cases:

Another null hypothesis about ρ is to be tested, e.g., H0: ρ = c (comparison of a correlation with a fixed non-zero value), or H0: ρ_1 = ρ_2 (comparison of two pairwise correlations from three or four data series). For these cases, no test with an exact distribution has yet been developed. Also, the correction of n for autocorrelations in Afyouni et al., 2019, is based on the Z-test.

Thus, if the authors want to take the Z-test for the latter reason, they can do so, but then they should determine the critical value for p = 0.05 (+/- 1.96) and not take +/- 2 σ = +/- 2. Reason: consistency within the manuscript. In other places in the same manuscript the authors use p-values and not multiples of sigma. Presumably they got the 2 σ from the old Kenny (1979).  

B) Formulas (2) and (3), line 139-140:

Defined c_xy(k) twice with two different meanings. Popular mistake of students in the first semester or without understanding of mathematical notation.

C) Line 180-182:

"There are several formulas suggested in scientific literature, but all of them rely on the assumption for uncorrelated (but auto-correlated) records. This requirement is not met by our time series, because they covariate in time ..."

This shows that the authors have not understood the principle of hypothesis testing. The distribution of a test statistic (i.e. a random variable t, Z, F etc.) is always set up under the assumption of the null hypothesis, in this case just "uncorrelated records". If the present value or a more extreme one in the distribution is very improbable, one rejects the H0 and speaks of significance.

Therefore it is nonsense when they write that they cannot use the formula because their time series covary! (Theoretically, it could be that they do not covary, and the detected covariation/correlation is just a coincidence ...).

D) Lines 213-214 und Table 1

"the confidence of the inferred ozone influence on the Nino3.4 index is tested using two statistical tests (i.e. the Z-test and the t-test ..."

This cannot be done in this way. Which test should then apply, or how are the probabilities p obtained offset against each other? One must already decide for one.

 E) Lag optimization - perhaps as the most severe error (multiple testing)

The authors make the mistake here that they first look for the best r of all lags and then test only this r. This is wrong. Rather, the Bonferroni correction should be applied in this case. A numerical example: the best r is searched for 20 lags, and the best found r = 0.3 has the p value p = 0.05. However, the Bonferroni correction results in p = 0.05/20 = 0.0025. Therefore, the required r for this p must have a much higher value than 0.3!

To summarize: The problem is twofold: you would need a very long time to produce proper statistics. Moreover, it is not even certain that there is any evidence of significance at all for the very complex topic under consideration. Monte Carlo calculations would be a possible solution, but again they are difficult and their computation time may be too long. Therefore, as in my first review, I recommend that you dispense with significance testing altogether.

I recommend publication of the paper in this case because I think it is of high interest. I will leave it to the editor to publish the paper even if you want to stick with flawed statistics. A reason but no excuse for this is a paper by Wilks in the American meteorological society, December 2016, where he writes „… only 3 of 281 papers considered the effects of multiple hypothesis testing on their scientific conclusions….Consequences of the widespread and continued failure to adress the issue of multiple hypothesis testing are overstatement and overinterpretation oft he scientific results, to the detriment oft he discipline.“

Author Response

Dear Editor,

Dear Reviewer 1

 

Thank you very much for the careful consideration of the manuscript “From phase transition to interdecadal changes of ENSO, altered by the lower stratospheric ozone” submitted for publication in Remote Sensing journal. The authors confirm that neither the manuscript nor any parts of its content are currently under consideration or published in another journal.

Below you will find our answers to the reviewer’s 1 critical comments and suggestions:

 

  1. A) 2-sigma significance test (line 146):

In our previous answer to the reviewer’s criticism we have argued the equivalency of the sentences: “statistically significant at 2σ level” and “statistically significant at α=0.05 level”.

The reviewer's 1 subsequent comment is as follows:

Thus, for the Z-test, one can take either 2σ or p=0.05. But this is only valid for the two-tailed Z-test, not for t-tests, F-tests, tests with other distributions, other p-values or one-sided tests.”

We answer to this comment with a citation from the article “t-statistics” from the Wikipedia:

“The t-statistic is used in a t-test to determine whether to support or reject the null hypothesis. It is very similar to the Z-score but with the difference that t-statistic is used when the sample size is small or the population standard deviation is unknown.” 

Furthermore, the Reviewer1 compares both tests:  t-test (a) and Z-test (b), which are usually used for assessment of statistical significance of correlation coefficients. According to his/her understanding:

Disadvantages of (b) (i.e. the usage of Z-test): First, stronger preconditions are necessary (in (b) X- and Y-series must be stochastically independent and bivariate normally distributed, in (a) only the Y-series must be stochastically independent and univariate normally distributed)

It is difficult to follow the logic of this comment, but the assumptions for independency of compared time series, and bivariate normally distributed samples’ pares (Xi,Yi) is valid for both tests.

 

second,  the Fisher-Z-transformation yields only an approximate normal distribution, where the approximation is worse the smaller n is, while the t-distribution is exact”.

Actually, our time series contain more than 100 values, which are thought to be great samples, and in this case the theory recommends the usage of Z-test.

Furthermore the reviewer concludes:

“The Z-test is therefore less accurate. Therefore, to test the null hypothesis H0: ρ = 0, the t-test is used today and no longer the Z-test.

We answer to this Reviewer’s 1 comment with another citation from the article “Z-test” from the Wikipedia:

“Because of the central limit theorem, many test statistics are approximately normally distributed

for large samples. Therefore, many statistical tests can be conveniently performed as approximate

Z-tests if the sample size is large or the population variance is known. If the population variance is unknown  (and  therefore  has  to  be  estimated  from  the  sample  itself)  and  the  sample  size  is  not large (n < 30), the Student's t-test may be more appropriate.”

This excerpt shows that the Z-test is not less-accurate and the usage of Z or t-test depends mainly on the size of analysed sample(s) or on the availability of information about the population variance.

Furthermore, the Reviewer1 insinuates that the method elaborated by Afyouni, Smith and Nichols, 2019  is not accurate, because is based on the Z-test. And he continues: “Thus, if the authors want to take the Z-test for the latter reason, they can do so, but then they should determine the critical value for p = 0.05 (+/- 1.96) and not take +/- 2 σ = +/- 2. Reason: consistency within the manuscript. In other places in the same manuscript the authors use p-values and not multiples of sigma. Presumably they got the 2 σ from the old Kenny (1979).

We are not going to advocate to none of the above mentioned authors (affiliated in world recognized universities as Oxford Big Data Institute, University of Oxford, UK; Department of Statistics, University of Warwick, UK, etc.). We must simply note, that we have used Z-table and Table of Critical Values of Pearson’s correlation coefficients, with a predefined α=0.05, in a two-tailed test – in order to refine the statistical significance of our results, derived from smoothed time series (i.e. serially correlated). So, the Reviewer’s 1 worry that we “Presumably got the 2 σ from the old Kenny (1979) is unreasonable, because:

  • the input in the Z-table is Z-value, calculated by formula (5) in our Manuscript, and output is p-value, which is further compared with the pre-defiled value of 0.05;
  • the inputs in the Table of Critical Values of Pearson’s correlation coefficients are: (i) the degrees of freedom and (ii) predefined value of α=0.05, while the output is the statistically significant correlation coefficient.

So, the Reviewer’s 1 doubt that we may have improperly used tests of significance is speculative.

  1. B) Formulas (2) and (3), line 139-140:

Defined c_xy(k) twice with two different meanings. Popular mistake of students in the first semester or without understanding of mathematical notation.

This comment is an indication that Reviewer1 is not familiar with the Lagged correlation analysis (known also as Distributed lags analysis). In our experience with peer reviewers this is not the first time, when we must explain the principles of the Lagged correlation method. So, we are ready to do our best to clarify the essence of the method to our reviewer.

Suppose we are looking for similarities in temporal variations of  variables X(X1,X2,….XN ) and Y(Y1,Y2,….YN ), within the period t1,t2,….tN. Since the lagged correlation coefficients are not symmetrical around the zero lag, and depends which of the variables will be chosen to be independent, both possibilities are investigated. For this reason the variable X is moved first backward and then forward along the tome axis. In the 1-st case the variable Y is the dependent one, and covariance cxy(k) is determined by formula (2), where k is the time lag (i.e. the steps of time with which both variables are shifted relative to each other). For example, the correlation coefficient with k=-1 is calculated by moving X variable backward by 1 step in time. Correlation coefficient for k=-2 is calculated by moving the X with 2 time intervals backward, and so on.

Secondly, the variable X is moved forward and in this case the Y is independent, while X is the dependent variable. In this case the covariance cxy(k) is calculated by the formula (3). For this reason, the time lags could be positive or negative, depending of what of both variable is chosen to be moved (first backward and then forward).

          for k=-1, -2, … -(N-1)                                 (2) 

              for k=0, 1, 2, … N-1                                    (3)

We hope that these explanations are making clear why the cross-covariance function is calculated by two different formulas.

  1. C) Line 180-182:

The Reviewer 1 stars with a citation of our second manuscript correction, reflecting the comments and suggestions of other reviewers, i.e. "There are several formulas suggested in scientific literature, but all of them rely on the assumption for uncorrelated (but auto-correlated) records. This requirement is not met by our time series, because they covariate in time ..."

This shows that the authors have not understood the principle of hypothesis testing. The distribution of a test statistic (i.e. a random variable t, Z, F etc.) is always set up under the assumption of the null hypothesis, in this case just "uncorrelated records". If the present value or a more extreme one in the distribution is very improbable, one rejects the H0 and speaks of significance.

Therefore it is nonsense when they write that they cannot use the formula because their time series covary! (Theoretically, it could be that they do not covary, and the detected covariation/correlation is just a coincidence ...).

The answer: We must stress that the choice of right formula – able to estimate the significance of cross-correlation between serially correlated records – is a complicated problem, which cannot simply be reduced from the principles of hypothesis testing. We had to show – how much the decreased degrees of freedom of our smoothed time series inflate the cross-correlation between both records (noticed by the other two reviewers).

A thorough analysis of the scientific literature, engaged with this problem, shows that most of the attempts of its solution are developed for analysis of small samples, for which the t-test is suitable for hypothesis testing. These methods are focused on the re-calculation of the degrees of freedom and determination of the critical values of correlation coefficient, above which they are statistically significant (for example by using the Table of Critical Values of Pearson’s correlation coefficients).

The other approach, described by Afyouni, Smith and Nichols (2019) is oriented toward big samples, for which the Z-test is more appropriate. Besides the autocorrelation functions of each variable, this method takes into account the influence of the zero lag cross-correlation coefficient (between analysed samples) on the cross-correlation variance and consequently – on the Z-value. The Z-value is used furthermore to determine the p-value of the cross-correlation coefficient, indicating its significance or insignificance.  

We decided to use both methods – based on the calculation of effective degree of freedom and on the Z-test, what possibly has not been understood by the reviewer. What we found is that our correlation coefficients are statistically significant in the regions of the strongest covariance between both time series.

  1. D) Lines 213-214 und Table 1

The reviewer comments our second revision:

"the confidence of the inferred ozone influence on the Nino3.4 index is tested using two statistical tests (i.e. the Z-test and the t-test ..."

This cannot be done in this way. Which test should then apply, or how are the probabilities p obtained offset against each other? One must already decide for one.

 

Answer: Unfortunately, we were not able to found the meaning of this reviewer comment.

  1. E) Lag optimization - perhaps as the most severe error (multiple testing)

The authors make the mistake here that they first look for the best r of all lags and then test only this r. This is wrong. Rather, the Bonferroni correction should be applied in this case. A numerical example: the best r is searched for 20 lags, and the best found r = 0.3 has the p value p = 0.05. However, the Bonferroni correction results in p = 0.05/20 = 0.0025. Therefore, the required r for this p must have a much higher value than 0.3!

This Reviewer’s 1 comment is obviously related to the fact that he/she is not familiar with the lagged correlation analysis. The Bonferroni correction is applied in the case of the multiple statistical testing (i.e. when several statistical inferences are tested). For example, if one tries to found out “How the variations of X influence the variations of Y”, and whether the “effect” is statistically significant, he changes X and then test which of the Y responses are statistically significant.

Our goal is much simpler – we are interested of the similarity in temporal variations of both variables (X and Y). The cross-correlation technics indicates the existence of linear similarity and we are testing a single hypothesis – does the cross-correlation coefficient is statistically different from zero.

In the lagged correlation analysis both time series remain unchanged – they are only shifted to each other in time. The goal is to detect the possible delay in the dependent variable’s response, to the implied forcing by the independent variable. So the Bonferroni correction is absolutely not at place in this case.

 

In conclusion: The authors of this manuscript are ready to respond to all comments and criticisms of the reviewers. However, the absolute inconsistency of the Reviewer’s1 comments is puzzling – in each subsequent editions he/she discovers new errors, which have not been pointed out in his/her previous reviews.

Moreover, in the last Reviewer’s 1 report is mentioned that a lot of efforts and improvements in statistics have been made, but the subsequent remarks completely denied these improvements.

On the other hand, the urgent recommendation to abandon the claims of statistical significance of our results is unacceptable neither for us, no for the other two reviewers.

Despite of the different points of views with the Reviewer 1, the authors are grateful for the careful consideration of our manuscript, which help us to realize that special care is required to prove the reliability of correlations between smoothed time series.  Regardless of the editor's decision to publish or reject our article, it was significantly improved – thanks to criticism of our reviewers.

 

With best regards:

On behalf of all authors:

Natalya Kilifarska

 

References:

Afyouni, S., Smith, S.M., Nichols, T.E., 2019. Effective degrees of freedom of the Pearson’s correlation coefficient under autocorrelation. NeuroImage 199, 609–625. https://doi.org/10.1016/j.neuroimage.2019.05.011

Author Response File: Author Response.pdf

Reviewer 2 Report

In the new version of the paper, the authors answered satisfactorily my questionings by providing statistical arguments to prove the robustness of their results. I do not have further questions.

Author Response

Reviewr 2 has not more questions.

Reviewer 3 Report

The authors have put in a tremendous amount of work to address the issues raised by me and the other reviewers. I am very impressed with the lengths the authors went to place their findings into proper context when considering the impact of smoothing on statistical significance. 

Author Response

The Reviewer 3 has not any more questions.

Round 2

Reviewer 1 Report

The statistics are still insufficient. Nevertheless, as I said in my last review, I will not reject the work because its results are too interesting. I therefore leave the decision to the editor. 

This manuscript is a resubmission of an earlier submission. The following is a list of the peer review reports and author responses from that submission.


Back to TopTop