Next Article in Journal
Maximum Safe Parameters of Ships in Complex Systems of Port Waterways
Next Article in Special Issue
A Scoping Literature Review of Relative Fundamental Frequency (RFF) in Individuals with and without Voice Disorders
Previous Article in Journal
A Modular and Semantic Approach to Personalised Adaptive Learning: WASPEC 2.0
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Tutorial

Inferential Statistics Is an Unfit Tool for Interpreting Data

Division of Speech and Language Pathology, Department of Clinical Science, Intervention and Technology, Karolinska Institutet, SE-141 86 Stockholm, Sweden
Appl. Sci. 2022, 12(15), 7691; https://doi.org/10.3390/app12157691
Submission received: 14 June 2022 / Revised: 22 July 2022 / Accepted: 26 July 2022 / Published: 30 July 2022
(This article belongs to the Special Issue Current Trends and Future Directions in Voice Acoustics Measurement)

Abstract

:
Null hypothesis significance testing is a commonly used tool for making statistical inferences in empirical studies, but its use has always been controversial. In this manuscript, I argue that even more problematic is that significance testing, and other abstract statistical benchmarks, often are used as tools for interpreting study data. This is problematic because interpreting data requires domain knowledge of the scientific topic and sensitivity to the study context, something that significance testing and other purely statistical approaches are not. By using simple examples, I demonstrate that researchers must first use their domain knowledge—professional expertise, clinical experience, practical insight—to interpret the data in their study and then use inferential statistics to provide some reasonable estimates about what can be generalized from the study data. Moving beyond the current focus on abstract statistical benchmarks will encourage researchers to measure their phenomena in more meaningful ways, transparently convey their data, and communicate their intellectual reasons for interpreting the data as they do, a shift that will better foster a scientific forum for cumulative science.

1. Introduction

The American Statistical Association recommends that researchers stop using null-hypothesis statistical significance testing (NHST) and move beyond p-values when analyzing and reporting data [1,2]. The special volume on p-values in The American Statistician (volume 73, issue suppl. 1) is a recent interjection in a century-old debate [3], but has quickly had an impact on several journals’ guidelines on statistical reporting [4,5] and was itself influenced by other forerunner journals [6]. Some key arguments against NHST as a tool for making inferences are that statistical significance testing does not give you the probability that your hypothesis is correct [7,8], NHST obscures uncertainty and the quantitative nature of empirical data (it leads to “dichotomania”) [9], NHST does not summarize the evidential value the data carries toward, or against, the researcher’s hypothesis [10,11,12], the assumptions underlying NHST are often known to be wrong a priori [13], and NHST triggers questionable research practices [14,15,16]. Here I will advance another concern with the usage of NHST, namely that NHST, and other abstract statistical benchmarks, often are used as tools for interpreting data.
Researchers need to separate the acts of interpreting data and making statistical inferences from the data. A researcher who has collected empirical data has two different considerations in front of them: (1) “What is the result in this study sample?”—they must interpret how the data in their sample meaningfully relate to their research question, and (2) “What can be generalized from the result of this study?”—they must describe what can be inferred from the sample to a broader population. Even defenders of NHST as a tool for making inferences will surely agree that NHST was never meant to be used as a tool for interpreting data. This is because interpreting data requires domain knowledge—expertise in the scientific topic, clinical experience, practical insight—and an understanding of the study context, something that is not incorporated into NHST, which Fisher and other early developers of NHST, of course, were aware of [3,8].

2. When Inferential Statistics Is Not Helpful

Raw data are seldom straightforward. Researchers must collate, describe, and generally “make sense” of the sample data so that it relates to the research question in a meaningful way. For example, in a study evaluating a new measurement procedure, researchers must interpret the sample data to understand whether the instrument successfully measures the phenomena of interest or not. In a study investigating the relationship between background noise and obesity, researchers must interpret the sample data to understand if the relationship is something noteworthy or not. In all empirical fields and in all study designs, researchers must measure the phenomenon of interest in a way that allows them to apply their domain knowledge to understand the result. The main example here will be that in a study evaluating a clinical intervention, researchers must interpret the individual data of the patients to understand whether, and for whom, the intervention was effective, and to what degree [17,18,19].
An understanding of research methods and an application of descriptive statistics can help researchers decode data and openly communicate it to others. For example, boxplots convey the inter-individual dispersion in a sample, graphs of individual treatment trajectories illustrate treatment outcomes on an individual level, and measures of test–retest reliability can explicate the robustness of the individual treatment outcomes [20] and the precision of the measures [21]. Clearly, an insight into research methods and descriptive statistics is practical when interpreting data.
However, after looking at the graphical presentations and various summary statistics, the researcher must still use their intellect to interpret what they see in order to meaningfully relate the results in the sample to the research question. The problem I want to highlight is that at this juncture in the analysis, inferential statistics have sometimes been used as a tool for interpreting the sample data by mechanically labeling the study result as providing a positive or negative answer to the research question. In general, statistically significant results have often been confused with a difference being important, a relationship being interesting, a treatment outcome being beneficial, and similarly [7,8,14,22]. However, what a result means and whether it is important or interesting is context sensitive and can only be evaluated in the light of domain knowledge. Abstract statistical procedures are blind to such things.

3. Two Examples on Interpretating fo Data in a Voice Training Program

To illustrate the difference between obtaining a statistically significant test result and interpreting data, my example will be an intervention study evaluating a voice training program aimed at helping transgender persons assigned male at birth, abbreviated as transwomen, to have a voice more in line with their preferences and gender identity. To make the example clear, participants in this example study are recruited because they explicitly want their voice to have a fundamental frequency (the acoustic correlate of pitch) in line with typically female-sounding voices. To justly evaluate such a training program, we need to take many dimensions into account, not least the participants’ expectations of the program and their satisfaction with the outcome. Here, the point is not that many dimensions are needed to comprehensively evaluate an intervention. The point is just to contrast interpreting data in a purely statistical manner with interpreting data using domain knowledge. For this reason, I seek a single, simple outcome measure with which I expect most readers to have some experience. Thus, in this example, I will focus only on evaluating the training program vis-á-vis the fundamental frequency of the patients’ voice, fo. Although including a (randomized) control group or at least repeated measures is necessary to have strong internal validity in intervention studies [23], for simplicity, I will compare only a measure taken immediately before training and another measure taken directly after training for one group of participants. The research question is whether the training program is effective at helping transwomen to have a fundamental frequency more in line with typically female-sounding voices.
In this made-up example, eight transwomen participated in the training program. The inter-individual variability and distribution of fo measured before and after training are illustrated in boxplots in Figure 1, panel A. Individual treatment trajectories as they relate to reference values of typically female-sounding voices are illustrated in Figure 1, panel B. Table 1 also describes the individual data in the sample. We can see that the participants, on average, increased in fundamental frequency and that seven of the eight participants increased in their fundamental frequency. Seeing all this, the researcher must now interpret whether the data support the training as effective or not. Unfortunately, what some researchers do at this junction is that they apply a statistical significance test to interpret the data. The result of a t-test applied on the prepost mean difference is t7 = 3.5, p = 0.01. As the p-value is below 0.05 (the set α level), the training effect is statistically significant. In some research projects, the intervention would thus be interpreted as effective because the training effect was statistically significant [17,18,19]. However, if we instead apply our domain knowledge—what we know about fundamental frequencies, what we know about typically female- and male-sounding voices [24,25,26]—to the data and research context, it is clear that none of the transwomen had any relevant changes in their fo. A relevant change would be if a person’s fo moved toward 180 Hz. The intervention was clearly not effective, whatever the p-value might be.
To clarify the statistical significance test, the t-test formula is: t = x ¯ μ s n where x ¯ is the sample mean, s is the sample standard deviation, n is the sample size, and μ is the population mean. In this example, x ¯ and s refer to the mean and standard deviation in the intra-individual difference scores. s and n are used to describe the dispersion in a hypothetical distribution of samples drawn from the hypothetical population described by the null hypothesis (via μ ). The t-value describes the distance between the hypothetical population mean and the obtained sample mean, using the standard error of the sampling distribution as its unit ( s n ). The p-value is the tail area of the hypothetical sampling distribution delimited by the t-value. The p-value describes the hypothetical probability of obtaining the sample data or something more extreme if we assume that the sample is drawn from the hypothetical population described by the null hypothesis. All those hypotheticals are all there is to the statistical test. There is nothing in the t-tests formulae (or ANOVAs, linear or logistic regressions, mixed models, etc.) that describes what the values mean or what the study context is. Nothing specifies that the values are measuring fo, what fo values are typical for male or female sounding voices, what the patients’ goals were, etc. Thus, the statistical tests are completely blind to domain knowledge and the study context. A statistical test is not helpful for interpreting data because “it” does not “understand” the research question. Interpreting data is simply not what statistical significance testing is about.
Further, p-values are unhelpful at answering even the more basic question of “is there an effect?” as compared to the more sophisticated question of “is the effect of a meaningful size?”. This is because a small p-value only signals that the data does not conform with at least one of the many theoretical, auxiliary, statistical, and inferential assumptions underlying its calculation. There is no reason to assume that it is the null-hypothesis-assumption that is the culprit, especially as we often know a priori that most of the many different assumptions include in the model are false to begin with [13].
To further illustrate the mismatch between statistical significance and actual data interpretation, we can use the same example study but with different example data; Figure 2, Table 2. We can see that the inter-individual differences in treatment trajectories are large; some participants had almost no change in fo, whereas others changed quite a bit. The result of a t-test applied on the pre–post mean difference is t7 = 1.8, p = 0.12. As the p-value is above 0.05 (the set α level), the training effect is not statistically significant. At this juncture, the training would sometimes be interpreted as not effective because the training effect was not statistically significant. However, what is striking—in the light of domain knowledge and research context—is that some of the transwomen (3/8 = 38%) did have relevant changes in their fo in the sense that they reached fo levels typically associated with female-sounding voices. Based on our domain knowledge, these data do support the training as effective for some transwomen in this (example) study, whatever the p-value might be.

4. An Example on Interpreting Visual Analog Scale Data in Voice Training Program

Because many readers are familiar with fo and could use their domain knowledge to interpret the data in the previous examples, the inappropriateness of relying on the statistical significance tests was hopefully clear. However, and importantly, a statistical significance test is equally inappropriate as a tool for interpreting data in situations where we do not have or cannot use our domain knowledge. To continue with the voice training program example, let us say that the participating transwomen also rated their voice satisfaction on a Visual Analog scale (VA-scale). The question was “Are you satisfied with your voice?” and the answering scheme was to put a mark on a line going from No at 0 to Yes at 100. The participants answered this question before and after training, and the example data are illustrated in Figure 3 and Table 3. As before, the researcher must now interpret the data based on their domain knowledge. Moreover, because of the nature of the outcome measure, it is much more difficult to do so in this example. Looking at the figures and tables, something happened with the ratings, but what? Can we be sure that any of the participants are satisfied with their voice after training? Do we know if a rating of 50 on the VA scale is where responses go from unsatisfied to satisfied? Can we be sure that the different participants use the scale in similar ways, or might different VA scale values mean different things for different raters?
In cases such as this, where we are not able to apply our domain knowledge, the risk that we fall into the trap of interpreting data based solely on statistical tests is much greater. For example, interpreting the training as effective because the statistical test applied on the VA data is statistically significant (t7 = 8.5, p < 0.0001). However, a statistical test does not evaluate how the data meaningfully relates to the research question and cannot be used as alchemy to convert unclear data into certainties [27]. If we cannot apply our domain knowledge to interpret the data because we are too unfamiliar with the measuring procedure (or, perhaps, because the measuring procedure is sub-optimal), that is the first and fundamental problem. No statistical test can plaster over that.

5. When Inferential Statistics Can Be Helpful

When researchers have applied their domain knowledge to interpret the data in their study sample and are satisfied with understanding what it means, they often also want to make some generalizations based on the result. In the second example on fo (Figure 2 and Table 2), 38% (3 out 8) of the participants had relevant changes in fo (reached fo frequencies around 180 Hz), and in this section, I will shorthand those individual treatment outcomes as those individuals having “benefitted” from the training. The study result suggests—given that we accept the representativeness of the sample, the validity of the outcome measures, the study methodology, my interpretation of the data based on my domain knowledge, and so on—that among a future group of similar transwomen going through a similar training program some of those individuals will also benefit (reach fo frequencies of around 180 Hz). However, it is improbable that exactly 38% of future transwomen will benefit. Can we give a reasonable range of possible future outcomes based on this study? Answering this question is where some types of inferential statistics can be helpful.
However, before tackling inferential statistics, researchers should think hard about whether their study methodology, recruitment procedure, and sample composition are such that it makes sense to try to generalize the study results. The default in many research fields is to report inferential statistics even when it is obvious that the sample is not representative or that the circumstances will never exist outside of the study. It would be preferable if researchers separate when a study result is interesting for theoretical reasons from when a study is interesting because its result is to some degree generalizable, and only report inferential calculations after having decided that they do contribute meaningfully.
A meaningful application of inferential statistics is to give some reasonable ranges of what can be generalized from the study sample, when it is appropriate to do so. A confidence interval (and its cousins likelihood intervals and credible intervals) illustrates what values outside the sample are most compatible with the data, given the statistical assumptions used to compute the interval [28]. To exemplify, we can calculate a confidence interval around the proportion of participants who ”benefited” in example 2, namely three out of eight participants. This can be achieved with different degrees of sophistication, but for simplicity, we can calculate a Wilson confidence interval to be 95% CI [0.14, 0.69]. This interval (given all assumptions underlying the interval) illustrates that the data are compatible with somewhere between 14% and 69% of similar patients going through similar training will also benefit (reach fo frequencies around 180 Hz).
Researchers should try to discuss the complete interval compatible with their data and accept the uncertainty it contains [28]. Importantly, interpreting confidence intervals does not fundamentally change just because zero happens to be included in the interval. If the 95% CI in example 2 would have been between 0% and 55%, we proceed as if nothing happened and say something to the effect of “If 0% of patients have ‘beneficial’ outcomes, the training is ineffective, and if 55% of patients have ‘beneficial’ outcomes, that is quite promising, and our data are equally compatible with both scenarios.” The range of reasonable parameter values does not collapse unto a point, just because the confidence intervals happen to overlap with zero. Using a confidence interval to declare something as “effective” or “significant” based on whether or not zero is included in the interval faces all the same issues of “dichotomania” as relying on p-values does [9,21].
To return to the main point, simply calculating a confidence interval does not by itself make the data meaningful. We can understand the confidence interval mentioned above (that somewhere between 14% and 69% of similar patients might benefit) because the numbers it is calculated on are meaningful; we understand what “3 out of 8 participants benefited from the training” means. But simply calculating a confidence interval does not make a dataset or measurement scale interpretable. We can calculate a confidence interval around the mean intra-individual difference on the VA-scale in example 3 to be 95% CI [21.6, 38.4 VA-scale units]. However, knowing this confidence interval—just as knowing the p-value—does not help us interpret the outcomes. Whatever the confidence interval is, we still do not know if any of the participants are satisfied with their voices or if any of them had relevant changes in their satisfaction. An understanding of the measurement scale is the first and fundamental issue that needs to be resolved before proceeding with inferential statistics.
Further, researchers must fear the overconfidence interval [21,28]. The confidence interval is based on the study data, the study design, the sampling procedure, the validity of the outcome measures, the statistical model, assumptions, and so on. Thus, the confidence interval gives a plausible range of population values only when we trust the study methodology and the representativeness of the sample. There is nothing in the mathematical formulae for calculating confidence intervals that abridge the uncertainty existing in the study to begin with; nothing in the calculations that ensures that the population parameter is included in the interval at any degree of confidence [21]. If we believe that the study sample is not representative of the population or that compliance in the study is unreasonably high compared to clinical practice, for example, then we should not believe that the confidence interval gives a reasonable range of possible future outcomes. The Limitation section of the Discussion in a manuscript should aim to put the confidence in the confidence interval into a reasonable perspective.
Finally, any single study result might be spurious and come from random sampling or measurement noise [29]. Traditional statistical analyses—e.g., a p-value—does not give you the probability that a result is spurious or not [1,7,21]. This is so because a p-value describes the hypothetical probability of obtaining the data if we assume the null hypothesis. A p-value does not describe the probability of the null hypothesis being true or the probability of any particular cause being the one producing the data (e.g., random chance). Similarly, calculating confidence intervals does not ensure that the parameter of interest is inside the interval at any degree of confidence, and even high-precision studies may well miss the population parameter [21]. The best way to consider if a result is likely to be spurious is to take your prior domain knowledge of the phenomenon into account: theoretically unexpected results are more likely to be spurious, and unreasonable hypotheses continue to be unreasonable even after supportive results in a single study [8]. The only way to test if a result is spurious or not is replication.

6. An Example with Real Data

The examples above have been on intervention studies, but there is nothing special to these kinds of study designs. Whatever the study question or design might be, researchers must measure the phenomena of interest in a way that allows them to apply their domain knowledge to understand the results. I turn now to exemplify the approach to interpreting data described above to some real data. In the interesting study by McNeill and colleagues [30], one thing the researchers investigated was the relationship between patient happiness and fo among transwomen. My personal prior guess regarding this relationship would be that transwomen who have a lower fo are generally less happy with their voices than transwomen with higher fo’s. However, McNeill et al. write in their abstract, “This study demonstrates that happiness with voice in male-to-female transgender clients is not directly related to F0.” This is a somewhat surprising claim. It is especially surprising considering that they report that in their sample, there was a positive correlation, r = 0.32, between the two variables. They further use an instructive scatterplot to illustrate the inter-individual dispersion in the two variables and what, to me, looks to be a positive correlation between fo and ”Happiness with voice”.
As the correlation in their sample was r = 0.32, the researcher must have based their claim that happiness with voice is not related to fo on some other statistic. Indeed, their complete reporting was: ”It was not possible to demonstrate a statistically significant relationship between patient happiness and F0 (r = 0.32, p = 0.32)”. Thus, these researchers based their negative claim on p = 0.32 (which is more than 0.05, the assumed α level). That is, what these researchers happened to do (and what is quite common) is that they confused the acts of interpreting data and making inferences from the data. The result of their study was that there was a positive correlation between fo and patient happiness with their voices (r = 0.32). Whether this correlation is large, important, or spurious, or not is a question that requires the researchers to apply their domain knowledge, what they know about male-to-female transgender persons, what they understand of the rating scale involved, what they understand of correlation coefficients, and so on. This question cannot be resolved based on p-values, standard recommendations of effect sizes, or any other abstract statistical benchmark.
If we turn to making inferences from the data, the researchers should, according to the American Statistical Association [1,2], not focus on the p-value related to their correlation coefficient but on the confidence interval around their correlation coefficient. Calculating a confidence interval for a correlation coefficient can be achieved with different degrees of sophistication, but for simplicity, we can calculate a Fisher transformed confidence interval around [30] result to be 95% CI [−0.31, 0.76]. This interval should be reported via something as “The correlation in our sample was r = 0.32, and our data are compatible with the population correlation being somewhere between −0.31 and 0.76 (95% CI). Note that our data are equally compatible with the population correlation being as large as 0.58, as it being as small as 0.0. Our estimate is reasonably imprecise, as we had such a small sample.” Publishing small scale studies is not a bad thing. However, we must accept the uncertainty involved in small scale studies and encourage “meta-analytic thinking” [2]. McNeill and colleagues just provided one estimate on how large the population correlation might be (somewhere between r = −0.31 and r = 0.76), and now we should wait for more studies with perhaps more precise estimates.

7. Other Abstract Benchmark-Approaches

So far, I have argued that statistical significance testing should not be used as objective decision rules for interpreting data.The same is true for other abstract benchmark approaches, such as standard labels for effect sizes or correlation coefficients being “large” or “small”. Such recommendations were made in the abstract and do not take domain knowledge or study context into account. “… don’t look for a magical alternative to [null hypothesis significance testing], some other objective mechanical ritual to replace it. It doesn’t exist.” [7].
To exemplify, the standardized mean difference (Cohen’s d) in the first example (Figure 1 and Table 1) is 1.25, a “large effect”, and in the second example (Figure 2 and Table 2), it is 0.65, a “medium effect”. However, when interpreting the data in light of domain knowledge, we saw that no one was helped in the first example and that three out of the eight patients had relevant changes in the second example. The standardized mean difference in the example with the VA-scale is 3.0, a “huge effect”. But calling the effect “large” or “huge” based on some archaic textbook does not help us interpret the data: it does not explain what a rating of 70 means or if any of the participants were satisfied with their voices.
The bar of scientific rigor would not be lowered if we stopped using abstract benchmarks to interpret data and instead relied on “subjective domain knowledge”; rather, quite the opposite holds [2,6]. For a particular interpretation to hold up to peer scrutiny, the sample data (not the results of statistical tests! [28]) would need to be communicated so transparently that reviewers and readers are able to use their domain knowledge to interpret the same data. Furthermore, researchers would need to communicate on what background information and practical insight they based their interpretation so that peers can review and complement that process. Quantitative reasoning is, in essence, comparative, and researchers need to provide some valid reference values or other indications of how to interpret data. Simply reporting a measurement value or statistic in a vacuum is meaningless. Clear communication of the sample data and intellectual reasons for interpreting the data in a particular way is a necessary basis for a scientific discussion of the results of a study. Such communication will form a basis for developing “inter-subjective”, agreed-upon standards for what can be considered a “clinically relevant change” or a “practically important result” in different research contexts, something we should be striving to achieve.

8. Conclusions

There are many arguments against using significance testing and other dichotomous statistical tests for making inferences [2]. However, there is an even clearer case against using context-blind statistical benchmarks for interpreting data. To interpret data, we need to have domain knowledge in the research topic and be able to apply that domain knowledge in the particular study context. Once we have used our domain knowledge to interpret the study data, then we can apply inferential statistics to give us some reasonable ranges of what can be generalized from the study results, at those times it makes sense to generalize from the sample. A confidence interval is one way to summarize what population values are reasonably compatible with the study sample, given that we accept the statistical model, the research design, the sampling procedures, the outcome measures, and find the venture of generalizing meaningful in the study context. Most often, we will realize that our studies provide imprecise estimates and that many studies and meta-analyses are necessary before we can say with any certainty what a particular effect or relationship might look like.
Moving beyond abstract statistical benchmarks for interpreting data will require more of researchers. This will require researchers (1) to use measurement scales that are transparent and understandable and for which we have some ideas of the variable space (e.g., knowing “typical” or reference values and the size of measurement error), (2) to convey the study data transparently in such a way that critical readers can apply their domain knowledge to interpret the data, and (3) to clearly communicate what domain knowledge, insight into the study, and intellectual reasoning that made the researchers interpret the data in the way they did. All of these practices will better foster a scientific forum for cumulative science.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

Thanks to all of my colleagues at the Division of Speech and Language Pathology, Karolinska Institutet, for your openness to new perspectives and for rekindling my interest in doing academic science. Big thanks to everyone who read, commented on, and in other ways improved this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Wasserstein, R.L.; Lazar, N.A. The ASA Statement on P-Values: Context, Process, and Purpose. Am. Stat. 2016, 70, 129–133. [Google Scholar] [CrossRef] [Green Version]
  2. Wasserstein, R.L.; Schirm, A.L.; Lazar, N.A. Moving to a World Beyond “p < 0.05”. Am. Stat. 2019, 73, 1–19. [Google Scholar] [CrossRef] [Green Version]
  3. Good, I.J. Some Logic and History of Hypothesis Testing. In Philosophy in Economics; Pitt, J.C., Ed.; Springer: Dordrecht, The Netherlands, 1981; pp. 149–174. [Google Scholar] [CrossRef]
  4. Harrington, D.; D’Agostino, R.B.; Gatsonis, C.; Hogan, J.W.; Hunter, D.J.; Normand, S.-L.T.; Drazen, J.M.; Hamel, M.B. New Guidelines for Statistical Reporting in the Journal. N. Engl. J. Med. 2019, 38, 285–286. [Google Scholar] [CrossRef] [PubMed]
  5. Michel, M.C.; Murphy, T.J.; Motulsky, H.J. New Author Guidelines for Displaying Data and Reporting Data Analysis and Statistical Methods in Experimental Biology. Mol. Pharmacol. 2020, 97, 49–60. [Google Scholar] [CrossRef] [PubMed]
  6. Trafimow, D.; Marks, M. Editorial. Basic Appl. Soc. Psychol. 2015, 37, 1–2. [Google Scholar] [CrossRef]
  7. Cohen, J. The Earth is Round (p < 0.05). Am. Psychol. 1994, 49, 997–1003. [Google Scholar] [CrossRef]
  8. Nuzzo, R. Statistical errors: P values, the ‘gold standard’ of statistical validity, are not as reliable as many scientists assume. Nature 2014, 506, 150–152. [Google Scholar] [CrossRef] [Green Version]
  9. Greenland, S. Invited Commentary: The Need for Cognitive Science in Methodology. Am. J. Epidemiol. 2017, 186, 639–645. [Google Scholar] [CrossRef]
  10. Dienes, Z. Bayesian Versus Orthodox Statistics: Which Side Are You On? Perspect. Psychol. Sci. 2011, 6, 274–290. [Google Scholar] [CrossRef] [Green Version]
  11. Greenland, S. Null Misinterpretation in Statistical Testing and Its Impact on Health Risk Assessment. Prev. Med. 2011, 53, 225–228. [Google Scholar] [CrossRef]
  12. Sand, A.; Nilsson, M.E. Subliminal or not? Comparing Null-Hypothesis and Bayesian Methods for Testing Subliminal Priming. Conscious. Cogn. 2016, 44, 29–40. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  13. Trafimow, D. A Taxonomy of Model Assumptions on Which P Is Based and Implications for Added Benefit in the Sciences. Int. J. Soc. Res. Methodol. 2019, 22, 571–583. [Google Scholar] [CrossRef]
  14. Amrhein, V.; Korner-Nievergelt, F.; Roth, T. The Earth is Flat (p > 0.05): Significance Thresholds and the Crisis of Unreplicable Research. PeerJ 2017, 5, e3544. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  15. John, L.K.; Loewenstein, G.; Prelec, D. Measuring the Prevalence of Questionable Research Practices with Incentives for Truth Telling. Psychol. Sci. 2012, 23, 524–532. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Open Science Collaboration. Estimating the Reproducibility of Psychological Science. Science 2015, 349, aac4716. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  17. Crosby, R.D.; Kolotkin, R.L.; Williams, G.R. Defining Clinically Meaningful Change in Health-Related Quality of Life. J. Clin. Epidemiol. 2003, 56, 395–407. [Google Scholar] [CrossRef]
  18. Guyatt, G.H.; Osoba, D.; Wu, A.W.; Wyrwich, K.W.; Norman, G.R. Methods to Explain the Clinical Significance of Health Status Measures. Mayo Clin. Proc. 2002, 77, 371–383. [Google Scholar] [CrossRef]
  19. Sand, A.; Hagberg, E.; Lohmander, A. On the Benefits of Speech-Language Therapy for Individuals Born with Cleft Palate: A Systematic Review and Meta-Analysis of Individual Participant Data. J. Speech Lang. Hear. Res. 2022, 65, 555–573. [Google Scholar] [CrossRef] [PubMed]
  20. Weir, J.P. Quantifying test-retest reliability using the intraclass correlation coefficient and the SEM. J. Strength Cond. Res. 2005, 19, 231–240. [Google Scholar] [CrossRef]
  21. Trafimow, D. A Frequentist Alternative to Significance Testing, p-Values, and Confidence Intervals. Econometrics 2019, 7, 26. [Google Scholar] [CrossRef] [Green Version]
  22. Dienes, Z.; Mclatchie, N. Four Reasons to Prefer Bayesian Analyses Over Significance Testing. Psychon. Bull. Rev. 2018, 25, 207–218. [Google Scholar] [CrossRef] [Green Version]
  23. Rvachew, S.; Matthews, T. Demonstrating Treatment Efficacy Using the Single Subject Randomization Design: A Tutorial and Demonstration. J. Commun. Disord. 2017, 67, 1–13. [Google Scholar] [CrossRef] [PubMed]
  24. Gorham-Rowan, M.; Morris, R. Aerodynamic Analysis of Male-to-Female Transgender Voice. J. Voice 2006, 20, 251–262. [Google Scholar] [CrossRef]
  25. Pegoraro Krook, M.I. Speaking Fundamental Frequency Characteristics of Normal Swedish Subjects Obtained by Glottal Frequency Analysis. Folia Phoniatr. Logop. 1988, 40, 82–90. [Google Scholar] [CrossRef]
  26. Quinn, S.; Oates, J.; Dacakis, G. Perceived Gender and Client Satisfaction in Transgender Voice Work: Comparing Self and listener Rating Scales Across a Training Program. Folia Phoniatr. Logopaedica. 2021, 1–16. [Google Scholar] [CrossRef] [PubMed]
  27. Gelman, A. The Problems with p-Values are not Just With p-Values. Am. Stat. 2016, 70. Online Discussion. [Google Scholar]
  28. Amrhein, V.; Trafimow, D.; Greenland, S. Inferential Statistics as Descriptive Statistics: There Is No Replication Crisis if We Don’t Expect Replication. Am. Stat. 2019, 73, 262–270. [Google Scholar] [CrossRef] [Green Version]
  29. Ioannidis, J.P.A. Contradicted and Initially Stronger Effects in Highly Cited Clinical Research. JAMA 2005, 294, 218–228. [Google Scholar] [CrossRef] [Green Version]
  30. McNeill, E.J.M.; Wilson, J.A.; Clark, S.; Deakin, J. Perception of Voice in the Transgender Client. J. Voice 2008, 22, 727–733. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Example 1. Panel A (left): Boxplots over an example sample’s fundamental frequencies (in Hz) before (pre) and after (post) training. A boxplot illustrates quartiles; the central black thick line is the median, the upper and lower limit of the box are the 25th and 75th percentile, and the lower and upper whiskers indicate minimum and maximum values. Panel B (right): Individual treatment “trajectories”. Each participant’s pre and post measure is color-coded per participant and illustrated with a circle connected by a straight line. In both figures, 180 Hz is illustrated as a horizontal dashed line because it has been suggested to be fundamental frequency range relevant for being perceived as a female speaker [24,25,26] which is the goal of the participants. Semitones are illustrated in the right-hand panel of Panel B as number of semitones above 110 Hz.
Figure 1. Example 1. Panel A (left): Boxplots over an example sample’s fundamental frequencies (in Hz) before (pre) and after (post) training. A boxplot illustrates quartiles; the central black thick line is the median, the upper and lower limit of the box are the 25th and 75th percentile, and the lower and upper whiskers indicate minimum and maximum values. Panel B (right): Individual treatment “trajectories”. Each participant’s pre and post measure is color-coded per participant and illustrated with a circle connected by a straight line. In both figures, 180 Hz is illustrated as a horizontal dashed line because it has been suggested to be fundamental frequency range relevant for being perceived as a female speaker [24,25,26] which is the goal of the participants. Semitones are illustrated in the right-hand panel of Panel B as number of semitones above 110 Hz.
Applsci 12 07691 g001
Figure 2. Example 2. Panel A (left): Boxplots over the sample’s fundamental frequencies (in Hz) before (pre) and after (post) training. Panel B (right): Individual treatment “trajectories”. Each participant’s pre and post measure is color-coded per participant and illustrated with a circle connected by a straight line. In both figures, 180 Hz is illustrated as a horizontal dashed line because it has been suggested to be fundamental frequency range relevant for being perceived as a female speaker [24,25,26] which is the goal of the participants.
Figure 2. Example 2. Panel A (left): Boxplots over the sample’s fundamental frequencies (in Hz) before (pre) and after (post) training. Panel B (right): Individual treatment “trajectories”. Each participant’s pre and post measure is color-coded per participant and illustrated with a circle connected by a straight line. In both figures, 180 Hz is illustrated as a horizontal dashed line because it has been suggested to be fundamental frequency range relevant for being perceived as a female speaker [24,25,26] which is the goal of the participants.
Applsci 12 07691 g002
Figure 3. Panel A (left): Boxplots over the sample’s voice satisfaction rating (on a Visual Analog scale) before (pre) and after (post) training. Panel B (right): Individual treatment “trajectories”. Each participant’s pre and post measure is color-coded per participant and illustrated with a circle connected by a straight line. Anchors are illustrated in the right-hand panel of Panel B as 0 = No and 100 = Yes.
Figure 3. Panel A (left): Boxplots over the sample’s voice satisfaction rating (on a Visual Analog scale) before (pre) and after (post) training. Panel B (right): Individual treatment “trajectories”. Each participant’s pre and post measure is color-coded per participant and illustrated with a circle connected by a straight line. Anchors are illustrated in the right-hand panel of Panel B as 0 = No and 100 = Yes.
Applsci 12 07691 g003
Table 1. Example 1. Each individual’s average fundamental frequency (in Hz) before (pre) and after (post) treatment. The right-most column is the post minus pre difference in Hz. Further illustrated are the mean and standard deviation for each column. Because the inter-individual variation in the difference scores is so small and most participants have a small, positive effect, a t-test applied to the mean difference is statistically significant. t7 = 3.5, p = 0.01. The standardized mean difference, Cohen’s d, is 1.25.
Table 1. Example 1. Each individual’s average fundamental frequency (in Hz) before (pre) and after (post) treatment. The right-most column is the post minus pre difference in Hz. Further illustrated are the mean and standard deviation for each column. Because the inter-individual variation in the difference scores is so small and most participants have a small, positive effect, a t-test applied to the mean difference is statistically significant. t7 = 3.5, p = 0.01. The standardized mean difference, Cohen’s d, is 1.25.
ParticipantPre (fo Hz)Post (fo Hz)Diff (fo Hz) (Post-Pre)
A1451527
B1351405
C1251294
D1151183
E1301333
F1201222
G1101111
H140139−1
Mean127.5130.53.0
SD12.213.32.4
Table 2. Example 2. Each individual’s average fundamental frequency (in Hz) before (pre) and after (post) treatment. The right-most column is the post minus pre difference in Hz. Further illustrated are the mean and standard deviation for each column. Because the inter-individual variation in the difference scores is so large as only some participants have large treatment effects, a t-test applied to the mean difference is not statistically significant. t7 = 1.8, p = 0.12. The standardized mean difference, Cohen’s d, is 0.64.
Table 2. Example 2. Each individual’s average fundamental frequency (in Hz) before (pre) and after (post) treatment. The right-most column is the post minus pre difference in Hz. Further illustrated are the mean and standard deviation for each column. Because the inter-individual variation in the difference scores is so large as only some participants have large treatment effects, a t-test applied to the mean difference is not statistically significant. t7 = 1.8, p = 0.12. The standardized mean difference, Cohen’s d, is 0.64.
ParticipantPre (fo Hz)Post (fo Hz)Diff (fo Hz) (Post-Pre)
A14517530
B13518045
C12518560
D1151161
E130124−6
F120116−4
G1101133
H1401400
Mean127.5143.616.1
SD12.231.425.4
Table 3. Each individual’s voice satisfaction rating (on a Visual Analog scale) before (pre) and after (post) treatment. The right-most column is the post minus pre difference in VA-scale units. Further illustrated are the mean and standard deviation for each column. Because the inter-individual variation in the difference scores is sufficiently small, a t-test applied to the mean difference is statistically significant. t7 = 8.5, p < 0.0001. The standardized mean difference, Cohen’s d, is 3.0.
Table 3. Each individual’s voice satisfaction rating (on a Visual Analog scale) before (pre) and after (post) treatment. The right-most column is the post minus pre difference in VA-scale units. Further illustrated are the mean and standard deviation for each column. Because the inter-individual variation in the difference scores is sufficiently small, a t-test applied to the mean difference is statistically significant. t7 = 8.5, p < 0.0001. The standardized mean difference, Cohen’s d, is 3.0.
ParticipantPre (VA-Scale)Post (VA-Scale)Diff (VA-Scale)
A103020
B204525
C205030
D257045
E257045
F306030
G305525
H406020
Mean255530
SD8.913.410.0
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Sand, A. Inferential Statistics Is an Unfit Tool for Interpreting Data. Appl. Sci. 2022, 12, 7691. https://doi.org/10.3390/app12157691

AMA Style

Sand A. Inferential Statistics Is an Unfit Tool for Interpreting Data. Applied Sciences. 2022; 12(15):7691. https://doi.org/10.3390/app12157691

Chicago/Turabian Style

Sand, Anders. 2022. "Inferential Statistics Is an Unfit Tool for Interpreting Data" Applied Sciences 12, no. 15: 7691. https://doi.org/10.3390/app12157691

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop