Next Article in Journal
Empirical Squared Hellinger Distance Estimator and Generalizations to a Family of α-Divergence Estimators
Next Article in Special Issue
Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data
Previous Article in Journal
A Cartesian-Based Trajectory Optimization with Jerk Constraints for a Robot
Previous Article in Special Issue
Change-Point Detection for Multi-Way Tensor-Based Frameworks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Sample Size Calculations in Simple Linear Regression: A New Approach

by
Tianyuan Guan
1,2,
Mohammed Khorshed Alam
2 and
Marepalli Bhaskara Rao
2,*
1
College of Public Health, Kent State University, 750 Hilltop Drive, Kent, OH 44240, USA
2
Department of Environmental Health and Public Health Sciences, University of Cincinnati, 160 Panzeca Way, Cincinnati, OH 45221, USA
*
Author to whom correspondence should be addressed.
Entropy 2023, 25(4), 611; https://doi.org/10.3390/e25040611
Submission received: 19 February 2023 / Revised: 25 March 2023 / Accepted: 26 March 2023 / Published: 3 April 2023
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)

Abstract

:
The problem tackled is the determination of sample size for a given level and power in the context of a simple linear regression model. The standard approach deals with planned experiments in which the predictor X is observed for a number n of times and the corresponding observations on the response variable Y are to be drawn. The statistic that is used is built on the least squares’ estimator of the slope parameter. Its conditional distribution given the data on the predictor X is utilized for sample size calculations. This is problematic. The sample size n is already presaged and the data on X is fixed. In unplanned experiments, in which both X and Y are to be sampled simultaneously, we do not have data on the predictor X yet. This conundrum has been discussed in several papers and books with no solution proposed. We overcome the problem by determining the exact unconditional distribution of the test statistic in the unplanned case. We have provided tables of critical values for given levels of significance following the exact distribution. In addition, we show that the distribution of the test statistic depends only on the effect size, which is defined precisely in the paper.

1. Introduction

Multiple regression is one of the core methodologies in statistics. Power computation and sample size determination have become integral part of many research proposals submitted for funding. Funding agencies such as UKRI (UK Research and Innovation) and NIH (National Institutes of Health) have been demanding sample size calculations in all prospective proposals. Regression has a long history dating back to Galton [1]. Horton and Switzer [2] reported that 51% of research articles published in the New England Journal of Medicine during May 2004 have Multiple Regression as one of the methods used. The figure for power analysis is at 39%.
In this paper, we focus on power computation in the context of simple linear regression. The current approach in power computations lacks justification. We will point out difficulties in this setting [3].
Simple linear regression is ubiquitous in pediatric clinical diagnostics. The model sets standards for normal growth in children on several metrics [4]. As an illustration, a pediatrician wants to check whether the lung function of a 13-year-old patient is normal. Data is to be collected on healthy subjects in the age range 12–14 years with response,
Y = FEV (Forced Expiratory Volume)
and predictor,
X = Height,
which is an example of an unplanned experiment.
In order to trust the model, we need to decide on the sample size, which in turn, depends on the level of significance, power, and effect size.
First, we investigate the setting under the simple linear regression paradigm. The model has two entities X, the predictor, and Y, the response variable. It is stated as
Y | X ~ N β 0 + β 1 X ,   σ 2
for some β 0 , β 1 , and σ 2 > 0 . The null hypothesis of interest is H 0 : β 1 = 0 against the alternative H 1 : β 1 0 . What should be the required sample size, n, for a given level of significance α, power 1-β, and at the alternative value A of β 1 . Let (X1, Y1), (X2, Y2), …, (Xn, Yn), be a potential sample for the testing problem. Let β ^ 1 be the least squares estimator of β 1 , i.e.,
β ^ 1 = S XY S XX
where
S XY = i = 1 n X i X ¯ Y i Y ¯
and
S X X = i = 1 n X i X ¯ 2
Let RSS be the residual sum of squares, i.e.,
RSS = i = 1 n Y i Y ¯ β ^ 1 X i X ¯ 2
For testing the null hypothesis H0, the following test statistic is used:
T = β ^ 1 S XX / RSS / n 2 .
Under the null hypothesis, conditioned on the X-data, T has a t-distribution with n − 2 degrees of freedom. Under the alternative value β 1 = A, T has a non-central t-distribution with degrees of freedom n − 2, and non-centrality parameter λ = A Sxx   / σ .
We reject the null hypothesis if and only if T > t n 2 , 1 α 2 where t n 2 , 1 α 2 is such that the area to the left of Student’s t-curve on (n − 2) degrees of freedom is 1 − α/2.
The power formula is given by
Power A = Pr Reject   H 0 | β 1 = A = Pr T > t n 2 , 1 α 2 | β 1 = A .
We can set power equal to 1-β and solve for n. This would work as long as we know what λ = A S XX   / σ is. This would require knowledge of the alternative value of β1, σ 2 , and S XX . We will not know what S XX is, prior to data collection, in the unplanned experiments. Equivalently, one should spell out what λ is. This is a tall order. Adcock [5] recognized these problems. Some software and textbooks assume that 1 / n i = 1 n ( X i X ¯ ) 2 is known. For example, the software PASS [6] and nQuery [7] proceed this way. To overcome these difficulties, we proceed with deriving the exact unconditional distribution of a variant of T. This requires a knowledge of the distribution of X. Let σ X 2 be the variances of X.
Modify the test statistic.
T = β ^ 1 σ ^ X / σ ^ ,  
where σ ^ 2 = RSS / n 2   and σ ^ X 2 = S XX / n 1 .
We obtain the unconditional distribution of T under β1 = 0 as well as under β1 = A ≠ 0. We assume X~N (µx, σ X 2 ), both parameters unknown. Under this assumption, the distribution of T is derived.
In due course, we will show the distribution of T when β1 = A ≠ 0 depends only on δ = A σ X / σ , which we can deem as the effect size.
The five-parameter model now is:
Y | X ~ N β 0 + β 1 X ,   σ 2  
X ~ N ( µ x ,   σ X 2 ) .
Note that the vector (X, Y) has a bivariate normal distribution.
The paper is organized as follows. In Section 2, we provide a literature review. In Section 3, we outline the main results. We derive the unconditional distribution of T under the null hypothesis in Section 3.1. In Section 3.2, we calculate critical values using the main results. In Section 3.3, we lay out the sample sizes required for a given level, power, and effect size δ   = A σ X / σ . In Section 4, we summarize the results and draw conclusions. The computational details along with the R code [8] are presented in the Supplementary Materials.

2. Literature Review

Ryan [3] has pointed out difficulties in power calculations in the environment of simple linear regression. The problem is how we handle the predictor X. Adcock [5] has looked at some possible scenarios. One scenario is that the investigator knows the Xi-values (deterministic) for every sample size n. In such a case, the test statistic
β ^ 1 S XX / RSS / n 2
is eminently usable for power calculations. Its (conditional) null and non-null distributions have been worked out explicitly. The conditional approach is also followed by Dupont et al. [9], Draper et al. [10], Hsieh et al. [11], Maxwell [12], and Thigpen [13].
As an alternative to the test statistic (2), we can build a test based on the sample correlation coefficient ρ ^ [14], under the joint normality of X and Y. The null and non-null distributions of the underlying test statistic based on ρ ^ have been worked out explicitly. In our consulting work, many researchers prefer to use the test based on β ^ 1 . It is a choice between causality and association [3,14,15,16,17,18,19,20,21]. The hypotheses H0: β 1 = 0 and H0: ρ   = 0 under bivariate normality are equivalent, but the test statistics are different. It is easy to determine sample size under the correlation context [14]. However, this sample size cannot be offered for testing the hypothesis on the slope. The power is less. In other words, test hopping is not permissible; i.e., they are two different tests with distinct power functions.

3. Outline of Results

We will now derive the unconditional distribution of β ^ 1 , which will be instrumental in sample size calculations. We use the test statistic T = β ^ 1 σ ^ X / σ ^ .
Under the null hypothesis β1 = 0, we show that
T 2 ~ n 2 n 1 W 1 W 4 W 2 W 3 ,
where W 1 ~ χ 1 2 , W 2 ~ χ n 1 2 , W 3 ~ χ n 2 2 and W 4 ~ χ n 1 2 , with the Wi values being mutually independent. It follows implicitly that
T   ~   n 2 n 1 U 1 U 2 / U 3
with U 1 ,   U 2 ,   U 3 independently distributed, U 1 ~ t n 1 ,   U 2 ~ χ n 1 , and U 3 ~ χ n 2 , and where χ n 1 is the χ distribution with n 1 degrees of freedom
We use this result to obtain the critical values of the test based on T, for given levels. For power and sample size computations, we need the distribution of T for any given value of β1. The distribution depends on the alternative values of β1, σ X 2 and σ2. It turns out that the distribution depends only on δ = β 1 σ X / σ , which we can deem as the effect size. The specification of δ facilitates computation of power. Despite all these deliberations, no magic explicit formula for power surfaces. Knowing the distribution of T2 when δ is spelled out, the pain is eased a little bit.

3.1. Distributional Results

In this section, we will derive the distribution of T of (1) unconditionally. The following series of steps will give the desired result.
  • Given X1, X2, …, Xn, β ^ 1 has a normal distribution with mean β 1 and variance σ 2 / S X X and β ^ 1 and R S S are independent.
  • Uncoditionally, R S S / σ 2 ~ χ n 2 2
  • S X X / σ X 2 ~ χ n 1 2 .
  • R S S and S X X are independent.
More generally, we obtain the distribution of T = β ^ 1 β 1 σ ^ X / σ ^ for a given value of β1.
The joint density function of β ^ 1 and S X X :
g β ^ 1 , S X X = S X X 2 π σ exp S X X 2 σ 2 β ^ 1 β 1 2 1 Г n 1 2 2 n 1 2   exp S X X 2 σ X 2 S X X σ X 2 n 1 2 1 1 σ X 2 , < β ^ 1 < ,   0 < S X X <
The (unconditional) marginal density of β ^ 1 is given by:
f β ^ 1 = 1 2 1 2 2 n 1 2 π Г n 1 2 σ σ X 2 n 1 2 0 S X X n 2 1 exp S X X 2   β 1 ^ β 1 2 σ 2 + 1 σ X 2 d S X X = Г n 2 π Г n 1 2 σ σ X 2 n 1 2 1 1 σ X 2 + β ^ 1 β 1 2 σ 2 n 2 = σ X B 1 2 , n 1 2 σ 1 1 + β ^ 1 β 1 2 σ X 2 σ 2 n 2 ,                           < β ^ 1 <
Some properties of this density are clear to observe. For example, the distribution is symmetric around the true value β 1 . If n = 2, the distribution is Cauchy. In addition,
σ X σ β ^ 1 β 1 n 1   ~ t n 1 .
Further, if n > 3, unconditionally,
  • E β ^ 1 = β 1 and Var β ^ 1 = ( σ 2 / σ X 2 ) n 3 1 ;
  • In the conditional set-up,
    E β ^ 1 | X 1 , X 2 , , X n = β 1
    Var β ^ 1 | X 1 , X 2 , , X n = σ 2 / S X X ;
  • The random variable U = β ^ 1 β 1 σ X / σ has the probability density function:
    f U = 1 B 1 2 , n 1 2 1 1 + U 2 n 2 ,   < U < ;
  • It follows that U 2 ~ W 1 / W 2 , where W 1 ~ χ 1 2 and W 2 ~ χ n 1 2 , with W 1 and W 2 being independent;
  • Exact distribution of T 2 : note that β ^ 1 β 1 / σ and σ ^ X 2 are independent;
  •  
    T 2 = β ^ 1 β 1 2 σ ^ X 2 / σ ^ 2 = β ^ 1 β 1 2 σ X 2 / σ 2 ( σ 2 / σ ^ 2 ) σ ^ X 2 / σ X 2 ~ ( W 1 / W 2 ) n 2 / W 3 W 4 / n 1 ,
    where W 3 ~ χ n 2 2 and W 4 ~ χ n 1 2 , and with W 1 , W 2 , W 3 , and W 4 being independent.
    In short,
    T 2   ~   n 2 / n 1 ( W 1 W 4 ) / ( W 2 W 3 )
  • It follows that:
    E T 2 = n 2 / { n 3 n 4  
  • An alternative form of the distribution [22]:
    n 1 / n 2 T 2   ~   ( W 1 W 4 ) / ( W 2 W 3 ) ~   B e t a I I 1 2 , n 2 2 B e t a I I n 1 2 , n 1 2 ,
    where B e t a I I signifies the beta distribution of the second kind.

3.2. Critical Values

We obtain the critical values of the test based on the test statistic T = β ^ 1 σ ^ X / σ ^ for three levels of significance. We denote the critical value by C n , α . The critical value C n , α satisfies the equation:
α = Pr β ^ 1 σ ^ X σ ^ > C n , α | H 0 : β 1 = 0 = Pr β ^ 1 2 σ ^ X 2 σ ^ 2 > C n , α 2 | H 0 : β 1 = 0
Under H 0 ,
T 2 = β ^ 1 2 σ ^ X 2 σ ^ 2 ~ n 2 n 1 W 1 W 4 W 2 W 3 ,
where W 1 ~ χ 1 2 , W 2 ~ χ n 1 2 , W 3 ~ χ n 2 2 and W 4 ~ χ n 1 2 , with Wi values being independent.
There are two options. One is using the pdf of n 2 / n 1   T2. Following Jambunathan [22], one can write the pdf of the product U*V of the random variables U and V with U ~ Beta II 1 / 2   , n 2 / 2 , V ~   Beta II n 1 / 2 , n 1 / 2 , and U and V being independent. The pdf is in the form of a double integral and its evaluation would require the use of a quadrature formula with the attendant errors of approximation. The second option is to determine the distribution of T2 by sampling extensively the components that make up T2 via Monte-Carlo. We have pursued the second option. The critical values are tabulated in File S1.
One can also obtain the critical value C n , α via the asymptotic distribution of T. One benefit of our derivation of the exact distribution is that if n is large, and null hypothesis is true,
T ~ Normal   ( 0 ,   n 2 / { n 3 ( n 4 ) } ) ,   approximately .
There are several ways to establish the asymptotic normality of T. The exact unconditional distribution of n 1   β ^ 1 β 1 σ X / σ is tn−1, which is asymptotically N (0, 1). Then we use the fact that σ ^ X is consistent for σ X and that σ ^ is consistent for σ. Since we know the variance of T exactly, we use this variance in the description of the asymptotic distribution of T. We can now calculate the critical values, as well as those coming from the exact distribution, following the asymptotic distribution.
In File S1, we report the average critical values Cn,α along with the critical values stemming from the asymptotic theory. A description of these asymptotic critical values is provided below.
Critical values from the normal approximation:
LevelCritical Value FormulaVerbal description in File S1
10% 1.645 × n 2 / n 3 n 4 10% normal
5% 1.96 × n 2 / n 3 n 4 5% normal
1% 2.576 × n 2 / n 3 n 4 1% normal
Comments on File S1: The Normal Critical Value column is explained.
  • Normal critical value 10% = critical value coming from the asymptotic distribution when α = 0.10.
  • Normal critical value 5% = critical value coming from the asymptotic distribution when α = 0.05.
  • Normal critical value 1% = critical value coming from the asymptotic distribution when α = 0.01.
  • Critical value 10% = critical value coming from the exact distribution of T when α = 0.10.
  • Critical value 5% = critical value coming from the exact distribution of T when α = 0.05.
  • Critical value 1% = critical value coming from the exact distribution of T when α = 0.01.
  • When α = 0.10, |Normal critical value 10%: Critical value 10%| ≤ 0.001 for n ≥ 50. One can enjoy the benefit of normal approximation when n ≥ 50.
  • When α = 0.05, |Normal critical value 5%: Critical value 5%| ≤ 0.001 for n ≥ 89. One can enjoy the benefit of normal approximation when n ≥ 89.
  • For α = 0.01, Table 1 is not informative when |Normal critical value 1%: Critical value 1%| ≤ 0.001.

3.3. Sample Size and Power

For a given level α, sample size n, and alternative value of β1 = A, power is given by
Power   ( A ) = Pr ( β ^ 1 σ ^ X / σ ^ > C n , α   |   β 1 = A ) .  
Suppose 1 − β is the specified power. For the sample size, we set
1 β = Pr ( β ^ 1 σ ^ X / σ ^ > C n , α   |   β 1 = A ) .
and solve for n. We will need the distribution of β ^ 1 σ ^ X / σ ^ , when β 1 = A . Rewrite
β ^ 1 σ ^ X / σ ^ = ( β ^ 1   β 1 ) σ ^ X / σ ^ + β 1 σ ^ X / σ ^ .  
The distribution of ( β ^ 1   β 1 ) σ ^ X / σ ^ is described in Section 3.1 and it is free of the parameters of the regression model. Consequently, the random variables ( β ^ 1   β 1 ) σ ^ X / σ ^ and β 1 σ ^ X / σ ^ are independently distributed. Since σ ^ and σ ^ X are independently distributed,
β 1 σ ^ X / σ ^ 2   d   β 1 2 σ X 2 / n 1 W 5 n 2 / σ 2 1 / W 6 = β 1 σ X / σ 2 n 2 / n 1 W 5 / W 6 ,
with W5~ χ n 1 2 , W6~ χ n 2 2 , and W5 and W6 being independent.
An important fact emerges from these deliberations in that the distribution of ( β ^ 1   β 1 ) σ ^ X / σ ^ + β 1 σ ^ X / σ ^ depends only on δ = β 1 σ X / σ , which we declare as the effect size.
In short, when β1 = A ≠ 0, the key steps are:
  • T = ( β ^ 1   β 1 ) σ ^ X / σ ^ + β 1 σ ^ X / σ ^
  • with { ( β ^ 1   β 1 ) σ ^ X / σ ^ } 2 ~ n 2 / n 1 ( W 1 W 4 ) / ( W 2 W 3 ) ,
  • ( β 1 σ ^ X / σ ^ )2~ A σ X / σ 2 n 2 / n 1 W 5 / W 6 ,
  • ( β ^ 1   β 1 ) σ ^ X / σ ^ and β 1 σ ^ X / σ ^ are independent,
and the distribution of T depends only on n and effect size δ = A σ X / σ .
In spite of all these labors, the distribution of T is not amenable to direct and simple computation of power.
We simulate the regression model for power computations. Simulations are greatly simplified when we exploit the key nature of the alternative distribution, namely, that it depends only on n and δ. Simulations are reported in the Supplementary Materials. Sample sizes are tabulated in Table 1, Table 2 and Table 3.
Table 1. Sample Size for Given Effect Size, Power, Level of Significance 10%, Mean of Power in the Validation Step, and its Standard Deviation.
Table 1. Sample Size for Given Effect Size, Power, Level of Significance 10%, Mean of Power in the Validation Step, and its Standard Deviation.
αES = β1 ∗ (σx/σ)PowernMeanSd
0.10.180%6200.79930.013
90%8700.90270.0095
95%11200.95460.0067
99%16900.9930.0027
0.10.280%1610.82590.0123
90%2190.900070.0096
95%2740.9490.0069
99%4400.00390.0024
0.10.380%730.80170.0124
90%1000.90060.00995
95%1240.94750.0071
99%1950.9310.0026
0.10.480%430.80310.0126
90%600.90730.0093
95%720.95180.0073
99%1050.98960.0031
0.10.580%290.80450.0134
90%390.90030.0099
95%480.9470.0068
99%690.9890.0034
0.10.680%210.80.0129
90%280.89610.0096
95%350.94760.0069
99%520.99110.003
Comments on Table 1, Table 2 and Table 3:
  • The first column in each table entertains three types of effect sizes: small (0.1, 0.2); medium (0.3, 0.4); and large (0.5, 0.6) [15].
  • The second column in each table lays out the powers entertained.
  • The third column in each table spells out the requisite sample size.
  • The fourth column is the fruit of our effort to validate the sample size. At the ascertained sample size, data are generated under the specifications, power calculated, and power averaged over thousand times.
  • The fifth column records the standard deviation of the thousand powers calculated.
  • We are satisfied that the sample sizes laid out are holding true.

4. Discussion

A simple linear regression is a five-parameter model spelling out causality between two quantitative variables Y and X typified by:
Y | X ~ N β 0 + β 1 X ,   σ 2 X ~ N µ X ,   σ X 2  
for some parameters β 0 ,   β 1 ,   μ X , σ 2 > 0 , and σ X 2 > 0 . The goal is to sample (X, Y) for testing H 0 : β 1 = 0 versus the alternative H 1 : β 1 0 . For determining sample size, we need the level of significance α, power 1 − β, and the effect size δ = A σ X / σ , where A is the given alternative value of β 1 . The test statistic T used here is the one based on the least squares’ estimator β ^ 1 of   β 1 .
The regression model, as originally formulated, is a conditional model, i.e., Y | X ~ N β 0 + β 1 X ,   σ 2 . In practice, in a planned experiment, the experimenter selects x 1 , x 2 , .   .   . , x n of X, and observes one or more Ys from the conditional distribution of Y|xi for each i. Thus, the sample size n has already been chosen. The statistic β ^ 1 S XX / RSS / n 2 is used for testing H 0 : β 1 = 0 against the alternative H 1 : β 1 0 . The conditional distribution of the test statistic given the data on X is Student’s t with n-2 degree of freedom under H 0 , and the distribution is non-central Student’s t with n-2 degrees of freedom and non-centrality parameter A Sxx / σ under H 1 :   β 1 = A . The alternative distribution can be used to calculate the power of the test at β 1 = A , and nothing more. The entities n and S XX are already in place, and σ has to be spelled out. The value of A is provided by the experimenter as the one of clinical significance. From the consulting experience of one of the authors, the experimenter usually comes up with value for σ from his/her pilot study.
In some statistical circles [6,7], the non = null distribution is used to calculate the sample size with the desired power, with S XX remaining the same. This is controversial and discussed in [3,5,13].
We are dealing with unplanned experiments in which both X and Y are sampled together. Unplanned experiments are very common in clinical studies [4]. The effect size, in this context, is the multiple of the alternative value of β 1 by the ratio of the two standard deviations of the model.
The current practice demands α, 1 − β, A, σ and S XX , which we do not have. Specification of S XX is avoided by determining the unconditional distribution of
T = β ^ 1 σ ^ X / σ ^ .
Exploiting the unconditional distribution of T, we calculated the critical values and required sample size. The unconditional distribution under the alternative depends on the effect size δ = β 1 σ X / σ , as well as n and α. As a contrast, popular software such as PASS [6] and nQuery [7] use the conditional distribution of the test statistic T* given the data on X, for calculating sample size.
An additional feature of our paper is that we provide a comprehensive table of critical values and sample sizes, unlike commercial software.
The main result that the non-null distribution of the test statistic T depends only on the effect size δ has an echo in other inference problems. For example, when testing μ 1 = μ 2 under the normality and common variance σ 2 assumptions, the non-null distribution of the two-sample t-statistic depends only on the effect size λ = ( μ 1 μ 2 ) / σ . This result, in spirit, is like ours. We have archived our findings for comments and insights [23].
We trust that the tables provided will help researchers to calculate sample size in the context of simple linear regression in unplanned experiments avoiding the controversies that have been problematic till now. We will continue to study how sample sizes are contrasted between one test based on the slope parameter of the model and one based on the correlation coefficient.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/e25040611/s1.

Author Contributions

The original sample size problem came from the consulting work of M.K.A. The project was designed by M.K.A. The bulk of derivations and computations were done by T.G. This was part of her thesis work. M.B.R. was the mentor. Conceptualization, M.K.A.; methodology, M.B.R.; software, T.G.; simulations, T.G.; writing—original draft preparation, T.G.; writing—review and editing, M.B.R.; supervision, M.B.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are immensely indebted to the four reviewers, who exhorted them to bring the paper into a sharp focus highlighting its strength. One of the reviewers identified the true pulse of the paper and commented that the paper is in the ambit of an unplanned simple linear regression domain in contrast to the traditional planned simple linear regression armory.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Galton, F. Regression towards Mediocrity in Hereditary Stature. J. Anthropol. Inst. Great Br. Irel. 1886, 15, 246–263. [Google Scholar] [CrossRef] [Green Version]
  2. Horton, N.J.; Switzer, S.S. Statistical Methods in the Journal. N. Engl. J. Med. 2005, 353, 1977–1979. [Google Scholar] [CrossRef] [PubMed]
  3. Ryan, T.P. Sample Size Determination and Power; John Wiley and Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
  4. Gripp, K.W. Handbook of Physical Measurements, 3rd ed.; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
  5. Adcock, C.J. Sample size determination: A review. J. R. Stat. Soc. D 1997, 46, 261–283. [Google Scholar] [CrossRef]
  6. PASS. Power Analysis and Sample Size Software; NCSS, LLC.: Kaysville, Utah, 2021; Available online: https://www.ncss.com/software/pass/ (accessed on 1 January 2020).
  7. nQuery. Sample Size and Power Calculation; Statsols (Statistical Solutions Ltd.): Cork, Ireland, 2017. [Google Scholar]
  8. R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017; Available online: https://www.R-project.org/ (accessed on 1 January 2020).
  9. Dupont, W.D.; Plummer, W.D. Power and Sample Size Calculations for Studies Involving Linear Regression. Control. Clin. Trials 1998, 19, 589–601. [Google Scholar] [CrossRef] [PubMed]
  10. Draper, N.R.; Smith, H. Applied Regression Analysis, 2nd ed.; Wiley: New York, NY, USA, 1981. [Google Scholar]
  11. Hsieh, F.; Bloch, D.; Larsen, M. A simple method of sample size calculation for linear and logistic regression. Stat. Med. 1998, 17, 1623–1634. [Google Scholar] [CrossRef]
  12. Maxwell, S.E. Sample Size and Multiple Regression Analysis. Psychol. Methods 2000, 5, 434–458. [Google Scholar] [CrossRef] [PubMed]
  13. Thigpen, C.C. A Sample-Size Problem in Simple Linear Regression. Am. Stat. 1987, 41, 214–215. [Google Scholar]
  14. Gatsonis, C.; Sampson, A.R. Multiple Correlation: Exact Power and Sample Size Calculations. Psychol. Bull. 1989, 106, 516–524. [Google Scholar] [CrossRef] [PubMed]
  15. Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; L. Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
  16. SAS Analytics Software and Solution—Version 9.4. Available online: https://support.sas.com/software/94/ (accessed on 1 January 2023).
  17. Krishnamoorthy, K.; Xia, Y. Sample size calculation for estimating or testing a nonzero squared multiple correlation coefficient. Multivar. Behav. Res. 2008, 43, 382–410. [Google Scholar] [CrossRef] [PubMed]
  18. Mendoza, J.L.; Stafford, K.L. Confidence Intervals, Power Calculation, and Sample Size Estimation for the Squared Multiple Correlation Coefficient under the Fixed and Random Regression Models: A Computer Program and Useful Standard Tables. Educ. Psychol. Meas. 2001, 61, 650–667. [Google Scholar] [CrossRef] [Green Version]
  19. Kelley, K. Sample size planning for the squared multiple correlation coefficient: Accuracy in parameter estimation via narrow confidence intervals. Multivar. Behav. Res. 2008, 43, 524–555. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  20. Shieh, G. A Unified Approach to Power Calculation and Sample Size Determination for Random Regression Models. Psychometrika 2007, 72, 347–360. [Google Scholar] [CrossRef]
  21. Shieh, G. Sample size requirements for interval estimation of the strength of association effect sizes in multiple regression analysis. Psicothema 2013, 25, 402–407. [Google Scholar] [PubMed]
  22. Jambunathan, M.V. Some Properties of Beta and Gamma Distributions. Ann. Math. Stat. 1954, 25, 401–405. [Google Scholar] [CrossRef]
  23. Guan, T.; Alam, M.K.; Rao, M.B. Sample Size Calculations in Simple Linear Regression: Trials and Tribulations. arXiv 2019, arXiv:1907.10569. [Google Scholar]
Table 2. Sample Size for Given Effect Size, Power, Level of Significance 5%, Mean of Power in the Validation Step, and its Standard Deviation.
Table 2. Sample Size for Given Effect Size, Power, Level of Significance 5%, Mean of Power in the Validation Step, and its Standard Deviation.
αES = β1 ∗ (σx/σ)PowernMeanSd
0.050.180%7900.80060.0129
90%10800.90540.0088
95%13500.95570.0067
99%18500.98980.0032
0.050.280%1990.7970.0133
90%2720.90390.0094
95%3300.94970.0069
99%4500.98910.0033
0.050.380%910.79780.0124
90%1230.90280.0094
95%1500.95050.0067
99%2200.9920.0028
0.050.480%530.7730.0128
90%700.89660.0096
95%870.94940.0071
99%1210.98910.0034
0.050.580%360.80510.0124
90%480.90950.0091
95%580.950.0068
99%790.98880.0033
0.050.680%260.80050.0124
90%340.89850.0094
95%430.95470.0066
99%590.99010.0031
Table 3. Sample Size for Given Effect Size, Power, Level of Significance 1%, Mean of Power in the Validation Step, and its Standard Deviation.
Table 3. Sample Size for Given Effect Size, Power, Level of Significance 1%, Mean of Power in the Validation Step, and its Standard Deviation.
αES = β1 ∗ (σx/σ)PowernMeanSd
0.010.180%11800.80260.0124
90%15000.90150.0095
95%17600.9460.0072
99%24400.99060.0031
0.010.280%3010.80450.0121
90%3880.90460.0093
95%4580.95290.0065
99%6200.9910.0031
0.010.380%1360.80440.0129
90%1720.90120.0089
95%1990.94320.0071
99%2650.98720.0034
0.010.480%780.80170.0124
90%950.88560.0099
95%1180.9490.007
99%1580.98920.0033
0.010.580%510.80420.0126
90%640.90110.0099
95%770.94640.007
99%1040.98910.0032
0.010.680%370.79750.0125
90%480.9060.0089
95%560.94850.007
99%730.98740.0035
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Guan, T.; Alam, M.K.; Rao, M.B. Sample Size Calculations in Simple Linear Regression: A New Approach. Entropy 2023, 25, 611. https://doi.org/10.3390/e25040611

AMA Style

Guan T, Alam MK, Rao MB. Sample Size Calculations in Simple Linear Regression: A New Approach. Entropy. 2023; 25(4):611. https://doi.org/10.3390/e25040611

Chicago/Turabian Style

Guan, Tianyuan, Mohammed Khorshed Alam, and Marepalli Bhaskara Rao. 2023. "Sample Size Calculations in Simple Linear Regression: A New Approach" Entropy 25, no. 4: 611. https://doi.org/10.3390/e25040611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop