Sample Size Calculations in Simple Linear Regression: A New Approach

Guan, Tianyuan; Alam, Mohammed Khorshed; Rao, Marepalli Bhaskara

doi:10.3390/e25040611

Open AccessArticle

Sample Size Calculations in Simple Linear Regression: A New Approach

by

Tianyuan Guan

^1,2

,

Mohammed Khorshed Alam

² and

Marepalli Bhaskara Rao

^2,*

¹

College of Public Health, Kent State University, 750 Hilltop Drive, Kent, OH 44240, USA

²

Department of Environmental Health and Public Health Sciences, University of Cincinnati, 160 Panzeca Way, Cincinnati, OH 45221, USA

^*

Author to whom correspondence should be addressed.

Entropy 2023, 25(4), 611; https://doi.org/10.3390/e25040611

Submission received: 19 February 2023 / Revised: 25 March 2023 / Accepted: 26 March 2023 / Published: 3 April 2023

(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)

Download Review Reports Versions Notes

Abstract

:

The problem tackled is the determination of sample size for a given level and power in the context of a simple linear regression model. The standard approach deals with planned experiments in which the predictor X is observed for a number n of times and the corresponding observations on the response variable Y are to be drawn. The statistic that is used is built on the least squares’ estimator of the slope parameter. Its conditional distribution given the data on the predictor X is utilized for sample size calculations. This is problematic. The sample size n is already presaged and the data on X is fixed. In unplanned experiments, in which both X and Y are to be sampled simultaneously, we do not have data on the predictor X yet. This conundrum has been discussed in several papers and books with no solution proposed. We overcome the problem by determining the exact unconditional distribution of the test statistic in the unplanned case. We have provided tables of critical values for given levels of significance following the exact distribution. In addition, we show that the distribution of the test statistic depends only on the effect size, which is defined precisely in the paper.

Keywords:

least squares estimator; level; power; unconditional distribution

1. Introduction

Multiple regression is one of the core methodologies in statistics. Power computation and sample size determination have become integral part of many research proposals submitted for funding. Funding agencies such as UKRI (UK Research and Innovation) and NIH (National Institutes of Health) have been demanding sample size calculations in all prospective proposals. Regression has a long history dating back to Galton [1]. Horton and Switzer [2] reported that 51% of research articles published in the New England Journal of Medicine during May 2004 have Multiple Regression as one of the methods used. The figure for power analysis is at 39%.

In this paper, we focus on power computation in the context of simple linear regression. The current approach in power computations lacks justification. We will point out difficulties in this setting [3].

Simple linear regression is ubiquitous in pediatric clinical diagnostics. The model sets standards for normal growth in children on several metrics [4]. As an illustration, a pediatrician wants to check whether the lung function of a 13-year-old patient is normal. Data is to be collected on healthy subjects in the age range 12–14 years with response,

Y = FEV (Forced Expiratory Volume)

and predictor,

X = Height,

which is an example of an unplanned experiment.

In order to trust the model, we need to decide on the sample size, which in turn, depends on the level of significance, power, and effect size.

First, we investigate the setting under the simple linear regression paradigm. The model has two entities X, the predictor, and Y, the response variable. It is stated as

Y | X ~ N (β_{0} + β_{1} X, σ^{2})

for some

β_{0}

,

β_{1},

and

σ^{2} > 0

. The null hypothesis of interest is

H_{0} : β_{1} = 0

against the alternative

H_{1} : β_{1} \neq 0

. What should be the required sample size, n, for a given level of significance α, power 1-β, and at the alternative value A of

β_{1}

. Let (X₁, Y₁), (X₂, Y₂), …, (X_n, Y_n), be a potential sample for the testing problem. Let

{\hat{β}}_{1}

be the least squares estimator of

β_{1}

, i.e.,

{\hat{β}}_{1} = \frac{S_{XY}}{S_{XX}}

where

S_{XY} = \sum_{i = 1}^{n} (X_{i} - \bar{X}) (Y_{i} - \bar{Y})

and

S_{X X} = \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}

Let RSS be the residual sum of squares, i.e.,

RSS = \sum_{i = 1}^{n} {[(Y_{i} - \bar{Y}) - {\hat{β}}_{1} (X_{i} - \bar{X})]}^{2}

For testing the null hypothesis H0, the following test statistic is used:

T = ({\hat{β}}_{1} \sqrt{S_{XX}}) / (\sqrt{RSS / (n - 2)}) .

Under the null hypothesis, conditioned on the X-data, T has a t-distribution with n − 2 degrees of freedom. Under the alternative value

β_{1}

= A, T has a non-central t-distribution with degrees of freedom n − 2, and non-centrality parameter

λ = |A| * \sqrt{Sxx} / σ

.

We reject the null hypothesis if and only if

|T| > t_{n - 2, 1 - \frac{α}{2}}

where

t_{n - 2, 1 - \frac{α}{2}}

is such that the area to the left of Student’s t-curve on (n − 2) degrees of freedom is 1 − α/2.

The power formula is given by

Power (A) = \Pr ({Reject H}_{0} | β_{1} = A) = \Pr (|T| > t_{n - 2, 1 - \frac{α}{2}} | β_{1} = A) .

We can set power equal to 1-β and solve for n. This would work as long as we know what

λ = |A| * \sqrt{S_{XX}} / σ

is. This would require knowledge of the alternative value of β₁,

σ^{2}

, and

S_{XX}

. We will not know what

S_{XX}

is, prior to data collection, in the unplanned experiments. Equivalently, one should spell out what λ is. This is a tall order. Adcock [5] recognized these problems. Some software and textbooks assume that

(1 / n) \sum_{i = 1}^{n} {(X_{i} - \bar{X})}^{2}

is known. For example, the software PASS [6] and nQuery [7] proceed this way. To overcome these difficulties, we proceed with deriving the exact unconditional distribution of a variant of T. This requires a knowledge of the distribution of X. Let

σ_{X}^{2}

be the variances of X.

Modify the test statistic.

T = {\hat{β}}_{1} * {\hat{σ}}_{X} / \hat{σ},

(1)

where

{\hat{σ}}^{2} = RSS / (n - 2)

{and \hat{σ}}_{X}^{2} = S_{XX} / (n - 1) .

We obtain the unconditional distribution of T under β₁ = 0 as well as under β₁ = A ≠ 0. We assume X~N (µ_x,

σ_{X}^{2}

), both parameters unknown. Under this assumption, the distribution of T is derived.

In due course, we will show the distribution of T when β₁ = A ≠ 0 depends only on δ =

|A| {* σ}_{X} / σ

, which we can deem as the effect size.

The five-parameter model now is:

Y | X ~ N (β_{0} + β_{1} X, σ^{2})

X ~ N (µ_{x}, σ_{X}^{2}) .

Note that the vector (X, Y) has a bivariate normal distribution.

The paper is organized as follows. In Section 2, we provide a literature review. In Section 3, we outline the main results. We derive the unconditional distribution of T under the null hypothesis in Section 3.1. In Section 3.2, we calculate critical values using the main results. In Section 3.3, we lay out the sample sizes required for a given level, power, and effect size

δ = |A| {* σ}_{X} / σ

. In Section 4, we summarize the results and draw conclusions. The computational details along with the R code [8] are presented in the Supplementary Materials.

2. Literature Review

Ryan [3] has pointed out difficulties in power calculations in the environment of simple linear regression. The problem is how we handle the predictor X. Adcock [5] has looked at some possible scenarios. One scenario is that the investigator knows the X_i-values (deterministic) for every sample size n. In such a case, the test statistic

({\hat{β}}_{1} \sqrt{S_{XX}}) / (\sqrt{RSS / (n - 2)})

(2)

is eminently usable for power calculations. Its (conditional) null and non-null distributions have been worked out explicitly. The conditional approach is also followed by Dupont et al. [9], Draper et al. [10], Hsieh et al. [11], Maxwell [12], and Thigpen [13].

As an alternative to the test statistic (2), we can build a test based on the sample correlation coefficient

\hat{ρ}

[14], under the joint normality of X and Y. The null and non-null distributions of the underlying test statistic based on

\hat{ρ}

have been worked out explicitly. In our consulting work, many researchers prefer to use the test based on

{\hat{β}}_{1}

. It is a choice between causality and association [3,14,15,16,17,18,19,20,21]. The hypotheses H₀:

β_{1}

= 0 and H₀:

ρ

= 0 under bivariate normality are equivalent, but the test statistics are different. It is easy to determine sample size under the correlation context [14]. However, this sample size cannot be offered for testing the hypothesis on the slope. The power is less. In other words, test hopping is not permissible; i.e., they are two different tests with distinct power functions.

3. Outline of Results

We will now derive the unconditional distribution of

{\hat{β}}_{1}

, which will be instrumental in sample size calculations. We use the test statistic T =

{\hat{β}}_{1} * {\hat{σ}}_{X} / \hat{σ}

.

Under the null hypothesis β₁ = 0, we show that

T^{2} ~ \frac{(n - 2)}{(n - 1)} * \frac{W_{1} W_{4}}{W_{2} W_{3}},

where

W_{1} ~ χ_{1}^{2}

,

W_{2} ~ χ_{n - 1}^{2}

,

W_{3} ~ χ_{n - 2}^{2}

and

W_{4} ~ χ_{n - 1}^{2}

, with the W_i values being mutually independent. It follows implicitly that

T ~ \sqrt{(n - 2) ∕ (n - 1)} * U_{1} * U_{2} / U_{3}

with

U_{1}, U_{2}, U_{3}

independently distributed,

U_{1} ~ t_{n - 1}, U_{2} ~ χ_{n - 1}

, and

U_{3} ~ χ_{n - 2}

, and where

χ_{n - 1}

is the

χ

distribution with

(n - 1)

degrees of freedom

We use this result to obtain the critical values of the test based on T, for given levels. For power and sample size computations, we need the distribution of T for any given value of β₁. The distribution depends on the alternative values of β₁,

σ_{X}^{2}

and σ². It turns out that the distribution depends only on δ =

β_{1} {* σ}_{X} / σ

, which we can deem as the effect size. The specification of δ facilitates computation of power. Despite all these deliberations, no magic explicit formula for power surfaces. Knowing the distribution of T² when δ is spelled out, the pain is eased a little bit.

3.1. Distributional Results

In this section, we will derive the distribution of T of (1) unconditionally. The following series of steps will give the desired result.

Given X₁, X₂, …, X_n, ${\hat{β}}_{1}$ has a normal distribution with mean $β_{1}$ and variance $σ^{2} / S_{X X}$ and ${\hat{β}}_{1}$ and $R S S$ are independent.
Uncoditionally, $R S S / σ^{2} ~ χ_{n - 2}^{2}$
$S_{X X} / σ_{X}^{2} ~ χ_{n - 1}^{2}$ .
$R S S$ and $S_{X X}$ are independent.

More generally, we obtain the distribution of

T = ({\hat{β}}_{1} - β_{1}) {\hat{σ}}_{X} / \hat{σ}

for a given value of β₁.

The joint density function of

{\hat{β}}_{1}

and

S_{X X}

:

\begin{array}{l} g ({\hat{β}}_{1}, S_{X X}) = \frac{\sqrt{S_{X X}}}{\sqrt{2 π} σ} * \exp (- \frac{S_{X X}}{2 σ^{2}} {({\hat{β}}_{1} - β_{1})}^{2}) * \frac{1}{Г (\frac{n - 1}{2}) * 2^{\frac{n - 1}{2}}} \exp (- \frac{S_{X X}}{2 * σ_{X}^{2}}) * {(\frac{S_{X X}}{σ_{X}^{2}})}^{\frac{n - 1}{2} - 1} * (\frac{1}{σ_{X}^{2}}), \\ - \infty < {\hat{β}}_{1} < \infty, 0 < S_{X X} < \infty \end{array}

The (unconditional) marginal density of

{\hat{β}}_{1}

is given by:

\begin{array}{l} f ({\hat{β}}_{1}) & = \frac{1}{2^{(\frac{1}{2})} * 2^{\frac{n - 1}{2}} * \sqrt{π} * Г (\frac{n - 1}{2}) * σ * {(σ_{X}^{2})}^{\frac{n - 1}{2}}} \int_{0}^{\infty} S_{X X}^{\frac{n}{2} - 1} * \exp (- \frac{S_{X X}}{2} [\frac{{(\hat{β_{1}} - β_{1})}^{2}}{σ^{2}} + \frac{1}{σ_{X}^{2}}]) d S_{X X} \\ = \frac{Г (\frac{n}{2})}{\sqrt{π} * Г (\frac{n - 1}{2}) * σ * {(σ_{X}^{2})}^{\frac{n - 1}{2}}} {(\frac{1}{\frac{1}{σ_{X}^{2}} + \frac{{({\hat{β}}_{1} - β_{1})}^{2}}{σ^{2}}})}^{\frac{n}{2}} \\ = \frac{σ_{X}}{B (\frac{1}{2}, \frac{n - 1}{2}) * σ} * \frac{1}{{(1 + \frac{{({\hat{β}}_{1} - β_{1})}^{2} * σ_{X}^{2}}{σ^{2}})}^{\frac{n}{2}}}, - \infty < {\hat{β}}_{1} < \infty \end{array}

Some properties of this density are clear to observe. For example, the distribution is symmetric around the true value

β_{1}

. If n = 2, the distribution is Cauchy. In addition,

(\frac{σ_{X}}{σ}) * ({\hat{β}}_{1} - β_{1}) \sqrt{n - 1} ~ t_{n - 1} .

Further, if n > 3, unconditionally,

$E ({\hat{β}}_{1}) = β_{1}$ and Var $({\hat{β}}_{1}) = (σ^{2} / σ_{X}^{2}) * {(n - 3)}^{- 1}$ ;
In the conditional set-up,

$E ({\hat{β}}_{1} | X_{1}, X_{2}, \dots, X_{n}) = β_{1}$

$Var ({\hat{β}}_{1} | X_{1}, X_{2}, \dots, X_{n}) = σ^{2} / S_{X X};$
The random variable $U = ({\hat{β}}_{1} - β_{1}) σ_{X} / σ$ has the probability density function:

$f (U) = \frac{1}{B (\frac{1}{2}, \frac{n - 1}{2})} * \frac{1}{{(1 + U^{2})}^{\frac{n}{2}}}, - \infty < U < \infty;$
It follows that $U^{2} ~ W_{1} / W_{2}$ , where $W_{1} ~ χ_{1}^{2}$ and $W_{2} ~ χ_{n - 1}^{2}$ , with $W_{1}$ and $W_{2}$ being independent;
Exact distribution of $T^{2}$ : note that $({\hat{β}}_{1} - β_{1}) / σ$ and ${\hat{σ}}_{X}^{2}$ are independent;
$T^{2} = {({\hat{β}}_{1} - β_{1})}^{2} {\hat{σ}}_{X}^{2} / {\hat{σ}}^{2} = \{{({\hat{β}}_{1} - β_{1})}^{2} * σ_{X}^{2} / σ^{2}\} * (σ^{2} / {\hat{σ}}^{2}) * {\hat{σ}}_{X}^{2} / σ_{X}^{2} ~ (W_{1} / W_{2}) * \{(n - 2) / W_{3}\} * W_{4} / (n - 1),$

where $W_{3} ~ χ_{n - 2}^{2}$ and $W_{4} ~ χ_{n - 1}^{2}$ , and with $W_{1}, W_{2}, W_{3},$ and $W_{4}$ being independent.
In short,

$T^{2} ~ (n - 2) / (n - 1) * (W_{1} W_{4}) / (W_{2} W_{3})$

(3)
It follows that:

$E (T^{2}) = (n - 2) / {(n - 3) (n - 4)$
An alternative form of the distribution [22]:

$(n - 1) / (n - 2) * T^{2} ~ (W_{1} W_{4}) / (W_{2} W_{3}) ~ B e t a_{I I} \{\frac{1}{2}, \frac{n - 2}{2}\} * B e t a_{I I} \{\frac{n - 1}{2}, \frac{n - 1}{2}\},$

where $B e t a_{I I}$ signifies the beta distribution of the second kind.

3.2. Critical Values

We obtain the critical values of the test based on the test statistic

T = {\hat{β}}_{1} {* \hat{σ}}_{X} / \hat{σ}

for three levels of significance. We denote the critical value by

C_{n, α}

. The critical value

C_{n, α}

satisfies the equation:

α = \Pr (|\frac{{\hat{β}}_{1} {\hat{σ}}_{X}}{\hat{σ}}| > C_{n, α} | H_{0} : β_{1} = 0) = \Pr (\frac{{\hat{β}}_{1}^{2} {\hat{σ}}_{X}^{2}}{{\hat{σ}}^{2}} > C_{n, α}^{2} | H_{0} : β_{1} = 0)

Under

H_{0}

,

T^{2} = \frac{{\hat{β}}_{1}^{2} {\hat{σ}}_{X}^{2}}{{\hat{σ}}^{2}} ~ \frac{(n - 2)}{(n - 1)} * \frac{W_{1} W_{4}}{W_{2} W_{3}},

where

W_{1} ~ χ_{1}^{2}

,

W_{2} ~ χ_{n - 1}^{2}

,

W_{3} ~ χ_{n - 2}^{2}

and

W_{4} ~ χ_{n - 1}^{2}

, with W_i values being independent.

There are two options. One is using the pdf of

\{(n - 2) / (n - 1)\}

*

T². Following Jambunathan [22], one can write the pdf of the product U*V of the random variables U and V with U

~ {Beta}_{II} (1 / 2, (n - 2) / 2)

, V

~ {Beta}_{II} ((n - 1) / 2, (n - 1) / 2)

, and U and V being independent. The pdf is in the form of a double integral and its evaluation would require the use of a quadrature formula with the attendant errors of approximation. The second option is to determine the distribution of T² by sampling extensively the components that make up T² via Monte-Carlo. We have pursued the second option. The critical values are tabulated in File S1.

One can also obtain the critical value

C_{n, α}

via the asymptotic distribution of T. One benefit of our derivation of the exact distribution is that if n is large, and null hypothesis is true,

T ~ Normal (0, (n - 2) / {(n - 3) (n - 4)}), approximately .

There are several ways to establish the asymptotic normality of T. The exact unconditional distribution of

\sqrt{n - 1}

* ({\hat{β}}_{1} - β_{1}) {* σ}_{X} / σ

is t_n−1, which is asymptotically N (0, 1). Then we use the fact that

{\hat{σ}}_{X}

is consistent for

σ_{X}

and that

\hat{σ}

is consistent for σ. Since we know the variance of T exactly, we use this variance in the description of the asymptotic distribution of T. We can now calculate the critical values, as well as those coming from the exact distribution, following the asymptotic distribution.

In File S1, we report the average critical values C_n,α along with the critical values stemming from the asymptotic theory. A description of these asymptotic critical values is provided below.

Critical values from the normal approximation:

Level	Critical Value Formula	Verbal description in File S1
10%	$1.645 \times \sqrt{(n - 2) / \{(n - 3) (n - 4)\}}$	10% normal
5%	$1.96 \times \sqrt{(n - 2) / \{(n - 3) (n - 4)\}}$	5% normal
1%	$2.576 \times \sqrt{(n - 2) / \{(n - 3) (n - 4)\}}$	1% normal

Comments on File S1: The Normal Critical Value column is explained.

Normal critical value 10% = critical value coming from the asymptotic distribution when α = 0.10.
Normal critical value 5% = critical value coming from the asymptotic distribution when α = 0.05.
Normal critical value 1% = critical value coming from the asymptotic distribution when α = 0.01.
Critical value 10% = critical value coming from the exact distribution of T when α = 0.10.
Critical value 5% = critical value coming from the exact distribution of T when α = 0.05.
Critical value 1% = critical value coming from the exact distribution of T when α = 0.01.
When α = 0.10, |Normal critical value 10%: Critical value 10%| ≤ 0.001 for n ≥ 50. One can enjoy the benefit of normal approximation when n ≥ 50.
When α = 0.05, |Normal critical value 5%: Critical value 5%| ≤ 0.001 for n ≥ 89. One can enjoy the benefit of normal approximation when n ≥ 89.
For α = 0.01, Table 1 is not informative when |Normal critical value 1%: Critical value 1%| ≤ 0.001.

3.3. Sample Size and Power

For a given level α, sample size n, and alternative value of β₁ = A, power is given by

Power (A) = \Pr (|{\hat{β}}_{1} {* \hat{σ}}_{X} / \hat{σ}| > C_{n, α} | β_{1} = A) .

Suppose 1 − β is the specified power. For the sample size, we set

1 - β = \Pr (|{\hat{β}}_{1} {* \hat{σ}}_{X} / \hat{σ}| > C_{n, α} | β_{1} = A) .

and solve for n. We will need the distribution of

{\hat{β}}_{1} {* \hat{σ}}_{X} / \hat{σ}

, when

β_{1} = A

. Rewrite

{\hat{β}}_{1} {* \hat{σ}}_{X} / \hat{σ} = ({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ} + β_{1} {* \hat{σ}}_{X} / \hat{σ} .

The distribution of

({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ}

is described in Section 3.1 and it is free of the parameters of the regression model. Consequently, the random variables

({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ}

and

β_{1} {* \hat{σ}}_{X} / \hat{σ}

are independently distributed. Since

\hat{σ}

and

{\hat{σ}}_{X}

are independently distributed,

{(β_{1} {* \hat{σ}}_{X} / \hat{σ})}^{2} \to^{d} β_{1}^{2} {* σ}_{X}^{2} / (n - 1) {* W}_{5} * (n - 2) / σ^{2} * 1 / W_{6} = {(β_{1} {* σ}_{X} / σ)}^{2} (n - 2) / (n - 1) {* W}_{5} / W_{6},

with W₅~

χ_{n - 1}^{2}

, W₆~

χ_{n - 2}^{2}

, and W₅ and W₆ being independent.

An important fact emerges from these deliberations in that the distribution of

({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ}

+

β_{1} {* \hat{σ}}_{X} / \hat{σ}

depends only on δ =

β_{1} {* σ}_{X} / σ

, which we declare as the effect size.

In short, when β₁ = A ≠ 0, the key steps are:

T = $({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ}$ + $β_{1} {* \hat{σ}}_{X} / \hat{σ}$
with ${({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ}}^{2}$ ~ $\{(n - 2) / (n - 1)\} * (W_{1} {* W}_{4}) / (W_{2} {* W}_{3}),$
$(β_{1} {* \hat{σ}}_{X} / \hat{σ}$ )²~ ${({A * σ}_{X} / σ)}^{2} * (n - 2) / (n - 1) {* W}_{5} / W_{6}$ ,
$({\hat{β}}_{1} - β_{1}) {* \hat{σ}}_{X} / \hat{σ}$ and $β_{1} {* \hat{σ}}_{X} / \hat{σ}$ are independent,

and the distribution of T depends only on n and effect size δ =

|A| {* σ}_{X} / σ

.

In spite of all these labors, the distribution of

T

is not amenable to direct and simple computation of power.

We simulate the regression model for power computations. Simulations are greatly simplified when we exploit the key nature of the alternative distribution, namely, that it depends only on n and δ. Simulations are reported in the Supplementary Materials. Sample sizes are tabulated in Table 1, Table 2 and Table 3.

Table 1. Sample Size for Given Effect Size, Power, Level of Significance 10%, Mean of Power in the Validation Step, and its Standard Deviation.

α	ES = β1 ∗ (σx/σ)	Power	n	Mean	Sd
0.1	0.1	80%	620	0.7993	0.013
		90%	870	0.9027	0.0095
		95%	1120	0.9546	0.0067
		99%	1690	0.993	0.0027
0.1	0.2	80%	161	0.8259	0.0123
		90%	219	0.90007	0.0096
		95%	274	0.949	0.0069
		99%	440	0.0039	0.0024
0.1	0.3	80%	73	0.8017	0.0124
		90%	100	0.9006	0.00995
		95%	124	0.9475	0.0071
		99%	195	0.931	0.0026
0.1	0.4	80%	43	0.8031	0.0126
		90%	60	0.9073	0.0093
		95%	72	0.9518	0.0073
		99%	105	0.9896	0.0031
0.1	0.5	80%	29	0.8045	0.0134
		90%	39	0.9003	0.0099
		95%	48	0.947	0.0068
		99%	69	0.989	0.0034
0.1	0.6	80%	21	0.8	0.0129
		90%	28	0.8961	0.0096
		95%	35	0.9476	0.0069
		99%	52	0.9911	0.003

Comments on Table 1, Table 2 and Table 3:

The first column in each table entertains three types of effect sizes: small (0.1, 0.2); medium (0.3, 0.4); and large (0.5, 0.6) [15].
The second column in each table lays out the powers entertained.
The third column in each table spells out the requisite sample size.
The fourth column is the fruit of our effort to validate the sample size. At the ascertained sample size, data are generated under the specifications, power calculated, and power averaged over thousand times.
The fifth column records the standard deviation of the thousand powers calculated.
We are satisfied that the sample sizes laid out are holding true.

4. Discussion

A simple linear regression is a five-parameter model spelling out causality between two quantitative variables Y and X typified by:

Y | X ~ N (β_{0} + β_{1} X, σ^{2}) X ~ N (µ_{X}, σ_{X}^{2})

for some parameters

β_{0}, β_{1}, μ_{X}

,

σ^{2} > 0

, and

σ_{X}^{2} > 0

. The goal is to sample (X, Y) for testing

H_{0} : β_{1} = 0

versus the alternative

H_{1} : β_{1} \neq 0

. For determining sample size, we need the level of significance α, power 1 − β, and the effect size

δ = {A * σ}_{X} / σ

, where A is the given alternative value of

β_{1}

. The test statistic T used here is the one based on the least squares’ estimator

{\hat{β}}_{1}

of

β_{1}

.

The regression model, as originally formulated, is a conditional model, i.e.,

Y | X ~ N (β_{0} + β_{1} X, σ^{2})

. In practice, in a planned experiment, the experimenter selects

x_{1}, x_{2}, . . ., x_{n}

of X, and observes one or more Ys from the conditional distribution of Y|x_i for each i. Thus, the sample size n has already been chosen. The statistic

({\hat{β}}_{1} \sqrt{S_{XX}}) / (\sqrt{RSS / (n - 2)})

is used for testing

H_{0} : β_{1} = 0

against the alternative

H_{1} : β_{1} \neq 0

. The conditional distribution of the test statistic given the data on X is Student’s t with n-2 degree of freedom under

H_{0}

, and the distribution is non-central Student’s t with n-2 degrees of freedom and non-centrality parameter

|A| * \sqrt{Sxx} / σ

under

H_{1} : β_{1} = A

. The alternative distribution can be used to calculate the power of the test at

β_{1} = A

, and nothing more. The entities n and

S_{XX}

are already in place, and

σ

has to be spelled out. The value of A is provided by the experimenter as the one of clinical significance. From the consulting experience of one of the authors, the experimenter usually comes up with value for

σ

from his/her pilot study.

In some statistical circles [6,7], the non = null distribution is used to calculate the sample size with the desired power, with

S_{XX}

remaining the same. This is controversial and discussed in [3,5,13].

We are dealing with unplanned experiments in which both X and Y are sampled together. Unplanned experiments are very common in clinical studies [4]. The effect size, in this context, is the multiple of the alternative value of

β_{1}

by the ratio of the two standard deviations of the model.

The current practice demands α, 1 − β, A, σ and

S_{XX}

, which we do not have. Specification of

S_{XX}

is avoided by determining the unconditional distribution of

T = {\hat{β}}_{1} * {\hat{σ}}_{X} / \hat{σ} .

Exploiting the unconditional distribution of T, we calculated the critical values and required sample size. The unconditional distribution under the alternative depends on the effect size δ =

β_{1} {* σ}_{X} / σ

, as well as n and α. As a contrast, popular software such as PASS [6] and nQuery [7] use the conditional distribution of the test statistic T^* given the data on X, for calculating sample size.

An additional feature of our paper is that we provide a comprehensive table of critical values and sample sizes, unlike commercial software.

The main result that the non-null distribution of the test statistic T depends only on the effect size δ has an echo in other inference problems. For example, when testing

μ_{1} = μ_{2}

under the normality and common variance

σ^{2}

assumptions, the non-null distribution of the two-sample t-statistic depends only on the effect size

λ = (μ_{1} - μ_{2)} / σ

. This result, in spirit, is like ours. We have archived our findings for comments and insights [23].

We trust that the tables provided will help researchers to calculate sample size in the context of simple linear regression in unplanned experiments avoiding the controversies that have been problematic till now. We will continue to study how sample sizes are contrasted between one test based on the slope parameter of the model and one based on the correlation coefficient.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/e25040611/s1.

Author Contributions

The original sample size problem came from the consulting work of M.K.A. The project was designed by M.K.A. The bulk of derivations and computations were done by T.G. This was part of her thesis work. M.B.R. was the mentor. Conceptualization, M.K.A.; methodology, M.B.R.; software, T.G.; simulations, T.G.; writing—original draft preparation, T.G.; writing—review and editing, M.B.R.; supervision, M.B.R. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors are immensely indebted to the four reviewers, who exhorted them to bring the paper into a sharp focus highlighting its strength. One of the reviewers identified the true pulse of the paper and commented that the paper is in the ambit of an unplanned simple linear regression domain in contrast to the traditional planned simple linear regression armory.

Conflicts of Interest

The authors declare no conflict of interest.

References

Galton, F. Regression towards Mediocrity in Hereditary Stature. J. Anthropol. Inst. Great Br. Irel. 1886, 15, 246–263. [Google Scholar] [CrossRef] [Green Version]
Horton, N.J.; Switzer, S.S. Statistical Methods in the Journal. N. Engl. J. Med. 2005, 353, 1977–1979. [Google Scholar] [CrossRef] [PubMed]
Ryan, T.P. Sample Size Determination and Power; John Wiley and Sons: Hoboken, NJ, USA, 2013. [Google Scholar]
Gripp, K.W. Handbook of Physical Measurements, 3rd ed.; Oxford University Press: Oxford, UK, 2013. [Google Scholar]
Adcock, C.J. Sample size determination: A review. J. R. Stat. Soc. D 1997, 46, 261–283. [Google Scholar] [CrossRef]
PASS. Power Analysis and Sample Size Software; NCSS, LLC.: Kaysville, Utah, 2021; Available online: https://www.ncss.com/software/pass/ (accessed on 1 January 2020).
nQuery. Sample Size and Power Calculation; Statsols (Statistical Solutions Ltd.): Cork, Ireland, 2017. [Google Scholar]
R Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2017; Available online: https://www.R-project.org/ (accessed on 1 January 2020).
Dupont, W.D.; Plummer, W.D. Power and Sample Size Calculations for Studies Involving Linear Regression. Control. Clin. Trials 1998, 19, 589–601. [Google Scholar] [CrossRef] [PubMed]
Draper, N.R.; Smith, H. Applied Regression Analysis, 2nd ed.; Wiley: New York, NY, USA, 1981. [Google Scholar]
Hsieh, F.; Bloch, D.; Larsen, M. A simple method of sample size calculation for linear and logistic regression. Stat. Med. 1998, 17, 1623–1634. [Google Scholar] [CrossRef]
Maxwell, S.E. Sample Size and Multiple Regression Analysis. Psychol. Methods 2000, 5, 434–458. [Google Scholar] [CrossRef] [PubMed]
Thigpen, C.C. A Sample-Size Problem in Simple Linear Regression. Am. Stat. 1987, 41, 214–215. [Google Scholar]
Gatsonis, C.; Sampson, A.R. Multiple Correlation: Exact Power and Sample Size Calculations. Psychol. Bull. 1989, 106, 516–524. [Google Scholar] [CrossRef] [PubMed]
Cohen, J. Statistical Power Analysis for the Behavioral Sciences, 2nd ed.; L. Erlbaum Associates: Hillsdale, NJ, USA, 1988. [Google Scholar]
SAS Analytics Software and Solution—Version 9.4. Available online: https://support.sas.com/software/94/ (accessed on 1 January 2023).
Krishnamoorthy, K.; Xia, Y. Sample size calculation for estimating or testing a nonzero squared multiple correlation coefficient. Multivar. Behav. Res. 2008, 43, 382–410. [Google Scholar] [CrossRef] [PubMed]
Mendoza, J.L.; Stafford, K.L. Confidence Intervals, Power Calculation, and Sample Size Estimation for the Squared Multiple Correlation Coefficient under the Fixed and Random Regression Models: A Computer Program and Useful Standard Tables. Educ. Psychol. Meas. 2001, 61, 650–667. [Google Scholar] [CrossRef] [Green Version]
Kelley, K. Sample size planning for the squared multiple correlation coefficient: Accuracy in parameter estimation via narrow confidence intervals. Multivar. Behav. Res. 2008, 43, 524–555. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Shieh, G. A Unified Approach to Power Calculation and Sample Size Determination for Random Regression Models. Psychometrika 2007, 72, 347–360. [Google Scholar] [CrossRef]
Shieh, G. Sample size requirements for interval estimation of the strength of association effect sizes in multiple regression analysis. Psicothema 2013, 25, 402–407. [Google Scholar] [PubMed]
Jambunathan, M.V. Some Properties of Beta and Gamma Distributions. Ann. Math. Stat. 1954, 25, 401–405. [Google Scholar] [CrossRef]
Guan, T.; Alam, M.K.; Rao, M.B. Sample Size Calculations in Simple Linear Regression: Trials and Tribulations. arXiv 2019, arXiv:1907.10569. [Google Scholar]

Table 2. Sample Size for Given Effect Size, Power, Level of Significance 5%, Mean of Power in the Validation Step, and its Standard Deviation.

α	ES = β1 ∗ (σx/σ)	Power	n	Mean	Sd
0.05	0.1	80%	790	0.8006	0.0129
		90%	1080	0.9054	0.0088
		95%	1350	0.9557	0.0067
		99%	1850	0.9898	0.0032
0.05	0.2	80%	199	0.797	0.0133
		90%	272	0.9039	0.0094
		95%	330	0.9497	0.0069
		99%	450	0.9891	0.0033
0.05	0.3	80%	91	0.7978	0.0124
		90%	123	0.9028	0.0094
		95%	150	0.9505	0.0067
		99%	220	0.992	0.0028
0.05	0.4	80%	53	0.773	0.0128
		90%	70	0.8966	0.0096
		95%	87	0.9494	0.0071
		99%	121	0.9891	0.0034
0.05	0.5	80%	36	0.8051	0.0124
		90%	48	0.9095	0.0091
		95%	58	0.95	0.0068
		99%	79	0.9888	0.0033
0.05	0.6	80%	26	0.8005	0.0124
		90%	34	0.8985	0.0094
		95%	43	0.9547	0.0066
		99%	59	0.9901	0.0031

Table 3. Sample Size for Given Effect Size, Power, Level of Significance 1%, Mean of Power in the Validation Step, and its Standard Deviation.

α	ES = β₁ ∗ (σx/σ)	Power	n	Mean	Sd
0.01	0.1	80%	1180	0.8026	0.0124
		90%	1500	0.9015	0.0095
		95%	1760	0.946	0.0072
		99%	2440	0.9906	0.0031
0.01	0.2	80%	301	0.8045	0.0121
		90%	388	0.9046	0.0093
		95%	458	0.9529	0.0065
		99%	620	0.991	0.0031
0.01	0.3	80%	136	0.8044	0.0129
		90%	172	0.9012	0.0089
		95%	199	0.9432	0.0071
		99%	265	0.9872	0.0034
0.01	0.4	80%	78	0.8017	0.0124
		90%	95	0.8856	0.0099
		95%	118	0.949	0.007
		99%	158	0.9892	0.0033
0.01	0.5	80%	51	0.8042	0.0126
		90%	64	0.9011	0.0099
		95%	77	0.9464	0.007
		99%	104	0.9891	0.0032
0.01	0.6	80%	37	0.7975	0.0125
		90%	48	0.906	0.0089
		95%	56	0.9485	0.007
		99%	73	0.9874	0.0035

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guan, T.; Alam, M.K.; Rao, M.B. Sample Size Calculations in Simple Linear Regression: A New Approach. Entropy 2023, 25, 611. https://doi.org/10.3390/e25040611

AMA Style

Guan T, Alam MK, Rao MB. Sample Size Calculations in Simple Linear Regression: A New Approach. Entropy. 2023; 25(4):611. https://doi.org/10.3390/e25040611

Chicago/Turabian Style

Guan, Tianyuan, Mohammed Khorshed Alam, and Marepalli Bhaskara Rao. 2023. "Sample Size Calculations in Simple Linear Regression: A New Approach" Entropy 25, no. 4: 611. https://doi.org/10.3390/e25040611

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Sample Size Calculations in Simple Linear Regression: A New Approach

Abstract

1. Introduction

2. Literature Review

3. Outline of Results

3.1. Distributional Results

3.2. Critical Values

3.3. Sample Size and Power

4. Discussion

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI