1. Introduction
Classical applied statistical techniques often depend heavily on the assumption that observations are normally distributed. The benefit of this assumption is that it helps to produce exact inferences in many popular methods, such as, t-test, F-test, chi-squared tests, analysis of variance (ANOVA) models and multivariate analysis. In reality, however, observations often show departures from normality (or near-normality). Statistical texts address this issue and discuss remedies, such as the Box-Cox transformations of data to normality. Developing techniques that use fewer assumptions (normality or others) is an important area of statistical research. When comparing different populations, this paper proposes an alternative to the one-way ANOVA by relaxing some of the assumptions.
Previous study described experimental radar reflectivity data obtained from independent radars deployed during NASA’s Tropical Rainfall Measuring Mission Kwajalein Experiment in the republic of the Marshall Islands on 15 July to 12 September 1999 [
1]. The data are skewed, and we investigate whether the data from two radars are from identical populations. A credit limit data set [
2] with three different education levels; graduate school, university and high school was obtained from the University of California Irvine repository. The data are skewed with high variability, and we studied whether the three credit limit populations are identical. Such data have unknown statistical distributions, and known statistical procedures may fail to work properly.
If
is the probability density function (pdf) of N
for
[
3], then one can write
where
From (
1), denoting
as a reference distribution, one can think of
as an exponential distortion or tilt of the reference. Furthermore, using (
2), the test of equality of
’s (as needed in the one-way ANOVA) reduces to the test of equality of the
’s to zero, or
Motivated by (
3), the same authors proposed a generalization of (
1) by replacing
with any pdf
,
x in the exponent of (
1) with any known function
and considered all pdfs
that can be expressed as exponential tilt pdfs of
. In this way, (
1)–(
2) are updated as
where
for some
. Then,
can also be used to test for the equality of any
k pdfs
satisfying (
4), which are not necessarily normal.
In classical ANOVA, the parameters of the normal distributions are estimated using the maximum likelihood method. To estimate the
parameters in (
4), Consider the restricted class of distributions obtained by multiplicative exponential distortions with
as a reference and, based on
k independent samples, used the profile maximum likelihood method to estimate
in that class [
3]. This paper, instead, considers the class
of all distributions (which are restricted by a given mean for
) and estimates the
by minimizing the Kullback–Leibler divergence between
and the class
.
Often the criterion of comparison between distributions may be clear from the context of the data, which helps to formulate the constraint . In the radar data, we are interested to know if the mean rain rate is equal in two different radars. In the credit limit data, we are interested to know if the mean credit limit is the same in three education groups. The constraints can be multiple criteria as well. Once the constraints are fixed in , only those aspects are considered in the comparison between distributions.
This approach matches with the maximum entropy (ME) principle, which may be stated as follows: when selecting a model for a given situation, it is often appropriate to express the prior information in terms of the constraints. However, one must be careful so that no information other than these specified constraints are used in model selection. That is, other than the constraints that we have, the uncertainty associated with the probability distribution to be selected should be kept at its maximum [
4]. In this paper, we extended the ME principle to general information projection using the Kullback–Leibler divergence.
We show in
Section 2 that the solution (
4) is optimum under specified constraints using
h when
g is known and, in this way, the proposed approach yields an optimality interpretation for the exponential tilt models. The proposed approach extends the comparisons of means in ANOVA to comparisons of means and variances for the normal and other known distributions using duality. In
Section 3, we develop a semi-parametric approach when
is unknown and derive asymptotic test statistics for testing the equality of populations for the cases when sample sizes are equal or different.
In
Section 4, we present simulation studies that evaluate the performance of the
-parameters with respect to the classical ANOVA methods. We also compare the test statistics developed in
Section 4 with existing parametric and nonparametric procedures.
Section 5 shows the details of the proposed methods for the applications with radar data and credit limit data sets.
Section 6 contains discussions on the choice of the function(s)
h for particular cases of reference distributions
. The Appendix contains additional results and proofs.
2. Tilt Optimality Models
Kullback–Liebler (KL) discriminant information or divergence is a measure of ‘distance’ between two probability distributions. For pdf’s , the KL discriminant information for against is given by which is always nonnegative, and if and only if .
Let
be a convex set of pdf’s with
. If
is the solution to
then
(
information projection) is the
closest to
among all pdf’s in
with respect to the KL distance. Let
be arbitrary but known functions of
x. Define
to be the class of all pdf’s
f where
—that is,
In order to solve (
6), Fenchel’s duality theorem can be applied as shown by [
5]. The corresponding dual cone is
The dual problem can be shown to be equivalent to
As the dual problem (
8) is a function of scalars only, it could be substantially easier to solve than the primal problem (
6), depending on the form of
. In particular, setting the derivative of the integral quantity in (
8) with respect to
equal to zero, we obtain
which can be solved easily for
’s by the Newton–Raphson method. If the solution of (
9) is
, then, from [
5], the solution of the primal problem (
6), say
, is of the form
When
, setting
, (
10) can be simplified as (
4). The above derivation explains the exponential structure of
, and (
9) verifies that the exponential tilt model
is in
(see (
7)). These developments are summarized in the following theorem.
Theorem 1. When is known, the exponential tilt model (4) is the optimum model in the sense that it is the closest to among all probability distributions in from (7) in the KL distance. The model (
10) will be referred to in the sequel as the
tilt optimality (TO) model. Note that, when
in (
7) specifies
and
, then the solution
is found to be
as was seen in (
1)–(
2). When
in (
6) is uniform (or Lebesgue measure), then the minimizing
is known as the
maximum entropy model (distribution) in
[
4].
Theorem A1 in the
Appendix A shows that closed form solutions for the dual problem are obtained for the normal distributions with constraints on both the mean and variance. However, this may not always be the case. The final solution depends on the form of
and the restrictions in
.
While (
1) compares each
with
one-at-a-time, another approach would be to compare all
k distributions simultaneously. For
, let
with known
and covariance matrix
with
and
, elsewhere. Considering equality of the
k means (unspecified) with possibly unequal variances, we define
Following (
10), the solution is given by
However, a closed form expression is only available when .
Theorem 2 (proof in
Appendix A) considers equality of means as in classical ANOVA (deals with unknown means and unknown but equal variances) but with unrestricted variances.
Theorem 2. For with known reference pdf with the above Σ, we consider the minimization problemwith in (11). The solution to (13) is given bywhere , Note that the solution
has the same covariance
as the reference
; nonetheless,
influences the mean
in solution (
15) as a weighted average of its elements. When
as in one-way ANOVA, then
,
Theorem 2 can be extended to
with closed form solutions for
, which is tedious [
6]. Unique solutions exist for higher
k but obtaining their closed forms seems intractable. This is also the case of extension to a general
with
.
Beyond the one-way ANOVA, the above approach allows us to simultaneously compare both the means and variances of k independent normal distributions (Theorem A2).
3. Semiparametric Approach
For an unknown data generating process, however, the true form of
in (
13) (with
in (
11)) may
not be known. Then,
in the solution (
4) becomes not well-defined (replacing
x with
x). Note that the model (
10) now becomes a ‘semiparametric tilt optimality restricted model’ because, along with the parametric component
, there is also the nonparametric component
, about which no distributional assumption is made. Using the sample, we define a discrete version of
expressed as moment constraints. Assuming that these sample (moment) constraints represent the corresponding population (moment) constraints efficiently and consistently, the resulting model is expected to perform well.
The dual problem corresponding to (
13) is
The relevant score equations are
To study the asymptotic properties of the model (similar in spirit to [
7]), we consider the cases when the sample sizes are equal and when they are different.
3.1. Equal Sample Sizes
Suppose k independent random samples , each of size n, are available from independent populations with unknown means and unknown variances , respectively. If we rearrange all the sample values as , where , then ’s form a random sample from a multivariate distribution (say, pdf ), with mean and covariance diag .
Let
be the empirical distribution that has mass
at each
. The constraint of equality of
k means,
in (
11), is discretized below as
, appropriately, using the probability mass function (pmf)
and sample values
’s as
Here, the primal problem (
6) becomes (replacing pdfs with pmfs)
and the dual problem is
The score equations for the dual problem are
Suppose
solve the score Equation (
20). Then, the primal solution
is given by
For arbitrary
, define
,
,
By (
20) and (
21),
, a vector of zeroes of length
. Furthermore, by Taylor’s expansion of
around
(
),
where
satisfies max{
,
} ≤
. As
, by the strong law of large numbers,
, and
, with probability 1 where
.
By (
21) and (
23), it follows that, when
,
, with probability 1. Let
. As
, by the central limit theorem
where
,
By (
22)–(
26),
as
. The above developments are summarized in the following theorem. This theorem establishes the asymptotic normality of the parameters of the proposed model when sample sizes are equal.
Theorem 3. For a general reference pdf g, assume that the solution of (13), , exists and where for β in an open neighborhood of 0. When , and all sample sizes are equal to n,as where are defined in (24) and (26), respectively. When , or equivalently, , the quantity has an asymptotic chi-square distribution with degrees of freedom as .
Thus, the test statistic
can be used for testing the hypothesis
where
are estimated by
,
, respectively, (since
, we replaced
by
).
Clearly, the above developments can be extended for simultaneous mean and variance comparisons for
k populations by modifying the
in (
11).
3.2. Different Sample Sizes
The simultaneous approach developed above for equal sample sizes does not allow the sample sizes to be different. To that end, we consider
independent one-at-a-time population optimization problems by adopting the development in
Section 2 reversing the roles of
and
in (
4), along with setting
, and, finally, we combine the
results. Both procedures work for equal sample sizes.
For
, consider the
ith problem as finding the pdf
in
which is the closest to
, assuming that
is
known. Following similar steps as in
Section 2, the pdf in (
29), which is the closest to
, is given by
, where
solves
If
, then
.
To develop the corresponding
sample optimization problems, suppose independent random samples
are available from
k populations with the means
, respectively. Although we assume that
is known, in reality, it maybe unknown. Thus, we suggest choosing the
kth sample as the one that is the largest in size. Let
be the mean of the
kth sample. Then, take
(see
Section 6).
For the
ith (
) sample optimization problem, let
be an empirical distribution that has mass
at each
. Let the
ith sample version of
in (
29), say
, containing
, be defined as
The
ith (
) sample version of (
6) and its dual problem becomes
respectively. The
ith score equation is
Suppose
solves (
32). Then, the primal solution
is given by
For arbitrary
, define
By (
32),
. Using Taylor’s expansion of
around
,
where
is such that max{
,
} ≤
. With
, by the strong law of large numbers, as
,
and
, with probability 1.
From (
31),
. When
, by (
33) and (
35), it follows that, as
,
, with probability 1. As
, by the central limit theorem
where
By (
34), (
35) and (
25),
as
.
Next, we combine the results from the sample-optimization problems.
Theorem 4. For a general reference pdf g, assume that the solution of (13) subject to (29), , exists and for some in an open neighborhood of 0, . Assume Since all k samples are independent, as . Then, under , has an asymptotic chi-square distribution with degrees of freedom as . Thus, the test statisticcan be used for testing the hypothesis , where we replaced by , respectively, withSince under , we replace each of by a common estimate , and then calculate a pooled estimate of variance over populations. With this substitution, the test statistic from (40) is simplified as The above developments can be extended for simultaneous mean and variance comparisons for
k populations by modifying the
in (
29).
For more than one h, let represent t constraints.
(i) For equal sample sizes, recall .
Recall
given as
.
Then, the IC is
where
is the
ith eigen value of
. Find
by Cholesky decomposition.
(ii) For different sample sizes, define
given as
.
Then, the IC is
where
is the
lth eigen value of
. Find
by Cholesky decomposition.
6. Discussions
Often prior information is known about the population. However, the sample collected may not reveal this information due to the sampling variability. Hence, it is worthwhile to build a model that satisfies the prior information and is the closest to the observed data. In Theorem 1, serves as the observed data, serves as the prior information, and the distance between and is measured using the KL distance.
This paper proposed a method to compare between different populations based on a set of restrictions specified by the investigator. The restrictions were set in the form of moment constraints through one or more functions
h. Setting different types of
h compares different aspects of the distributions under consideration, e.g.,
in (
4) compares
and
regarding their means. When
in (
4), then
. However, when
, then
and
might differ in aspects other than only their values of
.
For real data, one can obtain basic information from the data, including the shape. If any of the distributions under consideration are known to be approximately symmetric, using
and/or
may be the first steps to determine if the distributions are different regarding their means and/or variances. However, if the distributions under consideration are known to be approximately skewed, then using
would be more appropriate. In general, the reference distribution in (
4) may be any of the
k distributions, leaving the exponential distortion intact but with shifted parameters. When using the one-at-a-time method for different sample sizes, we chose the sample with the largest size as the reference, considering it to be the most trusted, and used its mean as
.