Next Article in Journal
Fixed Point Theorems for Mann’s Iteration Scheme in Convex Gb-Metric Spaces with an Application
Next Article in Special Issue
Statistical Inference of Truncated Cauchy Power-Inverted Topp–Leone Distribution under Hybrid Censored Scheme with Applications
Previous Article in Journal
A Combination of Fuzzy Techniques and Chow Test to Detect Structural Breaks in Time Series
Previous Article in Special Issue
Direct Constructions of Uniform Designs under the Weighted Discrete Discrepancy
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

A Method for Detecting Outliers from the Gamma Distribution

School of Mathematical Sciences, Capital Normal University, Beijing 100048, China
*
Author to whom correspondence should be addressed.
Axioms 2023, 12(2), 107; https://doi.org/10.3390/axioms12020107
Submission received: 30 November 2022 / Revised: 11 January 2023 / Accepted: 14 January 2023 / Published: 19 January 2023
(This article belongs to the Special Issue Computational Statistics & Data Analysis)

Abstract

:
Outliers often occur during data collection, which could impact the result seriously and lead to a large inference error; therefore, it is important to detect outliers before data analysis. Gamma distribution is a popular distribution in statistics; this paper proposes a method for detecting multiple upper outliers from gamma ( m , θ ). For computing the critical value of the test statistic in our method, we derive the density function for the case of a single outlier and design two algorithms based on the Monte Carlo and the kernel density estimation for the case of multiple upper outliers. A simulation study shows that the test statistic proposed in this paper outperforms some common test statistics. Finally, we propose an improved testing method to reduce the impact of the swamping effect, which is demonstrated by real data analyses.

1. Introduction

The presence of outliers in the data may have an appreciable impact on the data analysis, which often leads to erroneous conclusions, and in turn results in severe decision-making mistakes. Therefore, it is necessary to detect outliers before statistical analysis. On the other hand, outlier detection has a wide range of applications in the prevention of financial fraud, disease diagnosis, and judgment of the truth of military information, etc.
Refs. [1,2] define outliers as those observations which are surprisingly far away from the main group. In a one dimensional situation, if the observations are arranged in an ascending order of magnitude, there will be only three types of outlier detection problems: (i) only upper outliers; (ii) only lower outliers; and (iii) both upper and lower outliers.
The commonly used methods of dealing with outliers include the detection of outliers and robust statistical methods. Robust methods aim to analyze data while retain outliers and minimize the deviation of analytical results from theoretical results. The detection of outliers is to identify outliers in the sample by using a reasonable statistical procedure and then analyzing the remaining observations. In this paper, we focus on this method.
In the field of statistics, there are many results on the detection of outliers, and many effective methods have been proposed. These methods include descriptive statistics, machine learning, and hypothesis testing.
Descriptive statistics is intuitive and contains no computational burden. Commonly used methods include Box-plot, Hampel rule, etc. Box-plot needs to compute the 3 / 4 quantile and 1 / 4 quantile of the sample, Q 3 and Q 1 . Denote I Q R = Q 3 Q 1 as the interquartile range, then the observations are located in the interval of [ Q 1 1.5 I Q R , Q 3 + 1.5 I Q R ] in the plot are observed as clean observations, and other observations are tested as outliers. According to [3], a data point is identified as an outlier if the distance between it and the sample median exceeds 4.5 times MAD, where M A D ( X ) = m e d | X m e d ( X ) | .
Machine learning mainly trains the sample to detect outliers according to the data characteristics, combined with mathematical models and statistical principles. Some common methods include one-class support vector machines (one-class SVM), minimum spanning tree (MST), etc. One-class SVM usually trains a minimal, ellipsoid which contains all normal observations from historical data or other clean data. Then, the observations that fall outside the ellipsoid are treated as outliers; see [4]. MST algorithm defines the distance between points as Euclidean distance, considers the points as nodes, and finds a path connecting each node with the smallest sum of distances. Then, based on the given criteria, the sample is divided into different classes. The largest set is treated as inlying data, while the rest is treated as outliers; see [5].
Hypothesis testing is a basic method for outlier detection. By setting appropriate null and alternative hypotheses and constructing test statistics with certain properties, the hypothesis testing method can detect whether there are outliers in the sample with the given significance level.
In a univariate sample, and unlike the limitations of the exponential distribution, observations from gamma distribution are more extensive and easier to collect. This paper studies the multiple outlier detection under gamma distribution, a parameter θ slippages model. Since the 1950s, there has been many results about outlier detection based on the hypothesis testing method, but most of them aim to detect a single outlier or outliers in a normal distribution. In the 1970s, outlier detection under more general distributions such as exponential, Pareto, and uniform distributions received much attention. Multiple outlierdetection has recently drawn considerable attention in practice owing to the development of science and technology and the diversification of data collection methods. We briefly introduce three commonly used statistics, which are suitable for detecting multiple upper outliers in the gamma distribution.
Dixon’s statistic proposed in [6] is based on the idea that the dispersion of the suspect observations accounts for a large proportion of the sample dispersion. This method is further extended in [7,8,9], where [8] proposes the following statistic
D k = X ( n ) X ( n k ) X ( n ) X ( 1 ) .
With the given significance level α , X ( n k + 1 ) , ⋯, X ( n ) are identified as outliers if D k > d k ( α ) , where d k ( α ) is the critical value of D k . Later, another Dixon type statistic for detecting outliers in a gamma distribution is proposed in [10,11], and the statistic is
L k = X ( n ) X ( n k ) X ( n ) .
Ref. [10] gives the critical value l k ( α ) for the given significance level α , X ( n k + 1 ) , ⋯, X ( n ) are regarded as outliers if L k < l k ( α ) . The third test statistic is N k by [10,11]:
N k = X ( n k ) X ( 1 ) j = n k + 1 n ( X ( j ) X ( 1 ) ) .
Ref. [10] also obtains the corresponding critical value n k ( α ) for the given significance level α . X ( n k + 1 ) , ⋯, X ( n ) are regarded as outliers if N k < n k ( α ) . The fourth test statistic is a “gap-test” ([12]), which is given by
Z k = X ( n ) X ( n k ) j = 1 n X ( j ) .
Ref. [12] provides the critical value z k ( α ) for the significance level α , and  X ( n k + 1 ) , ⋯, X ( n ) are identified as outliers if Z k > z k ( α ) . The fifth test statistic is proposed in [13], which is given by
V k = j = 1 k ( X ( n k + j ) X ( n k ) ) j = 2 n ( X ( j ) X ( 1 ) ) .
Ref. [13] shows that the distribution of V k and the critical value v k ( α ) can be obtained for the given significance level α . Thus, X ( n k + 1 ) , ⋯, X ( n ) are regarded as outliers if V k > v k ( α ) .
The remainder of this article is organized as follows. In Section 2, we propose a test statistic to detect outliers in a gamma sample, and the density function of the proposed test statistic is derived. In order to obtain the critical values, a Monte Carlo procedure and a kernel density estimation procedure are proposed. In Section 3, the simulation results demonstrate that the proposed T k test statistic is better than others. Furthermore, an improved T k method is suggested, which can eliminate the swamping effect in multiple outliers detection in Section 4. A real data analysis is performed in Section 5. Section 6 is the conclusion. All proofs of theoretical results are presented in Appendix A, and the data of empirical applications is contained in Appendix B.

2. Model Framework and Methodology for Detecting Outliers

In this section, we propose a testing method to detect upper outliers from a gamma distribution. Both single and multiple outliers are considered. We will derive the distribution of the test statistic T k for single upper outlier detection, and design two methods—the Monte Carlo method and the kernel density method—to calculate the critical value of T k for multiple outliers.

2.1. Model Framework

Assume the null distribution is gamma distribution, gamma ( m , θ ), with the density function given by
f ( x | m , θ ) = θ m Γ ( m ) x m 1 e θ x , x > 0 ,
where m and θ are unknown, m , θ > 0 . The null hypothesis is
H : X 1 , , X n f ( x | m , θ ) .
Then, the density function in the alternative hypothesis is
f ( x | m , θ , λ ) = ( λ θ ) m Γ ( m ) x m 1 e λ θ x , x > 0 ,   0 < λ 1 ,
where λ denotes the contaminant factor. The slippage alternative hypothesis is
H ¯ : n k observations f ( x | m , θ ) , and k observations f ( x | m , θ , λ ) .
Sorting X 1 ,⋯,  X n  from small to large, we obtain the sample S = X ( 1 ) , , X ( n ) , where X ( j ) corresponds to the j th observation in S. When k = 1 , X ( n ) is the suspicious point, we propose the test statistic T ( n ) to detect an outlier in S,
T ( n ) = X ( n ) X ¯ .
For a given significance level α , letting t 1 ( α ) be the critical value, and  X ( n ) is detected as an outlier if T ( n ) > t 1 ( α ) . When k > 1 , we propose the following test statistic to detect multiple outliers,
T k = j = n k + 1 n X ( j ) X ¯ .
For a given significance level α , if we let t k ( α ) be the critical value, X ( n k + 1 ) ,⋯,  X ( n ) are detected as outliers if T k > t k ( α ) .
Theorem 1. 
T k is a test statistic that is derived from the likelihood ratio principle.
Proof of Theorem 1.
See Appendix A.1. □

2.2. Detecting Single Outlier

T ( n ) can be used for testing a single upper outlier for the gamma sample. To obtain the critical value of the test, we derive the distribution of T ( n ) under the null model, as follows.
Denote T j = T n , j = X j X ¯ and T ( j ) = T n , ( j ) = X ( j ) X ¯ . Note that X 1 X 2 , ⋯,  X n are independent, so X j j = 1 n X j follows beta ( m , ( n 1 ) m ) under the null model. Let a = m and b = ( n 1 ) m , for any j, the density function of X j i = 1 n X j is
β a , b ( u ) = { Γ ( a + b ) / Γ ( a ) Γ ( b ) } u a 1 ( 1 u ) b 1 , 0 < u < 1 .
As T j = T n , j = X j X ¯ = n X j i = 1 n X i , the density function of T j is given by
β a , b ( v ) = { Γ ( a + b ) / Γ ( a ) Γ ( b ) } v a 1 ( n v ) b 1 n a + b 1 , 1 < v < n .
It can be shown that
Lemma 1. 
Assume that X 1 ,⋯, X n 1 , X n are independent identically from gamma (m,θ), then max k n X k j = 1 n 1 X j and X n j = 1 n 1 X j are independent.
Proof of Lemma 1. 
See Appendix A.1. □
Theorem 2. 
If X 1 , X 2 ,⋯, X n 1 , X n are independent from gamma (m,θ), then the density function of T = X ( n ) X ¯ is
n β m , ( n 1 ) m ( v ) A n 1 [ ( n 1 ) v n v ] , 1 < v < n ,
where A n ( v ) is the cumulative distribution function (CDF) of T n , ( n ) .
Proof of Theorem 2. 
See Appendix A.1. □
The density function of T ( n ) = X ( n ) X ¯ under the null model is an iterative function and the critical value of T ( n ) can be obtained by Equation (12).

2.3. Detecting Multiple Outliers

T k with k > 1 can be used to detect outliers in the gamma sample if there exist multiple outliers. However, deriving the distribution of T k is a difficult task. In this case, to obtain the critical value of the test, we propose two methods, the Monte Carlo method and the kernel density estimation method.

2.3.1. Monte Carlo Method

First, note that the distribution of X ( j ) / X ¯ is unrelated to θ under the null model. Based on this property, the Monte Carlo method for computing the critical value of the T k test is given below.
Parameter m can be obtained by the Newton-Rapson algorithm which is based on the sample or estimated by other samples, empirical methods, and so on. We consider the outliers from a slippage model in which the parameter θ has been shifted to λ θ , with the parameter m being fixed, where 0 < λ 1 is the contamination factor.
The idea of the Monte Carlo method is generating n samples, and  T K can be obtained from each sample. Denote S M as the set that consists of all T k . Then, based on the law of large numbers, we use the 1 α quantile of S M as the estimate of t k ( α ) . The pseudocode of the Monte Carlo method is given by Algorithm 1.
Algorithm 1 Monte Carlo method
Input: Parameters
    n: sample size;
    k: number of suspicious observations;
    α: the significance level, say, α = 0.05;
    u: number of samples, say, u = 5000.
Output:  t k ( α ) .
  for j in 1 : u  do
       generate n observations from gamma( m , 1 );
        T k , j = i = n k + 1 n X ( j , i ) X j ¯ ;
  end for;
  get S M = { T k , 1 , , T k , u } ;
   t k ( α ) ( 1 α ) quantile of S M .
Using the above Monte Carlo method to compute the critical values of the T k test statistic for different n, k, and  m = 5 , the results are summarized in Table 1.

2.3.2. Kernel Density Estimation Method

This method aims to use a large sample of T k to approach its density function, and the estimated function is denoted as f ( x ) . Then, with the significance level α , we compute t k ( α ) from
t k ( α ) + f ( x ) d x = α .
Using a Gaussian kernel function, we have
K ( x x j h ) = 1 2 π e ( x x j ) 2 2 h 2 ,
where x j = T k , [ j ] and h is the bandwidth. Therefore, the estimated density function of T k is
f ( x ) = 1 u h j = 1 u 1 2 π e ( x x j ) 2 2 h 2 .
The pseudocode of the kernel density estimation method is given by Algorithm 2.
Algorithm 2 Kernel density estimation method
Input: Parameters
    n: sample size;
    k: number of suspicious observations;
     α : the significance level, say, α = 0.05 ;
    u: number of samples, say, u = 5000 .
Output: t k ( α ) .
   for j in 1 : u  do
       generate n observations from gamma( m , 1 );
        T k , j = i = n k + 1 n X ( j , i ) X j ¯ ;
   end for;
   get S M = { T k , 1 , , T k , u } ;
   compute the bandwidth of S M ;
   choose Gaussian kernel function, K ( x x j h ) = 1 2 π e ( x x j ) 2 2 h 2 , and the estimated density function of T k is f ( x ) = 1 u h j = 1 u 1 2 π e ( x x j ) 2 2 h 2 ;
    t k ( α ) root of t k ( α ) + f ( x ) d x α = 0 .
Table 2 includes critical values of the T k test statistic for different n and k with m = 5 and α = 0.05 , which are calculated by the kernel density algorithm.
After comparing a large number of simulation results of the Monte Carlo method and the kernel density estimation method, we find the difference of results between these two methods is very small. Therefore, which method is chosen depends on your personal preference.
More generally, Algorithms 1 and 2 contribute two feasible methods to calculate the critical values of any test statistics for the given significance level, sample size n, and presupposed k.

3. Simulation Study

In this section, we evaluate, by a simulation study, the performance of the proposed test statistic T k and compare it with the commonly used methods including D k , L k , N k , Z k , and V k given in Section 1.

3.1. Simulation Setting

To evaluate the performance of a test statistic in the outlier detection, we consider two cases with and without outliers. For the former, a test statistic can be evaluated by computing the power when there exist k outliers in the gamma ( m , θ ), and the probability of its power is replaced by the frequency of identifying outliers correctly; for the latter, a test statistic can be evaluated by counting the number of times that inlying observations are misjudged as outliers, which is called “false alarm”. A test statistic is better if it has higher power and lower “false alarm”.
To use the similar simulation setting as in [10,12,13], we transform the λ and θ in Equations (6) and (7) to 1 λ and 1 θ , respectively.
For computing the power, we generate n observations from Equation (6) and sort these points from small to large. X ( n k + 1 ) , ⋯, X ( n ) are replaced by λ X ( n k + 1 ) , ⋯, λ X ( n ) , which has the same effect as producing k upper outliers from Equation (7). Where k = 2 , 5 , λ in [1:2] (0.055), n = 20 and m = 5 . To measure the “false alarm”, denote k o as the number of outliers in the k largest observations. When k = 2 ( 5 ) , we have k o = 1 ( 2 ) . Generate n k o observations from Equation (6), and generate k o from Equation (7). Then, detect the largest k observations by using the different test statistics. These two cases with significance levels α = 0.01 and 0.05. Our simulation study is carried out based on 2000 replications.

3.2. Results

For the case of outliers existing, the simulation results on the power of six test statistics are shown in Figure 1 and Figure 2. It can be observed from Figure 1 that when m = 5 , k = 2 and α = 0.01 , our test statistic T k has a higher power than the other five test statistics for the values of λ smaller than 1.650; and for larger λ , T k is worse than N k and V k but better than Z k , D k , and L k . For α = 0.05 , T k is worse than N k and V k but better than Z k , D k , and L k . It is clear from Figure 2 that when m = 5 and k = 5 , T k has the highest outlier detection capability for α = 0.01 ; and if α = 0.05 , T k has the highest power for almost all the λ values.
For the case of the k largest observations consisting of contaminants and some good observations, the simulation results on the swamping effect of six test statistics are shown in Figure 3 and Figure 4. It can be observed from Figure 3 that for k o = 1 , with the significance level of 0.01, T k is better than Z k and D k but worse than N k , V k , and L k . For α = 0.05 , the “false alarm” of T k is worse than that of L k , but better than those of Z k , N k , D k , and V k . It is clear that the results of Figure 3 with k o = 1 and Figure 4 with k o = 2 are similar when α = 0.01 . For α = 0.05 , Figure 4 shows that T k is worse than N k and V k but better than Z k , D k , and L k .
In summary, the simulation results show that T k has the highest power and relatively lower “false alarm” than Z k , D k and L k for α = 0.05 and k = 5 . With k = 5 and α = 0.01 , T k has the highest power than other test statistics, but the “false alarm” of T k is worse than those of N k , V k and L K . Therefore, with large m and k, T k is generally better than Z k , N k , D k , V k , and L k for multiple outlierdetection.

4. Modified T k Test-ITK

In practice, almost all test statistics used to detect multiple outliers have the swamping effect. This phenomenon happens because large outliers may cause the sum of multiple observations to be too large in the block test. To reduce or eliminate the impact of the swamping effect, we suggest a modified T k test, ITK, which retains the high probabilities of outliers detecting and low error probabilities when there is no outlier in the gamma sample.
Note that for multiple outlier detection, some inlying observations may be judged as outliers falsely caused by improper k. For example, consider a sample consisting of 0.30 , 0.62 , 0.72 , 0.80 , 1.13 , 1.42 , 1.45 , 2.30 , 14.86 , and 22.01 , and use T 3 to test X ( 10 ) = 22.01 , X ( 9 ) = 14.86 , X ( 8 ) = 2.30 . Clearly, T 3 = X ( 10 ) + X ( 9 ) + X ( 8 ) X ¯ = 8.08 , and the critical value of T 3 , by using Algorithm 1 in Section 2.3.1, is t k ( 0.05 ) = 6.43 . Therefore, T 3 > t k ( 0.05 ) , and X ( 10 ) , X ( 9 ) , X ( 8 ) are outliers in the sample. However, in fact, X ( 8 ) is a genuineobservation from the inlying cluster. X ( 8 ) is detected as an outlier because X ( 10 ) and X ( 9 ) compared with the inlying sample are too large, causing the sum of X ( 10 ) , X ( 9 ) , X ( 8 ) beyond the bound range, i.e., swamping effect. However, this negative impact will be eliminated if we take k = 2 .
To deal with the swamping effect, a method for choosing a reasonable k should be given. Thus, our modified test includes two stages: (1) pick a reasonable k, and use the T k test to detect k upper observations; (2) use stepwise forward testing for the remainingobservations or stepwise backward testing for the “outliers” sample from the first stage.

4.1. Estimation of k

From [14], the number of outliers should be less than n / 2 . Later, [1] put forward a point that the number of outliers is usually less than n if the sample is collected properly.
Here, we take
k = k ^ = [ n ] ,
where [ n ] is the greatest integer less than or equal to n .

4.2. The Improvement of the T k Test-ITK

Based on Section 4.1, we propose an improved T k test procedure, as follows:
Step 1. For the significance level α , X ( n k + 1 ) , ⋯, X ( n ) are judged as outliers preliminarily, which forms a preliminary outliers sample, if T k > t k ( α ) ; otherwise, goto Step 5. The remaining observations constitute the preliminary inlying group, S .
Step 2 (step forward test). Using step forward test to detect whether S includes any outliers. For α = 0.05 , T [ k ] , 1 = X ( n k ) X ¯ [ k ] , and X ( n k ) is an outlier if T [ k ] , 1 > t [ k ] , 1 ( α ) ; otherwise, goto Step 4.
Step 3. Repeat the test process in Step 2 until no outlier can be detected in S . If X ( j ) is the smallest outlier in S , then X ( j ) , ⋯, X ( n ) are outliers in the data and stop the procedure.
Step 4 (step backward test). After the step forward test has stopped, use the step backward test to check the preliminary outliers sample in Step 1. For the significance level α = 0.05 , if T [ k + 1 ] , 1 = X ( n k + 1 ) X ¯ [ k + 1 ] > t [ k + 1 ] , 1 ( α ) , then the step backward test ends; otherwise, use the step backward test for detecting X ( n k + 2 ) . Repeat this step until an outlier is detected. If X ( n ) is not judged as an outlier, then there is no outlier in the sample, the sample is inlying data.
Step 5. Let k ^ n e w = [ k ^ 2 ] , and substitute k = k ^ = k ^ n e w to Step 1. If k ^ n e w = 0 , there is no outlier in the sample, and the test procedure ends.

5. Empirical Applications

In this section, we apply the ITK test method to two data sets: Alcohol-related mortality rates and artificial scout position data, and compare it with the other six test statistics of T k , D k , N k , Z k , V k , and L k .

5.1. Alcohol-Related Mortality Rates in Selected Countries in 2000

The dataset (see Appendix B) is selected from Office for National Statistics (ONS). The Kolmogorov-Smirnov test indicates that this data follows the gamma distribution.
Here, n = 100 and so k ^ = 10 . We obtain m = 1.2 by using the Newton-Rapson algorithm. From Appendix B, it is observed that T 10 = j = 91 100 X ( j ) X ¯ = 117.65 2.47 = 47.70 . Further, we compute the critical value of T 10 by using Algorithm 1 in Section 2.3.1, and obtain t 10 ( 0.05 ) = 21.99 . Obviously, T 10 > t 10 ( 0.05 ) , and hence X ( 91 ) , X ( 92 ) , , X ( 100 ) are detected as outliers preliminarily.
Then, we use the step forward test for the remaining sample. It is clear that T ( 90 ) = X ( 90 ) X ¯ [ 10 ] = 6.17 1.43 = 4.31 < t [ 10 ] , 1 ( 0.05 ) = 6.45 . Thus, X ( 90 ) = 6.17 is a normal observation.
We now use the step backward test for X ( 91 ) , X ( 92 ) , , X ( 100 ) . It is readily observed that X ( 91 ) = 10.17 , T [ 9 ] , 1 = X ( 91 ) X ¯ [ 9 ] = 10.17 1.53 = 6.65 , and in the 5% significance level, T [ 9 ] , 1 = 6.55 . As T [ 9 ] , 1 > t [ 9 ] , 1 ( 0.05 ) , X ( 91 ) , X ( 92 ) , , X ( 100 ) are detected as upper outliers.
On the other hand, we utilize the T k , D k , N k , Z k , V k , and L k test statistics to detect outliers, and the results are shown in Table 3.
As we can observe from Table 3, ITK, T k , N k , and V k can identify outliers correctly without misjudgment. This phenomenon happens because k is chosen reasonably. We can also observe that D k , Z k , and L k have bad performance in multiple upper outlier detection.
Furthermore, the result from Table 3 shows that Ireland, France, Austria, Slovenia, Portugal, Denmark, the United Kingdom of Great Britain and Northern Ireland, the Republic of Korea, the Russian Federation, and Australia have higher alcohol-related mortality rates, which means that these countries need to pay more attention to alcohol-related mortality.

5.2. Artificial Scout Position Data

In the application of military information, the gamma model is usually used to describe the position of some objects. Suppose a military scene, in a mission, 20 scouts reconnoiter a certain area, and their location components are characterized by X j , j = 1 , ⋯, 20, and the larger X j , the further they are away from the landing site. If X j deviates from the main group, this indicates that the i th soldier is separated from the troops and may not be able to obtain support in time in case of an emergency. Therefore, it is necessary to pay attention to this movement.
In our setting, the basic model is gamma (3,5) and the alternative model is gamma (3,10). The initial data are outlined in Appendix B.
Here, m = 3 is known. The sample size is 20, thus k ^ = [ n ] = [ 20 ] = 4 . From Appendix B, it is observed that T 4 = j = 17 20 X ( j ) X ¯ = 0.91 + 2.90 + 3.32 + 3.44 0.97 = 10.89 . With the significance level of 0.05, we utilize the Monte Carlo method to calculate the critical value for T 4 , and we obtain t 4 ( 0.05 ) = 8.71 . As T 4 > t 4 ( 0.05 ) , X ( 17 ) , X ( 18 ) , X ( 19 ) and X ( 20 ) are placed into the initial outlier group.
Furthermore, we continue to test the remained sample and carry out the step forward test for X ( 16 ) = 0.88 . Noting that t [ 4 ] , 1 ( 0.05 ) = 3.09 > T [ 4 ] , 1 = X ( 16 ) X ¯ [ 4 ] = 0.88 0.55 = 1.59 , X ( 16 ) = 0.88 is not an outlier.
Presently, we use step backward test for X ( 17 ) = 0.91 , X ( 18 ) = 2.90 , X ( 19 ) = 3.32 , X ( 20 ) = 3.44 . It is clear that T [ 3 ] , 1 = X ( 17 ) X [ 3 ] ¯ = 0.91 0.57 = 1.60 , and with the significance level of 0.05, t [ 3 ] , 1 ( 0.05 ) = 3.14 . Noting that T [ 3 ] , 1 < t [ 3 ] , 1 ( 0.05 ) , X ( 17 ) = 0.91 is not an outlier. Moreover, T [ 2 ] , 1 = X ( 18 ) X [ 2 ] ¯ = 2.90 0.70 = 4.13   >   t [ 2 ] , 1 ( 0.05 ) = 3.18 , the test procedure ends. Therefore, X ( 18 ) = 2.90 , X ( 19 ) = 3.32 and X ( 20 ) = 3.44 are outliers in the sample.
Meanwhile, we utilize the T k , D k , N k , Z k , V k , and L k test statistics to detect outliers, and the results are shown in Table 4.
It can be observed from Table 4 that the ITK method performs better than the other five methods (the T k , D k , N k , Z k , V k , and L k test statistics) because it can not only detect all outliers in the sample, but also has the lowest misjudged probabilities.
Further, from the result of the ITK method, we can obtain information that the IDs 18, 19, and 20 seem to be far away from the landing site. This means that they would be endangered in case of an emergency.

6. Concluding Remarks

It can be observed from the simulation that with the increase in k and n values, compared with other test statistics, our test statistic T k has a higher power and relatively lower “false alarm” on outlier detection, especially for a lower significance level. However, the swamping effect still exists for T k , and this phenomenon will cause the loss of information. Therefore, to reduce the impact of swamping effect, we design the ITK test. From the outlier detection results of the two real data analyses, the ITK test has the same high power as the T k test statistic and lower error probabilities than the other six test statistics ( T k , Z k , N k , D k , V k , and L k ). In conclusion, compared with other test statistics, ITK has the highest detection capability for outliers and the lowest “false alarm”. Thus, the ITK method is recommended to be used to identify multiple outliers in a sample.
In this paper, we design two algorithms based on the Monte Carlo and the kernel density estimation to obtain the critical values of T k . How to derive the exact critical value of T k is an interesting problem. Further, in the case of k being unknown, we take a conservative estimation of k = k ^ = [ n ] . Thus, it is worth studying the problem of choosing a more appropriate value of k in our ITK method. This article discusses only the case of multiple upper outliers existing in a gamma sample. Noting that lower outliers or both upper and lower outliers may exist in practice, it is necessary to extend our outlier detection methods to these situations. In addition, the masking effect with our methods is not discussed in this paper, which remains our future research. How to extend our approaches to other distributions is also an important topic.

Author Contributions

Conceptualization, X.L., T.W. and G.Z.; methodology, X.L.; software, X.L.; validation, X.L., T.W. and G.Z.; formal analysis, X.L. and T.W.; writing—original draft preparation, X.L.; writing—review and editing, X.L., T.W. and G.Z.; visualization, X.L. and T.W.; supervision, G.Z.; project administration, G.Z.; funding acquisition, G.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Beijing Natural Science Foundation (Grant No. Z210003).

Data Availability Statement

Open suorce. Data presented in the article can be obtained by visiting https://www.ons.gov.uk/ (accessed on 29 November 2022).

Acknowledgments

The authors are grateful to the anonymous referees for helpful comments and suggestions that greatly improved this paper.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Appendix A.1. Lemma 1 and Its Proof

Proof of Lemma 1. 
For any n > k ( k 0 ) , we have max k n X k j = 1 n 1 X j = max ( X 1 j = 1 n 1 X j , , X n 1 j = 1 n 1 X j ) . Thus, max k n X k j = 1 n 1 X j and X n j = 1 n 1 X j are independent if ( X 1 j = 1 n 1 X j , , X n 1 j = 1 n 1 X j ) and X n j = 1 n 1 X j are independent. Note that X n 1 j = 1 n 1 X j = 1 ( X 1 j = 1 n 1 X j + + X n 2 j = 1 n 1 X j ) , so max k n X k j = 1 n 1 X j and X n j = 1 n 1 X j are independent if ( X 1 j = 1 n 1 X j , , X n 2 j = 1 n 1 X j ) is independent of X n j = 1 n 1 X j . Observe that X 1 , , X n are independent and from gamma ( m , θ ) , so the joint density of ( X 1 , , X n ) is
f ( x 1 , , x n ) = 1 Γ n ( m ) θ n m e j = 1 n x j θ j = 1 n x j m 1 .
Similar to [15], let V 1 = X 1 j = 1 n 1 X j , , V n 2 = X n 2 j = 1 n 1 X j , V n 1 = X n j = 1 n 1 X j , then the joint density of ( V 1 , , V n 2 , V n 1 ) is
f ( v 1 , , v n 2 , v n 1 ) = Γ ( n m ) Γ n ( m ) θ n m [ v 1 v n 2 ( 1 ( v 1 + + v n 2 ) ) ] m 1 v n 1 m 1 × ( θ 1 + v n 1 ) n m , 0 < v 1 , , v n 2 < 1 , 0 < j = 1 n 2 v j < 1 , v n 1 > 0 .
It can be observed that the marginal densities of ( V 1 , . . . , V n 2 ) and V n 1 are given by
f 1 ( v 1 , , v n 2 ) = Γ ( ( n 1 ) m ) Γ n 1 ( m ) [ ( v 1 v n 2 ) ( 1 ( v 1 + + v n 2 ) ) ] m 1 , 0 < v 1 , , v n 2 < 1 ,   0 < j = 1 n 2 v j < 1 ,
and
f 2 ( v n 1 ) = Γ ( n m ) Γ ( ( n 1 ) m ) Γ ( m ) θ n m v n 1 m 1 ( θ 1 + v n 1 ) n m ,   v n 1 > 0 ,
respectively. Clearly, f ( v 1 , , v n 2 , v n 1 ) = f 1 ( v 1 , , v n 2 ) f 2 ( v n 1 ) , so ( X 1 j = 1 n 1 X j , , X n 2 j = 1 n 1 X j ) is independent of X n j = 1 n 1 X j . Therefore, Lemma 1 is proved. □

Appendix A.2. Proofs of Theorems

Proof of Theorem 1. 
Consider the null distribution defined by Equation (6), and distribution of the alternative model defined by Equation (7). The proof for Theorem 1 is an extension of that in [1], which discusses the single outlier detection in the exponential distribution. Suppose there are n observations, denoted by X 1 , X 2 , ⋯, X n , especially, X j ( j = n k + 1 , ⋯, n) is an observation from the sample, which consists of the k largest points. Therefore, the alternative hypothesis is
X 1 , X 2 , , X n k f ( x | m , θ ) ;
X n k + 1 , X n k + 2 , , X n f ( x | m , θ , λ ) .
Denoting T k = j = n k + 1 n X ( j ) X ¯ and T = j = n k + 1 n X j X ¯ , we first prove that the test statistic T is an MLR test statistic. Noting that under the H, { X 1 , , X n } is a random sample from (6), the likelihood function is
L H ( m , θ | x ) = j = 1 n f ( m , θ | x j ) = j = 1 n θ m Γ ( m ) · x j m 1 · e θ x j = θ n m Γ n ( m ) · ( j = 1 n x j m 1 ) · e n θ x ¯ .
Denote the associated log likelihood function as ln L H ( m , θ | x ) = m n ln θ n ln Γ ( m ) + ( m 1 ) j = 1 n ln x j n θ x ¯ , and let ln L H ( m , θ | x ) m = n Γ ( m ) Γ ( m ) + n ln θ + j = 1 n ln x j = 0 , ln L H ( m , θ | x ) θ = n m θ n x ¯ = 0 , then we obtain the maximum likelihood estimates of m and θ , denoted by m ^ and θ ^ , i.e.,
θ ^ = m x ¯ ,
and m ^ satisfies n Γ ( m ) Γ ( m ) + n ln m n ln x ¯ + j = 1 n ln x j = 0 ; here, there is no explicit form solution for m ^ . The numerical value of m ^ can be obtained by Newton-Raphson algorithm or extra-sample information. Therefore, if m is known, and we substitue θ ^ = m x ¯ to ln L H ( m , θ | x ) , then
ln L H ( m , θ ^ | x ) = n ln Γ ( m ) n m ln x ¯ + n m ln m + ( m 1 ) j = 1 n ln x j n m .
Similarly, under the alternative hypothesis H ¯ , we have
ln L H ¯ ( m , θ ^ , λ ^ | x ) = n ln Γ ( m ) + n m ln ( n k ) + n m ln m n m ln ( n x ¯ j = n k + 1 n x j ) n m + k m ln k k m ln ( n k ) + k m ln ( n x ¯ j = n k + 1 n x j j = n k + 1 n x j ) + ( m 1 ) j = 1 n ln x j .
Therefore, reject H if L H ¯ ( m , θ ^ , λ ^ | x ) L H ( m , θ ^ | x ) c , i.e., X n k + 1 , ⋯, X n are outliers if ln L H ¯ ( m , θ ^ , λ ^ | x ) ln L H ( m , θ ^ | x ) ln c . Thus, we consider
ln L H ¯ ( m , θ ^ , λ ^ | x ) ln L H ( m , θ ^ | x ) = n m ln n k n T + k m ln ( k n k ( n T T ) ) , T > k .
where T = j = n k + 1 n X j X ¯ . Obviously, the derivative of Equation (A10) with respect to T is
f ( T ) = n m n T n k m ( n T ) T .
It is clear that f ( T ) > 0 for T > k , and ln L H ¯ ( m , θ ^ , λ ^ | x ) ln L H ( m , θ ^ | x ) about T is monotone increasing. Thus, T = j = n k + 1 n X j X ¯ is an MLR test statistic; see [1,16,17].
In practice, it is too difficult to assure X j is not only the j th observation but also an observation from the k largest observations. Therefore, to extend the T test statistic for the ordering the samples’ situation, the multiple decision procedures will be used here. The null hypothesis remains unchanged, and the i th alternative hypothesis is
H i ¯ : X i 1 , , X i , n k from f ( x | m , θ ) ; X i , n k + 1 , , X i , n from f ( x | m , θ , λ ) .
The number of such alternative hypotheses is ( k n ) , and
ln L H i ¯ ( m , θ ^ , λ ^ | x ) ln L H ( m , θ ^ | x ) = n m ln n k n T i + k m ln ( k n k ( n T i T i ) ) , T i > k ,
where T i = j = n k + 1 n X i , j X i ¯ with X i ¯ = j = 1 n X i , j n . Subject to a probability of correct adoption of the null hypothesis H, the decision criterion is that of maximizing the power of adopting the correct H i ¯ . In the present situation of a gamma model, the multiple decision procedures lead to adopting H i ¯ if T i is maximized and is sufficiently large. Because all observations are one-dimensional, outliers only exist at the upper end, and so the appropriate test statistic is
T k = j = n k + 1 n X ( j ) X ¯ .
Theorem 1 is proved. □
Proof of Theorem 2. 
Similar to [18], denote a n ( v ) and A n ( v ) as the density function and the cumulative distribution function (CDF) of T n , ( n ) , respectively, and we have
a n ( v ) = lim d v 0 P ( T n , ( n ) ( v , v + d v ) ) d v .
Denote Ω = j = 1 n { T n , ( n ) = T n , j } , so
P ( T n , ( n ) ( v , v + d v ) ) = P ( { T n , ( n ) ( v , v + d v ) } { j = 1 n { T n , ( n ) = T n , j } ) = P ( j = 1 n ( T n , j ( v , v + d v ) , T n , ( n ) = T n , j ) ) .
Note that { T n , j ( v , v + d v ) , T n , ( n ) = T n , j } is incompatible with { T n , i ( v , v + d v ) , T n , ( n ) = T n , i } , for any i j , thus, by the additivity of probability measures,
P ( j = 1 n ( T n , j ( v , v + d v ) , T n , ( n ) = T n , j ) ) = j = 1 n P ( T n , j ( v , v + d v ) , T n , ( n ) = T n , j ) = n P ( T n , j ( v , v + d v ) , T n , ( n ) = T n , j ) = n P ( T n , n ( v , v + d v ) , T n , ( n ) = T n , n ) = n P ( T n , n ( v , v + d v ) , max k n X k < X n ) = n P ( T n , n ( v , v + d v ) , max k n X k X ¯ [ n ] < X n X ¯ [ n ] ) ,
where X ¯ [ n ] = 1 n 1 j = 1 n 1 X j . Note that
T n , n = X n X ¯ = n ( n 1 ) X n j = 1 n X j X n ( n 1 ) ( 1 + X n j = 1 n X j X n ) = n X n X ¯ [ n ] ( n 1 ) + X n X ¯ [ n ] .
Since X 1 ,⋯, X n are independent and from gamma ( m , θ ), X k j = 1 n 1 X j follows beta ( m , ( n 2 ) m ). Therefore,
( A 17 ) = n P ( T n , n ( v , v + d v ) , T n 1 , ( n 1 ) < X n X ¯ [ n ] ) = n P ( T n , n ( v , v + d v ) , T n 1 , ( n 1 ) < ( n 1 ) T n , n n T n , n ) = n P ( T n , n ( v , v + d v ) ) P ( T n 1 , ( n 1 ) < ( n 1 ) T n , n n T n , n | T n , n ( v , v + d v ) ) .
Because T n 1 , n = n X n j = 1 n 1 X j and T n 1 , ( n 1 ) = n max k n X k j = 1 n 1 X j are independent, we obtain
a n ( v ) = lim d v 0 { n P ( T n , n ( v , v + d v ) ) d v P ( T n 1 , ( n 1 ) < ( n 1 ) T n , n n T n , n | T n , n ( v , v + d v ) ) } = n β m , ( n 1 ) m ( v ) A n 1 [ ( n 1 ) v n v ] , 1 < v < n .
Theorem 2 is proved. □

Appendix B

The appendix lists the alcohol-related mortality rates in selected countries in 2000 and artificial scout position data.
Table A1. Alcohol-related mortality rates in selected countries in 2000.
Table A1. Alcohol-related mortality rates in selected countries in 2000.
CountryMortality
Afghanistan0.01
Algeria0.25
Angola1.85
Armenia2.90
Australia10.17
Austria13.2
Azerbaijan0.65
Bahrain2.15
Bangladesh0.01
Benin1.34
Bhutan0.17
Bolivia (Plurinational State of)2.32
Brunei Darussalam0.37
Cambodia1.51
Central African Republic1.51
Chad0.25
Colombia4.66
Comoros0.09
Congo2.26
Democratic Republic of the Congo1.98
Denmark11.69
Djibouti1.34
Egypt0.14
El Salvador2.79
Eritrea0.83
Estonia0.01
Ethiopia0.88
Fi Ji2.05
France13.63
Gambia2.18
Ghana1.60
Guatemala2.63
Guinea0.17
Guinea-Bissau2.84
Honduras2.61
Iceland6.17
India0.93
Indonesia0.06
Iran0.01
Iraq0.20
Ireland14.07
Israel2.53
Jordan0.49
Kenya1.51
Kiribati0.46
Kuwait0.01
Kyrgyzstan2.13
Lebanon2.26
Libya0.01
Madagascar1.16
Malawi1.18
Malaysia0.54
Maldives1.83
Mali0.47
Mauritania0.03
Mexico4.99
Micronesia (Federated States of)2.23
Mongolia2.79
Montenegro0.01
Morocco0.45
Mozambique1.14
Myanmar0.35
Nepal0.08
Niger0.1
Oman0.38
Pakistan0.02
Papua New Guinea0.73
Portugal11.89
Qatar0.5
Republic of Korea10.33
Russian Federation10.18
Samoa3
Saudi Arabia0.05
Senegal0.29
Singapore2.03
Slovenia11.9
Solomon Islands0.71
Somalia0.01
Sri Lanka1.45
Sudan1.76
Syrian Arab Republic1.41
Tajikistann0.37
The former Yugoslav republic of Macedonia2.86
Timor-Leste0.5
Togo1.1
Tonga1.24
Tunisia1.21
Turkey1.54
Turkmenistan2.9
United Arab Emirates1.64
United Kingdom of Great Britain and Northern Ireland10.59
Uzbekistan1.6
Vanuatu1.21
Viet Nam1.6
Yemen0.07
Zambia2.62
Zimbabwe1.68
Table A2. Artificial scout position data.
Table A2. Artificial scout position data.
Soldier’s IDPosition
10.88
22.90
30.21
40.47
53.44
60.48
70.83
83.32
90.58
100.35
110.31
120.53
130.91
140.65
150.70
160.80
170.52
180.13
190.55
200.85

References

  1. Barnett, V.; Lewis, T. Outliers in Statistical Data, 3rd ed.; Wiley and Son: Chichester, UK, 1994; pp. 1–76. [Google Scholar]
  2. Hawkins, D.M. Identification of Outliers; Springer: Dordrecht, The Netherlands, 1980; pp. 1–67. [Google Scholar]
  3. Hampel, F.R.; Ronchetti, E.M.; Rousseeuw, P.; Stahel, W.A. Robust Statistics: The Approach Based on Influence Functions; Wiley-Interscience: New York, NY, USA, 1986; pp. 1–67. [Google Scholar]
  4. Smola, A.J.; Schölkopf, B. Learning with Kernels; GMD-Forschungszentrum Informationstechnik: Berlin, Germany, 1998; pp. 27–42. [Google Scholar]
  5. Sebert, D.M.; Montgomery, D.C.; Rollier, D.A. A clustering algorithm for identifying multiple outliers in linear regression. Comput. Stat. Data Anal. 1998, 27, 461–484. [Google Scholar]
  6. Dixon, W.J. Ratios involving extreme values. Ann. Math. Stat. 1951, 22, 68–78. [Google Scholar] [CrossRef]
  7. Likeš, J. Distribution of Dixon’s statistics in the case of an exponential population. Metrika 1967, 11, 46–54. [Google Scholar] [CrossRef]
  8. Singh, A.K.; Lalitha, S. Detection of upper outliers in gamma sample. J. Stat. Appl. Probab. Lett. 2018, 5, 53–62. [Google Scholar] [CrossRef] [PubMed]
  9. Singh, A.K.; Singh, A.; Patawa, R. Multiple upper outlier detection procedure in generalized exponential sample. Eur. J. Stat. 2021, 1, 58–73. [Google Scholar] [CrossRef]
  10. Nooghabi, M.J.; Nooghabi, H.J.; Nasiri, P. Detecting outliers in gamma distribution. Commun. Stat. Theory Methods 2010, 39, 698–706. [Google Scholar] [CrossRef]
  11. Zerbet, A.; Nikulin, M. A new statistic for detecting outliers in exponential case. Commun. Stat. Theory Methods 2003, 32, 573–583. [Google Scholar] [CrossRef]
  12. Lalitha, S.; Kumar, N. Multiple outlier test for upper outliers in an exponential sample. J. Appl. Stat. 2012, 39, 1323–1330. [Google Scholar] [CrossRef]
  13. Kumar, N.; Lalitha, S. Testing for upper outliers in gamma sample. Commun. Stat. Theory Methods 2012, 41, 820–828. [Google Scholar] [CrossRef]
  14. Tietjen, G.L.; Moore, R.H. Some Grubbs-type statistics for the detection of several outliers. Technometrics 1972, 14, 583–597. [Google Scholar] [CrossRef]
  15. Mathal, A.M.; Moschopoulos, P.G. A form of multivariate gamma distribution. Ann. Inst. Stat. Math. 1992, 44, 97–106. [Google Scholar] [CrossRef]
  16. Neyman, J.; Pearson, E.S. On the use and interpretation of certain test criteria for purposes of statistical inference: Part II. Biometrika 1928, 20A, 263–294. [Google Scholar] [CrossRef]
  17. Domaóski, P.D. Study on statistical outlier detection and labelling. Int. J. Autom. Comput. 2020, 17, 788–811. [Google Scholar] [CrossRef]
  18. Lewis, T.; Fieller, N.R.J. A recursive algorithm for null distributions for outliers: I. gamma samples. Technometrics 1979, 21, 371–376. [Google Scholar] [CrossRef]
Figure 1. Power of test statistics for m = 5 , k = 2 , and n = 20 .
Figure 1. Power of test statistics for m = 5 , k = 2 , and n = 20 .
Axioms 12 00107 g001
Figure 2. Power of test statistics for m = 5 , k = 5 , and n = 20 .
Figure 2. Power of test statistics for m = 5 , k = 5 , and n = 20 .
Axioms 12 00107 g002
Figure 3. False alarm of statistics for m = 5 , k = 2 , k o = 1 , and n = 20 .
Figure 3. False alarm of statistics for m = 5 , k = 2 , k o = 1 , and n = 20 .
Axioms 12 00107 g003
Figure 4. False alarm of statistics for m = 5 , k = 5 , k o = 2 , and n = 20 .
Figure 4. False alarm of statistics for m = 5 , k = 5 , k o = 2 , and n = 20 .
Axioms 12 00107 g004
Table 1. The critical values of T k in the case of m = 5 and significance level α = 0.05 .
Table 1. The critical values of T k in the case of m = 5 and significance level α = 0.05 .
n100120150200
k
1020.8521.4322.1523.08
2035.7837.1038.6540.79
3048.4950.5453.0656.19
4059.5062.4966.0670.36
5069.2973.2377.9180.50
Table 2. The critical values of T k in the case of m = 5 and significance level α = 0.05 .
Table 2. The critical values of T k in the case of m = 5 and significance level α = 0.05 .
n100120150200
k
1020.8121.4422.1223.09
2035.7937.1238.7940.81
3048.5350.6153.1456.28
4059.5262.5766.1870.44
5069.2973.2677.9880.51
Table 3. The outlier detection results of alcohol-related mortality rates by using various tests.
Table 3. The outlier detection results of alcohol-related mortality rates by using various tests.
Test StatisticNumber of Identified ObservationsNumber of Potentially Misjudged Observations
ITK100
T k 100
D k 00
N k 100
Z k 00
V k 100
L k 00
Table 4. The outlier detection results of artificial scout position data.
Table 4. The outlier detection results of artificial scout position data.
Test StatisticNumber of Identified ObservationsNumber of Misjudged Observations
ITK30
T k 41
D k 41
N k 41
Z k 41
V k 41
L k 00
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liao, X.; Wang, T.; Zou, G. A Method for Detecting Outliers from the Gamma Distribution. Axioms 2023, 12, 107. https://doi.org/10.3390/axioms12020107

AMA Style

Liao X, Wang T, Zou G. A Method for Detecting Outliers from the Gamma Distribution. Axioms. 2023; 12(2):107. https://doi.org/10.3390/axioms12020107

Chicago/Turabian Style

Liao, Xiou, Tongtong Wang, and Guohua Zou. 2023. "A Method for Detecting Outliers from the Gamma Distribution" Axioms 12, no. 2: 107. https://doi.org/10.3390/axioms12020107

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop