# A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested

^{1}

^{2}

## Abstract

**:**

## 1. Introduction

_{G}) when stating that the most departed values from the mean (min(x), max(x) or both) are not outliers (see Table 1). The associated probabilities of the observed (p

_{G}) are obtained from the Student t distribution [13].

## 2. Materials and Methods

_{1}, …, p

_{τ}are probabilities from τ independent tests, CDFχ

^{2}is the χ

^{2}cumulative distribution function (see also up until Equation (6) below), and p

_{FCS}is the combined probability from independent tests.

_{1}, …, x

_{n}) as n independent draws (or observations) from a (assumed known) continuous distribution defined by its probability density function, PDF (x; (π

_{j})

_{1}

_{≤j}

_{≤m}) where (π

_{j})

_{1}

_{≤j}

_{≤m}are the (assumed unknown) m statistical parameters of the distribution, by way of integration for a (assumed known) domain (D) of the distribution, we may have access to the associated cumulative density function (CDF) CDF(x; (π

_{j})

_{1}

_{≤j}

_{≤m}; PDF), simply expressed as (Equation (5)):

_{j})

_{1}

_{≤j}

_{≤m}), CDF(x; (π

_{j})

_{1}

_{≤j}

_{≤m}) is a bijective function and therefore, it is always invertible (even if we do not have an explicit formula; let “InvCDF” be its inverse, Equation (6)):

_{j})

_{1}

_{≤j}

_{≤m}), then x = InvCDF(p; (π

_{j})

_{1}

_{≤j}

_{≤m}), and vice-versa

_{j})

_{1}

_{≤j}

_{≤m}; “PDF”) is a strong tool that greatly simplifies the problem at hand: the problems of analyzing any distribution function (PDF) are translated such that only one needs to be analyzed (the continuous uniform distribution). That is, a series of observed data (x

_{i})

_{1≤i≤n}is expressed through their associated probabilities p

_{i}= CDF(x

_{i}; (π

_{j})

_{1}

_{≤j}

_{≤m}) (for 1≤i≤n) and the analysis can be conducted on the (p

_{i})

_{1≤i≤n}series instead.

_{i})

_{1≤i≤n}series of probabilities is a native case of order statistics, the discussion now turns to order statistics. The first studies in this area were by the fathers of modern statistics, Karl Pearson [20] and Ronald A. Fisher [3] while the first order statistic applicable to any distribution (not only the normal distribution) was first studied by Cramér and Von Mises (see [21,22]

_{i})

_{1≤i≤n}) will sort the values (let (q

_{i})

_{1≤i≤n}be the series of sorted (p

_{i})

_{1≤i≤n}values, Equation (7)) and will assess its departure from the continuous uniform distribution (where it is assumed that SORT is a procedure that sorts ascending the values).

^{2}[29], and the H1 Statistic [18]; see Equation (8). They remain in use today.

_{Statistic}and later tests the value (from a sample) against the threshold of a chosen significance level (usually 5%).

## 3. Proposed Outlier Detection Statistic

_{i})

_{1}

_{≤i}

_{≤n}) or (sorted probabilities, (q

_{i})

_{1}

_{≤i}

_{≤n}) associated with a series of (repeated drawing) observations ((x

_{i})

_{1}

_{≤i}

_{≤n}), the (r

_{i})

_{1}

_{≤i}

_{≤n}differences are calculated as Equation (9):

_{i})

_{1}

_{≤i}

_{≤n}) or not (then it is made on (p

_{i})

_{1}

_{≤i}

_{≤n}).

_{“min”}, G

_{“max”}, G

_{“all”}) and Equation (9) are compared, for a standard normal distribution N(x; μ=0,σ=1) the equation defining G

_{“all”}becomes much more like Equation (9), with the difference being that in Equation (2) the sample mean ($\overline{x}$) is used as an estimate for the mean of the population (μ) and the sample standard deviation (s) is used as an estimate for the standard deviation of the population (σ) while Equation (9) basically expresses the same in terms of associated probabilities (p

_{i}= P(X ≤ x

_{i}) = CDF

_{“Normal”}(x

_{i}; μ,σ), 0.5 = P(X ≤ μ) = CDF

_{“Normal”}(μ; μ,σ)).

_{“all”}value ($\overline{x}$ and s), thereby reducing the degrees of freedom associated with the value (from n to n-2) while for the g1 value (and statistic) the degrees of freedom remain unchanged (n). The major difference is actually the one that makes the proposed statistic generalizable to any distribution—the mean used in the Grubbs test is replaced by the median—the beauty of this change is that for symmetrical distributions (including a Normal distribution) these two coincide.

_{i})

_{1≤i≤n}or (q

_{i})

_{1≤i≤n}) instead of the observed space ((x

_{i})

_{1≤i≤n}), the calculation formula (Equations (9) and (10)) is slightly different (to those given in Equations (1) and (2)), and the probability associated with the departure will no longer be extracted from the Student t distribution (as in Equations (1) and (2)). The change from mean (μ for G

_{“all”}) to median (0.5 in Equation (9)) is a safe extension for any distribution type, since Equation (9) measures (or accounts for) the extreme departures from the equiprobable point—having an observation y (y ← X) with y ≤ InvCDF

_{“Any distribution”}(0.5; “parameters”) and an observation z (z ← X) with z ≥ InvCDF

_{“Any distribution”}(0.5; “parameters”) is equiprobable.

## 4. Simulation Study

^{9}in Table 2, repetitions were joined (n, p, g1) as pairs from the p·n control points, that is, where the probability was from 0.001 to 0.999 with a step of 0.001 for each n (from 2 to 12). The external repetitions (resa = 7 in Table 2) were joined together by taking the median (since the median is a sufficiency statistic [31] for any order statistic such as in the extraction of (n, p, g1) pairs from the p·n control points). The MC simulation was conducted with the configuration set as defined in Table 2. The obtained data were recorded in separate files by sample size and analyzed as such.

## 5. The Analytical Formula of CDF for g1

## 6. Simulation Results for the Distribution of the “g1” Statistic

^{−5}, never bigger than 1.5·× 10

^{−5}and tends to become smaller with the increase in sample size (n). Using Equation (11), Figure 2 depicts the shape of the CDF

_{“g1”}(x;n).

^{−1}·2

^{−1/n}(and having a left asymmetry decreasing with the increasing of n and converging (for n → ∞) to symmetry) and mean of 1/2(n+1).

_{“g1”}is easily inverted (see Equation (13)).

## 7. from “g1” Statistic to “g1” Confidence Intervals for the Extreme Values

_{“g1”}(1-α;n) = $\sqrt[n]{1-\alpha}/2$). By placing this value into Equations (9) and (10), the (extreme) probabilities can be extracted (Equation (14)).

_{“GL”}(9.603; μ = 6.47938, σ = 0.82828, k = 1.79106) = 0.9998). Due to the force of this deviation from the median, 9.603 was suspected as being an outlier and was removed (it should be noted that in a broader context, an outlier can be also seen as an atypical observation, correctly collected from the population observation, as part of the data generation process and thus it may be maintained in the sample but probably with a less weight). The same procedure (as in Table 4) can be applied to the remaining data (205 observations). Then, InvCDF

_{“g1”}(1-0.05; 205) = 0.499875, p

_{min}(n=205) = 0.0001251; and p

_{max}(n=205) = 0.9998749. The MLE estimates for the parameters of the Gauss-Laplace distribution remain unchanged (μ = 6.47938, σ = 0.82828, k = 1.79106) and the removed observation (9.603) is still not an outlier (x

_{max}= InvCDF

_{“GL”}(0.9998749; μ = 6.47938, σ = 0.82828, k = 1.79106) = 9.7166 > 9.603).

## 8. Proposed Procedure for Detecting the Outliers

## 9. Second Simulation Assessing “Grubbs” and “g1” Outlier Detection Alternatives

_{1}, …, p

_{10}} in Figure 3) associated with the series of the observations from the sample ({x

_{1}, …, x

_{10}} in Figure 3).

_{FCS}= 7% > 5% = α, see Figure 3).

## 10. Going Further with the Outlier Analysis

_{FCS}in Table 6 is 7%, while in Table 7 it is 16%) and there is no change in the accuracy of the classification ({563, 543} comparable with 500, {2341, 2333} is much greater than 500; the existing method produces type I errors by leading to false positive detection of outliers in the samples, while the proposed method does not). When comparing the results given in Table 6 with the results given in Table 7 it should be noted that both tests (Grubbs and the newly proposed g1) produce somewhat confusing results (see Table 8 for side-by-side outcomes).

## 11. Further Discussion

^{2}[29], the WU

_{Statistic}in Equation (8), uses it).

## 12. Conclusions

## Supplementary Materials

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Gauss, C.F. Theoria Motus Corporum Coelestium; (Translated in 1857 as “Theory of Motion of the Heavenly Bodies Moving about the Sun in Conic Sections” by C. H. Davis. Little, Brown: Boston. Reprinted in 1963 by Dover: New York); Perthes et Besser: Hamburg, Germany, 1809; pp. 249–259. [Google Scholar]
- Tippett, L.H.C. The extreme individuals and the range of samples taken from a normal population. Biometrika
**1925**, 17, 151–164. [Google Scholar] [CrossRef] - Fisher, R.A.; Tippett, L.H.C. Limiting forms of the frequency distribution of the largest and smallest member of a sample. Proc. Camb. Philos. Soc.
**1928**, 24, 180–190. [Google Scholar] [CrossRef] - Thompson, W.R. On a criterion for the rejection of observations and the distribution of the ratio of the deviation to the sample standard deviation. Ann. Math. Stat.
**1935**, 6, 214–219. [Google Scholar] [CrossRef] - Pearson, E.; Sekar, C.C. The efficiency of the statistical tools and a criterion for the rejection of outlying observations. Biometrika
**1936**, 28, 308–320. [Google Scholar] [CrossRef] - Grubbs, F.E. Sample criteria for testing outlying observations. Ann. Math. Stat.
**1950**, 21, 27–58. [Google Scholar] [CrossRef] - Grubbs, F.E. Procedures for detecting outlying observations in samples. Technometrics
**1969**, 11, 1–21. [Google Scholar] [CrossRef] - Nooghabi, M.; Nooghabi, H.; Nasiri, P. Detecting outliers in gamma distribution. Commun. Stat. Theory Methods
**2010**, 39, 698–706. [Google Scholar] [CrossRef] - Kumar, N.; Lalitha, S. Testing for upper outliers in gamma sample. Commun. Stat. Theory Methods
**2012**, 41, 820–828. [Google Scholar] [CrossRef] - Lucini, M.; Frery, A. Comments on “Detecting Outliers in Gamma Distribution” by M. Jabbari Nooghabi et al. (2010). Commun. Stat. Theory Methods
**2017**, 46, 5223–5227. [Google Scholar] [CrossRef] - Hartley, H. The range in random samples. Biometrika
**1942**, 32, 334–348. [Google Scholar] [CrossRef] - Bardet, J.-M.; Dimby, S.-F. A new non-parametric detector of univariate outliers for distributions with unbounded support. Extremes
**2017**, 20, 751–775. [Google Scholar] [CrossRef][Green Version] - Gosset, W. The probable error of a mean. Biometrika
**1908**, 6, 1–25. [Google Scholar] - Jäntschi, L.; Bolboacă, S.-D. Computation of probability associated with Anderson-Darling statistic. Mathematics
**2018**, 6, 88. [Google Scholar] [CrossRef] - Fisher, R. On an Absolute Criterion for Fitting Frequency Curves. Messenger Math.
**1912**, 41, 155–160. [Google Scholar] - Fisher, R. Questions and answers #14. Am. Stat.
**1948**, 2, 30–31. [Google Scholar] - Bolboacă, S.D.; Jäntschi, L.; Sestraș, A.F.; Sestraș, R.E.; Pamfil, D.C. Supplementary material of ’Pearson-Fisher chi-square statistic revisited’. Information
**2011**, 2, 528–545. [Google Scholar] [CrossRef] - Jäntschi, L.; Bolboacă, S.D. Performances of Shannon’s Entropy Statistic in Assessment of Distribution of Data. Ovidius Univ. Ann. Chem.
**2017**, 28, 30–42. [Google Scholar] [CrossRef] - Davis, P.; Rabinowitz, P. Methods of Numerical Integration; Academic Press: New York, NY, USA, 1975; pp. 51–198. [Google Scholar]
- Pearson, K. Note on Francis Gallon’s problem. Biometrika
**1902**, 1, 390–399. [Google Scholar] - Cramér, H. On the composition of elementary errors. Scand. Actuar. J.
**1928**, 1, 13–74. [Google Scholar] [CrossRef] - Von Mises, R.E. Wahrscheinlichkeit, Statistik und Wahrheit; Julius Springer: Berlin, Germany, 1928; pp. 100–138. [Google Scholar]
- Kolmogorov, A. Sulla determinazione empirica di una legge di distribuzione. Giornale dell’Istituto Italiano degli Attuari
**1933**, 4, 83–91. [Google Scholar] - Kolmogorov, A. Confidence Limits for an Unknown Distribution Function. Ann. Math. Stat.
**1941**, 12, 461–463. [Google Scholar] [CrossRef] - Smirnov, N. Table for estimating the goodness of fit of empirical distributions. Ann. Math. Stat.
**1948**, 19, 279–281. [Google Scholar] [CrossRef] - Anderson, T.W.; Darling, D.A. Asymptotic theory of certain “goodness-of-fit” criteria based on stochastic processes. Ann. Math. Stat.
**1952**, 23, 193–212. [Google Scholar] [CrossRef] - Anderson, T.W.; Darling, D.A. A Test of Goodness-of-Fit. J. Am. Stat. Assoc.
**1954**, 49, 765–769. [Google Scholar] [CrossRef] - Kuiper, N.H. Tests concerning random points on a circle. Proc. Koninklijke Nederlandse Akademie van Wetenschappen Series A
**1960**, 63, 38–47. [Google Scholar] [CrossRef][Green Version] - Watson, G.S. Goodness-Of-Fit Tests on a Circle. Biometrika
**1961**, 48, 109–114. [Google Scholar] [CrossRef] - Metropolis, N.; Ulam, S. The Monte Carlo Method. J. Am. Stat. Assoc.
**1949**, 44, 335–341. [Google Scholar] [CrossRef] [PubMed] - Fisher, R.A. On the mathematical foundations of theoretical statistics. Philos. Trans. R. Soc. A
**1922**, 222, 309–368. [Google Scholar] [CrossRef][Green Version] - Jäntschi, L. Distribution fitting 1. Parameters estimation under assumption of agreement between observation and model. Bull. UASVM Hortic.
**2009**, 66, 684–690. [Google Scholar] - Jäntschi, L.; Bolboacă, S.D. Distribution fitting 2. Pearson-Fisher, Kolmogorov-Smirnov, Anderson-Darling, Wilks-Shapiro, Kramer-von-Misses and Jarque-Bera statistics. Bull. UASVM Hortic.
**2009**, 66, 691–697. [Google Scholar] - Bolboacă, S.D.; Jäntschi, L. Distribution fitting 3. Analysis under normality assumption. Bull. UASVM Hortic.
**2009**, 66, 698–705. [Google Scholar] - Liu, K.; Chen, Y.Q.; Domański, P.D.; Zhang, X. A novel method for control performance assessment with fractional order signal processing and its application to semiconductor manufacturing. Algorithms
**2018**, 11, 90. [Google Scholar] [CrossRef] - Paiva, J.S.; Ribeiro, R.S.R.; Cunha, J.P.S.; Rosa, C.C.; Jorge, P.A.S. Single particle differentiation through 2D optical fiber trapping and back-scattered signal statistical analysis: An exploratory approach. Sensors
**2018**, 18, 710. [Google Scholar] [CrossRef] [PubMed] - Teunissen, P.J.G.; Imparato, D.; Tiberius, C.C.J.M. Does RAIM with correct exclusion produce unbiased positions? Sensors
**2017**, 17, 1508. [Google Scholar] [CrossRef] [PubMed] - Pan, Z.; Liu, L.; Qiu, X.; Lei, B. Fast vessel detection in Gaofen-3 SAR images with ultrafine strip-map mode. Sensors
**2017**, 17, 1578. [Google Scholar] [CrossRef] [PubMed] - Vergura, S.; Carpentieri, M. Statistics to detect low-intensity anomalies in PV systems. Energies
**2018**, 11, 30. [Google Scholar] [CrossRef] - Chen, L.; He, J.; Sazzed, S.; Walker, R. An investigation of atomic structures derived from X-ray crystallography and cryo-electron microscopy using distal blocks of side-chains. Molecules
**2018**, 23, 610. [Google Scholar] [CrossRef] [PubMed] - Bolboacă, S.D.; Jäntschi, L. The effect of leverage and influential on structure-activity relationships. Comb. Chem. High Throughput Screen.
**2013**, 16, 288–297. [Google Scholar] [CrossRef] [PubMed] - Faes, L.; Porta, A.; Nollo, G.; Javorka, M. Information decomposition in multivariate systems: Definitions, implementation and application to cardiovascular networks. Entropy
**2017**, 19, 5. [Google Scholar] [CrossRef] - Li, G.; Wang, J.; Liang, J.; Yue, C. Application of sliding nest window control chart in data stream anomaly detection. Symmetry
**2018**, 10, 113. [Google Scholar] [CrossRef] - Paolella, M.S. Stable-GARCH models for financial returns: Fast estimation and tests for stability. Econometrics
**2016**, 4, 25. [Google Scholar] [CrossRef]

**Figure 1.**Departures between expected and observed probabilities for g1 statistic (Equation (10) vs. Equation (11)).

Sample statistic (G) | Associated probability (p_{G} = 1-α_{G}) | Equation |
---|---|---|

${G}_{\u201cmin\u201d}=\frac{\overline{x}-\mathrm{min}(x)}{s}$ | ${\alpha}_{G}=n\xb7{\mathrm{CDF}}_{\u201c\mathrm{Student}\text{}\mathrm{t}\u201d}(-\sqrt{\frac{n(n-2)}{{(\frac{n-1}{G})}^{2}-n}},n-2)$ | (1) |

${G}_{\u201cmax\u201d}=\frac{\mathrm{max}(x)-\overline{x}}{s}$ | ||

${G}_{\u201call\u201d}=\mathrm{max}({G}_{\u201cmin\u201d},{G}_{\u201cmax\u201d})$ | ${\alpha}_{G}=2n\xb7{\mathrm{CDF}}_{\u201c\mathrm{Student}\text{}\mathrm{t}\u201d}(-\sqrt{\frac{n(n-2)}{{(\frac{n-1}{G})}^{2}-n}},n-2)$ | (2) |

Parameter | Meaning | Setting |
---|---|---|

n | sample size of the observed | from 2 to 12 |

m | sample size of the MC simulation | 10^{8} |

p | control points for the probability | 999 |

resa | internal resamples (repetitions) | 10 |

repe | external repetitions | 7 |

**Table 3.**Descriptive statistics for the agreement in the calculation of the “g1” statistic (Equation (10) vs. Equation (11)).

n | SE | $\mathbf{min}({\mathit{p}}_{\mathit{i}}-{\widehat{\mathit{p}}}_{\mathit{i}})$ | $\mathbf{max}({\mathit{p}}_{\mathit{i}}-{\widehat{\mathit{p}}}_{\mathit{i}})$ | ||
---|---|---|---|---|---|

2 | 2.9 × 10^{−6} | −7.9 × 10^{−6} | at p = 0.694 | 5.7 × 10^{−6} | at p = 0.427 |

3 | 5.6 × 10^{−6} | −1.2 × 10^{−5} | at p = 0.787 | 1.6 × 10^{−6} | at p = 0.118 |

4 | 2.2 × 10^{−6} | −5.6 × 10^{−6} | at p = 0.234 | 3.7 × 10^{−6} | at p = 0.613 |

5 | 6.0 × 10^{−6} | −1.2 × 10^{−5} | at p = 0.546 | 2.3 × 10^{−6} | at p = 0.080 |

6 | 3.5 × 10^{−6} | −5.8 × 10^{−6} | at p = 0.797 | 9.2 × 10^{−6} | at p = 0.196 |

7 | 5.0 × 10^{−6} | −9.6 × 10^{−6} | at p = 0.777 | 3.8 × 10^{−6} | at p = 0.035 |

8 | 4.2 × 10^{−6} | −8.4 × 10^{−6} | at p = 0.675 | 3.9 × 10^{−6} | at p = 0.948 |

9 | 3.3 × 10^{−6} | −9.1 × 10^{−6} | at p = 0.269 | 7.9 × 10^{−6} | at p = 0.689 |

10 | 2.8 × 10^{−6} | −6.4 × 10^{−6} | at p = 0.443 | 6.6 × 10^{−6} | at p = 0.652 |

**Table 4.**Distribution analysis for a series of 206 measurements for the octanol water partition coefficient (K

_{ow}) of polychlorinated biphenyls expressed in logarithmic scale (log

_{10}(K

_{ow}))

Step | Results |
---|---|

Dataset (given for convenience) | 4.151; 4.401; 4.421; 4.601; 4.941; 5.021; 5.023; 5.150; 5.180; 5.295; 5.301; 5.311; 5.311; 5.335; 5.343; 5.404; 5.421; 5.447; 5.452; 5.452; 5.481; 5.504; 5.517; 5.537; 5.537; 5.551; 5.561; 5.572; 5.577; 5.577; 5.627; 5.637; 5.637; 5.667; 5.667; 5.671; 5.677; 5.677; 5.691; 5.717; 5.743; 5.751; 5.757; 5.761; 5.767; 5.767; 5.787; 5.811; 5.817; 5.827; 5.867; 5.897; 5.897; 5.904; 5.943; 5.957; 5.957; 5.987; 6.041; 6.047; 6.047; 6.047; 6.057; 6.077; 6.091; 6.111; 6.117; 6.117; 6.137; 6.137; 6.137; 6.137; 6.137; 6.142; 6.167; 6.177; 6.177; 6.177; 6.204; 6.207; 6.221; 6.227; 6.227; 6.231; 6.237; 6.257; 6.267; 6.267; 6.267; 6.291; 6.304; 6.327; 6.357; 6.357; 6.367; 6.367; 6.371; 6.427; 6.457; 6.467; 6.487; 6.497; 6.511; 6.517; 6.517; 6.523; 6.532; 6.547; 6.583; 6.587; 6.587; 6.587; 6.607; 6.611; 6.647; 6.647; 6.647; 6.647; 6.647; 6.657; 6.657; 6.671; 6.671; 6.677; 6.677; 6.677; 6.697; 6.704; 6.717; 6.717; 6.737; 6.737; 6.737; 6.747; 6.767; 6.767; 6.767; 6.797; 6.827; 6.857; 6.867; 6.897; 6.897; 6.937; 6.937; 6.957; 6.961; 6.997; 7.027; 7.027; 7.027; 7.057; 7.071; 7.087; 7.087; 7.117; 7.117; 7.117; 7.121; 7.123; 7.147; 7.151; 7.177; 7.177; 7.187; 7.187; 7.207; 7.207; 7.207; 7.211; 7.247; 7.247; 7.277; 7.277; 7.277; 7.281; 7.304; 7.307; 7.307; 7.321; 7.337; 7.367; 7.391; 7.427; 7.441; 7.467; 7.516; 7.527; 7.527; 7.557; 7.567; 7.592; 7.627; 7.627; 7.657; 7.657; 7.717; 7.747; 7.751; 7.933; 8.007; 8.164; 8.423; 8.683; 9.143; 9.603 |

For n = 206 calculate the probability that the extreme values contain an outlier by using Equation (13) | At α = 5% risk being in error InvCDF_{“g1”}(1-0.05; 206) = 0.498755 |

Calculate the critical probabilities for the extreme values by using Equations (9) and (10) | g1 = 0.498755 → |0.5 - p_{min/max}| = 0.498755 → 1 - 2p_{min/max} = ± 0.99751 →p _{min} = 0.0001245; p_{max} = 0.9998755 |

Estimate the parameters of the distribution fitting the dataset (distribution: Gauss-Laplace; μ - location parameter; σ - scale parameter; k - shape parameter) | Initial estimates (from a hybrid CM & MLE method): μ = 6.4806; σ = 0.83076; k = 1.4645; MLE estimates (by applying eq.3): μ = 6.47938; σ = 0.82828; k = 1.79106; |

Calculate the lower and the upper bound for the extreme values by using InvCDF of the distribution fitting the data (Equation (15)) | InvCDF_{“GL”}(0.0001245; μ = 6.47938, σ = 0.82828, k = 1.79106) = 3.2409InvCDF _{“GL”}(0.9998755; μ = 6.47938, σ = 0.82828, k = 1.79106) = 9.7178 |

Make the conclusion regarding the outliers | Since the smallest value in the dataset is 4.151 (> 3.24) and the largest value is 9.603 (< 9.71), at 5% risk being in error there are no outliers in the dataset on the assumption that data follows the Gauss-Laplace distribution |

**Table 5.**Comparison of the steps of the analysis and simulation for extreme values confidence intervals (proposed method vs. Grubbs test)

Step | Action (step 0 is setting the dataset; α ← 0.05) |
---|---|

1 | Estimate (with MLE, Equation (3)) parameters (μ, σ) of the Normal distribution; calculate the associated CDFs (Equation (18)) |

2 | Calculate the order statistics, their associated risks being in error, FCS and p_{FCS} (Equations (6) and (4)) |

3 | For n and α calculate the confidence intervals for the extreme values by using (a) Equation (6) and (17) and (b) Equation (19) |

4 | Run the MC experiment (Figure 4) for K = 10000 (and then the expected number of outliers is 500) samples and count the samples containing outliers for the existing method (Grubbs, Equation (19); with μ and σ from CM method) and for the proposed method (g1, Equations (13)–(15) and (17); with μ and σ from the MLE method) |

Step | Results (for α = 5%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | μ = 575.2; σ = 8.256 (MLE) → CPs = {0.1916, 0.2644, 0.2644, 0.2644, 0.3492, 0.3492, 0.3492, 0.6328, 0.8568, 0.9941} | |||||||||

2 | Statistic | AD | KS | CM | KV | WU | H1 | FCS | ||

Value | 1.137 | 1.110 | 0.206 | 1.715 | 0.182 | 5.266 | 12.293 | |||

α_{Statistic} | 0.288 | 0.132 | 0.259 | 0.028 | 0.049 | 0.343 | 0.056 | |||

3 | x_{crit}(5%) = 575.2 ± 2.29·8.7025; p_{extreme}(5%) = 0.5 ± InvCDF_{“g1”}(1-0.05; 10); x_{extreme}(5%) = {552.086, 598.314} | |||||||||

4 | Number of samples containing outliers | Existing method (Grubbs) | Proposed method (g1) | |||||||

First run | 1977 (19.77%) | 510 (5.1%) | ||||||||

Second run | 2009 (20.09%) | 526 (5.26%) |

Step | Results (for α = 5%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | μ = 572.889; σ = 4.725 (MLE) → CPs = {0.1504, 0.2705, 0.2705, 0.2705, 0.4254, 0.4254, 0.4254, 0.8603, 0.9907} | |||||||||

2 | Statistic | AD | KS | CM | KV | WU | H1 | FCS | ||

Value | 0.935 | 1.057 | 0.174 | 1.535 | 0.155 | 4.678 | 9.715 | |||

α_{Statistic} | 0.389 | 0.167 | 0.327 | 0.082 | 0.088 | 0.394 | 0.137 | |||

3 | x_{crit}(5%) = 572.89 ± 2.215·5.011; p_{extreme}(5%) = 0.5 ± InvCDF_{“g1”}(1-0.05; 9); x_{extreme}(5%) = {559.822, 585.956} | |||||||||

4 | Number of samples containing outliers | Existing method (Grubbs) | Proposed method (g1) | |||||||

First run | 2341 (23.41%) | 563 (5.63%) | ||||||||

Second run | 2333 (23.33%) | 543 (5.43%) |

Sample | {568, 570, 570, 570, 572, 572, 572, 578, 584, 596} | {568, 570, 570, 570, 572, 572, 572, 578, 584} |
---|---|---|

At 5% risk being in error can the hypothesis that the sample was drawn from a normal distribution be rejected? | No (α_{FCS} = 7%) | No (α_{FCS} = 15.8%) |

Grubbs confidence interval for ‘no outliers’ at 5% risk being in error | (555.27, 595.13) 596 is detected as being outlier | (561.79, 583.99) 584 is detected as being outlier |

g1 confidence interval for ‘no outliers’ at 5% risk being in error | (552.08, 598.32) no outliers | (559.82, 585.96) no outliers |

**Table 9.**Outlier analysis results for the {568, 570, 570, 570, 572, 572, 572, 578, 584,

**601**} dataset.

Step | Results (for α = 5%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | From the CM method: μ = 575.7; σ = 10.067; from MLE method: μ = 575.7; σ = 9.550 | |||||||||

2 | Statistic | AD | KS | CM | KV | WU | H1 | FCS | ||

Value | 1.267 | 1.109 | 0.225 | 1.774 | 0.198 | 5.411 | 13.652 | |||

α_{Statistic} | 0.241 | 0.132 | 0.226 | 0.018 | 0.035 | 0.254 | 0.034 | |||

3 | Grubbs confidence interval for ’no outliers’ at 5% risk being in error: (552.647,598.753); 601 is an outlierg1 confidence interval for ’no outliers’ at 5% risk being in error: (548.963, 602.437); no outliers |

**Table 10.**Outlier analysis results for the {568, 570, 570, 570, 572, 572, 572, 578, 584, 604} dataset.

Step | Results (for α = 5%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | From the CM method: μ = 576.0; σ = 10.914; from MLE method: μ = 576.0; σ = 10.354 | |||||||||

2 | Statistic | AD | KS | CM | KV | WU | H1 | FCS | ||

Value | 1.348 | 1.108 | 0.238 | 1.803 | 0.209 | 5.481 | 14.468 | |||

α_{Statistic} | 0.216 | 0.133 | 0.206 | 0.015 | 0.028 | 0.215 | 0.025 | |||

3 | Grubbs confidence interval for ’no outliers’ at 5% risk being in error: (551.00, 601.00); 604 is an outlier g1 confidence interval for ’no outliers’ at 5% risk being in error: (547.01, 604.99); no outliers |

**Table 11.**Outlier analysis results for Table 4 dataset under the assumption of normal distribution.

Step | Results (for α = 5%) | |||||||||
---|---|---|---|---|---|---|---|---|---|---|

1 | Table 5 Dataset; Normal distribution → CM: μ = 6.481; σ = 0.831; MLE: μ = 6.481; σ = 0.829 | |||||||||

2 | Statistic | AD | KS | CM | KV | WU | H1 | FCS | ||

Value | 0.439 | 0.484 | 0.049 | 0.952 | 0.047 | 104.2 | 1.276 | |||

α_{Statistic} | 0.812 | 0.965 | 0.886 | 0.852 | 0.743 | 0.641 | 0.973 | |||

3 | Grubbs confidence interval for ’no outliers’ at 5% risk being in error: (3.492, 9.470); 9.603 is an outlier g1 confidence interval for ’no outliers’ at 5% risk being in error: (3.444, 9.517); 9.603 is an outlier | |||||||||

4 | Number of samples containing outliers | Existing method (Grubbs) | Proposed method (g1) | |||||||

First run | 637 (6.37%) | 511 (5.11%) | ||||||||

Second run | 630 (6.3%) | 481 (4.81%) |

© 2019 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Jäntschi, L.
A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested. *Symmetry* **2019**, *11*, 835.
https://doi.org/10.3390/sym11060835

**AMA Style**

Jäntschi L.
A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested. *Symmetry*. 2019; 11(6):835.
https://doi.org/10.3390/sym11060835

**Chicago/Turabian Style**

Jäntschi, Lorentz.
2019. "A Test Detecting the Outliers for Continuous Distributions Based on the Cumulative Distribution Function of the Data Being Tested" *Symmetry* 11, no. 6: 835.
https://doi.org/10.3390/sym11060835