Next Issue
Volume 6, March
Previous Issue
Volume 5, September
 
 

Stats, Volume 5, Issue 4 (December 2022) – 28 articles

Cover Story (view full-size image): Due to the wide availability of functional data from multiple disciplines, studies on functional data analysis have become popular in recent literature. However, the related development in censored survival data has been relatively sparse. In this work, we consider the problem of analyzing time-to-event data in the presence of functional predictors. We develop a generalized Kaplan–Meier (KM) estimator that incorporates functional predictors using kernel weights. In addition, we propose to select the optimal bandwidth based on a time-dependent Brier score.  We carry out extensive numerical studies to examine the finite sample performance of the proposed functional KM estimator and bandwidth selector.  We illustrated the practical usage of our proposed method by using a data set from Alzheimer’s Disease Neuroimaging Initiative data. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
20 pages, 775 KiB  
Article
Statistical Analysis in the Presence of Spatial Autocorrelation: Selected Sampling Strategy Effects
by Daniel A. Griffith and Richard E. Plant
Stats 2022, 5(4), 1334-1353; https://doi.org/10.3390/stats5040081 - 16 Dec 2022
Cited by 3 | Viewed by 1527
Abstract
Fundamental to most classical data collection sampling theory development is the random drawings assumption requiring that each targeted population member has a known sample selection (i.e., inclusion) probability. Frequently, however, unrestricted random sampling of spatially autocorrelated data is impractical and/or inefficient. Instead, randomly [...] Read more.
Fundamental to most classical data collection sampling theory development is the random drawings assumption requiring that each targeted population member has a known sample selection (i.e., inclusion) probability. Frequently, however, unrestricted random sampling of spatially autocorrelated data is impractical and/or inefficient. Instead, randomly choosing a population subset accounts for its exhibited spatial pattern by utilizing a grid, which often provides improved parameter estimates, such as the geographic landscape mean, at least via its precision. Unfortunately, spatial autocorrelation latent in these data can produce a questionable mean and/or standard error estimate because each sampled population member contains information about its nearby members, a data feature explicitly acknowledged in model-based inference, but ignored in design-based inference. This autocorrelation effect prompted the development of formulae for calculating an effective sample size (i.e., the equivalent number of sample selections from a geographically randomly distributed population that would yield the same sampling error) estimate. Some researchers recently challenged this and other aspects of spatial statistics as being incorrect/invalid/misleading. This paper seeks to address this category of misconceptions, demonstrating that the effective geographic sample size is a valid and useful concept regardless of the inferential basis invoked. Its spatial statistical methodology builds upon the preceding ingredients. Full article
(This article belongs to the Section Statistical Methods)
Show Figures

Figure 1

13 pages, 1103 KiB  
Article
Robust Testing of Paired Outcomes Incorporating Covariate Effects in Clustered Data with Informative Cluster Size
by Sandipan Dutta
Stats 2022, 5(4), 1321-1333; https://doi.org/10.3390/stats5040080 - 14 Dec 2022
Cited by 1 | Viewed by 1140
Abstract
Paired outcomes are common in correlated clustered data where the main aim is to compare the distributions of the outcomes in a pair. In such clustered paired data, informative cluster sizes can occur when the number of pairs in a cluster (i.e., a [...] Read more.
Paired outcomes are common in correlated clustered data where the main aim is to compare the distributions of the outcomes in a pair. In such clustered paired data, informative cluster sizes can occur when the number of pairs in a cluster (i.e., a cluster size) is correlated to the paired outcomes or the paired differences. There have been some attempts to develop robust rank-based tests for comparing paired outcomes in such complex clustered data. Most of these existing rank tests developed for paired outcomes in clustered data compare the marginal distributions in a pair and ignore any covariate effect on the outcomes. However, when potentially important covariate data is available in observational studies, ignoring these covariate effects on the outcomes can result in a flawed inference. In this article, using rank based weighted estimating equations, we propose a robust procedure for covariate effect adjusted comparison of paired outcomes in a clustered data that can also address the issue of informative cluster size. Through simulated scenarios and real-life neuroimaging data, we demonstrate the importance of considering covariate effects during paired testing and robust performances of our proposed method in covariate adjusted paired comparisons in complex clustered data settings. Full article
(This article belongs to the Special Issue Novel Semiparametric Methods)
Show Figures

Figure 1

16 pages, 706 KiB  
Article
Extracting Proceedings Data from Court Cases with Machine Learning
by Bruno Mathis
Stats 2022, 5(4), 1305-1320; https://doi.org/10.3390/stats5040079 - 13 Dec 2022
Viewed by 2708
Abstract
France is rolling out an open data program for all court cases, but with few metadata attached. Reusers will have to use named-entity recognition (NER) within the text body of the case to extract any value from it. Any court case may include [...] Read more.
France is rolling out an open data program for all court cases, but with few metadata attached. Reusers will have to use named-entity recognition (NER) within the text body of the case to extract any value from it. Any court case may include up to 26 variables, or labels, that are related to the proceeding, regardless of the case substance. These labels are from different syntactic types: some of them are rare; others are ubiquitous. This experiment compares different algorithms, namely CRF, SpaCy, Flair and DeLFT, to extract proceedings data and uses the learning model assessment capabilities of Kairntech, an NLP platform. It shows that an NER model can apply to this large and diverse set of labels and extract data of high quality. We achieved an 87.5% F1 measure with Flair trained on more than 27,000 manual annotations. Quality may yet be improved by combining NER models by data type. Full article
(This article belongs to the Special Issue Machine Learning and Natural Language Processing (ML & NLP))
Show Figures

Figure 1

11 pages, 277 KiB  
Article
Regression Models for Lifetime Data: An Overview
by Chrys Caroni
Stats 2022, 5(4), 1294-1304; https://doi.org/10.3390/stats5040078 - 07 Dec 2022
Viewed by 1704
Abstract
Two methods dominate the regression analysis of time-to-event data: the accelerated failure time model and the proportional hazards model. Broadly speaking, these predominate in reliability modelling and biomedical applications, respectively. However, many other methods have been proposed, including proportional odds, proportional mean residual [...] Read more.
Two methods dominate the regression analysis of time-to-event data: the accelerated failure time model and the proportional hazards model. Broadly speaking, these predominate in reliability modelling and biomedical applications, respectively. However, many other methods have been proposed, including proportional odds, proportional mean residual life and several other “proportional” models. This paper presents an overview of the field and the concept behind each of these ideas. Multi-parameter modelling is also discussed, in which (in contrast to, say, the proportional hazards model) more than one parameter of the lifetime distribution may depend on covariates. This includes first hitting time (or threshold) regression based on an underlying latent stochastic process. Many of the methods that have been proposed have seen little or no practical use. Lack of user-friendly software is certainly a factor in this. Diagnostic methods are also lacking for most methods. Full article
(This article belongs to the Section Survival Analysis)
23 pages, 7037 KiB  
Article
The Lookup Table Regression Model for Histogram-Valued Symbolic Data
by Manabu Ichino
Stats 2022, 5(4), 1271-1293; https://doi.org/10.3390/stats5040077 - 04 Dec 2022
Cited by 1 | Viewed by 1218
Abstract
This paper presents the Lookup Table Regression Model (LTRM) for histogram-valued symbolic data. We first transform the given symbolic data to a numerical data table by the quantile method. Then, under the selected response variable, we apply the Monotone Blocks Segmentation (MBS) to [...] Read more.
This paper presents the Lookup Table Regression Model (LTRM) for histogram-valued symbolic data. We first transform the given symbolic data to a numerical data table by the quantile method. Then, under the selected response variable, we apply the Monotone Blocks Segmentation (MBS) to the obtained numerical data table. If the selected response variable and some remained explanatory variable(s) organize a monotone structure, the MBS generates a Lookup Table composed of interval values. For a given object, we search the nearest value of an explanatory variable, then the corresponding value of the response variable becomes the estimated value. If the response variable and the explanatory variable(s) are covariate but they follow to a non-monotonic structure, we need to divide the given data into several monotone substructures. For this purpose, we apply the hierarchical conceptual clustering to the given data, and we obtain Multiple Lookup Tables by applying the MBS to each of substructures. We show the usefulness of the proposed method by using an artificial data set and real data sets. Full article
Show Figures

Figure 1

17 pages, 309 KiB  
Article
Addressing Disparities in the Propensity Score Distributions for Treatment Comparisons from Observational Studies
by Tingting Zhou, Michael R. Elliott and Roderick J. A. Little
Stats 2022, 5(4), 1254-1270; https://doi.org/10.3390/stats5040076 - 02 Dec 2022
Cited by 1 | Viewed by 1009
Abstract
Propensity score (PS) based methods, such as matching, stratification, regression adjustment, simple and augmented inverse probability weighting, are popular for controlling for observed confounders in observational studies of causal effects. More recently, we proposed penalized spline of propensity prediction (PENCOMP), which multiply-imputes outcomes [...] Read more.
Propensity score (PS) based methods, such as matching, stratification, regression adjustment, simple and augmented inverse probability weighting, are popular for controlling for observed confounders in observational studies of causal effects. More recently, we proposed penalized spline of propensity prediction (PENCOMP), which multiply-imputes outcomes for unassigned treatments using a regression model that includes a penalized spline of the estimated selection probability and other covariates. For PS methods to work reliably, there should be sufficient overlap in the propensity score distributions between treatment groups. Limited overlap can result in fewer subjects being matched or in extreme weights causing numerical instability and bias in causal estimation. The problem of limited overlap suggests (a) defining alternative estimands that restrict inferences to subpopulations where all treatments have the potential to be assigned, and (b) excluding or down-weighting sample cases where the propensity to receive one of the compared treatments is close to zero. We compared PENCOMP and other PS methods for estimation of alternative causal estimands when limited overlap occurs. Simulations suggest that, when there are extreme weights, PENCOMP tends to outperform the weighted estimators for ATE and performs similarly to the weighted estimators for alternative estimands. We illustrate PENCOMP in two applications: the effect of antiretroviral treatments on CD4 counts using the Multicenter AIDS cohort study (MACS) and whether right heart catheterization (RHC) is a beneficial treatment in treating critically ill patients. Full article
(This article belongs to the Section Biostatistics)
Show Figures

Figure 1

12 pages, 383 KiB  
Article
A Bayesian One-Sample Test for Proportion
by Luai Al-Labadi, Yifan Cheng, Forough Fazeli-Asl, Kyuson Lim and Yanqing Weng
Stats 2022, 5(4), 1242-1253; https://doi.org/10.3390/stats5040075 - 01 Dec 2022
Cited by 1 | Viewed by 1510
Abstract
This paper deals with a new Bayesian approach to the one-sample test for proportion. More specifically, let x=(x1,,xn) be an independent random sample of size n from a Bernoulli distribution with an unknown [...] Read more.
This paper deals with a new Bayesian approach to the one-sample test for proportion. More specifically, let x=(x1,,xn) be an independent random sample of size n from a Bernoulli distribution with an unknown parameter θ. For a fixed value θ0, the goal is to test the null hypothesis H0:θ=θ0 against all possible alternatives. The proposed approach is based on using the well-known formula of the Kullback–Leibler divergence between two binomial distributions chosen in a certain way. Then, the difference of the distance from a priori to a posteriori is compared through the relative belief ratio (a measure of evidence). Some theoretical properties of the method are developed. Examples and simulation results are included. Full article
Show Figures

Figure 1

11 pages, 316 KiB  
Article
A Bootstrap Method for a Multiple-Imputation Variance Estimator in Survey Sampling
by Lili Yu and Yichuan Zhao
Stats 2022, 5(4), 1231-1241; https://doi.org/10.3390/stats5040074 - 29 Nov 2022
Cited by 1 | Viewed by 997
Abstract
Rubin’s variance estimator of the multiple imputation estimator for a domain mean is not asymptotically unbiased. Kim et al. derived the closed-form bias for Rubin’s variance estimator. In addition, they proposed an asymptotically unbiased variance estimator for the multiple imputation estimator when the [...] Read more.
Rubin’s variance estimator of the multiple imputation estimator for a domain mean is not asymptotically unbiased. Kim et al. derived the closed-form bias for Rubin’s variance estimator. In addition, they proposed an asymptotically unbiased variance estimator for the multiple imputation estimator when the imputed values can be written as a linear function of the observed values. However, this needs the assumption that the covariance of the imputed values in the same imputed dataset is twice that in the different imputed datasets. In this study, we proposed a bootstrap variance estimator that does not need this assumption. Both theoretical argument and simulation studies show that it was unbiased and asymptotically valid. The new method was applied to the Hox pupil popularity data for illustration. Full article
(This article belongs to the Section Statistical Methods)
10 pages, 278 KiB  
Article
Assessing Regional Entrepreneurship: A Bootstrapping Approach in Data Envelopment Analysis
by Ioannis E. Tsolas
Stats 2022, 5(4), 1221-1230; https://doi.org/10.3390/stats5040073 - 28 Nov 2022
Cited by 2 | Viewed by 1014
Abstract
The aim of the present paper is to demonstrate the viability of using data envelopment analysis (DEA) in a regional context to evaluate entrepreneurial activities. DEA was used to assess regional entrepreneurship in Greece using individual measures of entrepreneurship as inputs and employment [...] Read more.
The aim of the present paper is to demonstrate the viability of using data envelopment analysis (DEA) in a regional context to evaluate entrepreneurial activities. DEA was used to assess regional entrepreneurship in Greece using individual measures of entrepreneurship as inputs and employment rates as outputs. In addition to point estimates, a bootstrap algorithm was used to produce bias-corrected metrics. In the light of the results of the study, the Greek regions perform differently in terms of converting entrepreneurial activity into job creation. Moreover, there is some evidence that unemployment may be a driver of entrepreneurship and thus negatively affects DEA-based inefficiency. The derived indicators can serve as diagnostic tools and can also be used for the design of various interventions at the regional level. Full article
9 pages, 283 KiB  
Article
On the Relation between Lambert W-Function and Generalized Hypergeometric Functions
by Pushpa Narayan Rathie and Luan Carlos de Sena Monteiro Ozelim
Stats 2022, 5(4), 1212-1220; https://doi.org/10.3390/stats5040072 - 23 Nov 2022
Cited by 2 | Viewed by 1090
Abstract
In the theory of special functions, finding correlations between different types of functions is of great interest as unifying results, especially when considering issues such as analytic continuation. In the present paper, the relation between Lambert W-function and generalized hypergeometric functions is discussed. [...] Read more.
In the theory of special functions, finding correlations between different types of functions is of great interest as unifying results, especially when considering issues such as analytic continuation. In the present paper, the relation between Lambert W-function and generalized hypergeometric functions is discussed. It will be shown that it is possible to link these functions by following two different strategies, namely, by means of the direct and inverse Mellin transform of Lambert W-function and by solving the trinomial equation originally studied by Lambert and Euler. The new results can be used both to numerically evaluate Lambert W-function and to study its analytic structure. Full article
Show Figures

Figure 1

17 pages, 1697 KiB  
Article
Model Validation of a Single Degree-of-Freedom Oscillator: A Case Study
by Edward Boone, Jan Hannig, Ryad Ghanam, Sujit Ghosh, Fabrizio Ruggeri and Serge Prudhomme
Stats 2022, 5(4), 1195-1211; https://doi.org/10.3390/stats5040071 - 18 Nov 2022
Cited by 1 | Viewed by 1024
Abstract
In this paper, we investigate a validation process in order to assess the predictive capabilities of a single degree-of-freedom oscillator. Model validation is understood here as the process of determining the accuracy with which a model can predict observed physical events or important [...] Read more.
In this paper, we investigate a validation process in order to assess the predictive capabilities of a single degree-of-freedom oscillator. Model validation is understood here as the process of determining the accuracy with which a model can predict observed physical events or important features of the physical system. Therefore, assessment of the model needs to be performed with respect to the conditions under which the model is used in actual simulations of the system and to specific quantities of interest used for decision-making. Model validation also supposes that the model be trained and tested against experimental data. In this work, virtual data are produced from a non-linear single degree-of-freedom oscillator, the so-called oracle model, which is supposed to provide an accurate representation of reality. The mathematical model to be validated is derived from the oracle model by simply neglecting the non-linear term. The model parameters are identified via Bayesian updating. This calibration process also includes a modeling error due to model misspecification and modeled as a normal probability density function with zero mean and standard deviation to be calibrated. Full article
Show Figures

Figure 1

21 pages, 1344 KiB  
Article
Closed Form Bayesian Inferences for Binary Logistic Regression with Applications to American Voter Turnout
by Kevin Dayaratna, Jesse Crosson and Chandler Hubbard
Stats 2022, 5(4), 1174-1194; https://doi.org/10.3390/stats5040070 - 17 Nov 2022
Viewed by 1345
Abstract
Understanding the factors that influence voter turnout is a fundamentally important question in public policy and political science research. Bayesian logistic regression models are useful for incorporating individual level heterogeneity to answer these and many other questions. When these questions involve incorporating individual [...] Read more.
Understanding the factors that influence voter turnout is a fundamentally important question in public policy and political science research. Bayesian logistic regression models are useful for incorporating individual level heterogeneity to answer these and many other questions. When these questions involve incorporating individual level heterogeneity for large data sets that include many demographic and ethnic subgroups, however, standard Markov Chain Monte Carlo (MCMC) sampling methods to estimate such models can be quite slow and impractical to perform in a reasonable amount of time. We present an innovative closed form Empirical Bayesian approach that is significantly faster than MCMC methods, thus enabling the estimation of voter turnout models that had previously been considered computationally infeasible. Our results shed light on factors impacting voter turnout data in the 2000, 2004, and 2008 presidential elections. We conclude with a discussion of these factors and the associated policy implications. We emphasize, however, that although our application is to the social sciences, our approach is fully generalizable to the myriads of other fields involving statistical models with binary dependent variables and high-dimensional parameter spaces as well. Full article
(This article belongs to the Special Issue Bayes and Empirical Bayes Inference)
Show Figures

Figure 1

15 pages, 1014 KiB  
Article
A Weibull-Beta Prime Distribution to Model COVID-19 Data with the Presence of Covariates and Censored Data
by Elisângela C. Biazatti, Gauss M. Cordeiro, Gabriela M. Rodrigues, Edwin M. M. Ortega and Luís H. de Santana
Stats 2022, 5(4), 1159-1173; https://doi.org/10.3390/stats5040069 - 17 Nov 2022
Cited by 4 | Viewed by 1250
Abstract
Motivated by the recent popularization of the beta prime distribution, a more flexible generalization is presented to fit symmetrical or asymmetrical and bimodal data, and a non-monotonic failure rate. Thus, the Weibull-beta prime distribution is defined, and some of its structural properties are [...] Read more.
Motivated by the recent popularization of the beta prime distribution, a more flexible generalization is presented to fit symmetrical or asymmetrical and bimodal data, and a non-monotonic failure rate. Thus, the Weibull-beta prime distribution is defined, and some of its structural properties are obtained. The parameters are estimated by maximum likelihood, and a new regression model is proposed. Some simulations reveal that the estimators are consistent, and applications to censored COVID-19 data show the adequacy of the models. Full article
(This article belongs to the Section Regression Models)
Show Figures

Figure 1

14 pages, 1112 KiB  
Article
A New Predictive Algorithm for Time Series Forecasting Based on Machine Learning Techniques: Evidence for Decision Making in Agriculture and Tourism Sectors
by Juan D. Borrero, Jesús Mariscal and Alfonso Vargas-Sánchez
Stats 2022, 5(4), 1145-1158; https://doi.org/10.3390/stats5040068 - 16 Nov 2022
Cited by 2 | Viewed by 1841
Abstract
Accurate time series prediction techniques are becoming fundamental to modern decision support systems. As massive data processing develops in its practicality, machine learning (ML) techniques applied to time series can automate and improve prediction models. The radical novelty of this paper is the [...] Read more.
Accurate time series prediction techniques are becoming fundamental to modern decision support systems. As massive data processing develops in its practicality, machine learning (ML) techniques applied to time series can automate and improve prediction models. The radical novelty of this paper is the development of a hybrid model that combines a new approach to the classical Kalman filter with machine learning techniques, i.e., support vector regression (SVR) and nonlinear autoregressive (NAR) neural networks, to improve the performance of existing predictive models. The proposed hybrid model uses, on the one hand, an improved Kalman filter method that eliminates the convergence problems of time series data with large error variance and, on the other hand, an ML algorithm as a correction factor to predict the model error. The results reveal that our hybrid models obtain accurate predictions, substantially reducing the root mean square and absolute mean errors compared to the classical and alternative Kalman filter models and achieving a goodness of fit greater than 0.95. Furthermore, the generalization of this algorithm was confirmed by its validation in two different scenarios. Full article
(This article belongs to the Special Issue Modern Time Series Analysis)
Show Figures

Figure 1

15 pages, 1094 KiB  
Article
On the Sampling Size for Inverse Sampling
by Daniele Cuntrera, Vincenzo Falco and Ornella Giambalvo
Stats 2022, 5(4), 1130-1144; https://doi.org/10.3390/stats5040067 - 15 Nov 2022
Cited by 1 | Viewed by 1353
Abstract
In the Big Data era, sampling remains a central theme. This paper investigates the characteristics of inverse sampling on two different datasets (real and simulated) to determine when big data become too small for inverse sampling to be used and to examine the [...] Read more.
In the Big Data era, sampling remains a central theme. This paper investigates the characteristics of inverse sampling on two different datasets (real and simulated) to determine when big data become too small for inverse sampling to be used and to examine the impact of the sampling rate of the subsamples. We find that the method, using the appropriate subsample size for both the mean and proportion parameters, performs well with a smaller dataset than big data through the simulation study and real-data application. Different settings related to the selection bias severity are considered during the simulation study and real application. Full article
(This article belongs to the Special Issue Multivariate Statistics and Applications)
Show Figures

Figure 1

17 pages, 402 KiB  
Article
Conditional Kaplan–Meier Estimator with Functional Covariates for Time-to-Event Data
by Sudaraka Tholkage, Qi Zheng and Karunarathna B. Kulasekera
Stats 2022, 5(4), 1113-1129; https://doi.org/10.3390/stats5040066 - 10 Nov 2022
Cited by 1 | Viewed by 1385
Abstract
Due to the wide availability of functional data from multiple disciplines, the studies of functional data analysis have become popular in the recent literature. However, the related development in censored survival data has been relatively sparse. In this work, we consider the problem [...] Read more.
Due to the wide availability of functional data from multiple disciplines, the studies of functional data analysis have become popular in the recent literature. However, the related development in censored survival data has been relatively sparse. In this work, we consider the problem of analyzing time-to-event data in the presence of functional predictors. We develop a conditional generalized Kaplan–Meier (KM) estimator that incorporates functional predictors using kernel weights and rigorously establishes its asymptotic properties. In addition, we propose to select the optimal bandwidth based on a time-dependent Brier score. We then carry out extensive numerical studies to examine the finite sample performance of the proposed functional KM estimator and bandwidth selector. We also illustrated the practical usage of our proposed method by using a data set from Alzheimer’s Disease Neuroimaging Initiative data. Full article
(This article belongs to the Section Survival Analysis)
Show Figures

Figure 1

16 pages, 3930 KiB  
Review
Selected Payback Statistical Contributions to Matrix/Linear Algebra: Some Counterflowing Conceptualizations
by Daniel A. Griffith
Stats 2022, 5(4), 1097-1112; https://doi.org/10.3390/stats5040065 - 09 Nov 2022
Viewed by 1148
Abstract
Matrix/linear algebra continues bestowing benefits on theoretical and applied statistics, a practice it began decades ago (re Fisher used the word matrix in a 1941 publication), through a myriad of contributions, from recognition of a suite of matrix properties relevant to statistical concepts, [...] Read more.
Matrix/linear algebra continues bestowing benefits on theoretical and applied statistics, a practice it began decades ago (re Fisher used the word matrix in a 1941 publication), through a myriad of contributions, from recognition of a suite of matrix properties relevant to statistical concepts, to matrix specifications of linear and nonlinear techniques. Consequently, focused parts of matrix algebra are topics of several statistics books and journal articles. Contributions mostly have been unidirectional, from matrix/linear algebra to statistics. Nevertheless, statistics offers great potential for making this interface a bidirectional exchange point, the theme of this review paper. Not surprisingly, regression, the workhorse of statistics, provides one tool for such historically based recompence. Another prominent one is the mathematical matrix theory eigenfunction abstraction. A third is special matrix operations, such as Kronecker sums and products. A fourth is multivariable calculus linkages, especially arcane matrix/vector operators as well as the Jacobian term associated with variable transformations. A fifth, and the final idea this paper treats, is random matrices/vectors within the context of simulation, particularly for correlated data. These are the five prospectively reviewed discipline of statistics subjects capable of informing, inspiring, or otherwise furnishing insight to the far more general world of linear algebra. Full article
(This article belongs to the Section Data Science)
Show Figures

Figure 1

18 pages, 372 KiB  
Article
Bias-Corrected Maximum Likelihood Estimation and Bayesian Inference for the Process Performance Index Using Inverse Gaussian Distribution
by Tzong-Ru Tsai, Hua Xin, Ya-Yen Fan and Yuhlong Lio
Stats 2022, 5(4), 1079-1096; https://doi.org/10.3390/stats5040064 - 05 Nov 2022
Cited by 1 | Viewed by 1322
Abstract
In this study, the estimation methods of bias-corrected maximum likelihood (BCML), bootstrap BCML (B-BCML) and Bayesian using Jeffrey’s prior distribution were proposed for the inverse Gaussian distribution with small sample cases to obtain the ML and Bayes estimators of the model parameters and [...] Read more.
In this study, the estimation methods of bias-corrected maximum likelihood (BCML), bootstrap BCML (B-BCML) and Bayesian using Jeffrey’s prior distribution were proposed for the inverse Gaussian distribution with small sample cases to obtain the ML and Bayes estimators of the model parameters and the process performance index based on the lower specification process performance index. Moreover, an approximate confidence interval and the highest posterior density interval of the process performance index were established via the delta and Bayesian inference methods, respectively. To overcome the computational difficulty of sampling from the posterior distribution in Bayesian inference, the Markov chain Monte Carlo approach was used to implement the proposed Bayesian inference procedures. Monte Carlo simulations were conducted to evaluate the performance of the proposed BCML, B-BCML and Bayesian estimation methods. An example of the active repair times for an airborne communication transceiver is used for illustration. Full article
Show Figures

Figure A1

17 pages, 342 KiB  
Article
Bayesian Hierarchical Copula Models with a Dirichlet–Laplace Prior
by Paolo Onorati and Brunero Liseo
Stats 2022, 5(4), 1062-1078; https://doi.org/10.3390/stats5040063 - 01 Nov 2022
Viewed by 1034
Abstract
We discuss a Bayesian hierarchical copula model for clusters of financial time series. A similar approach has been developed in recent paper. However, the prior distributions proposed there do not always provide a proper posterior. In order to circumvent the problem, we adopt [...] Read more.
We discuss a Bayesian hierarchical copula model for clusters of financial time series. A similar approach has been developed in recent paper. However, the prior distributions proposed there do not always provide a proper posterior. In order to circumvent the problem, we adopt a proper global–local shrinkage prior, which is also able to account for potential dependence structures among different clusters. The performance of the proposed model is presented via simulations and a real data analysis. Full article
(This article belongs to the Section Bayesian Methods)
18 pages, 1668 KiB  
Article
Product Recalls in European Textile and Clothing Sector—A Macro Analysis of Risks and Geographical Patterns
by Vijay Kumar
Stats 2022, 5(4), 1044-1061; https://doi.org/10.3390/stats5040062 - 31 Oct 2022
Viewed by 1294
Abstract
Textile and clothing (T&C) products contribute to a substantial proportion of the non-food product recalls in the European Union (EU) due to various levels of associated risks. Out of the listed 34 categories for product recalls in the EU’s Rapid Exchange of Information [...] Read more.
Textile and clothing (T&C) products contribute to a substantial proportion of the non-food product recalls in the European Union (EU) due to various levels of associated risks. Out of the listed 34 categories for product recalls in the EU’s Rapid Exchange of Information System (RAPEX), the category ’clothing, textiles, and fashion items’ was among the top 3 categories with the most recall cases during 2013–2019. Previous studies have attempted to highlight the issue of product recalls and their impacts from the perspective of a single company or selected companies, whereas limited attention is paid to understand the problem from a sector-specific perspective. However, considering the nature of product risks and the consistency in a higher number of recall cases, it is important to analyze the issue of product recalls in the T&C sector from a sector-specific perspective. In this context, the paper focuses on investigating the past recalls in the T&C sector reported RAPEX during 2005–2021 to understand the major trends in recall occurrence and associated hazards. Correspondence Analysis (CA) and Latent Dirichlet Allocation (LDA) were applied to analyze the qualitative and quantitative recall data. The results reveal that there is a geographical pattern for the product risk that leads to the recalls. The countries in eastern part of Europe tend to have proportionately high recalls in strangulation and choking-related issues, whereas chemical-related recalls are proportionately high in countries located in western part of Europe. Further, text-mining results indicate that design-related recall issues are more prevalent in children’s clothing. Full article
Show Figures

Figure 1

15 pages, 1521 KiB  
Article
Spatial Analysis: A Socioeconomic View on the Incidence of the New Coronavirus in Paraná-Brazil
by Elizabeth Giron Cima, Miguel Angel Uribe Opazo, Marcos Roberto Bombacini, Weimar Freire da Rocha Junior and Luciana Pagliosa Carvalho Guedes
Stats 2022, 5(4), 1029-1043; https://doi.org/10.3390/stats5040061 - 31 Oct 2022
Viewed by 1419
Abstract
This paper presents a spatial analysis of the incidence rate of COVID-19 cases in the state of Paraná, Brazil, from June to December 2020, and a study of the incidence rate of COVID-19 cases associated with socioeconomic variables, such as the Gini index, [...] Read more.
This paper presents a spatial analysis of the incidence rate of COVID-19 cases in the state of Paraná, Brazil, from June to December 2020, and a study of the incidence rate of COVID-19 cases associated with socioeconomic variables, such as the Gini index, Theil-L index, and municipal human development index (MHDI). The data were provided from the Paraná State Health Department and Paraná Institute for Economic and Social Development. For the study of spatial autocorrelation, the univariate global Moran index (I), local univariate Moran (LISA), global Geary (c), and univariate local Geary (ci) were calculated. For the analysis of the spatial correlation, the global bivariate Moran index (Ixy), the local multivariate Geary indices (CiM), and the bivariate Lee index (Lxy) were calculated. There is significant positive spatial autocorrelation between the incidence rate of COVID-19 cases and correlations between the incidence rate of COVID-19 cases and the Gini index, Theil-L index, and MHDI in the regions under study. The highest risk areas were concentrated in the macro-regions: east and west. Understanding the spatial distribution of COVID-19, combined with economic and social factors, can contribute to greater efficiency in preventive actions and the control of new viral epidemics. Full article
(This article belongs to the Section Econometric Modelling)
Show Figures

Figure 1

25 pages, 458 KiB  
Article
A Novel Generalization of Zero-Truncated Binomial Distribution by Lagrangian Approach with Applications for the COVID-19 Pandemic
by Muhammed Rasheed Irshad, Christophe Chesneau, Damodaran Santhamani Shibu, Mohanan Monisha and Radhakumari Maya
Stats 2022, 5(4), 1004-1028; https://doi.org/10.3390/stats5040060 - 30 Oct 2022
Cited by 3 | Viewed by 1303
Abstract
The importance of Lagrangian distributions and their applicability in real-world events have been highlighted in several studies. In light of this, we create a new zero-truncated Lagrangian distribution. It is presented as a generalization of the zero-truncated binomial distribution (ZTBD) and hence named [...] Read more.
The importance of Lagrangian distributions and their applicability in real-world events have been highlighted in several studies. In light of this, we create a new zero-truncated Lagrangian distribution. It is presented as a generalization of the zero-truncated binomial distribution (ZTBD) and hence named the Lagrangian zero-truncated binomial distribution (LZTBD). The moments, probability generating function, factorial moments, as well as skewness and kurtosis measures of the LZTBD are discussed. We also show that the new model’s finite mixture is identifiable. The unknown parameters of the LZTBD are estimated using the maximum likelihood method. A broad simulation study is executed as an evaluation of the well-established performance of the maximum likelihood estimates. The likelihood ratio test is used to assess the effectiveness of the third parameter in the new model. Six COVID-19 datasets are used to demonstrate the LZTBD’s applicability, and we conclude that the LZTBD is very competitive on the fitting objective. Full article
Show Figures

Figure 1

11 pages, 389 KiB  
Article
Comparison of Positivity in Two Epidemic Waves of COVID-19 in Colombia with FDA
by Cristhian Leonardo Urbano-Leon and Manuel Escabias
Stats 2022, 5(4), 993-1003; https://doi.org/10.3390/stats5040059 - 28 Oct 2022
Viewed by 933
Abstract
We use the functional data methodology to examine whether there are significant differences between two waves of contagion by COVID-19 in Colombia between 7 July 2020 and 20 July 2021. A pointwise functional t-test is initially used, then an alternative statistical test [...] Read more.
We use the functional data methodology to examine whether there are significant differences between two waves of contagion by COVID-19 in Colombia between 7 July 2020 and 20 July 2021. A pointwise functional t-test is initially used, then an alternative statistical test proposal for paired samples is presented, which has a theoretical distribution and performs well in small samples. Our statistical test generates a scalar p-value, which provides a global idea about the significance of the positivity curves, complementing the existing punctual tests, as an advantage. Full article
(This article belongs to the Section Applied Stochastic Models)
Show Figures

Figure 1

8 pages, 352 KiB  
Article
Snooker Statistics and Zipf’s Law
by Wim Hordijk
Stats 2022, 5(4), 985-992; https://doi.org/10.3390/stats5040058 - 21 Oct 2022
Cited by 2 | Viewed by 1616
Abstract
Zipf’s law is well known in linguistics: the frequency of a word is inversely proportional to its rank. This is a special case of a more general power law, a common phenomenon in many kinds of real-world statistical data. Here, it is shown [...] Read more.
Zipf’s law is well known in linguistics: the frequency of a word is inversely proportional to its rank. This is a special case of a more general power law, a common phenomenon in many kinds of real-world statistical data. Here, it is shown that snooker statistics also follow such a mathematical pattern, but with varying parameter values. Two types of rankings (prize money earned and centuries scored), and three different time frames (all-time, decade, and year) are considered. The results indicate that the power law parameter values depend on the type of ranking used, as well as the time frame considered. Furthermore, in some cases, the resulting parameter values vary significantly over time, for which a plausible explanation is provided. Finally, it is shown how individual rankings can be described somewhat more accurately using a log-normal distribution, but that the overall conclusions derived from the power law analysis remain valid. Full article
(This article belongs to the Section Data Science)
Show Figures

Figure 1

8 pages, 734 KiB  
Article
Extreme Tail Ratios and Overrepresentation among Subpopulations with Normal Distributions
by Theodore P. Hill and Ronald F. Fox
Stats 2022, 5(4), 977-984; https://doi.org/10.3390/stats5040057 - 20 Oct 2022
Cited by 1 | Viewed by 2117
Abstract
Given several different populations, the relative proportions of each in the high (or low) end of the distribution of a given characteristic are often more important than the overall average values or standard deviations. In the case of two different normally-distributed random variables, [...] Read more.
Given several different populations, the relative proportions of each in the high (or low) end of the distribution of a given characteristic are often more important than the overall average values or standard deviations. In the case of two different normally-distributed random variables, as is shown here, one of the (right) tail ratios will not only eventually be greater than 1 from some point on, but will even become infinitely large. More generally, in every finite mixture of different normal distributions, there will always be exactly one of those distributions that is not only overrepresented in the right tail of the mixture but even completely overwhelms all other subpopulations in the rightmost tails. This property (and the analogous result for the left tails), although not unique to normal distributions, is not shared by other common continuous centrally symmetric unimodal distributions, such as Laplace, nor even by other bell-shaped distributions, such as Cauchy (Lorentz) distributions. Full article
Show Figures

Figure 1

7 pages, 231 KiB  
Communication
Ordinal Cochran-Mantel-Haenszel Testing and Nonparametric Analysis of Variance: Competing Methodologies
by J. C. W. Rayner and G. C. Livingston, Jr.
Stats 2022, 5(4), 970-976; https://doi.org/10.3390/stats5040056 - 17 Oct 2022
Viewed by 1621
Abstract
The Cochran-Mantel-Haenszel (CMH) and nonparametric analysis of variance (NP ANOVA) methodologies are both sets of tests for categorical response data. The latter are competitor tests for the ordinal CMH tests in which the response variable is necessarily ordinal; the treatment variable may be [...] Read more.
The Cochran-Mantel-Haenszel (CMH) and nonparametric analysis of variance (NP ANOVA) methodologies are both sets of tests for categorical response data. The latter are competitor tests for the ordinal CMH tests in which the response variable is necessarily ordinal; the treatment variable may be either ordinal or nominal. The CMH mean score test seeks to detect mean treatment differences, while the CMH correlation test assesses ordinary or (1, 1) generalized correlation. Since the corresponding nonparametric ANOVA tests assess arbitrary univariate and bivariate moments, the ordinal CMH tests have been extended to enable a fuller comparison. The CMH tests are conditional tests, assuming that certain marginal totals in the data table are known. They have been extended to have unconditional analogues. The NP ANOVA tests are unconditional. Here, we give a brief overview of both methodologies to address the question “which methodology is preferable?”. Full article
(This article belongs to the Section Statistical Methods)
22 pages, 1264 KiB  
Article
On the Bivariate Composite Gumbel–Pareto Distribution
by Alexandra Badea, Catalina Bolancé and Raluca Vernic
Stats 2022, 5(4), 948-969; https://doi.org/10.3390/stats5040055 - 16 Oct 2022
Viewed by 1058
Abstract
In this paper, we propose a bivariate extension of univariate composite (two-spliced) distributions defined by a bivariate Pareto distribution for values larger than some thresholds and by a bivariate Gumbel distribution on the complementary domain. The purpose of this distribution is to capture [...] Read more.
In this paper, we propose a bivariate extension of univariate composite (two-spliced) distributions defined by a bivariate Pareto distribution for values larger than some thresholds and by a bivariate Gumbel distribution on the complementary domain. The purpose of this distribution is to capture the behavior of bivariate data consisting of mainly small and medium values but also of some extreme values. Some properties of the proposed distribution are presented. Further, two estimation procedures are discussed and illustrated on simulated data and on a real data set consisting of a bivariate sample of claims from an auto insurance portfolio. In addition, the risk of loss in this insurance portfolio is estimated by Monte Carlo simulation. Full article
Show Figures

Figure 1

14 pages, 556 KiB  
Article
Benford Networks
by Roeland de Kok and Giulia Rotundo
Stats 2022, 5(4), 934-947; https://doi.org/10.3390/stats5040054 - 30 Sep 2022
Viewed by 972
Abstract
The Benford law applied within complex networks is an interesting area of research. This paper proposes a new algorithm for the generation of a Benford network based on priority rank, and further specifies the formal definition. The condition to be taken into account [...] Read more.
The Benford law applied within complex networks is an interesting area of research. This paper proposes a new algorithm for the generation of a Benford network based on priority rank, and further specifies the formal definition. The condition to be taken into account is the probability density of the node degree. In addition to this first algorithm, an iterative algorithm is proposed based on rewiring. Its development requires the introduction of an ad hoc measure for understanding how far an arbitrary network is from a Benford network. The definition is a semi-distance and does not lead to a distance in mathematical terms, instead serving to identify the Benford network as a class. The semi-distance is a function of the network; it is computationally less expensive than the degree of conformity and serves to set a descent condition for the rewiring. The algorithm stops when it meets the condition that either the network is Benford or the maximum number of iterations is reached. The second condition is needed because only a limited set of densities allow for a Benford network. Another important topic is assortativity and the extremes which can be achieved by constraining the network topology; for this reason, we ran simulations on artificial networks and explored further theoretical settings as preliminary work on models of preferential attachment. Based on our extensive analysis, the first proposed algorithm remains the best one from a computational point of view. Full article
(This article belongs to the Special Issue Benford's Law(s) and Applications)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop