Next Issue
Volume 4, March
Previous Issue
Volume 3, September
 
 

Stats, Volume 3, Issue 4 (December 2020) – 7 articles

Cover Story (view full-size image): The analysis of massive databases is a key issue for most applications today, and the use of parallel computing techniques is one of the suitable approaches for that. One way to perform statistical analyses over massive databases is combining some tools via the sparklyr package, which allows for an R application to use Apache Spark as a framework. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a local processing of a large data set with 1.26 billion observations which total more than 100 GB. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables to BFP utilization rate. The analysis was performed with RF and indicated the high importance of some variables such as family income, education, occupation, and density of people in the homes. View [...] Read more.
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
16 pages, 1949 KiB  
Article
A New Biased Estimator to Combat the Multicollinearity of the Gaussian Linear Regression Model
by Issam Dawoud and B. M. Golam Kibria
Stats 2020, 3(4), 526-541; https://doi.org/10.3390/stats3040033 - 06 Nov 2020
Cited by 20 | Viewed by 3249
Abstract
In a multiple linear regression model, the ordinary least squares estimator is inefficient when the multicollinearity problem exists. Many authors have proposed different estimators to overcome the multicollinearity problem for linear regression models. This paper introduces a new regression estimator, called the Dawoud–Kibria [...] Read more.
In a multiple linear regression model, the ordinary least squares estimator is inefficient when the multicollinearity problem exists. Many authors have proposed different estimators to overcome the multicollinearity problem for linear regression models. This paper introduces a new regression estimator, called the Dawoud–Kibria estimator, as an alternative to the ordinary least squares estimator. Theory and simulation results show that this estimator performs better than other regression estimators under some conditions, according to the mean squares error criterion. The real-life datasets are used to illustrate the findings of the paper. Full article
Show Figures

Figure 1

16 pages, 979 KiB  
Article
On the Number of Independent Pieces of Information in a Functional Linear Model with a Scalar Response
by Eduardo L. Montoya
Stats 2020, 3(4), 510-525; https://doi.org/10.3390/stats3040032 - 05 Nov 2020
Viewed by 1660
Abstract
In a functional linear model (FLM) with scalar response, the parameter curve quantifies the relationship between a functional explanatory variable and a scalar response. While these models can be ill-posed, a penalized regression spline approach may be used to obtain an estimate of [...] Read more.
In a functional linear model (FLM) with scalar response, the parameter curve quantifies the relationship between a functional explanatory variable and a scalar response. While these models can be ill-posed, a penalized regression spline approach may be used to obtain an estimate of the parameter curve. The penalized regression spline estimate will be dependent on the value of a smoothing parameter. However, the ability to obtain a reasonable parameter curve estimate is reliant on how much information is present in the covariate functions for estimating the parameter curve. We propose to quantify the information present in the covariate functions to estimate the parameter curve. In addition, we examine the influence of this information on the stability of the parameter curve estimator and on the performance of smoothing parameter selection methods in a FLM with a scalar response. Full article
Show Figures

Figure 1

26 pages, 491 KiB  
Article
Model Free Inference on Multivariate Time Series with Conditional Correlations
by Dimitrios Thomakos, Johannes Klepsch and Dimitris N. Politis
Stats 2020, 3(4), 484-509; https://doi.org/10.3390/stats3040031 - 03 Nov 2020
Cited by 1 | Viewed by 2428
Abstract
New results on volatility modeling and forecasting are presented based on the NoVaS transformation approach. Our main contribution is that we extend the NoVaS methodology to modeling and forecasting conditional correlation, thus allowing NoVaS to work in a multivariate setting as well. We [...] Read more.
New results on volatility modeling and forecasting are presented based on the NoVaS transformation approach. Our main contribution is that we extend the NoVaS methodology to modeling and forecasting conditional correlation, thus allowing NoVaS to work in a multivariate setting as well. We present exact results on the use of univariate transformations and on their combination for joint modeling of the conditional correlations: we show how the NoVaS transformed series can be combined and the likelihood function of the product can be expressed explicitly, thus allowing for optimization and correlation modeling. While this keeps the original “model-free” spirit of NoVaS it also makes the new multivariate NoVaS approach for correlations “semi-parametric”, which is why we introduce an alternative using cross validation. We also present a number of auxiliary results regarding the empirical implementation of NoVaS based on different criteria for distributional matching. We illustrate our findings using simulated and real-world data, and evaluate our methodology in the context of portfolio management. Full article
(This article belongs to the Special Issue Time Series Analysis and Forecasting)
Show Figures

Figure 1

9 pages, 286 KiB  
Article
A Note on the Nonparametric Estimation of the Conditional Mode by Wavelet Methods
by Salim Bouzebda and Christophe Chesneau
Stats 2020, 3(4), 475-483; https://doi.org/10.3390/stats3040030 - 31 Oct 2020
Cited by 3 | Viewed by 1637
Abstract
The purpose of this note is to introduce and investigate the nonparametric estimation of the conditional mode using wavelet methods. We propose a new linear wavelet estimator for this problem. The estimator is constructed by combining a specific ratio technique and an established [...] Read more.
The purpose of this note is to introduce and investigate the nonparametric estimation of the conditional mode using wavelet methods. We propose a new linear wavelet estimator for this problem. The estimator is constructed by combining a specific ratio technique and an established wavelet estimation method. We obtain rates of almost sure convergence over compact subsets of Rd. A general estimator beyond the wavelet methodology is also proposed, discussing adaptivity within this statistical framework. Full article
10 pages, 230 KiB  
Article
Psychometric Properties of the Adult Self-Report: Data from over 11,000 American Adults
by Michelle Guerrero, Matt Hoffmann and Laura Pulkki-Råback
Stats 2020, 3(4), 465-474; https://doi.org/10.3390/stats3040029 - 29 Oct 2020
Cited by 9 | Viewed by 3468
Abstract
The first purpose of this study was to examine the factor structure of the Adult Self-Report (ASR) via traditional confirmatory factor analysis (CFA) and contemporary exploratory structural equation modeling (ESEM). The second purpose was to examine the measurement invariance of the ASR subscales [...] Read more.
The first purpose of this study was to examine the factor structure of the Adult Self-Report (ASR) via traditional confirmatory factor analysis (CFA) and contemporary exploratory structural equation modeling (ESEM). The second purpose was to examine the measurement invariance of the ASR subscales across age groups. We used baseline data from the Adolescent Brain Cognitive Development study. ASR data from 11,773 participants were used to conduct the CFA and ESEM analyses and data from 11,678 participants were used to conduct measurement invariance testing. Fit indices supported both the CFA and ESEM solutions, with the ESEM solution yielding better fit indices. However, several items in the ESEM solution did not sufficiently load on their intended factors and/or cross-loaded on unintended factors. Results from the measurement invariance analysis suggested that the ASR subscales are robust and fully invariant across subgroups of adults formed on the basis of age (18–35 years vs. 36–59 years). Future research should seek to both CFA and ESEM to provide a more comprehensive assessment of the ASR. Full article
(This article belongs to the Special Issue Statistics in Epidemiology)
21 pages, 889 KiB  
Article
Local Processing of Massive Databases with R: A National Analysis of a Brazilian Social Programme
by Hellen Paz, Mateus Maia, Fernando Moraes, Ricardo Lustosa, Lilia Costa, Samuel Macêdo, Marcos E. Barreto and Anderson Ara
Stats 2020, 3(4), 444-464; https://doi.org/10.3390/stats3040028 - 19 Oct 2020
Cited by 4 | Viewed by 3742
Abstract
The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts [...] Read more.
The analysis of massive databases is a key issue for most applications today and the use of parallel computing techniques is one of the suitable approaches for that. Apache Spark is a widely employed tool within this context, aiming at processing large amounts of data in a distributed way. For the Statistics community, R is one of the preferred tools. Despite its growth in the last years, it still has limitations for processing large volumes of data in single local machines. In general, the data analysis community has difficulty to handle a massive amount of data on local machines, often requiring high-performance computing servers. One way to perform statistical analyzes over massive databases is combining both tools (Spark and R) via the sparklyr package, which allows for an R application to use Spark. This paper presents an analysis of Brazilian public data from the Bolsa Família Programme (BFP—conditional cash transfer), comprising a large data set with 1.26 billion observations. Our goal was to understand how this social program acts in different cities, as well as to identify potentially important variables reflecting its utilization rate. Statistical modeling was performed using random forest to predict the utilization rated of BFP. Variable selection was performed through a recent method based on the importance and interpretation of variables in the random forest model. Among the 89 variables initially considered, the final model presented a high predictive performance capacity with 17 selected variables, as well as indicated high importance of some variables for the observed utilization rate in income, education, job informality, and inactive youth, namely: family income, education, occupation and density of people in the homes. In this work, using a local machine, we highlighted the potential of aggregating Spark and R for analysis of a large database of 111.6 GB. This can serve as proof of concept or reference for other similar works within the Statistics community, as well as our case study can provide important evidence for further analysis of this important social support programme. Full article
(This article belongs to the Section Data Science)
Show Figures

Figure 1

17 pages, 454 KiB  
Article
Identification of Judicial Outcomes in Judgments: A Generalized Gini-PLS Approach
by Gildas Tagny-Ngompé, Stéphane Mussard, Guillaume Zambrano, Sébastien Harispe and Jacky Montmain
Stats 2020, 3(4), 427-443; https://doi.org/10.3390/stats3040027 - 27 Sep 2020
Cited by 2 | Viewed by 2344
Abstract
This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics [...] Read more.
This paper presents and compares several text classification models that can be used to extract the outcome of a judgment from justice decisions, i.e., legal documents summarizing the different rulings made by a judge. Such models can be used to gather important statistics about cases, e.g., success rate based on specific characteristics of cases’ parties or jurisdiction, and are therefore important for the development of Judicial prediction not to mention the study of Law enforcement in general. We propose in particular the generalized Gini-PLS which better considers the information in the distribution tails while attenuating, as in the simple Gini-PLS, the influence exerted by outliers. Modeling the studied task as a supervised binary classification, we also introduce the LOGIT-Gini-PLS suited to the explanation of a binary target variable. In addition, various technical aspects regarding the evaluated text classification approaches which consists of combinations of representations of judgments and classification algorithms are studied using an annotated corpora of French justice decisions. Full article
(This article belongs to the Special Issue Interdisciplinary Research on Predictive Justice)
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop