entropy-logo

Journal Browser

Journal Browser

Statistical Methods for Modeling High-Dimensional and Complex Data

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (31 March 2023) | Viewed by 22709

Special Issue Editor


E-Mail Website
Guest Editor
Department of Mathematics and Statistics, York University, Toronto, ON M3J 1P3, Canada
Interests: statistical modeling and inference for data with a very complex structure and/or with high dimension
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

A statistical model has contributed to the understanding of the structure of a system or a process in various fields of engineering and natural and social sciences. One of the most important tasks in statistics is to develop methodologies and theories for building a statistical model for a data set. Such a model is not unique in general. For a given set of competing models, it is important to select a best approximating model among them, and then statistical analysis can be performed.

As data often exhibit complex structures, a statistical model is expected to capture this complexity, which can further our understanding of the underlying data-generating mechanism and advance relevant fields in science and engineering. This Special Issue calls for the newly developed statistical methods for modeling high-dimensional/complex data.

Prof. Dr. Yuehua Wu
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • model selection
  • spatiotemporal modeling
  • cluster analysis
  • high-dimensional statistics
  • data mining
  • multiple change-point detection

Related Special Issue

Published Papers (15 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

15 pages, 332 KiB  
Article
Detection of Interaction Effects in a Nonparametric Concurrent Regression Model
by Rui Pan, Zhanfeng Wang and Yaohua Wu
Entropy 2023, 25(9), 1327; https://doi.org/10.3390/e25091327 - 12 Sep 2023
Viewed by 801
Abstract
Many methods have been developed to study nonparametric function-on-function regression models. Nevertheless, there is a lack of model selection approach to the regression function as a functional function with functional covariate inputs. To study interaction effects among these functional covariates, in this article, [...] Read more.
Many methods have been developed to study nonparametric function-on-function regression models. Nevertheless, there is a lack of model selection approach to the regression function as a functional function with functional covariate inputs. To study interaction effects among these functional covariates, in this article, we first construct a tensor product space of reproducing kernel Hilbert spaces and build an analysis of variance (ANOVA) decomposition of the tensor product space. We then use a model selection method with the L1 criterion to estimate the functional function with functional covariate inputs and detect interaction effects among the functional covariates. The proposed method is evaluated using simulations and stroke rehabilitation data. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

12 pages, 304 KiB  
Article
Tweedie Compound Poisson Models with Covariate-Dependent Random Effects for Multilevel Semicontinuous Data
by Renjun Ma, Md. Dedarul Islam, M. Tariqul Hasan and Bent Jørgensen
Entropy 2023, 25(6), 863; https://doi.org/10.3390/e25060863 - 28 May 2023
Viewed by 829
Abstract
Multilevel semicontinuous data occur frequently in medical, environmental, insurance and financial studies. Such data are often measured with covariates at different levels; however, these data have traditionally been modelled with covariate-independent random effects. Ignoring dependence of cluster-specific random effects and cluster-specific covariates in [...] Read more.
Multilevel semicontinuous data occur frequently in medical, environmental, insurance and financial studies. Such data are often measured with covariates at different levels; however, these data have traditionally been modelled with covariate-independent random effects. Ignoring dependence of cluster-specific random effects and cluster-specific covariates in these traditional approaches may lead to ecological fallacy and result in misleading results. In this paper, we propose Tweedie compound Poisson model with covariate-dependent random effects to analyze multilevel semicontinuous data where covariates at different levels are incorporated at relevant levels. The estimation of our models has been developed based on the orthodox best linear unbiased predictor of random effect. Explicit expressions of random effects predictors facilitate computation and interpretation of our models. Our approach is illustrated through the analysis of the basic symptoms inventory study data where 409 adolescents from 269 families were observed at varying number of times from 1 to 17 times. The performance of the proposed methodology was also examined through the simulation studies. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

11 pages, 283 KiB  
Article
Sample Size Calculations in Simple Linear Regression: A New Approach
by Tianyuan Guan, Mohammed Khorshed Alam and Marepalli Bhaskara Rao
Entropy 2023, 25(4), 611; https://doi.org/10.3390/e25040611 - 03 Apr 2023
Viewed by 3603
Abstract
The problem tackled is the determination of sample size for a given level and power in the context of a simple linear regression model. The standard approach deals with planned experiments in which the predictor X is observed for a number n of [...] Read more.
The problem tackled is the determination of sample size for a given level and power in the context of a simple linear regression model. The standard approach deals with planned experiments in which the predictor X is observed for a number n of times and the corresponding observations on the response variable Y are to be drawn. The statistic that is used is built on the least squares’ estimator of the slope parameter. Its conditional distribution given the data on the predictor X is utilized for sample size calculations. This is problematic. The sample size n is already presaged and the data on X is fixed. In unplanned experiments, in which both X and Y are to be sampled simultaneously, we do not have data on the predictor X yet. This conundrum has been discussed in several papers and books with no solution proposed. We overcome the problem by determining the exact unconditional distribution of the test statistic in the unplanned case. We have provided tables of critical values for given levels of significance following the exact distribution. In addition, we show that the distribution of the test statistic depends only on the effect size, which is defined precisely in the paper. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
14 pages, 2644 KiB  
Article
Change-Point Detection for Multi-Way Tensor-Based Frameworks
by Shanshan Qin, Ge Zhou and Yuehua Wu
Entropy 2023, 25(4), 552; https://doi.org/10.3390/e25040552 - 23 Mar 2023
Viewed by 939
Abstract
Graph-based change-point detection methods are often applied due to their advantages for using high-dimensional data. Most applications focus on extracting effective information of objects while ignoring their main features. However, in some applications, one may be interested in detecting objects with different features, [...] Read more.
Graph-based change-point detection methods are often applied due to their advantages for using high-dimensional data. Most applications focus on extracting effective information of objects while ignoring their main features. However, in some applications, one may be interested in detecting objects with different features, such as color. Therefore, we propose a general graph-based change-point detection method under the multi-way tensor framework, aimed at detecting objects with different features that change in the distribution of one or more slices. Furthermore, considering that recorded tensor sequences may be vulnerable to natural disturbances, such as lighting in images or videos, we propose an improved method incorporating histogram equalization techniques to improve detection efficiency. Finally, through simulations and real data analysis, we show that the proposed methods achieve higher efficiency in detecting change-points. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

17 pages, 3007 KiB  
Article
A Robust and High-Dimensional Clustering Algorithm Based on Feature Weight and Entropy
by Xinzhi Du
Entropy 2023, 25(3), 510; https://doi.org/10.3390/e25030510 - 16 Mar 2023
Cited by 1 | Viewed by 1562
Abstract
Since the Fuzzy C-Means algorithm is incapable of considering the influence of different features and exponential constraints on high-dimensional and complex data, a fuzzy clustering algorithm based on non-Euclidean distance combining feature weights and entropy weights is proposed. The proposed algorithm is based [...] Read more.
Since the Fuzzy C-Means algorithm is incapable of considering the influence of different features and exponential constraints on high-dimensional and complex data, a fuzzy clustering algorithm based on non-Euclidean distance combining feature weights and entropy weights is proposed. The proposed algorithm is based on the Fuzzy C-Means soft clustering algorithm to deal with high-dimensional and complex data. The objective function of the new algorithm is modified with the help of two different entropy terms and a non-Euclidean way of computing the distance. The distance calculation formula enhances the efficiency of extracting the contribution of different features. The first entropy term helps to minimize the clusters’ dispersion and maximize the negative entropy to control the clustering process, which also promotes the association between the samples. The second entropy term helps to control the weights of features since different features have different weights in the clustering process. Experiments on real-world datasets indicate that the proposed algorithm gives better clustering results than other algorithms. The experiments demonstrate the proposed algorithm’s robustness by analyzing the parameters’ sensitivity and comparing the computational distance formulas. In summary, the improved algorithm improves classification performance under noisy interference and high-dimensional datasets, increases computational efficiency, performs well in real-world high-dimensional datasets, and encourages the development of robust noise-resistant high-dimensional fuzzy clustering algorithms. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

20 pages, 552 KiB  
Article
Bayesian Analysis of Tweedie Compound Poisson Partial Linear Mixed Models with Nonignorable Missing Response and Covariates
by Zhenhuan Wu, Xingde Duan and Wenzhuan Zhang
Entropy 2023, 25(3), 506; https://doi.org/10.3390/e25030506 - 15 Mar 2023
Cited by 1 | Viewed by 1120
Abstract
Under the Bayesian framework, this study proposes a Tweedie compound Poisson partial linear mixed model on the basis of Bayesian P-spline approximation to nonparametric function for longitudinal semicontinuous data in the presence of nonignorable missing covariates and responses. The logistic regression model is [...] Read more.
Under the Bayesian framework, this study proposes a Tweedie compound Poisson partial linear mixed model on the basis of Bayesian P-spline approximation to nonparametric function for longitudinal semicontinuous data in the presence of nonignorable missing covariates and responses. The logistic regression model is simultaneously used to specify the missing response and covariate mechanisms. A hybrid algorithm combining the Gibbs sampler and the Metropolis–Hastings algorithm is employed to produce the joint Bayesian estimates of unknown parameters and random effects as well as nonparametric function. Several simulation studies and a real example relating to the osteoarthritis initiative data are presented to illustrate the proposed methodologies. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

16 pages, 441 KiB  
Article
Change-Point Detection in a High-Dimensional Multinomial Sequence Based on Mutual Information
by Xinrong Xiang, Baisuo Jin and Yuehua Wu
Entropy 2023, 25(2), 355; https://doi.org/10.3390/e25020355 - 14 Feb 2023
Viewed by 1251
Abstract
Time-series data often have an abrupt structure change at an unknown location. This paper proposes a new statistic to test the existence of a change-point in a multinomial sequence, where the number of categories is comparable with the sample size as it tends [...] Read more.
Time-series data often have an abrupt structure change at an unknown location. This paper proposes a new statistic to test the existence of a change-point in a multinomial sequence, where the number of categories is comparable with the sample size as it tends to infinity. To construct this statistic, the pre-classification is implemented first; then, it is given based on the mutual information between the data and the locations from the pre-classification. Note that this statistic can also be used to estimate the position of the change-point. Under certain conditions, the proposed statistic is asymptotically normally distributed under the null hypothesis and consistent under the alternative hypothesis. Simulation results show the high power of the test based on the proposed statistic and the high accuracy of the estimate. The proposed method is also illustrated with a real example of physical examination data. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

20 pages, 4890 KiB  
Article
Detecting Non-Overlapping Signals with Dynamic Programming
by Mordechai Roth, Amichai Painsky and Tamir Bendory
Entropy 2023, 25(2), 250; https://doi.org/10.3390/e25020250 - 30 Jan 2023
Viewed by 1117
Abstract
This paper studies the classical problem of detecting the locations of signal occurrences in a one-dimensional noisy measurement. Assuming the signal occurrences do not overlap, we formulate the detection task as a constrained likelihood optimization problem and design a computationally efficient dynamic program [...] Read more.
This paper studies the classical problem of detecting the locations of signal occurrences in a one-dimensional noisy measurement. Assuming the signal occurrences do not overlap, we formulate the detection task as a constrained likelihood optimization problem and design a computationally efficient dynamic program that attains its optimal solution. Our proposed framework is scalable, simple to implement, and robust to model uncertainties. We show by extensive numerical experiments that our algorithm accurately estimates the locations in dense and noisy environments, and outperforms alternative methods. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

23 pages, 350 KiB  
Article
Robust Variable Selection with Exponential Squared Loss for the Spatial Durbin Model
by Zhongyang Liu, Yunquan Song and Yi Cheng
Entropy 2023, 25(2), 249; https://doi.org/10.3390/e25020249 - 30 Jan 2023
Viewed by 1136
Abstract
With the continuous application of spatial dependent data in various fields, spatial econometric models have attracted more and more attention. In this paper, a robust variable selection method based on exponential squared loss and adaptive lasso is proposed for the spatial Durbin model. [...] Read more.
With the continuous application of spatial dependent data in various fields, spatial econometric models have attracted more and more attention. In this paper, a robust variable selection method based on exponential squared loss and adaptive lasso is proposed for the spatial Durbin model. Under mild conditions, we establish the asymptotic and “Oracle” properties of the proposed estimator. However, in model solving, nonconvex and nondifferentiable programming problems bring challenges to solving algorithms. To solve this problem effectively, we design a BCD algorithm and give a DC decomposition of the exponential squared loss. Numerical simulation results show that the method is more robust and accurate than existing variable selection methods when noise is present. In addition, we also apply the model to the 1978 housing price dataset in the Baltimore area. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
18 pages, 376 KiB  
Article
Estimation of Large-Dimensional Covariance Matrices via Second-Order Stein-Type Regularization
by Bin Zhang, Hengzhen Huang and Jianbin Chen
Entropy 2023, 25(1), 53; https://doi.org/10.3390/e25010053 - 27 Dec 2022
Viewed by 1345
Abstract
This paper tackles the problem of estimating the covariance matrix in large-dimension and small-sample-size scenarios. Inspired by the well-known linear shrinkage estimation, we propose a novel second-order Stein-type regularization strategy to generate well-conditioned covariance matrix estimators. We model the second-order Stein-type regularization as [...] Read more.
This paper tackles the problem of estimating the covariance matrix in large-dimension and small-sample-size scenarios. Inspired by the well-known linear shrinkage estimation, we propose a novel second-order Stein-type regularization strategy to generate well-conditioned covariance matrix estimators. We model the second-order Stein-type regularization as a quadratic polynomial concerning the sample covariance matrix and a given target matrix, representing the prior information of the actual covariance structure. To obtain available covariance matrix estimators, we choose the spherical and diagonal target matrices and develop unbiased estimates of the theoretical mean squared errors, which measure the distances between the actual covariance matrix and its estimators. We formulate the second-order Stein-type regularization as a convex optimization problem, resulting in the optimal second-order Stein-type estimators. Numerical simulations reveal that the proposed estimators can significantly lower the Frobenius losses compared with the existing Stein-type estimators. Moreover, a real data analysis in portfolio selection verifies the performance of the proposed estimators. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

12 pages, 269 KiB  
Article
Feature Selection in High-Dimensional Models via EBIC with Energy Distance Correlation
by Isaac Xoese Ocloo and Hanfeng Chen
Entropy 2023, 25(1), 14; https://doi.org/10.3390/e25010014 - 21 Dec 2022
Cited by 1 | Viewed by 1039
Abstract
In this paper, the LASSO method with extended Bayesian information criteria (EBIC) for feature selection in high-dimensional models is studied. We propose the use of the energy distance correlation in place of the ordinary correlation coefficient to measure the dependence of two variables. [...] Read more.
In this paper, the LASSO method with extended Bayesian information criteria (EBIC) for feature selection in high-dimensional models is studied. We propose the use of the energy distance correlation in place of the ordinary correlation coefficient to measure the dependence of two variables. The energy distance correlation detects linear and non-linear association between two variables, unlike the ordinary correlation coefficient, which detects only linear association. EBIC is adopted as the stopping criterion. It is shown that the new method is more powerful than Luo and Chen’s method for feature selection. This is demonstrated by simulation studies and illustrated by a real-life example. It is also proved that the new algorithm is selection-consistent. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
18 pages, 432 KiB  
Article
Testing the Intercept of a Balanced Predictive Regression Model
by Qijun Wang, Xiaohui Liu, Yawen Fan and Ling Peng
Entropy 2022, 24(11), 1594; https://doi.org/10.3390/e24111594 - 02 Nov 2022
Viewed by 1090
Abstract
Testing predictability is known to be an important issue for the balanced predictive regression model. Some unified testing statistics of desirable properties have been proposed, though their validity depends on a predefined assumption regarding whether or not an intercept term nevertheless exists. In [...] Read more.
Testing predictability is known to be an important issue for the balanced predictive regression model. Some unified testing statistics of desirable properties have been proposed, though their validity depends on a predefined assumption regarding whether or not an intercept term nevertheless exists. In fact, most financial data have endogenous or heteroscedasticity structure, and the existing intercept term test does not perform well in these cases. In this paper, we consider the testing for the intercept of the balanced predictive regression model. An empirical likelihood based testing statistic is developed, and its limit distribution is also derived under some mild conditions. We also provide some simulations and a real application to illustrate its merits in terms of both size and power properties. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

13 pages, 327 KiB  
Article
Analysis of Longitudinal Binomial Data with Positive Association between the Number of Successes and the Number of Failures: An Application to Stock Instability Study
by Xiaolei Zhang, Guohua Yan, Renjun Ma and Jiaxiu Li
Entropy 2022, 24(10), 1472; https://doi.org/10.3390/e24101472 - 16 Oct 2022
Viewed by 1176
Abstract
Numerous methods have been developed for longitudinal binomial data in the literature. These traditional methods are reasonable for longitudinal binomial data with a negative association between the number of successes and the number of failures over time; however, a positive association may occur [...] Read more.
Numerous methods have been developed for longitudinal binomial data in the literature. These traditional methods are reasonable for longitudinal binomial data with a negative association between the number of successes and the number of failures over time; however, a positive association may occur between the number of successes and the number of failures over time in some behaviour, economic, disease aggregation and toxicological studies as the numbers of trials are often random. In this paper, we propose a joint Poisson mixed modelling approach to longitudinal binomial data with a positive association between longitudinal counts of successes and longitudinal counts of failures. This approach can accommodate both a random and zero number of trials. It can also accommodate overdispersion and zero inflation in the number of successes and the number of failures. An optimal estimation method for our model has been developed using the orthodox best linear unbiased predictors. Our approach not only provides robust inference against misspecified random effects distributions, but also consolidates the subject-specific and population-averaged inferences. The usefulness of our approach is illustrated with an analysis of quarterly bivariate count data of stock daily limit-ups and limit-downs. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

15 pages, 2311 KiB  
Article
Bayesian Variable Selection and Estimation in Semiparametric Simplex Mixed-Effects Models with Longitudinal Proportional Data
by Anmin Tang, Xingde Duan and Yuanying Zhao
Entropy 2022, 24(10), 1466; https://doi.org/10.3390/e24101466 - 14 Oct 2022
Cited by 2 | Viewed by 1256
Abstract
In the development of simplex mixed-effects models, random effects in these mixed-effects models are generally distributed in normal distribution. The normality assumption may be violated in an analysis of skewed and multimodal longitudinal data. In this paper, we adopt the centered Dirichlet process [...] Read more.
In the development of simplex mixed-effects models, random effects in these mixed-effects models are generally distributed in normal distribution. The normality assumption may be violated in an analysis of skewed and multimodal longitudinal data. In this paper, we adopt the centered Dirichlet process mixture model (CDPMM) to specify the random effects in the simplex mixed-effects models. Combining the block Gibbs sampler and the Metropolis–Hastings algorithm, we extend a Bayesian Lasso (BLasso) to simultaneously estimate unknown parameters of interest and select important covariates with nonzero effects in semiparametric simplex mixed-effects models. Several simulation studies and a real example are employed to illustrate the proposed methodologies. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

16 pages, 636 KiB  
Article
Logistic Regression Model for a Bivariate Binomial Distribution with Applications in Baseball Data Analysis
by Yewon Han, Jaeho Kim, Hon Keung Tony Ng and Seong W. Kim
Entropy 2022, 24(8), 1138; https://doi.org/10.3390/e24081138 - 17 Aug 2022
Cited by 3 | Viewed by 2081
Abstract
There has been a considerable amount of literature on binomial regression models that utilize well-known link functions, such as logistic, probit, and complementary log-log functions. The conventional binomial model is focused only on a single parameter representing one probability of success. However, we [...] Read more.
There has been a considerable amount of literature on binomial regression models that utilize well-known link functions, such as logistic, probit, and complementary log-log functions. The conventional binomial model is focused only on a single parameter representing one probability of success. However, we often encounter data for which two different success probabilities are of interest simultaneously. For instance, there are several offensive measures in baseball to predict the future performance of batters. Under these circumstances, it would be meaningful to consider more than one success probability. In this article, we employ a bivariate binomial distribution that possesses two success probabilities to conduct a regression analysis with random effects being incorporated under a Bayesian framework. Major League Baseball data are analyzed to demonstrate our methodologies. Extensive simulation studies are conducted to investigate model performances. Full article
(This article belongs to the Special Issue Statistical Methods for Modeling High-Dimensional and Complex Data)
Show Figures

Figure 1

Back to TopTop