Statistical Data Modeling and Machine Learning with Applications II

A special issue of Mathematics (ISSN 2227-7390). This special issue belongs to the section "Mathematics and Computer Science".

Deadline for manuscript submissions: closed (10 April 2023) | Viewed by 30968

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Department of Mathematical Analysis, Faculty of Mathematics and Informatics, University of Plovdiv Paisii Hilendarski, 24 Tzar Assen St., 4000 Plovdiv, Bulgaria
Interests: computational statistics; applied mathematics; data mining; computer modeling in physics and engineering
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Department of Mathematical Analysis, Faculty of Mathematics and Informatics, Paisii Hilendarski University of Plovdiv, 4000 Plovdiv, Bulgaria
Interests: predictive modeling; regression methods; time series analysis

E-Mail Website
Guest Editor
Department of Mathematical Analysis, University of Plovdiv Paisii Hilendarski, 24 Tzar Asen St., 4000 Plovdiv, Bulgaria
Interests: data analysis; applied and computational statistics; data mining; applications of predictive data mining techniques

Special Issue Information

Dear Colleagues,

Statistics and machine learning are two intertwined fields of mathematics and computer science. In recent years, very powerful classification and predictive methods have been developed in this area. As a rule, the new methods for statistical data modeling and machine learning provide enormous opportunities for the development of new methods and approaches, as well as for their use to effectively solve practical problems like never before.

The proposed Special Issue aims to publish review papers, research articles, and communications that present new original methods, applications, data analyses, case studies, comparative studies, and other results. Special attention will be given, but is not limited, to the theory and application of statistical data modeling and machine learning to diverse areas such as computer science, economics, industry, medicine, environmental sciences, forex and finance, education, engineering, marketing, agriculture, and more.

Prof. Dr. Snezhana Gocheva-Ilieva
Dr. Atanas Ivanov
Dr. Hristina Kulina
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Mathematics is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • Computational statistics
  • Dimensionality reduction and variable selection
  • Nonparametric statistical modeling
  • Supervised learning (classification, regression)
  • Clustering methods
  • Financial statistics and econometrics
  • Statistical algorithms
  • Time series analysis and forecasting
  • Machine learning algorithms
  • Decision trees
  • Ensemble methods
  • Neural networks
  • Deep learning
  • Hybrid models
  • Data analysis

Published Papers (16 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

4 pages, 193 KiB  
Editorial
Special Issue “Statistical Data Modeling and Machine Learning with Applications II”
by Snezhana Gocheva-Ilieva, Atanas Ivanov and Hristina Kulina
Mathematics 2023, 11(12), 2775; https://doi.org/10.3390/math11122775 - 20 Jun 2023
Viewed by 916
Abstract
Currently, we are witnessing rapid progress and synergy between mathematics and computer science [...] Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)

Research

Jump to: Editorial

17 pages, 5002 KiB  
Article
Detection of Anomalies in Natural Complicated Data Structures Based on a Hybrid Approach
by Oksana Mandrikova, Bogdana Mandrikova and Oleg Esikov
Mathematics 2023, 11(11), 2464; https://doi.org/10.3390/math11112464 - 26 May 2023
Cited by 2 | Viewed by 809
Abstract
A hybrid approach is proposed to detect anomalies in natural complicated data structures with high noise levels. The approach includes the application of an autoencoder neural network and singular spectrum analysis (SSA) with an adaptive anomaly detection algorithm (AADA) developed by the authors. [...] Read more.
A hybrid approach is proposed to detect anomalies in natural complicated data structures with high noise levels. The approach includes the application of an autoencoder neural network and singular spectrum analysis (SSA) with an adaptive anomaly detection algorithm (AADA) developed by the authors. The autoencoder is the quintessence of the representation learning algorithm, and it projects (selects) data features. Here, under-complete autoencoders are used. They are a product of the development of the principal component method and allow one to approximate complex nonlinear dependencies. Singular spectrum analysis decomposes data through the singular decomposition of matrix trajectories and makes it possible to detect the data structure in the noise. The AADA is based on the combination of wavelet transforms with threshold functions. Combinations of different constructions of wavelet transformation with threshold functions are widely applied to tasks relating to complex data processing. However, when the noise level is high and there is no complete knowledge of a useful signal, anomaly detection is not a trivial problem and requires a complex approach. This paper considers the use of adaptive threshold functions, the parameters of which are estimated on a probabilistic basis. Adaptive thresholds and a moving time window are introduced. The efficiency of the proposed method in detecting anomalies in neutron monitor data is illustrated. Neutron monitor data record cosmic ray intensities. We used neutron monitor data from ground stations. Anomalies in cosmic rays can create serious radiation hazards for people as well as for space and ground facilities. Thus, the diagnostics of anomalies in cosmic ray parameters is quite topical, and research is being carried out by teams from different countries. A comparison of the results for the autoencoder + AADA and SSA + AADA methods showed the higher efficiency of the autoencoder + AADA method. A more flexible NN apparatus provides better detection of short-period anomalies that have complicated structures. However, the combination of SSA and the AADA is efficient in the detection of long-term anomalies in cosmic rays that occur during strong magnetic storms. Thus, cosmic ray data analysis requires a more complex approach, including the use of the autoencoder and SSA with the AADA. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

22 pages, 4950 KiB  
Article
Leisure Time Prediction and Influencing Factors Analysis Based on LightGBM and SHAP
by Qiyan Wang and Yuanyuan Jiang
Mathematics 2023, 11(10), 2371; https://doi.org/10.3390/math11102371 - 19 May 2023
Cited by 2 | Viewed by 1438
Abstract
Leisure time is crucial for personal development and leisure consumption. Accurate prediction of leisure time and analysis of its influencing factors creates a benefit by increasing personal leisure time. We predict leisure time and analyze its key influencing factors according to survey data [...] Read more.
Leisure time is crucial for personal development and leisure consumption. Accurate prediction of leisure time and analysis of its influencing factors creates a benefit by increasing personal leisure time. We predict leisure time and analyze its key influencing factors according to survey data of Beijing residents’ time allocation in 2011, 2016, and 2021, with an effective sample size of 3356. A Light Gradient Boosting Machine (LightGBM) model is utilized to classify and predict leisure time, and the SHapley Additive exPlanation (SHAP) approach is utilized to conduct feature importance analysis and influence mechanism analysis of influencing factors from four perspectives: time allocation, demographics, occupation, and family characteristics. The results verify that LightGBM effectively predicts personal leisure time, with the test set’s accuracy, recall, and F1 values being 0.85 and the AUC value reaching 0.91. The results of SHAP highlight that work/study time within the system is the main constraint on leisure time. Demographic factors, such as gender and age, are also of great significance for leisure time. Occupational and family heterogeneity exist in leisure time as well. The results contribute to the government improving the public holiday system, companies designing personalized leisure products for users with different leisure characteristics, and residents understanding and independently increasing their leisure time. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

21 pages, 5283 KiB  
Article
ELCT-YOLO: An Efficient One-Stage Model for Automatic Lung Tumor Detection Based on CT Images
by Zhanlin Ji, Jianyong Zhao, Jinyun Liu, Xinyi Zeng, Haiyang Zhang, Xueji Zhang and Ivan Ganchev
Mathematics 2023, 11(10), 2344; https://doi.org/10.3390/math11102344 - 17 May 2023
Cited by 5 | Viewed by 2059
Abstract
Research on lung cancer automatic detection using deep learning algorithms has achieved good results but, due to the complexity of tumor edge features and possible changes in tumor positions, it is still a great challenge to diagnose patients with lung tumors based on [...] Read more.
Research on lung cancer automatic detection using deep learning algorithms has achieved good results but, due to the complexity of tumor edge features and possible changes in tumor positions, it is still a great challenge to diagnose patients with lung tumors based on computed tomography (CT) images. In order to solve the problem of scales and meet the requirements of real-time detection, an efficient one-stage model for automatic lung tumor detection in CT Images, called ELCT-YOLO, is presented in this paper. Instead of deepening the backbone or relying on a complex feature fusion network, ELCT-YOLO uses a specially designed neck structure, which is suitable to enhance the multi-scale representation ability of the entire feature layer. At the same time, in order to solve the problem of lacking a receptive field after decoupling, the proposed model uses a novel Cascaded Refinement Scheme (CRS), composed of two different types of receptive field enhancement modules (RFEMs), which enables expanding the effective receptive field and aggregate multi-scale context information, thus improving the tumor detection performance of the model. The experimental results show that the proposed ELCT-YOLO model has strong ability in expressing multi-scale information and good robustness in detecting lung tumors of various sizes. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

27 pages, 1371 KiB  
Article
RaKShA: A Trusted Explainable LSTM Model to Classify Fraud Patterns on Credit Card Transactions
by Jay Raval, Pronaya Bhattacharya, Nilesh Kumar Jadav, Sudeep Tanwar, Gulshan Sharma, Pitshou N. Bokoro, Mitwalli Elmorsy, Amr Tolba and Maria Simona Raboaca
Mathematics 2023, 11(8), 1901; https://doi.org/10.3390/math11081901 - 17 Apr 2023
Cited by 2 | Viewed by 2233
Abstract
Credit card (CC) fraud has been a persistent problem and has affected financial organizations. Traditional machine learning (ML) algorithms are ineffective owing to the increased attack space, and techniques such as long short-term memory (LSTM) have shown promising results in detecting CC fraud [...] Read more.
Credit card (CC) fraud has been a persistent problem and has affected financial organizations. Traditional machine learning (ML) algorithms are ineffective owing to the increased attack space, and techniques such as long short-term memory (LSTM) have shown promising results in detecting CC fraud patterns. However, owing to the black box nature of the LSTM model, the decision-making process could be improved. Thus, in this paper, we propose a scheme, RaKShA, which presents explainable artificial intelligence (XAI) to help understand and interpret the behavior of black box models. XAI is formally used to interpret these black box models; however, we used XAI to extract essential features from the CC fraud dataset, consequently improving the performance of the LSTM model. The XAI was integrated with LSTM to form an explainable LSTM (X-LSTM) model. The proposed approach takes preprocessed data and feeds it to the XAI model, which computes the variable importance plot for the dataset, which simplifies the feature selection. Then, the data are presented to the LSTM model, and the output classification is stored in a smart contract (SC), ensuring no tampering with the results. The final data are stored on the blockchain (BC), which forms trusted and chronological ledger entries. We have considered two open-source CC datasets. We obtain an accuracy of 99.8% with our proposed X-LSTM model over 50 epochs compared to 85% without XAI (simple LSTM model). We present the gas fee requirements, IPFS bandwidth, and the fraud detection contract specification in blockchain metrics. The proposed results indicate the practical viability of our scheme in real-financial CC spending and lending setups. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

26 pages, 3361 KiB  
Article
Multi-Step Ahead Ex-Ante Forecasting of Air Pollutants Using Machine Learning
by Snezhana Gocheva-Ilieva, Atanas Ivanov, Hristina Kulina and Maya Stoimenova-Minova
Mathematics 2023, 11(7), 1566; https://doi.org/10.3390/math11071566 - 23 Mar 2023
Cited by 3 | Viewed by 1036
Abstract
In this study, a novel general multi-step ahead strategy is developed for forecasting time series of air pollutants. The values of the predictors at future moments are gathered from official weather forecast sites as independent ex-ante data. They are updated with new forecasted [...] Read more.
In this study, a novel general multi-step ahead strategy is developed for forecasting time series of air pollutants. The values of the predictors at future moments are gathered from official weather forecast sites as independent ex-ante data. They are updated with new forecasted values every day. Each new sample is used to build- a separate single model that simultaneously predicts future pollution levels. The sought forecasts were estimated by averaging the actual predictions of the single models. The strategy was applied to three pollutants—PM10, SO2, and NO2—in the city of Pernik, Bulgaria. Random forest (RF) and arcing (Arc-x4) machine learning algorithms were applied to the modeling. Although there are many highly changing day-to-day predictors, the proposed averaging strategy shows a promising alternative to single models. In most cases, the root mean squared errors (RMSE) of the averaging models (aRF and aAR) for the last 10 horizons are lower than those of the single models. In particular, for PM10, the aRF’s RMSE is 13.1 vs. 13.8 micrograms per cubic meter for the single model; for the NO2 model, the aRF exhibits 21.5 vs. 23.8; for SO2, the aAR has 17.3 vs. 17.4; for NO2, the aAR’s RMSE is 22.7 vs. 27.5, respectively. Fractional bias is within the same limits of (−0.65, 0.7) for all constructed models. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

30 pages, 1432 KiB  
Article
Using Machine Learning in Predicting the Impact of Meteorological Parameters on Traffic Incidents
by Aleksandar Aleksić, Milan Ranđelović and Dragan Ranđelović
Mathematics 2023, 11(2), 479; https://doi.org/10.3390/math11020479 - 16 Jan 2023
Cited by 3 | Viewed by 2196
Abstract
The opportunity for large amounts of open-for-public and available data is one of the main drivers of the development of an information society at the beginning of the 21st century. In this sense, acquiring knowledge from these data using different methods of machine [...] Read more.
The opportunity for large amounts of open-for-public and available data is one of the main drivers of the development of an information society at the beginning of the 21st century. In this sense, acquiring knowledge from these data using different methods of machine learning is a prerequisite for solving complex problems in many spheres of human activity, starting from medicine to education and the economy, including traffic as today’s important economic branch. Having this in mind, this paper deals with the prediction of the risk of traffic incidents using both historical and real-time data for different atmospheric factors. The main goal is to construct an ensemble model based on the use of several machine learning algorithms which has better characteristics of prediction than any of those installed when individually applied. In global, a case-proposed model could be a multi-agent system, but in a considered case study, a two-agent system is used so that one agent solves the prediction task by learning from the historical data, and the other agent uses the real time data. The authors evaluated the obtained model based on a case study and data for the city of Niš from the Republic of Serbia and also described its implementation as a practical web citizen application. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

20 pages, 14462 KiB  
Article
Surface Approximation by Means of Gaussian Process Latent Variable Models and Line Element Geometry
by Ivan De Boi, Carl Henrik Ek and Rudi Penne
Mathematics 2023, 11(2), 380; https://doi.org/10.3390/math11020380 - 11 Jan 2023
Cited by 1 | Viewed by 1923
Abstract
The close relation between spatial kinematics and line geometry has been proven to be fruitful in surface detection and reconstruction. However, methods based on this approach are limited to simple geometric shapes that can be formulated as a linear subspace of line or [...] Read more.
The close relation between spatial kinematics and line geometry has been proven to be fruitful in surface detection and reconstruction. However, methods based on this approach are limited to simple geometric shapes that can be formulated as a linear subspace of line or line element space. The core of this approach is a principal component formulation to find a best-fit approximant to a possibly noisy or impartial surface given as an unordered set of points or point cloud. We expand on this by introducing the Gaussian process latent variable model, a probabilistic non-linear non-parametric dimensionality reduction approach following the Bayesian paradigm. This allows us to find structure in a lower dimensional latent space for the surfaces of interest. We show how this can be applied in surface approximation and unsupervised segmentation to the surfaces mentioned above and demonstrate its benefits on surfaces that deviate from these. Experiments are conducted on synthetic and real-world objects. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Graphical abstract

18 pages, 545 KiB  
Article
Efficient Monte Carlo Methods for Multidimensional Modeling of Slot Machines Jackpot
by Slavi Georgiev and Venelin Todorov
Mathematics 2023, 11(2), 266; https://doi.org/10.3390/math11020266 - 04 Jan 2023
Cited by 1 | Viewed by 1907
Abstract
Nowadays, entertainment is one of the biggest industries, which continues to expand. In this study, the problem of estimating the consolation prize as a fraction of the jackpot is dealt with, which is an important issue for each casino and gambling club. Solving [...] Read more.
Nowadays, entertainment is one of the biggest industries, which continues to expand. In this study, the problem of estimating the consolation prize as a fraction of the jackpot is dealt with, which is an important issue for each casino and gambling club. Solving the problem leads to the computation of multidimensional integrals. For that purpose, modifications of the most powerful stochastic quasi-Monte Carlo approaches are employed, in particular lattice and digital sequences, Halton and Sobol sequences, and Latin hypercube sampling. They show significant improvements to the classical Monte Carlo methods. After accurate computation of the arisen integrals, it is shown how to calculate the expectation of the real consolation prize, taking into account the distribution of time, when different numbers of players are betting. Moreover, a solution to the problem with higher dimensions is also proposed. All the suggestions are verified by computational experiments with real data. Besides gambling, the results obtained in this study have various applications in numerous areas, including finance, ecology and many others. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

16 pages, 3499 KiB  
Article
Approximation and Analysis of Natural Data Based on NARX Neural Networks Involving Wavelet Filtering
by Oksana Mandrikova, Yuryi Polozov, Nataly Zhukova and Yulia Shichkina
Mathematics 2022, 10(22), 4345; https://doi.org/10.3390/math10224345 - 19 Nov 2022
Cited by 3 | Viewed by 1300
Abstract
Recurrent neural network (RNN) models continue the theory of the autoregression integrated moving average (ARIMA) model class. In this paper, we consider the architecture of the RNN with embedded memory—«Process of Nonlinear Autoregressive Exogenous Model» (NARX). Though it is known that NN is [...] Read more.
Recurrent neural network (RNN) models continue the theory of the autoregression integrated moving average (ARIMA) model class. In this paper, we consider the architecture of the RNN with embedded memory—«Process of Nonlinear Autoregressive Exogenous Model» (NARX). Though it is known that NN is a universal approximator, certain difficulties and restrictions in different NN applications are still topical and call for new approaches and methods. In particular, it is difficult for an NN to model noisy and significantly nonstationary time series. The paper suggests optimizing the modeling process for a complicated-structure time series by NARX networks involving wavelet filtering. The developed procedure of wavelet filtering includes the application of the construction of wavelet packets and stochastic thresholds. A method to estimate the thresholds to obtain a solution with a defined confidence level is also developed. We introduce the algorithm of wavelet filtering. It is shown that the proposed wavelet filtering makes it possible to obtain a more accurate NARX model and improves the efficiency of the forecasting process for a natural time series of a complicated structure. Compared to ARIMA, the suggested method allows us to obtain a more adequate model of a nonstationary time series of complex nonlinear structure. The advantage of the method, compared to RNN, is the higher quality of data approximation for smaller computation efforts at the stages of network training and functioning that provides the solution to the problem of long-term dependencies. Moreover, we develop a scheme of approach realization for the task of data modeling based on NARX and anomaly detection. The necessity of anomaly detection arises in different application areas. Anomaly detection is of particular relevance in the problems of geophysical monitoring and requires method accuracy and efficiency. The effectiveness of the suggested method is illustrated in the example of processing of ionospheric parameter time series. We also present the results for the problem of ionospheric anomaly detection. The approach can be applied in space weather forecasting to predict ionospheric parameters and to detect ionospheric anomalies. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

20 pages, 2510 KiB  
Article
Jewel 2.0: An Improved Joint Estimation Method for Multiple Gaussian Graphical Models
by Claudia Angelini, Daniela De Canditiis and Anna Plaksienko
Mathematics 2022, 10(21), 3983; https://doi.org/10.3390/math10213983 - 26 Oct 2022
Cited by 1 | Viewed by 1674
Abstract
In this paper, we consider the problem of estimating the graphs of conditional dependencies between variables (i.e., graphical models) from multiple datasets under Gaussian settings. We present jewel 2.0, which improves our previous method jewel 1.0 by modeling commonality and class-specific differences [...] Read more.
In this paper, we consider the problem of estimating the graphs of conditional dependencies between variables (i.e., graphical models) from multiple datasets under Gaussian settings. We present jewel 2.0, which improves our previous method jewel 1.0 by modeling commonality and class-specific differences in the graph structures and better estimating graphs with hubs, making this new approach more appealing for biological data applications. We introduce these two improvements by modifying the regression-based problem formulation and the corresponding minimization algorithm. We also present, for the first time in the multiple graphs setting, a stability selection procedure to reduce the number of false positives in the estimated graphs. Finally, we illustrate the performance of jewel 2.0 through simulated and real data examples. The method is implemented in the new version of the R package \({\texttt{jewel}}\). Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

28 pages, 7029 KiB  
Article
The FEDHC Bayesian Network Learning Algorithm
by Michail Tsagris
Mathematics 2022, 10(15), 2604; https://doi.org/10.3390/math10152604 - 26 Jul 2022
Cited by 2 | Viewed by 1371
Abstract
The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Further, the paper manifests that the only implementation of MMHC in the statistical software R is prohibitively [...] Read more.
The paper proposes a new hybrid Bayesian network learning algorithm, termed Forward Early Dropping Hill Climbing (FEDHC), devised to work with either continuous or categorical variables. Further, the paper manifests that the only implementation of MMHC in the statistical software R is prohibitively expensive, and a new implementation is offered. Further, specifically for the case of continuous data, a robust to outliers version of FEDHC, which can be adopted by other BN learning algorithms, is proposed. The FEDHC is tested via Monte Carlo simulations that distinctly show that it is computationally efficient, and that it produces Bayesian networks of similar to, or of higher accuracy than MMHC and PCHC. Finally, an application of FEDHC, PCHC and MMHC algorithms to real data, from the field of economics, is demonstrated using the statistical software R. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

17 pages, 661 KiB  
Article
Robust Variable Selection Based on Penalized Composite Quantile Regression for High-Dimensional Single-Index Models
by Yunquan Song, Zitong Li and Minglu Fang
Mathematics 2022, 10(12), 2000; https://doi.org/10.3390/math10122000 - 10 Jun 2022
Cited by 4 | Viewed by 1346
Abstract
The single-index model is an intuitive extension of the linear regression model. It has been increasingly popular due to its flexibility in modeling. In this work, we focus on the estimators of the parameters and the unknown link function for the single-index model [...] Read more.
The single-index model is an intuitive extension of the linear regression model. It has been increasingly popular due to its flexibility in modeling. In this work, we focus on the estimators of the parameters and the unknown link function for the single-index model in a high-dimensional situation. The SCAD and Laplace error penalty (LEP)-based penalized composite quantile regression estimators, which could realize variable selection and estimation simultaneously, are proposed; a practical iterative algorithm is introduced to obtain the efficient and robust estimators. The choices of the tuning parameters, the bandwidth, and the initial values are also discussed. Furthermore, under some mild conditions, we show the large sample properties and oracle property of the SCAD and Laplace penalized composite quantile regression estimators. Finally, we evaluated the performances of the proposed estimators by two numerical simulations and a real data application. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

23 pages, 2560 KiB  
Article
Anomaly Detection in the Internet of Vehicular Networks Using Explainable Neural Networks (xNN)
by Saddam Aziz, Muhammad Talib Faiz, Adegoke Muideen Adeniyi, Ka-Hong Loo, Kazi Nazmul Hasan, Linli Xu and Muhammad Irshad
Mathematics 2022, 10(8), 1267; https://doi.org/10.3390/math10081267 - 11 Apr 2022
Cited by 15 | Viewed by 3574
Abstract
It is increasingly difficult to identify complex cyberattacks in a wide range of industries, such as the Internet of Vehicles (IoV). The IoV is a network of vehicles that consists of sensors, actuators, network layers, and communication systems between vehicles. Communication plays an [...] Read more.
It is increasingly difficult to identify complex cyberattacks in a wide range of industries, such as the Internet of Vehicles (IoV). The IoV is a network of vehicles that consists of sensors, actuators, network layers, and communication systems between vehicles. Communication plays an important role as an essential part of the IoV. Vehicles in a network share and deliver information based on several protocols. Due to wireless communication between vehicles, the whole network can be sensitive towards cyber-attacks.In these attacks, sensitive information can be shared with a malicious network or a bogus user, resulting in malicious attacks on the IoV. For the last few years, detecting attacks in the IoV has been a challenging task. It is becoming increasingly difficult for traditional Intrusion Detection Systems (IDS) to detect these newer, more sophisticated attacks, which employ unusual patterns. Attackers disguise themselves as typical users to evade detection. These problems can be solved using deep learning. Many machine-learning and deep-learning (DL) models have been implemented to detect malicious attacks; however, feature selection remains a core issue. Through the use of training empirical data, DL independently defines intrusion features. We built a DL-based intrusion model that focuses on Denial of Service (DoS) assaults in particular. We used K-Means clustering for feature scoring and ranking. After extracting the best features for anomaly detection, we applied a novel model, i.e., an Explainable Neural Network (xNN), to classify attacks in the CICIDS2019 dataset and UNSW-NB15 dataset separately. The model performed well regarding the precision, recall, F1 score, and accuracy. Comparatively, it can be seen that our proposed model xNN performed well after the feature-scoring technique. In dataset 1 (UNSW-NB15), xNN performed well, with the highest accuracy of 99.7%, while CNN scored 87%, LSTM scored 90%, and the Deep Neural Network (DNN) scored 92%. xNN achieved the highest accuracy of 99.3% while classifying attacks in the second dataset (CICIDS2019); the Convolutional Neural Network (CNN) achieved 87%, Long Short-Term Memory (LSTM) achieved 89%, and the DNN achieved 82%. The suggested solution outperformed the existing systems in terms of the detection and classification accuracy. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

20 pages, 2747 KiB  
Article
Predicting the 305-Day Milk Yield of Holstein-Friesian Cows Depending on the Conformation Traits and Farm Using Simplified Selective Ensembles
by Snezhana Gocheva-Ilieva, Antoaneta Yordanova and Hristina Kulina
Mathematics 2022, 10(8), 1254; https://doi.org/10.3390/math10081254 - 11 Apr 2022
Cited by 6 | Viewed by 2006
Abstract
In animal husbandry, it is of great interest to determine and control the key factors that affect the production characteristics of animals, such as milk yield. In this study, simplified selective tree-based ensembles were used for modeling and forecasting the 305-day average milk [...] Read more.
In animal husbandry, it is of great interest to determine and control the key factors that affect the production characteristics of animals, such as milk yield. In this study, simplified selective tree-based ensembles were used for modeling and forecasting the 305-day average milk yield of Holstein-Friesian cows, depending on 12 external traits and the farm as an environmental factor. The preprocessing of the initial independent variables included their transformation into rotated principal components. The resulting dataset was divided into learning (75%) and holdout test (25%) subsamples. Initially, three diverse base models were generated using Classifiction and Regression Trees (CART) ensembles and bagging and arcing algorithms. These models were processed using the developed simplified selective algorithm based on the index of agreement. An average reduction of 30% in the number of trees of selective ensembles was obtained. Finally, by separately stacking the predictions from the non-selective and selective base models, two linear hybrid models were built. The hybrid model of the selective ensembles showed a 13.6% reduction in the test set prediction error compared to the hybrid model of the non-selective ensembles. The identified key factors determining milk yield include the farm, udder width, chest width, and stature of the animals. The proposed approach can be applied to improve the management of dairy farms. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

20 pages, 4930 KiB  
Article
Forecasting of Electrical Energy Consumption in Slovakia
by Michal Pavlicko, Mária Vojteková and Oľga Blažeková
Mathematics 2022, 10(4), 577; https://doi.org/10.3390/math10040577 - 12 Feb 2022
Cited by 21 | Viewed by 3174
Abstract
Prediction of electricity energy consumption plays a crucial role in the electric power industry. Accurate forecasting is essential for electricity supply policies. A characteristic feature of electrical energy is the need to ensure a constant balance between consumption and electricity production, whereas electricity [...] Read more.
Prediction of electricity energy consumption plays a crucial role in the electric power industry. Accurate forecasting is essential for electricity supply policies. A characteristic feature of electrical energy is the need to ensure a constant balance between consumption and electricity production, whereas electricity cannot be stored in significant quantities, nor is it easy to transport. Electricity consumption generally has a stochastic behavior that makes it hard to predict. The main goal of this study is to propose the forecasting models to predict the maximum hourly electricity consumption per day that is more accurate than the official load prediction of the Slovak Distribution Company. Different models are proposed and compared. The first model group is based on the transverse set of Grey models and Nonlinear Grey Bernoulli models and the second approach is based on a multi-layer feed-forward back-propagation network. Moreover, a new potential hybrid model combining these different approaches is used to forecast the maximum hourly electricity consumption per day. Various performance metrics are adopted to evaluate the performance and effectiveness of models. All the proposed models achieved more accurate predictions than the official load prediction, while the hybrid model offered the best results according to performance metrics and supported the legitimacy of this research. Full article
(This article belongs to the Special Issue Statistical Data Modeling and Machine Learning with Applications II)
Show Figures

Figure 1

Back to TopTop