Data Analysis and Mining

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: closed (20 November 2022) | Viewed by 35064

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Department of Digital Systems, University of the Peloponnese, 23100 Kladas, Sparta, Greece
Interests: data mining; machine learning; data reduction; data streams; algorithms and data structures; web

E-Mail Website
Guest Editor
Department of Digital Systems, University of the Peloponnese, 23100 Kladas, Sparta, Greece
Interests: software; personalization; business processes; web; social networks

Special Issue Information

Dear Colleagues,

Nowadays, data analysis and mining are being used in numerous everyday tasks to solve practical problems. This research field has attracted the interest of both academia and industry. Therefore, the research community has contributed algorithms, techniques and tools for the prediction of future situations, discovery of clusters with similar data, association rules mining, pattern recognition, etc., all of which are finding applications in many domains, such as medicine, finance, business, biology, marketing, education, etc. This Special Issue is seeking the submission of papers that present new data mining algorithms and techniques as well as applications of data analysis and mining in real-world domains. Moreover, papers that present data mining software tools are also welcomed.

Dr. Stefanos Ougiaroglou
Dr. Dionisis Margaris
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data and web mining
  • data analytics
  • machine learning
  • pattern recognition
  • data streams
  • data reduction
  • recommender systems
  • association rules
  • time series
  • data preprocessing
  • data cleaning
  • feature selection and extraction
  • multilabel classification
  • neural networks
  • data visualization
  • tools for data analysis and mining
  • applications of data analysis and mining

Published Papers (17 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 1709 KiB  
Article
A Data-Science Approach for Creation of a Comprehensive Model to Assess the Impact of Mobile Technologies on Humans
by Magdalena Garvanova, Ivan Garvanov, Vladimir Jotsov, Abdul Razaque, Bandar Alotaibi, Munif Alotaibi and Daniela Borissova
Appl. Sci. 2023, 13(6), 3600; https://doi.org/10.3390/app13063600 - 11 Mar 2023
Cited by 2 | Viewed by 1391
Abstract
Mobile technologies are an essential part of people’s everyday lives since they are utilized for a variety of purposes, such as communication, entertainment, commerce, and education. However, when these gadgets are misused, the human body is exposed to continuous radiation from the electromagnetic [...] Read more.
Mobile technologies are an essential part of people’s everyday lives since they are utilized for a variety of purposes, such as communication, entertainment, commerce, and education. However, when these gadgets are misused, the human body is exposed to continuous radiation from the electromagnetic field created by them. The communication services available are improving as mobile technologies advance; however, the problem is becoming more severe as the frequency range of mobile devices expands. To solve this complex case, it is necessary to propose a comprehensive approach that combines and processes data obtained from different types of research and sources of information, such as thermal imaging, electroencephalograms, computer models, and surveys. In the present article, a complex model for the processing and analysis of heterogeneous data is proposed based on mathematical and statistical methods in order to study the problem of electromagnetic radiation from mobile devices in-depth. Data science selection/preprocessing is one of the most important aspects of data and knowledge processing aiming at successful and effective analysis and data fusion from many sources. Special types of logic-based binding and pointing constraints are considered for data/knowledge selection applications. The proposed logic-based statistical modeling method provides both algorithmic as well as data-driven realizations that can be evolutionary. As a result, non-anticipated and collateral data/features can be processed if their role in the selected/constrained area is significant. In this research, the data-driven part does not use artificial neural networks; however, this combination was successfully applied in the past. It is an independent subsystem maintaining control of both the statistical and machine-learning parts. The proposed modeling applies to a wide range of reasoning/smart systems. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

23 pages, 2315 KiB  
Article
A Flexible Session-Based Recommender System for e-Commerce
by Michail Salampasis, Alkiviadis Katsalis, Theodosios Siomos, Marina Delianidi, Dimitrios Tektonidis, Konstantinos Christantonis, Pantelis Kaplanoglou, Ifigeneia Karaveli, Chrysostomos Bourlis and Konstantinos Diamantaras
Appl. Sci. 2023, 13(5), 3347; https://doi.org/10.3390/app13053347 - 06 Mar 2023
Cited by 3 | Viewed by 2806
Abstract
Research into session-based recommendation systems (SBSR) has attracted a lot of attention, but each study focuses on a specific class of methods. This work examines and evaluates a large range of methods, from simpler statistical co-occurrence methods to embeddings and SotA deep learning [...] Read more.
Research into session-based recommendation systems (SBSR) has attracted a lot of attention, but each study focuses on a specific class of methods. This work examines and evaluates a large range of methods, from simpler statistical co-occurrence methods to embeddings and SotA deep learning methods. This paper analyzes theoretical and practical issues in developing and evaluating methods for SBSR in e-commerce applications, where user profiles and purchase data do not exist. The major tasks of SBRS are reviewed and studied, namely: prediction of next-item, next-basket and purchase intent. For physical retail shopping where no information about the current session exists, we treat the previous baskets purchased by the user as previous sessions drawn from a loyalty system. Mobile application scenarios such as push notifications and calling tune recommendations are also presented. Recommender models using graphs, embeddings and deep learning methods are studied and evaluated in all SBRS tasks using different datasets. Our work contributes a number of very interesting findings. Among all tested models, LSTMs consistently outperform other methods of SBRS in all tasks. They can be applied directly because they do not need significant fine-tuning. Additionally, they naturally model the dynamic browsing that happens in e-commerce web applications. On the other hand, another important finding of our work is that graph-based methods can be a good compromise between effectiveness and efficiency. Another important conclusion is that a “temporal locality principle” holds, implying that more recent behavior is better suited for prediction. In order to evaluate these systems further in realistic environments, several session-based recommender methods were integrated into an e-shop and an A/B testing method was applied. The results of this A/B testing are in line with the experimental results, which represents another important contribution of this paper. Finally, important parameters such as efficiency, application of business rules, re-ranking issues, and the utilization of hybrid methods are also considered and tested, providing comprehensive useful insights into SBRS and facilitating the transferability of this research work to other domains and recommendation scenarios. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

16 pages, 393 KiB  
Article
Session-Based Recommendations for e-Commerce with Graph-Based Data Modeling
by Marina Delianidi, Konstantinos Diamantaras, Dimitrios Tektonidis and Michail Salampasis
Appl. Sci. 2023, 13(1), 394; https://doi.org/10.3390/app13010394 - 28 Dec 2022
Cited by 2 | Viewed by 2040
Abstract
Conventional recommendation methods such as collaborative filtering cannot be applied when long-term user models are not available. In this paper, we propose two session-based recommendation methods for anonymous browsing in a generic e-commerce framework. We represent the data using a graph where items [...] Read more.
Conventional recommendation methods such as collaborative filtering cannot be applied when long-term user models are not available. In this paper, we propose two session-based recommendation methods for anonymous browsing in a generic e-commerce framework. We represent the data using a graph where items are connected to sessions and to each other based on the order of appearance or their co-occurrence. In the first approach, called Hierarchical Sequence Probability (HSP), recommendations are produced using the probabilities of items’ appearances on certain structures in the graph. Specifically, given a current item during a session, to create a list of recommended next items, we first compute the probabilities of all possible sequential triplets ending in each candidate’s next item, then of all candidate item pairs, and finally of the proposed item. In our second method, called Recurrent Item Co-occurrence (RIC), we generate the recommendation list based on a weighted score produced by a linear recurrent mechanism using the co-occurrence probabilities between the current item and all items. We compared our approaches with three state-of-the-art Graph Neural Network (GNN) models using four session-based datasets one of which contains data collected by us from a leather apparel e-shop. In terms of recommendation effectiveness, our methods compete favorably on a number of datasets while the time to generate the graph and produce the recommendations is significantly lower. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

12 pages, 308 KiB  
Article
Exploiting Domain Knowledge to Address Class Imbalance in Meteorological Data Mining
by Evangelos Tsagalidis and Georgios Evangelidis
Appl. Sci. 2022, 12(23), 12402; https://doi.org/10.3390/app122312402 - 04 Dec 2022
Viewed by 755
Abstract
We deal with the problem of class imbalance in data mining and machine learning classification algorithms. This is the case where some of the class labels are represented by a small number of examples in the training dataset compared to the rest of [...] Read more.
We deal with the problem of class imbalance in data mining and machine learning classification algorithms. This is the case where some of the class labels are represented by a small number of examples in the training dataset compared to the rest of the class labels. Usually, those minority class labels are the most important ones, implying that classifiers should primarily perform well on predicting those labels. This is a well-studied problem and various strategies that use sampling methods are used to balance the representation of the labels in the training dataset and improve classifier performance. We explore whether expert knowledge in the field of Meteorology can enhance the quality of the training dataset when treated by pre-processing sampling strategies. We propose four new sampling strategies based on our expertise on the data domain and we compare their effectiveness against the established sampling strategies used in the literature. It turns out that our sampling strategies, which take advantage of expert knowledge from the data domain, achieve class balancing that improves the performance of most classifiers. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

22 pages, 2355 KiB  
Article
Similarity Calculation via Passage-Level Event Connection Graph
by Ming Liu, Lei Chen and Zihao Zheng
Appl. Sci. 2022, 12(19), 9887; https://doi.org/10.3390/app12199887 - 01 Oct 2022
Cited by 1 | Viewed by 1397
Abstract
Recently, many information processing applications appear on the web on the demand of user requirement. Since text is one of the most popular data formats across the web, how to measure text similarity becomes the key challenge to many web applications. Web text [...] Read more.
Recently, many information processing applications appear on the web on the demand of user requirement. Since text is one of the most popular data formats across the web, how to measure text similarity becomes the key challenge to many web applications. Web text is often used to record events, especially for news. One text often mentions multiple events, while only the core event decides its main topic. This core event should take the important position when measuring text similarity. For this reason, this paper constructs a passage-level event connection graph to model the relations among events mentioned in one text. This graph is composed of many subgraphs formed by triggers and arguments extracted sentence by sentence. The subgraphs are connected via the overlapping arguments. In term of centrality measurement, the core event can be revealed from the graph and utilized to measure text similarity. Moreover, two improvements based on vector tunning are provided to better model the relations among events. One is to find the triggers which are semantically similar. By linking them in the event connection graph, the graph can cover the relations among events more comprehensively. The other is to apply graph embedding to integrate the global information carried by the entire event connection graph into the core event to let text similarity be partially guided by the full-text content. As shown by experimental results, after measuring text similarity from a passage-level event representation perspective, our calculation acquires superior results than unsupervised methods and even comparable results with some supervised neuron-based methods. In addition, our calculation is unsupervised and can be applied in many domains free from the preparation of training data. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

15 pages, 2011 KiB  
Article
Neural Networks for Early Diagnosis of Postpartum PTSD in Women after Cesarean Section
by Christos Orovas, Eirini Orovou, Maria Dagla, Alexandros Daponte, Nikolaos Rigas, Stefanos Ougiaroglou, Georgios Iatrakis and Evangelia Antoniou
Appl. Sci. 2022, 12(15), 7492; https://doi.org/10.3390/app12157492 - 26 Jul 2022
Cited by 5 | Viewed by 1298
Abstract
The correlation between the kind of cesarean section and post-traumatic stress disorder (PTSD) in Greek women after a traumatic birth experience has been recognized in previous studies along with other risk factors, such as perinatal conditions and traumatic life events. Data from early [...] Read more.
The correlation between the kind of cesarean section and post-traumatic stress disorder (PTSD) in Greek women after a traumatic birth experience has been recognized in previous studies along with other risk factors, such as perinatal conditions and traumatic life events. Data from early studies have suggested some possible links between some vulnerable factors and the potential development of postpartum PTSD. The classification of each case in three possible states (PTSD, profile PTSD, and free of symptoms) is typically performed using the guidelines and the metrics of the version V of the Diagnostic and Statistical Manual of Mental Disorders (DSM-V) which requires the completion of several questionnaires during the postpartum period. The motivation in the present work is the need for a model that can detect possible PTSD cases using a minimum amount of information and produce an early diagnosis. The early PTSD diagnosis is critical since it allows the medical personnel to take the proper measures as soon as possible. Our sample consists of 469 women who underwent emergent or elective cesarean delivery in a university hospital in Greece. The methodology which is followed is the application of random decision forests (RDF) to detect the most suitable and easily accessible information which is then used by an artificial neural network (ANN) for the classification. As is demonstrated from the results, the derived decision model can reach high levels of accuracy even when only partial and quickly available information is provided. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

15 pages, 4355 KiB  
Article
Chicken Swarm-Based Feature Subset Selection with Optimal Machine Learning Enabled Data Mining Approach
by Monia Hamdi, Inès Hilali-Jaghdam, Manal M. Khayyat, Bushra M. E. Elnaim, Sayed Abdel-Khalek and Romany F. Mansour
Appl. Sci. 2022, 12(13), 6787; https://doi.org/10.3390/app12136787 - 04 Jul 2022
Cited by 5 | Viewed by 1413
Abstract
Data mining (DM) involves the process of identifying patterns, correlation, and anomalies existing in massive datasets. The applicability of DM includes several areas such as education, healthcare, business, and finance. Educational Data Mining (EDM) is an interdisciplinary domain which focuses on the applicability [...] Read more.
Data mining (DM) involves the process of identifying patterns, correlation, and anomalies existing in massive datasets. The applicability of DM includes several areas such as education, healthcare, business, and finance. Educational Data Mining (EDM) is an interdisciplinary domain which focuses on the applicability of DM, machine learning (ML), and statistical approaches for pattern recognition in massive quantities of educational data. This type of data suffers from the curse of dimensionality problems. Thus, feature selection (FS) approaches become essential. This study designs a Feature Subset Selection with an optimal machine learning model for Educational Data Mining (FSSML-EDM). The proposed method involves three major processes. At the initial stage, the presented FSSML-EDM model uses the Chicken Swarm Optimization-based Feature Selection (CSO-FS) technique for electing feature subsets. Next, an extreme learning machine (ELM) classifier is employed for the classification of educational data. Finally, the Artificial Hummingbird (AHB) algorithm is utilized for adjusting the parameters involved in the ELM model. The performance study revealed that FSSML-EDM model achieves better results compared with other models under several dimensions. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

22 pages, 1806 KiB  
Article
A High-Level Representation of the Navigation Behavior of Website Visitors
by Alicia Huidobro, Raúl Monroy and Bárbara Cervantes
Appl. Sci. 2022, 12(13), 6711; https://doi.org/10.3390/app12136711 - 02 Jul 2022
Cited by 2 | Viewed by 1385
Abstract
Knowing how visitors navigate a website can lead to different applications. For example, providing a personalized navigation experience or identifying website failures. In this paper, we present a method for representing the navigation behavior of an entire class of website visitors in a [...] Read more.
Knowing how visitors navigate a website can lead to different applications. For example, providing a personalized navigation experience or identifying website failures. In this paper, we present a method for representing the navigation behavior of an entire class of website visitors in a moderately small graph, aiming to ease the task of web analysis, especially in marketing areas. Current solutions are mainly oriented to a detailed page-by-page analysis. Thus, obtaining a high-level abstraction of an entire class of visitors may involve the analysis of large amounts of data and become an overwhelming task. Our approach extracts the navigation behavior that is common among a certain class of visitors to create a graph that summarizes class navigation behavior and enables a contrast of classes. The method works by representing website sessions as the sequence of visited pages. Sub-sequences of visited pages of common occurrence are identified as “rules”. Then, we replace those rules with a symbol that is given a representative name and use it to obtain a shrinked representation of a session. Finally, this shrinked representation is used to create a graph of the navigation behavior of a visitor class (group of visitors relevant to the desired analysis). Our results show that a few rules are enough to capture a visitor class. Since each class is associated with a conversion, a marketing expert can easily find out what makes classes different. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

20 pages, 6931 KiB  
Article
A Deep Neural Network Technique for Detecting Real-Time Drifted Twitter Spam
by Amira Abdelwahab and Mohamed Mostafa
Appl. Sci. 2022, 12(13), 6407; https://doi.org/10.3390/app12136407 - 23 Jun 2022
Cited by 2 | Viewed by 1689
Abstract
The social network is considered a part of most user’s lives as it contains more than a billion users, which makes it a source for spammers to spread their harmful activities. Most of the recent research focuses on detecting spammers using statistical features. [...] Read more.
The social network is considered a part of most user’s lives as it contains more than a billion users, which makes it a source for spammers to spread their harmful activities. Most of the recent research focuses on detecting spammers using statistical features. However, such statistical features are changed over time, and spammers can defeat all detection systems by changing their behavior and using text paraphrasing. Therefore, we propose a novel technique for spam detection using deep neural network. We combine the tweet level detection with statistical feature detection and group their results over meta-classifier to build a robust technique. Moreover, we embed our technique with initial text paraphrasing for each detected tweet spam. We train our model using different datasets: random, continuous, balanced, and imbalanced. The obtained experimental results showed that our model has promising results in terms of accuracy, precision, and time, which make it applicable to be used in social networks. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

19 pages, 1308 KiB  
Article
Natural Time Series Parameters Forecasting: Validation of the Pattern-Sequence-Based Forecasting (PSF) Algorithm; A New Python Package
by Mayur Kishor Shende, Sinan Q. Salih, Neeraj Dhanraj Bokde, Miklas Scholz, Atheer Y. Oudah and Zaher Mundher Yaseen
Appl. Sci. 2022, 12(12), 6194; https://doi.org/10.3390/app12126194 - 17 Jun 2022
Cited by 5 | Viewed by 2195
Abstract
Climate change has contributed substantially to the weather and land characteristic phenomena. Accurate time series forecasting for climate and land parameters is highly essential in the modern era for climatologists. This paper provides a brief introduction to the algorithm and its implementation in [...] Read more.
Climate change has contributed substantially to the weather and land characteristic phenomena. Accurate time series forecasting for climate and land parameters is highly essential in the modern era for climatologists. This paper provides a brief introduction to the algorithm and its implementation in Python. The pattern-sequence-based forecasting (PSF) algorithm aims to forecast future values of a univariate time series. The algorithm is divided into two major processes: the clustering of data and prediction. The clustering part includes the selection of an optimum value for the number of clusters and labeling the time series data. The prediction part consists of the selection of a window size and the prediction of future values with reference to past patterns. The package aims to ease the use and implementation of PSF for python users. It provides results similar to the PSF package available in R. Finally, the results of the proposed Python package are compared with results of the PSF and ARIMA methods in R. One of the issues with PSF is that the performance of forecasting result degrades if the time series has positive or negative trends. To overcome this problem difference pattern-sequence-based forecasting (DPSF) was proposed. The Python package also implements the DPSF method. In this method, the time series data are first differenced. Then, the PSF algorithm is applied to this differenced time series. Finally, the original and predicted values are restored by applying the reverse method of the differencing process. The proposed methodology is tested on several complex climate and land processes and its potential is evidenced. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

17 pages, 4516 KiB  
Article
Identification of Mobility Patterns of Clusters of City Visitors: An Application of Artificial Intelligence Techniques to Social Media Data
by Jonathan Ayebakuro Orama, Assumpció Huertas, Joan Borràs, Antonio Moreno and Salvador Anton Clavé
Appl. Sci. 2022, 12(12), 5834; https://doi.org/10.3390/app12125834 - 08 Jun 2022
Cited by 5 | Viewed by 2314
Abstract
In order to enhance tourists’ experiences, Destination Management Organizations need to know who their tourists are, their travel preferences, and their flows around the destination. The study develops a methodology that, through the application of Artificial Intelligence techniques to social media data, creates [...] Read more.
In order to enhance tourists’ experiences, Destination Management Organizations need to know who their tourists are, their travel preferences, and their flows around the destination. The study develops a methodology that, through the application of Artificial Intelligence techniques to social media data, creates clusters of tourists according to their mobility and visiting preferences at the destination. The applied method improves the knowledge about the different mobility patterns of tourists (the most visited points and the main flows between them within a destination) depending on who they are and what their preferences are. Clustering tourists by their travel mobility permits uncovering much more information about them and their preferences than previous studies. This knowledge will allow DMOs and tourism service providers to offer personalized services and information, to attract specific types of tourists to certain points of interest, to create new routes, or to enhance public transport services. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

21 pages, 1844 KiB  
Article
Multivariate Time Series Deep Spatiotemporal Forecasting with Graph Neural Network
by Zichao He, Chunna Zhao and Yaqun Huang
Appl. Sci. 2022, 12(11), 5731; https://doi.org/10.3390/app12115731 - 05 Jun 2022
Cited by 8 | Viewed by 3802
Abstract
Multivariate time series forecasting has long been a subject of great concern. For example, there are many valuable applications in forecasting electricity consumption, solar power generation, traffic congestion, finance, and so on. Accurately forecasting periodic data such as electricity can greatly improve the [...] Read more.
Multivariate time series forecasting has long been a subject of great concern. For example, there are many valuable applications in forecasting electricity consumption, solar power generation, traffic congestion, finance, and so on. Accurately forecasting periodic data such as electricity can greatly improve the reliability of forecasting tasks in engineering applications. Time series forecasting problems are often modeled using deep learning methods. However, the deep information of sequences and dependencies among multiple variables are not fully utilized in existing methods. Therefore, a multivariate time series deep spatiotemporal forecasting model with a graph neural network (MDST-GNN) is proposed to solve the existing shortcomings and improve the accuracy of periodic data prediction in this paper. This model integrates a graph neural network and deep spatiotemporal information. It comprises four modules: graph learning, temporal convolution, graph convolution, and down-sampling convolution. The graph learning module extracts dependencies between variables. The temporal convolution module abstracts the time information of each variable sequence. The graph convolution is used for the fusion of the graph structure and the information of the temporal convolution module. An attention mechanism is presented to filter information in the graph convolution module. The down-sampling convolution module extracts deep spatiotemporal information with different sparsities. To verify the effectiveness of the model, experiments are carried out on four datasets. Experimental results show that the proposed model outperforms the current state-of-the-art baseline methods. The effectiveness of the module for solving the problem of dependencies and deep information is verified by ablation experiments. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

22 pages, 6634 KiB  
Article
Performance Evaluation of Sequential Rule Mining Algorithms
by Amira Abdelwahab and Nesma Youssef
Appl. Sci. 2022, 12(10), 5230; https://doi.org/10.3390/app12105230 - 21 May 2022
Viewed by 1536
Abstract
Data mining techniques are useful in discovering hidden knowledge from large databases. One of its common techniques is sequential rule mining. A sequential rule (SR) helps in finding all sequential rules that achieved support and confidence threshold for help in prediction. It is [...] Read more.
Data mining techniques are useful in discovering hidden knowledge from large databases. One of its common techniques is sequential rule mining. A sequential rule (SR) helps in finding all sequential rules that achieved support and confidence threshold for help in prediction. It is an alternative to sequential pattern mining in that it takes the probability of the following patterns into account. In this paper, we address the preferable utilization of sequential rule mining algorithms by applying them to databases with different features for improving the efficiency in different fields of application. The three compared algorithms are the TRuleGrowth algorithm, which is an extension sequential rule algorithm of RuleGrowth; the top-k non-redundant sequential rules algorithm (TNS); and a non-redundant dynamic bit vector (NRD-DBV). The analysis compares the three algorithms regarding the run time, the number of produced rules, and the used memory to nominate which of them is best suited in prediction. Additionally, it explores the most suitable applications for each algorithm to improve the efficiency. The experimental results proved that the performance of the algorithms appears related to the dataset characteristics. It has been demonstrated that altering the window size constraint, determining the number of created rules, or changing the value of the minSup threshold can reduce execution time and control the number of valid rules generated. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

29 pages, 493 KiB  
Article
User Trust Inference in Online Social Networks: A Message Passing Perspective
by Yu Liu and Bai Wang
Appl. Sci. 2022, 12(10), 5186; https://doi.org/10.3390/app12105186 - 20 May 2022
Cited by 1 | Viewed by 1924
Abstract
Online social networks are vital environments for information sharing and user interactivity. To help users of online social services to build, expand, and maintain their friend networks or webs of trust, trust management systems have been deployed and trust inference (or more generally, [...] Read more.
Online social networks are vital environments for information sharing and user interactivity. To help users of online social services to build, expand, and maintain their friend networks or webs of trust, trust management systems have been deployed and trust inference (or more generally, friend recommendation) techniques have been studied in many online social networks. However, there are some challenging issues obstructing the real-world trust inference tasks. Using only explicit yet sparse trust relationships to predict user trust is inefficient in large online social networks. In the age of privacy-respecting Internet, certain types of user data may be unavailable, and thus existing models for trust inference may be less accurate or even defunct. Although some less interpretable models may achieve better performance in trust prediction, the interpretability of the models may prevent them from being adopted or improved for making relevant informed decisions. To tackle these problems, we propose a probabilistic graphical model for trust inference in online social networks in this paper. The proposed model is built upon the skeleton of explicit trust relationships (the web of trust) and embeds various types of available user data as comprehensively-designed trust-aware features. A message passing algorithm, loop belief propagation, is applied to the model inference, which greatly improves the interpretability of the proposed model. The performance of the proposed model is demonstrated by experiments on a real-world online social network dataset. Experimental results show the proposed model achieves acceptable accuracy with both fully and partially available data. Comparison experiments were conducted, and the results show the proposed model’s promise for trust inference in some circumstances. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

15 pages, 1362 KiB  
Article
Parallel Frequent Subtrees Mining Method by an Effective Edge Division Strategy
by Jing Wang and Xiongfei Li
Appl. Sci. 2022, 12(9), 4778; https://doi.org/10.3390/app12094778 - 09 May 2022
Viewed by 1128
Abstract
Most data with a complicated structure can be represented by a tree structure. Parallel processing is essential to mining frequent subtrees from massive data in a timely manner. However, only a few algorithms could be transplanted to a parallel framework. A new parallel [...] Read more.
Most data with a complicated structure can be represented by a tree structure. Parallel processing is essential to mining frequent subtrees from massive data in a timely manner. However, only a few algorithms could be transplanted to a parallel framework. A new parallel algorithm is proposed to mine frequent subtrees by grouping strategy (GS) and edge division strategy (EDS). The main idea of GS is dividing edges according to different intervals and then dividing subtrees consisting of the edges in different intervals to their corresponding groups. Besides, the compression stage in mining is optimized by avoiding all candidate subtrees of a compression tree, which reduces the mining time on the nodes. Load balancing can improve the performance of parallel computing. An effective EDS is proposed to achieve load balancing. EDS divides the edges with different frequencies into different intervals reasonably, which directly affects the task amount in each computing node. Experiments demonstrate that the proposed algorithm can implement parallel mining, and it outperforms other compared methods on load balancing and speedup. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

18 pages, 6488 KiB  
Article
Interested Keyframe Extraction of Commodity Video Based on Adaptive Clustering Annotation
by Guangyi Man and Xiaoyan Sun
Appl. Sci. 2022, 12(3), 1502; https://doi.org/10.3390/app12031502 - 30 Jan 2022
Cited by 4 | Viewed by 2549
Abstract
Keyframe recognition in video is very important for extracting pivotal information from videos. Numerous studies have been successfully carried out on identifying frames with motion objectives as keyframes. The definition of “keyframe” can be quite different for different requirements. In the field of [...] Read more.
Keyframe recognition in video is very important for extracting pivotal information from videos. Numerous studies have been successfully carried out on identifying frames with motion objectives as keyframes. The definition of “keyframe” can be quite different for different requirements. In the field of E-commerce, the keyframes of the products videos should be those interested by a customer and help the customer make correct and quick decisions, which is greatly different from the existing studies. Accordingly, here, we first define the key interested frame of commodity video from the viewpoint of user demand. As there are no annotations on the interested frames, we develop a fast and adaptive clustering strategy to cluster the preprocessed videos into several clusters according to the definition and make an annotation. These annotated samples are utilized to train a deep neural network to obtain the features of key interested frames and achieve the goal of recognition. The performance of the proposed algorithm in effectively recognizing the key interested frames is demonstrated by applying it to some commodity videos fetched from the E-commerce platform. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

15 pages, 836 KiB  
Article
Optimal Tests for Combining p-Values
by Zhongxue Chen
Appl. Sci. 2022, 12(1), 322; https://doi.org/10.3390/app12010322 - 29 Dec 2021
Cited by 4 | Viewed by 2779
Abstract
Combining information (p-values) obtained from individual studies to test whether there is an overall effect is an important task in statistical data analysis. Many classical statistical tests, such as chi-square tests, can be viewed as being a p-value combination approach. [...] Read more.
Combining information (p-values) obtained from individual studies to test whether there is an overall effect is an important task in statistical data analysis. Many classical statistical tests, such as chi-square tests, can be viewed as being a p-value combination approach. It remains challenging to find powerful methods to combine p-values obtained from various sources. In this paper, we study a class of p-value combination methods based on gamma distribution. We show that this class of tests is optimal under certain conditions and several existing popular methods are equivalent to its special cases. An asymptotically and uniformly most powerful p-value combination test based on constrained likelihood ratio test is then studied. Numeric results from simulation study and real data examples demonstrate that the proposed tests are robust and powerful under many conditions. They have potential broad applications in statistical inference. Full article
(This article belongs to the Special Issue Data Analysis and Mining)
Show Figures

Figure 1

Back to TopTop