Data Analysis and Mining: New Techniques and Applications

A special issue of Applied Sciences (ISSN 2076-3417). This special issue belongs to the section "Computing and Artificial Intelligence".

Deadline for manuscript submissions: 20 August 2024 | Viewed by 5779

Special Issue Editor

College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
Interests: data mining; social network analysis; multimodel learning; graph data analysis; time serial analysis

Special Issue Information

Dear Colleagues,

Learning hierarchical representation and finding useful patterns from data by differentiable models in an end-to-end fashion has been amongst of the greatest developments in data mining so far. Despite its application in traditional research fields like computer vision, natural language processing, and recommendation systems, such a data-driven approach shows great potential when it comes to the intersection of AI and science. From protein structure prediction to quantum artificial intelligence, data mining techniques are providing amazing insight into fitting data and have assisted in the discovery of scientific laws in various domains, as well as contributing to a new research paradigm called AI for science.

Even though artificial general intelligence (AGI) is far from reach, mining scientific data still find many intriguing applications. Recent applications include, but are not restricted to, quantum physics, computational chemistry, molecular biology, fluid dynamics, software engineering, and other disciplines. This Special Issue invites the submission of papers with innovative ideas either in data mining algorithms or in applications of a specific research field. To facilitate the application of data mining technology and accelerate the process of its industrial application, papers that present data mining tools in a specific domain are also welcomed.

Dr. Donghai Guan
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Applied Sciences is an international peer-reviewed open access semimonthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • data mining
  • time series analysis
  • multimodel learning
  • social network analysis
  • classification
  • clustering
  • graph data analysis

Published Papers (6 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

16 pages, 370 KiB  
Article
Pairwise Likelihood Estimation of the 2PL Model with Locally Dependent Item Responses
by Alexander Robitzsch
Appl. Sci. 2024, 14(6), 2652; https://doi.org/10.3390/app14062652 - 21 Mar 2024
Viewed by 270
Abstract
The local independence assumption is crucial for the consistent estimation of item parameters in item response theory models. This article explores a pairwise likelihood estimation approach for the two-parameter logistic (2PL) model that treats the local dependence structure as a nuisance in the [...] Read more.
The local independence assumption is crucial for the consistent estimation of item parameters in item response theory models. This article explores a pairwise likelihood estimation approach for the two-parameter logistic (2PL) model that treats the local dependence structure as a nuisance in the optimization function. Hence, item parameters can be consistently estimated without explicit modeling assumptions of the dependence structure. Two simulation studies demonstrate that the proposed pairwise likelihood estimation approach allows nearly unbiased and consistent item parameter estimation. Our proposed method performs similarly to the marginal maximum likelihood and pairwise likelihood estimation approaches, which also estimate the parameters for the local dependence structure. Full article
(This article belongs to the Special Issue Data Analysis and Mining: New Techniques and Applications)
27 pages, 3097 KiB  
Article
A Methodology for Knowledge Discovery in Labeled and Heterogeneous Graphs
by Víctor H. Ortega-Guzmán, Luis Gutiérrez-Preciado, Francisco Cervantes and Mildreth Alcaraz-Mejia
Appl. Sci. 2024, 14(2), 838; https://doi.org/10.3390/app14020838 - 18 Jan 2024
Viewed by 555
Abstract
Graph mining has emerged as a significant field of research with applications spanning multiple domains, including marketing, corruption analysis, business, and politics. The exploration of knowledge within graphs has garnered considerable attention due to the exponential growth of graph-modeled data and its potential [...] Read more.
Graph mining has emerged as a significant field of research with applications spanning multiple domains, including marketing, corruption analysis, business, and politics. The exploration of knowledge within graphs has garnered considerable attention due to the exponential growth of graph-modeled data and its potential in applications where data relationships are a crucial component, and potentially being even more important than the data themselves. However, the increasing use of graphs for data storing and modeling presents unique challenges that have prompted advancements in graph mining algorithms, data modeling and storage, query languages for graph databases, and data visualization techniques. Despite there being various methodologies for data analysis, they predominantly focus on structured data and may not be optimally suited for highly connected data. Accordingly, this work introduces a novel methodology specifically tailored for knowledge discovery in labeled and heterogeneous graphs (KDG), and it presents three case studies demonstrating its successful application in addressing various challenges across different application domains. Full article
(This article belongs to the Special Issue Data Analysis and Mining: New Techniques and Applications)
Show Figures

Figure 1

20 pages, 4842 KiB  
Article
SbMBR Tree—A Spatiotemporal Data Indexing and Compression Algorithm for Data Analysis and Mining
by Runda Guan, Ziyu Wang, Xiaokang Pan, Rongjie Zhu, Biao Song and Xinchang Zhang
Appl. Sci. 2023, 13(19), 10562; https://doi.org/10.3390/app131910562 - 22 Sep 2023
Viewed by 516
Abstract
In the field of data analysis and mining, adopting efficient data indexing and compression techniques to spatiotemporal data can significantly reduce computational and storage overhead for the abilities to control the volume of data and exploit the spatiotemporal characteristics. However, traditional lossy compression [...] Read more.
In the field of data analysis and mining, adopting efficient data indexing and compression techniques to spatiotemporal data can significantly reduce computational and storage overhead for the abilities to control the volume of data and exploit the spatiotemporal characteristics. However, traditional lossy compression techniques are hardly suitable due to their inherently random nature. They often impose unpredictable damage to scientific data, which affects the results of data mining and analysis tasks that require certain precision. In this paper, we propose a similarity-based minimum bounding rectangle (SbMBR) tree, a tree-based indexing and compression method, to address the aforementioned problem. Our method can hierarchically select appropriate minimum bounding rectangles according to the given maximum acceptable errors and use the average value contained in each selected MBR to replace the original data to achieve data compression with multi-layer loss control. This paper also provides the corresponding tree construction algorithm and range query processing algorithm for the indexing structure mentioned above. To evaluate the data quality preservation in cross-domain data analysis and mining scenarios, we use mutual information as the estimation metric. Experimental results emphasize the superiority of our method over some of the typical indexing and compression algorithms. Full article
(This article belongs to the Special Issue Data Analysis and Mining: New Techniques and Applications)
Show Figures

Figure 1

30 pages, 5836 KiB  
Article
Machine Learning Ensemble Modelling for Predicting Unemployment Duration
by Barbora Gabrikova, Lucia Svabova and Katarina Kramarova
Appl. Sci. 2023, 13(18), 10146; https://doi.org/10.3390/app131810146 - 08 Sep 2023
Cited by 2 | Viewed by 1468
Abstract
Predictions of the unemployment duration of the economically active population play a crucial assisting role for policymakers and employment agencies in the well-organised allocation of resources (tied to solving problems of the unemployed, whether on the labour supply or demand side) and providing [...] Read more.
Predictions of the unemployment duration of the economically active population play a crucial assisting role for policymakers and employment agencies in the well-organised allocation of resources (tied to solving problems of the unemployed, whether on the labour supply or demand side) and providing targeted support to jobseekers in their job search. This study aimed to develop an ensemble model that can serve as a reliable tool for predicting unemployment duration among jobseekers in Slovakia. The ensemble model was developed using real data from the database of jobseekers (those registered as unemployed and actively searching for a job through the Local Labour Office, Social Affairs, and Family) using the stacking method, incorporating predictions from three individual models: CART, CHAID, and discriminant analysis. The final meta-model was created using logistic regression and indicates an overall accuracy of the prediction of unemployment duration of almost 78%. This model demonstrated high accuracy and precision in identifying jobseekers at risk of long-term unemployment exceeding 12 months. The presented model, working with real data of a robust nature, represents an operational tool that can be used to check the functionality of the current labour market policy and to solve the problem of long-term unemployed individuals in Slovakia, as well as in the creation of future government measures aimed at solving the problem of unemployment. The measures from the state are financed from budget funds, and by applying the appropriate model, it is possible to arrive at the rationalization of the financing of these measures, or to specifically determine the means intended to solve the problem of long-term unemployment in Slovakia (this, together with the regional disproportion of unemployment, is considered one of the most prominent problems in the labour market in Slovakia). The model also has the potential to be adapted in other economies, taking into account country-specific conditions and variables, which is possible due to the data-mining approach used. Full article
(This article belongs to the Special Issue Data Analysis and Mining: New Techniques and Applications)
Show Figures

Figure 1

14 pages, 2476 KiB  
Article
Effects of the Hybrid CRITIC–VIKOR Method on Product Aspect Ranking in Customer Reviews
by Saif Addeen Ahmad Alrababah and Keng Hoon Gan
Appl. Sci. 2023, 13(16), 9176; https://doi.org/10.3390/app13169176 - 11 Aug 2023
Cited by 2 | Viewed by 738
Abstract
Product aspect ranking is critical for prioritizing the most important aspects of a specific product/service to assist probable customers in selecting suitable products that can realize their needs. However, given the voluminous customer reviews published on websites, customers are hindered from manually extracting [...] Read more.
Product aspect ranking is critical for prioritizing the most important aspects of a specific product/service to assist probable customers in selecting suitable products that can realize their needs. However, given the voluminous customer reviews published on websites, customers are hindered from manually extracting and characterizing the specific aspects of searched products. A few multicriteria decision-making methods have been implemented to rank the most relevant product aspects. As weights greatly affect the ranking results of product aspects, this study used objective methods in finding the importance degree of a criteria set to overcome the limitations of subjective weighting. The growing popularity of online shopping has led to an exponential increase in the number of customer reviews available on various e-commerce websites. The sheer volume of these reviews makes it nearly impossible for customers to manually extract and analyze the specific aspects of the products they are interested in. This challenge highlights the need for automated techniques that can efficiently rank the product aspects based on their relevance and importance. Multicriteria decision-making techniques can address the issue of product aspect ranking. These techniques seek to offer a methodical strategy for assessing and contrasting various product attributes based on various criteria. The subjective nature of determining weights for each criterion raises serious issues because it might lead to bias and inconsistent ranking outcomes. The CRITIC–VIKOR method was adopted in the product aspect ranking process. The statistical findings based on a benchmark dataset using NDCG demonstrate the superior performance of the method of using objective weighting to reasonably acquire subjective weighting results. Also, the results show that the product aspects ranked by using CRITIC–VIKOR could be considered guidelines for probable customers to make a wise purchasing decision. Full article
(This article belongs to the Special Issue Data Analysis and Mining: New Techniques and Applications)
Show Figures

Figure 1

24 pages, 4667 KiB  
Article
A Dynamic Grid Index for CkNN Queries on Large-Scale Road Networks with Moving Objects
by Kailei Tang, Zhiyan Dong, Wenxiang Shi and Zhongxue Gan
Appl. Sci. 2023, 13(8), 4946; https://doi.org/10.3390/app13084946 - 14 Apr 2023
Viewed by 1013
Abstract
As the Internet of Things devices are deployed on a large scale, location-based services are being increasingly utilized. Among these services, kNN (k-nearest neighbor) queries based on road network constraints have gained importance. This study focuses on the Ck [...] Read more.
As the Internet of Things devices are deployed on a large scale, location-based services are being increasingly utilized. Among these services, kNN (k-nearest neighbor) queries based on road network constraints have gained importance. This study focuses on the CkNN (continuous k-nearest neighbor) queries for non-uniformly distributed moving objects with large-scale dynamic road network constraints, where CkNN objects are continuously and periodically queried based on their motion evolution. The present CkNN high-concurrency query under the constraints of a super-large road network faces problems, such as high computational cost and low query efficiency. The aim of this study is to ensure high concurrency nearest neighbor query requests while shortening the query response time and reducing global computation costs. To address this issue, we propose the DVTG-Index (Dynamic V-Tree Double-Layer Grid Index), which intelligently adjusts the index granularity by continuously merging and splitting subgraphs as the objects move, thereby filtering unnecessary vertices. Based on DVTG-Index, we further propose the DVTG-CkNN algorithm to calculate the initial kNN query and utilize the existing results to speed up the CkNN query. Finally, extensive experiments on real road networks confirm the superior performance of our proposed method, which has significant practical applications in large-scale dynamic road network constraints with non-uniformly distributed moving objects. Full article
(This article belongs to the Special Issue Data Analysis and Mining: New Techniques and Applications)
Show Figures

Figure 1

Planned Papers

The below list represents only planned manuscripts. Some of these manuscripts have not been received by the Editorial Office yet. Papers submitted to MDPI journals are subject to peer-review.

Title: Software Defect Prediction with Semantic, Context Features and multi-adaptation
Authors: Chuanqi Tao
Affiliation: taochuanqi@nuaa.edu.cn
Abstract: Many software testing methods, such as random testing and other methods, have been extensively used, but these testing methods may result in a lot of waste of resources. Software defect prediction (SDP), which predicts defective code regions, can help developers find errors and make reasonable testing plans. Cross-project defect prediction(CPDP) model is mainly learning through sufficient data and labels of other projects. Then predicting the defective label of another new project with insufficient data and few labels. Although CPDP has great advantages when there is little historical data of the new project, previous methods are mainly designed with handcrafted features and semantic features. However source code contains rich information including semantic and context features and it is important to know code’s context features in order to diagnose defective code, in this paper we combine handcrafted features with semantic and context features from source code and utilize them conducting experiments on both with-project defect prediction(WPDP) and CPDP. And existing CPDP methods based on the deep learning model have not fully considered the differences among projects and the domain multi-adaption method. To solve these problems, the authors propose a model to automatically generate semantic and context features from source code and then utilize joint domain adaption with multi-layer and multi-kernel maximum mean discrepancy (MLMK-MMD) in deep transfer learning for CPDP.

Title: Pcilad:Pre-trained Temporal Spatial Network for UUV Anomaly Trajectory Detection
Authors: Donghai Guan
Affiliation: College of Computer Science & Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
Abstract: The recognition of abnormal trajectory in unmanned underwater vehicles(UUV) is crucial for navigation safety and efficiency. Existing works mainly rely on machine learning and probability density, which are difficult to learn the spatiotemporal information of trajectory data, resulting in low abnormal recognition rates and poor transferability across different tasks. To address this issue, this paper proposes a multi-dimensional spatiotemporal fusion model named Pcilad which leverages pre-training techniques and Pcilad is designed to learn spatiotemporal information and enhance transferability through pre-training and fine-tuning. In the pre-training stage, a spatiotemporal encoder-decoder architecture was utilized to extract spatiotemporal features of trajectory sequences. To capture the spatiotemporal dependencies of UUV trajectory, the sequence was dynamically masked and randomly embedded through masked spatiotemporal trajectory modeling. In the fine-tuning stage, the pre-trained spatiotemporal encoder weights were loaded into classifiers in downstream tasks for end-to-end fine-tuning. This paper conducted experiments on five datasets, and results showed that Pcilad could significantly improve abnormal recognition rates and outperform existing models in terms of performance.

Back to TopTop