Knowledge Extraction from Data Using Machine Learning

A special issue of Data (ISSN 2306-5729). This special issue belongs to the section "Information Systems and Data Management".

Deadline for manuscript submissions: closed (30 April 2022) | Viewed by 64211

Special Issue Editor


E-Mail Website
Guest Editor
Department of Architecture and Industrial Design, Università degli Studi della Campania "Luigi Vanvitelli", Aversa, Italy
Interests: acoustics; architecture; digital signal processing; sound; audio signal processing; acoustic signal processing; acoustic analysis; acoustics and acoustic engineering; sound analysis; noise analysis
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Machine Learning is a field of artificial intelligence that deals with the creation of algorithms and systems capable of extracting new knowledge from input data. It is widely used in disciplines including physics, mathematics, statistics, and mechanics as an alternative to classical data analysis procedures. In many fields, including computer vision, image processing, speech processing, and pattern recognition, thanks to the use of algorithms based on machine learning, we have witnessed a progressive technological evolution that has led to the processing of intelligent machines. Machine learning represents a form of adaptation of the system to the environment through experience, similar to what happens to every living being. This adaptation of the system to the environment through experience is aims to lead to an improvement without relying on continuous human intervention. To achieve this, the system must be able to learn—that is, it must be able to extract useful information on a given problem by examining a series of examples associated with it. The constant increase in the amount of data produced daily and the high growth in the computing capacity of computers are two key factors that have contributed to the development of new data analysis methodologies. Machine learning is used in science to facilitate research on the collection, classification, and correlation of data. Instead, companies are increasingly using these algorithms to extract knowledge from data in order to develop models to support strategic decisions and create value. To obtain this result, it is essential to obtain the key information which makes it possible to create knowledge.

The purpose of this Special Issue is to collect scientific contributions that demonstrate the widespread use of machine-learning-based applications to extract knowledge from data. Therefore, original research articles as well as review articles will be welcome, containing examples of works based on these technologies in the most popular fields: natural sciences, healthcare, medicine, finance, business, and economics.

Dr. Giuseppe Ciaburro
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Data is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 1600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Published Papers (15 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review, Other

21 pages, 6222 KiB  
Article
Using Transfer Learning to Train a Binary Classifier for Lorrca Ektacytometery Microscopic Images of Sickle Cells and Healthy Red Blood Cells
by Marya Butt and Ander de Keijzer
Data 2022, 7(9), 126; https://doi.org/10.3390/data7090126 - 05 Sep 2022
Viewed by 1915
Abstract
Multiple blood images of stressed and sheared cells, taken by a Lorrca Ektacytometery microscope, needed a classification for biomedical researchers to assess several treatment options for blood-related diseases. The study proposes the design of a model capable of classifying these images, with high [...] Read more.
Multiple blood images of stressed and sheared cells, taken by a Lorrca Ektacytometery microscope, needed a classification for biomedical researchers to assess several treatment options for blood-related diseases. The study proposes the design of a model capable of classifying these images, with high accuracy, into healthy Red Blood Cells (RBCs) or Sickle Cells (SCs) images. The performances of five Deep Learning (DL) models with two different optimizers, namely Adam and Stochastic Gradient Descent (SGD), were compared. The first three models consisted of 1, 2 and 3 blocks of CNN, respectively, and the last two models used a transfer learning approach to extract features. The dataset was first augmented, scaled, and then trained to develop models. The performance of the models was evaluated by testing on new images and was illustrated by confusion matrices, performance metrics (accuracy, recall, precision and f1 score), a receiver operating characteristic (ROC) curve and the area under the curve (AUC) value. The first, second and third models with the Adam optimizer could not achieve training, validation or testing accuracy above 50%. However, the second and third models with SGD optimizers showed good loss and accuracy scores during training and validation, but the testing accuracy did not exceed 51%. The fourth and fifth models used VGG16 and Resnet50 pre-trained models for feature extraction, respectively. VGG16 performed better than Resnet50, scoring 98% accuracy and an AUC of 0.98 with both optimizers. The study suggests that transfer learning with the VGG16 model helped to extract features from images for the classification of healthy RBCs and SCs, thus making a significant difference in performance comparing the first, second, third and fifth models. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

15 pages, 2676 KiB  
Article
Multi-Resolution Discrete Cosine Transform Fusion Technique Face Recognition Model
by Bader M. AlFawwaz, Atallah AL-Shatnawi, Faisal Al-Saqqar and Mohammad Nusir
Data 2022, 7(6), 80; https://doi.org/10.3390/data7060080 - 15 Jun 2022
Cited by 1 | Viewed by 1792
Abstract
This work presents a Multi-Resolution Discrete Cosine Transform (MDCT) fusion technique Fusion Feature-Level Face Recognition Model (FFLFRM) comprising face detection, feature extraction, feature fusion, and face classification. It detects core facial characteristics as well as local and global features utilizing Local Binary Pattern [...] Read more.
This work presents a Multi-Resolution Discrete Cosine Transform (MDCT) fusion technique Fusion Feature-Level Face Recognition Model (FFLFRM) comprising face detection, feature extraction, feature fusion, and face classification. It detects core facial characteristics as well as local and global features utilizing Local Binary Pattern (LBP) and Principal Component Analysis (PCA) extraction. MDCT fusion technique was applied, followed by Artificial Neural Network (ANN) classification. Model testing used 10,000 faces derived from the Olivetti Research Laboratory (ORL) library. Model performance was evaluated in comparison with three state-of-the-art models depending on Frequency Partition (FP), Laplacian Pyramid (LP) and Covariance Intersection (CI) fusion techniques, in terms of image features (low-resolution issues and occlusion) and facial characteristics (pose, and expression per se and in relation to illumination). The MDCT-based model yielded promising recognition results, with a 97.70% accuracy demonstrating effectiveness and robustness for challenges. Furthermore, this work proved that the MDCT method used by the proposed FFLFRM is simpler, faster, and more accurate than the Discrete Fourier Transform (DFT), Fast Fourier Transform (FFT) and Discrete Wavelet Transform (DWT). As well as that it is an effective method for facial real-life applications. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

10 pages, 821 KiB  
Article
Using Twitter to Detect Hate Crimes and Their Motivations: The HateMotiv Corpus
by Noha Alnazzawi
Data 2022, 7(6), 69; https://doi.org/10.3390/data7060069 - 24 May 2022
Cited by 5 | Viewed by 3767
Abstract
With the rapidly increasing use of social media platforms, much of our lives is spent online. Despite the great advantages of using social media, unfortunately, the spread of hate, cyberbullying, harassment, and trolling can be very common online. Many extremists use social media [...] Read more.
With the rapidly increasing use of social media platforms, much of our lives is spent online. Despite the great advantages of using social media, unfortunately, the spread of hate, cyberbullying, harassment, and trolling can be very common online. Many extremists use social media platforms to communicate their messages of hatred and spread violence, which may result in serious psychological consequences and even contribute to real-world violence. Thus, the aim of this research was to build the HateMotiv corpus, a freely available dataset that is annotated for types of hate crimes and the motivation behind committing them. The dataset was developed using Twitter as an example of social media platforms and could provide the research community with a very unique, novel, and reliable dataset. The dataset is unique as a consequence of its topic-specific nature and its detailed annotation. The corpus was annotated by two annotators who are experts in annotation based on unified guidelines, so they were able to produce an annotation of a high standard with F-scores for the agreement rate as high as 0.66 and 0.71 for type and motivation labels of hate crimes, respectively. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

15 pages, 840 KiB  
Article
An Ensemble Model for Predicting Retail Banking Churn in the Youth Segment of Customers
by Vijayakumar Bharathi S, Dhanya Pramod and Ramakrishnan Raman
Data 2022, 7(5), 61; https://doi.org/10.3390/data7050061 - 09 May 2022
Cited by 10 | Viewed by 4604
Abstract
(1) This study aims to predict the youth customers’ defection in retail banking. The sample comprised 602 young adult bank customers. (2) The study applied Machine learning techniques, including ensembles, to predict the possibility of churn. (3) The absence of mobile banking, zero-interest [...] Read more.
(1) This study aims to predict the youth customers’ defection in retail banking. The sample comprised 602 young adult bank customers. (2) The study applied Machine learning techniques, including ensembles, to predict the possibility of churn. (3) The absence of mobile banking, zero-interest personal loans, access to ATMs, and customer care and support were critical driving factors to churn. The ExtraTreeClassifier model resulted in an accuracy rate of 92%, and an AUC of 91.88% validated the findings. (4) Customer retention is one of the critical success factors for organizations so as to enhance the business value. It is imperative for banks to predict the drivers of churn among their young adult customers so as to create and deliver proactive enable quality services. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

18 pages, 4824 KiB  
Article
An Estimated-Travel-Time Data Scraping and Analysis Framework for Time-Dependent Route Planning
by Hong-Le Tee, Soung-Yue Liew, Chee-Siang Wong and Boon-Yaik Ooi
Data 2022, 7(5), 54; https://doi.org/10.3390/data7050054 - 27 Apr 2022
Cited by 1 | Viewed by 3093
Abstract
Generally, a courier company needs to employ a fleet of vehicles to travel through a number of locations in order to provide efficient parcel delivery services. The route planning of these vehicles can be formulated as a vehicle routing problem (VRP). Most existing [...] Read more.
Generally, a courier company needs to employ a fleet of vehicles to travel through a number of locations in order to provide efficient parcel delivery services. The route planning of these vehicles can be formulated as a vehicle routing problem (VRP). Most existing VRP algorithms assume that the traveling durations between locations are time invariant; thus, they normally use only a set of estimated travel times (ETTs) to plan the vehicles’ routes; however, this is not realistic because the traffic pattern in a city varies over time. One solution to tackle the problem is to use different sets of ETTs for route planning in different time periods, and these data are collectively called the time-dependent estimated travel times (TD-ETTs). This paper focuses on a low-cost and robust solution to effectively scrape, process, clean, and analyze the TD-ETT data from free web-mapping services in order to gain the knowledge of the traffic pattern in a city in different time periods. To achieve the abovementioned goal, our proposed framework contains four phases, namely, (i) Full Data Scraping, (ii) Data Pre-Processing and Analysis, (iii) Fast Data Scraping, and (iv) Data Patching and Maintenance. In our experiment, we used the above framework to obtain the TD-ETT data across 68 locations in Penang, Malaysia, for six months. We then fed the data to a VRP algorithm for evaluation. We found that the performance of our low-cost approach is comparable with that of using the expensive paid data. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

22 pages, 6057 KiB  
Article
An Efficient Spark-Based Hybrid Frequent Itemset Mining Algorithm for Big Data
by Mohamed Reda Al-Bana, Marwa Salah Farhan and Nermin Abdelhakim Othman
Data 2022, 7(1), 11; https://doi.org/10.3390/data7010011 - 14 Jan 2022
Cited by 8 | Viewed by 5241
Abstract
Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan [...] Read more.
Frequent itemset mining (FIM) is a common approach for discovering hidden frequent patterns from transactional databases used in prediction, association rules, classification, etc. Apriori is an FIM elementary algorithm with iterative nature used to find the frequent itemsets. Apriori is used to scan the dataset multiple times to generate big frequent itemsets with different cardinalities. Apriori performance descends when data gets bigger due to the multiple dataset scan to extract the frequent itemsets. Eclat is a scalable version of the Apriori algorithm that utilizes a vertical layout. The vertical layout has many advantages; it helps to solve the problem of multiple datasets scanning and has information that helps to find each itemset support. In a vertical layout, itemset support can be achieved by intersecting transaction ids (tidset/tids) and pruning irrelevant itemsets. However, when tids become too big for memory, it affects algorithms efficiency. In this paper, we introduce SHFIM (spark-based hybrid frequent itemset mining), which is a three-phase algorithm that utilizes both horizontal and vertical layout diffset instead of tidset to keep track of the differences between transaction ids rather than the intersections. Moreover, some improvements are developed to decrease the number of candidate itemsets. SHFIM is implemented and tested over the Spark framework, which utilizes the RDD (resilient distributed datasets) concept and in-memory processing that tackles MapReduce framework problem. We compared the SHFIM performance with Spark-based Eclat and dEclat algorithms for the four benchmark datasets. Experimental results proved that SHFIM outperforms Eclat and dEclat Spark-based algorithms in both dense and sparse datasets in terms of execution time. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

20 pages, 604 KiB  
Article
The Impact of Global Structural Information in Graph Neural Networks Applications
by Davide Buffelli and Fabio Vandin
Data 2022, 7(1), 10; https://doi.org/10.3390/data7010010 - 13 Jan 2022
Cited by 3 | Viewed by 2858
Abstract
Graph Neural Networks (GNNs) rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours. A known limitation of GNNs is that, as the number of layers increases, information gets smoothed and [...] Read more.
Graph Neural Networks (GNNs) rely on the graph structure to define an aggregation strategy where each node updates its representation by combining information from its neighbours. A known limitation of GNNs is that, as the number of layers increases, information gets smoothed and squashed and node embeddings become indistinguishable, negatively affecting performance. Therefore, practical GNN models employ few layers and only leverage the graph structure in terms of limited, small neighbourhoods around each node. Inevitably, practical GNNs do not capture information depending on the global structure of the graph. While there have been several works studying the limitations and expressivity of GNNs, the question of whether practical applications on graph structured data require global structural knowledge or not remains unanswered. In this work, we empirically address this question by giving access to global information to several GNN models, and observing the impact it has on downstream performance. Our results show that global information can in fact provide significant benefits for common graph-related tasks. We further identify a novel regularization strategy that leads to an average accuracy improvement of more than 5% on all considered tasks. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Graphical abstract

42 pages, 6853 KiB  
Article
Knowledge Management Model for Smart Campus in Indonesia
by Deden Sumirat Hidayat and Dana Indra Sensuse
Data 2022, 7(1), 7; https://doi.org/10.3390/data7010007 - 10 Jan 2022
Cited by 13 | Viewed by 5322
Abstract
The application of smart campuses (SC), especially at higher education institutions (HEI) in Indonesia, is very diverse, and does not yet have standards. As a result, SC practice is spread across various areas in an unstructured and uneven manner. KM is one of [...] Read more.
The application of smart campuses (SC), especially at higher education institutions (HEI) in Indonesia, is very diverse, and does not yet have standards. As a result, SC practice is spread across various areas in an unstructured and uneven manner. KM is one of the critical components of SC. However, the use of KM to support SC is less clearly discussed. Most implementations and assumptions still consider the latest IT application as the SC component. As such, this study aims to identify the components of the KM model for SC. This study used a systematic literature review (SLR) technique with PRISMA procedures, an analytical hierarchy process, and expert interviews. SLR is used to identify the components of the conceptual model, and AHP is used for model priority component analysis. Interviews were used for validation and model development. The results show that KM, IoT, and big data have the highest trends. Governance, people, and smart education have the highest trends. IT is the highest priority component. The KM model for SC has five main layers grouped in phases of the system cycle. This cycle describes the organization’s intellectual ability to adapt in achieving SC indicators. The knowledge cycle at HEIs focuses on education, research, and community service. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

20 pages, 953 KiB  
Article
News Monitor: A Framework for Exploring News in Real-Time
by Nikolaos Panagiotou, Antonia Saravanou and Dimitrios Gunopulos
Data 2022, 7(1), 3; https://doi.org/10.3390/data7010003 - 27 Dec 2021
Cited by 2 | Viewed by 3565
Abstract
News articles generated by online media are a major source of information. In this work, we present News Monitor, a framework that automatically collects news articles from a wide variety of online news portals and performs various analysis tasks. The framework initially identifies [...] Read more.
News articles generated by online media are a major source of information. In this work, we present News Monitor, a framework that automatically collects news articles from a wide variety of online news portals and performs various analysis tasks. The framework initially identifies fresh news (first stories) and clusters articles about the same incidents. For every story, at first, it extracts all of the corresponding triples and, then, it creates a knowledge base (KB) using open information extraction techniques. This knowledge base is then used to create a summary for the user. News Monitor allows for the users to use it as a search engine, ask their questions in their natural language and receive answers that have been created by the state-of-the-art framework BERT. In addition, News Monitor crawls the Twitter stream using a dynamic set of “trending” keywords in order to retrieve all messages relevant to the news. The framework is distributed, online and performs analysis in real-time. According to the evaluation results, the fake news detection techniques utilized by News Monitor allow for a F-measure of 82% in the rumor identification task and an accuracy of 92% in the stance detection tasks. The major contribution of this work can be summarized as a novel real-time and scalable architecture that combines various effective techniques under a news analysis framework. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

19 pages, 959 KiB  
Article
Shipping Accidents Dataset: Data-Driven Directions for Assessing Accident’s Impact and Improving Safety Onboard
by Panagiotis Panagiotidis, Kyriakos Giannakis, Nikolaos Angelopoulos and Angelos Liapis
Data 2021, 6(12), 129; https://doi.org/10.3390/data6120129 - 03 Dec 2021
Cited by 4 | Viewed by 5645
Abstract
Recent tragic marine incidents indicate that more efficient safety procedures and emergency management systems are needed. During the 2014–2019 period, 320 accidents cost 496 lives, and 5424 accidents caused 6210 injuries. Ideally, we need historical data from real accident cases of ships to [...] Read more.
Recent tragic marine incidents indicate that more efficient safety procedures and emergency management systems are needed. During the 2014–2019 period, 320 accidents cost 496 lives, and 5424 accidents caused 6210 injuries. Ideally, we need historical data from real accident cases of ships to develop data-driven solutions. According to the literature, the most critical factor to the post-incident management phase is human error. However, no structured datasets record the crew’s actions during an incident and the human factors that contributed to its occurrence. To overcome the limitations mentioned above, we decided to utilise the unstructured information from accident reports conducted by governmental organisations to create a new, well-structured dataset of maritime accidents and provide intuitions for its usage. Our dataset contains all the information that the majority of the marine datasets include, such as the place, the date, and the conditions during the post-incident phase, e.g., weather data. Additionally, the proposed dataset contains attributes related to each incident’s environmental/financial impact, as well as a concise description of the post-incident events, highlighting the crew’s actions and the human factors that contributed to the incident. We utilise this dataset to predict the incident’s impact and provide data-driven directions regarding the improvement of the post-incident safety procedures for specific types of ships. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

11 pages, 1339 KiB  
Article
Learning Interpretable Mixture of Weibull Distributions—Exploratory Analysis of How Economic Development Influences the Incidence of COVID-19 Deaths
by Róbert Csalódi, Zoltán Birkner and János Abonyi
Data 2021, 6(12), 125; https://doi.org/10.3390/data6120125 - 26 Nov 2021
Viewed by 2361
Abstract
This paper presents an algorithm for learning local Weibull models, whose operating regions are represented by fuzzy rules. The applicability of the proposed method is demonstrated in estimating the mortality rate of the COVID-19 pandemic. The reproducible results show that there is a [...] Read more.
This paper presents an algorithm for learning local Weibull models, whose operating regions are represented by fuzzy rules. The applicability of the proposed method is demonstrated in estimating the mortality rate of the COVID-19 pandemic. The reproducible results show that there is a significant difference between mortality rates of countries due to their economic situation, urbanization, and the state of the health sector. The proposed method is compared with the semi-parametric Cox proportional hazard regression method. The distribution functions of these two methods are close to each other, so the proposed method can estimate efficiently. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

13 pages, 2919 KiB  
Article
A Principal Components Analysis-Based Method for the Detection of Cannabis Plants Using Representation Data by Remote Sensing
by Carmine Gambardella, Rosaria Parente, Alessandro Ciambrone and Marialaura Casbarra
Data 2021, 6(10), 108; https://doi.org/10.3390/data6100108 - 13 Oct 2021
Cited by 6 | Viewed by 4576
Abstract
Integrating the representation of the territory, through airborne remote sensing activities with hyperspectral and visible sensors, and managing complex data through dimensionality reduction for the identification of cannabis plantations, in Albania, is the focus of the research proposed by the multidisciplinary group of [...] Read more.
Integrating the representation of the territory, through airborne remote sensing activities with hyperspectral and visible sensors, and managing complex data through dimensionality reduction for the identification of cannabis plantations, in Albania, is the focus of the research proposed by the multidisciplinary group of the Benecon University Consortium. In this study, principal components analysis (PCA) was used to remove redundant spectral information from multiband datasets. This makes it easier to identify the most prevalent spectral characteristics in most bands and those that are specific to only a few bands. The survey and airborne monitoring by hyperspectral sensors is carried out with an Itres CASI 1500 sensor owned by Benecon, characterized by a spectral range of 380–1050 nm and 288 configurable channels. The spectral configuration adopted for the research was developed specifically to maximize the spectral separability of cannabis. The ground resolution of the georeferenced cartographic data varies according to the flight planning, inserted in the aerial platform of an Italian Guardia di Finanza’s aircraft, in relation to the orography of the sites under investigation. The geodatabase, wherein the processing of hyperspectral and visible images converge, contains ancillary data such as digital aeronautical maps, digital terrain models, color orthophoto, topographic data and in any case a significant amount of data so that they can be processed synergistically. The goal is to create maps and predictive scenarios, through the application of the spectral angle mapper algorithm, of the cannabis plantations scattered throughout the area. The protocol consists of comparing the spectral data acquired with the CASI1500 airborne sensor and the spectral signature of the cannabis leaves that have been acquired in the laboratory with ASD Fieldspec PRO FR spectrometers. These scientific studies have demonstrated how it is possible to achieve ex ante control of the evolution of the phenomenon itself for monitoring the cultivation of cannabis plantations. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

Review

Jump to: Research, Other

20 pages, 456 KiB  
Review
The Role of Human Knowledge in Explainable AI
by Andrea Tocchetti and Marco Brambilla
Data 2022, 7(7), 93; https://doi.org/10.3390/data7070093 - 06 Jul 2022
Cited by 9 | Viewed by 4856
Abstract
As the performance and complexity of machine learning models have grown significantly over the last years, there has been an increasing need to develop methodologies to describe their behaviour. Such a need has mainly arisen due to the widespread use of black-box models, [...] Read more.
As the performance and complexity of machine learning models have grown significantly over the last years, there has been an increasing need to develop methodologies to describe their behaviour. Such a need has mainly arisen due to the widespread use of black-box models, i.e., high-performing models whose internal logic is challenging to describe and understand. Therefore, the machine learning and AI field is facing a new challenge: making models more explainable through appropriate techniques. The final goal of an explainability method is to faithfully describe the behaviour of a (black-box) model to users who can get a better understanding of its logic, thus increasing the trust and acceptance of the system. Unfortunately, state-of-the-art explainability approaches may not be enough to guarantee the full understandability of explanations from a human perspective. For this reason, human-in-the-loop methods have been widely employed to enhance and/or evaluate explanations of machine learning models. These approaches focus on collecting human knowledge that AI systems can then employ or involving humans to achieve their objectives (e.g., evaluating or improving the system). This article aims to present a literature overview on collecting and employing human knowledge to improve and evaluate the understandability of machine learning models through human-in-the-loop approaches. Furthermore, a discussion on the challenges, state-of-the-art, and future trends in explainability is also provided. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

30 pages, 4019 KiB  
Review
Machine Learning-Based Algorithms to Knowledge Extraction from Time Series Data: A Review
by Giuseppe Ciaburro and Gino Iannace
Data 2021, 6(6), 55; https://doi.org/10.3390/data6060055 - 25 May 2021
Cited by 18 | Viewed by 8615
Abstract
To predict the future behavior of a system, we can exploit the information collected in the past, trying to identify recurring structures in what happened to predict what could happen, if the same structures repeat themselves in the future as well. A time [...] Read more.
To predict the future behavior of a system, we can exploit the information collected in the past, trying to identify recurring structures in what happened to predict what could happen, if the same structures repeat themselves in the future as well. A time series represents a time sequence of numerical values observed in the past at a measurable variable. The values are sampled at equidistant time intervals, according to an appropriate granular frequency, such as the day, week, or month, and measured according to physical units of measurement. In machine learning-based algorithms, the information underlying the knowledge is extracted from the data themselves, which are explored and analyzed in search of recurring patterns or to discover hidden causal associations or relationships. The prediction model extracts knowledge through an inductive process: the input is the data and, possibly, a first example of the expected output, the machine will then learn the algorithm to follow to obtain the same result. This paper reviews the most recent work that has used machine learning-based techniques to extract knowledge from time series data. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

Other

Jump to: Research, Review

11 pages, 2166 KiB  
Data Descriptor
Dataset: Mobility Patterns of a Coastal Area Using Traffic Classification Radars
by Joaquim Ferreira, Rui Aguiar, José A. Fonseca, João Almeida, João Barraca, Diogo Gomes, Rafael Oliveira, João Rufino, Fernando Braz and Pedro Gonçalves
Data 2022, 7(7), 97; https://doi.org/10.3390/data7070097 - 13 Jul 2022
Viewed by 1538
Abstract
Monitoring road traffic is extremely important given the possibilities it opens up in terms of studying the behavior of road users, road design and planning problems, as well as because it can be used to predict future traffic. Especially on highways that connect [...] Read more.
Monitoring road traffic is extremely important given the possibilities it opens up in terms of studying the behavior of road users, road design and planning problems, as well as because it can be used to predict future traffic. Especially on highways that connect beaches and larger urban areas, traffic is characterized by having peaks that are highly dependent on weather conditions and rest periods. This paper describes a dataset of mobility patterns of a coastal area in Aveiro region, Portugal, fully covered with traffic classification radars, over a two-year period. The sensing infrastructure was deployed in the scope of the PASMO project, an open living lab for co-operative intelligent transportation systems. The data gathered includes the speed of the detected objects, their position, and their type (heavy vehicle, light vehicle, two-wheeler, and pedestrian). The dataset includes 74,305 records, corresponding to the aggregation of road information at 10 min intervals. A brief analysis of the dataset shows the highly dynamic nature of traffic during the two-year period. In addition, the existence of meteorological records from nearby stations, and the recording of daily data on COVID-19 infections, make it possible to cross-reference information and study the influence of weather conditions and infections on traffic behavior. Full article
(This article belongs to the Special Issue Knowledge Extraction from Data Using Machine Learning)
Show Figures

Figure 1

Back to TopTop