Next Issue
Volume 9, January
Previous Issue
Volume 8, November
 
 

Data, Volume 8, Issue 12 (December 2023) – 14 articles

Cover Story (view full-size image): β-Galactosylceramidase (GALC) is a lysosomal enzyme involved in sphingolipid metabolism. Previous observations have shown that GALC exerts pro-oncogenic activity in human melanoma. Here, we investigated the impact of GALC overexpression on the proteomic landscape of BRAF-mutated human melanoma cell lines by mass spectrometry analysis. The data indicate that GALC overexpression causes the upregulation/downregulation of 172/99 proteins in GALC-transduced cells when compared to controls. These proteins belong to various biological processes that include RNA metabolism, cell organelle fate, and intracellular redox status. Overall, these data provide novel insights about the pro-oncogenic function of sphingolipid metabolizing enzymes in melanoma. View this paper
  • Issues are regarded as officially published after their release is announced to the table of contents alert mailing list.
  • You may sign up for e-mail alerts to receive table of contents of newly released issues.
  • PDF is the official format for papers published in both, html and pdf forms. To view the papers in pdf format, click on the "PDF Full-text" link, and use the free Adobe Reader to open them.
Order results
Result details
Section
Select all
Export citation of selected articles as:
4 pages, 416 KiB  
Data Descriptor
Genome Sequence of the Plant-Growth-Promoting Endophyte Curtobacterium flaccumfaciens Strain W004
by Vladimir K. Chebotar, Maria S. Gancheva, Elena P. Chizhevskaya, Maria E. Baganova, Oksana V. Keleinikova, Kharon A. Husainov and Veronika N. Pishchik
Data 2023, 8(12), 187; https://doi.org/10.3390/data8120187 - 09 Dec 2023
Viewed by 1470
Abstract
We report the whole-genome sequences of the endophyte Curtobacterium flaccumfaciens strain W004 isolated from the seeds of winter wheat, cv. Bezostaya 100. The genome was obtained using Oxford Nanopore MinION sequencing. The bacterium has a circular chromosome consisting of 3.63 kbp with a [...] Read more.
We report the whole-genome sequences of the endophyte Curtobacterium flaccumfaciens strain W004 isolated from the seeds of winter wheat, cv. Bezostaya 100. The genome was obtained using Oxford Nanopore MinION sequencing. The bacterium has a circular chromosome consisting of 3.63 kbp with a G+C% content of 70.89%. We found that Curtobacterium flaccumfaciens strain W004 could promote the growth of spring wheat plants, resulting in an increase in grain yield of 54.3%. Sequencing the genome of this new strain can provide insights into its potential role in plant–microbe interactions. Full article
Show Figures

Figure 1

19 pages, 11983 KiB  
Data Descriptor
A Qualitative Dataset for Coffee Bio-Aggressors Detection Based on the Ancestral Knowledge of the Cauca Coffee Farmers in Colombia
by Juan Felipe Valencia-Mosquera, David Griol, Mayra Solarte-Montoya, Cristhian Figueroa, Juan Carlos Corrales and David Camilo Corrales
Data 2023, 8(12), 186; https://doi.org/10.3390/data8120186 - 08 Dec 2023
Viewed by 1558
Abstract
This paper describes a novel qualitative dataset regarding coffee pests based on the ancestral knowledge of coffee farmers in the Department of Cauca, Colombia. The dataset has been obtained from a survey applied to coffee growers with 432 records and 41 variables collected [...] Read more.
This paper describes a novel qualitative dataset regarding coffee pests based on the ancestral knowledge of coffee farmers in the Department of Cauca, Colombia. The dataset has been obtained from a survey applied to coffee growers with 432 records and 41 variables collected weekly from September 2020 to August 2021. The qualitative dataset includes climatic conditions, productive activities, external conditions, and coffee bio-aggressors. This dataset allows researchers to find patterns for coffee crop protection through the ancestral knowledge not detected by real-time agricultural sensors. As far as we are concerned, there are no datasets like the one presented in this paper with similar characteristics of qualitative value that express the empirical knowledge of coffee farmers used to detect triggers of causal behaviors of pests and diseases in coffee crops. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

23 pages, 6555 KiB  
Article
Land Cover Classification in the Antioquia Region of the Tropical Andes Using NICFI Satellite Data Program Imagery and Semantic Segmentation Techniques
by Luisa F. Gomez-Ossa, German Sanchez-Torres and John W. Branch-Bedoya
Data 2023, 8(12), 185; https://doi.org/10.3390/data8120185 - 04 Dec 2023
Viewed by 1937
Abstract
Land cover classification, generated from satellite imagery through semantic segmentation, has become fundamental for monitoring land use and land cover change (LULCC). The tropical Andes territory provides opportunities due to its significance in the provision of ecosystem services. However, the lack of reliable [...] Read more.
Land cover classification, generated from satellite imagery through semantic segmentation, has become fundamental for monitoring land use and land cover change (LULCC). The tropical Andes territory provides opportunities due to its significance in the provision of ecosystem services. However, the lack of reliable data for this region, coupled with challenges arising from its mountainous topography and diverse ecosystems, hinders the description of its coverage. Therefore, this research proposes the Tropical Andes Land Cover Dataset (TALANDCOVER). It is constructed from three sample strategies: aleatory, minimum 50%, and 70% of representation per class, which address imbalanced geographic data. Additionally, the U-Net deep learning model is applied for enhanced and tailored classification of land covers. Using high-resolution data from the NICFI program, our analysis focuses on the Department of Antioquia in Colombia. The TALANDCOVER dataset, presented in TIF format, comprises multiband R-G-B-NIR images paired with six labels (dense forest, grasslands, heterogeneous agricultural areas, bodies of water, built-up areas, and bare-degraded lands) with an estimated 0.76 F1 score compared to ground truth data by expert knowledge and surpassing the precision of existing global cover maps for the study area. To the best of our knowledge, this work is a pioneer in its release of open-source data for segmenting coverages with pixel-wise labeled NICFI imagery at a 4.77 m resolution. The experiments carried out with the application of the sample strategies and models show F1 score values of 0.70, 0.72, and 0.74 for aleatory, balanced 50%, and balanced 70%, respectively, over the expert segmented sample (ground truth), which suggests that the personalized application of our deep learning model, together with the TALANDCOVER dataset offers different possibilities that facilitate the training of deep architectures for the classification of large-scale covers in complex areas, such as the tropical Andes. This advance has significant potential for decision making, emphasizing sustainable land use and the conservation of natural resources. Full article
Show Figures

Figure 1

12 pages, 7250 KiB  
Data Descriptor
An Urban Image Stimulus Set Generated from Social Media
by Ardaman Kaur, André Leite Rodrigues, Sarah Hoogstraten, Diego Andrés Blanco-Mora, Bruno Miranda, Paulo Morgado and Dar Meshi
Data 2023, 8(12), 184; https://doi.org/10.3390/data8120184 - 01 Dec 2023
Viewed by 1570
Abstract
Social media data, such as photos and status posts, can be tagged with location information (geotagging). This geotagged information can be used for urban spatial analysis to explore neighborhood characteristics or mobility patterns. With increasing rural-to-urban migration, there is a need for comprehensive [...] Read more.
Social media data, such as photos and status posts, can be tagged with location information (geotagging). This geotagged information can be used for urban spatial analysis to explore neighborhood characteristics or mobility patterns. With increasing rural-to-urban migration, there is a need for comprehensive data capturing the complexity of urban settings and their influence on human experiences. Here, we share an urban image stimulus set from the city of Lisbon that researchers can use in their experiments. The stimulus set consists of 160 geotagged urban space photographs extracted from the Flickr social media platform. We divided the city into 100 × 100 m cells to calculate the cell image density (number of images in each cell) and the cell green index (Normalized Difference Vegetation Index of each cell) and assigned these values to each geotagged image. In addition, we also computed the popularity of each image (normalized views on the social network). We also categorized these images into two putative groups by photographer status (residents and tourists), with 80 images belonging to each group. With the rise in data-driven decisions in urban planning, this stimulus set helps explore human–urban environment interaction patterns, especially if complemented with survey/neuroimaging measures or machine-learning analyses. Full article
Show Figures

Figure 1

9 pages, 4934 KiB  
Data Descriptor
Spectrogram Dataset of Korean Smartphone Audio Files Forged Using the “Mix Paste” Command
by Yeongmin Son, Won Jun Kwak and Jae Wan Park
Data 2023, 8(12), 183; https://doi.org/10.3390/data8120183 - 01 Dec 2023
Cited by 1 | Viewed by 1463
Abstract
This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created [...] Read more.
This study focuses on the field of voice forgery detection, which is increasing in importance owing to the introduction of advanced voice editing technologies and the proliferation of smartphones. This study introduces a unique dataset that was built specifically to identify forgeries created using the “Mix Paste” technique. This editing technique can overlay audio segments from similar or different environments without creating a new timeframe, making it nearly infeasible to detect forgeries using traditional methods. The dataset consists of 4665 and 45,672 spectrogram images from 1555 original audio files and 15,224 forged audio files, respectively. The original audio was recorded using iPhone and Samsung Galaxy smartphones to ensure a realistic sampling environment. The forged files were created from these recordings and subsequently converted into spectrograms. The dataset also provided the metadata of the original voice files, offering additional context and information that could be used for analysis and detection. This dataset not only fills a gap in existing research but also provides valuable support for developing more efficient deep learning models for voice forgery detection. By addressing the “Mix Paste” technique, the dataset caters to a critical need in voice authentication and forensics, potentially contributing to enhancing security in society. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

22 pages, 2667 KiB  
Article
An Automated Big Data Quality Anomaly Correction Framework Using Predictive Analysis
by Widad Elouataoui, Saida El Mendili and Youssef Gahi
Data 2023, 8(12), 182; https://doi.org/10.3390/data8120182 - 01 Dec 2023
Viewed by 1913
Abstract
Big data has emerged as a fundamental component in various domains, enabling organizations to extract valuable insights and make informed decisions. However, ensuring data quality is crucial for effectively using big data. Thus, big data quality has been gaining more attention in recent [...] Read more.
Big data has emerged as a fundamental component in various domains, enabling organizations to extract valuable insights and make informed decisions. However, ensuring data quality is crucial for effectively using big data. Thus, big data quality has been gaining more attention in recent years by researchers and practitioners due to its significant impact on decision-making processes. However, existing studies addressing data quality anomalies often have a limited scope, concentrating on specific aspects such as outliers or inconsistencies. Moreover, many approaches are context-specific, lacking a generic solution applicable across different domains. To the best of our knowledge, no existing framework currently automatically addresses quality anomalies comprehensively and generically, considering all aspects of data quality. To fill the gaps in the field, we propose a sophisticated framework that automatically corrects big data quality anomalies using an intelligent predictive model. The proposed framework comprehensively addresses the main aspects of data quality by considering six key quality dimensions: Accuracy, Completeness, Conformity, Uniqueness, Consistency, and Readability. Moreover, the framework is not correlated to a specific field and is designed to be applicable across various areas, offering a generic approach to address data quality anomalies. The proposed framework was implemented on two datasets and has achieved an accuracy of 98.22%. Moreover, the results have shown that the framework has allowed the data quality to be boosted to a great score, reaching 99%, with an improvement rate of up to 14.76% of the quality score. Full article
(This article belongs to the Section Information Systems and Data Management)
Show Figures

Figure 1

32 pages, 5551 KiB  
Data Descriptor
Internationalization in the Baltic Regional Accounts: A NUTS 3 Region Dataset
by Rasmus Bøgh Holmen, Nicolas Gavoille, Jaan Masso and Arūnas Burinskas
Data 2023, 8(12), 181; https://doi.org/10.3390/data8120181 - 30 Nov 2023
Viewed by 1300
Abstract
Features of internationalization, such as trade, foreign direct investments, and international migration, are crucial for understanding the economic developments of small and open economies. However, studying internationalization at the country level may obscure significant heterogeneity in its relationship with economic growth and other [...] Read more.
Features of internationalization, such as trade, foreign direct investments, and international migration, are crucial for understanding the economic developments of small and open economies. However, studying internationalization at the country level may obscure significant heterogeneity in its relationship with economic growth and other economic and social outcomes. Regional accounts provide insights into the geography of internationalization, but collections of such disaggregated statistics are rarely provided by statistical bureaus. The purpose of this paper is twofold. First, we demonstrate how regional account data, including internationalization indicators, can be constructed to obtain consistent and homogeneous regional-level series using a combination of micro and macro data sources. Second, our aim is to foster spatial research on internationalization and the spatial economy in the Baltics by providing comprehensive data collection of socio-economic variables at the NUTS 3 regional level over time. This collection encompasses trade, FDI, and migration, enabling the study of internationalization and other features of the Baltic economy. We present a series of key features, revealing noticeable correlation patterns between regional development and internationalization. Full article
Show Figures

Figure 1

27 pages, 6888 KiB  
Article
Public Perception of ChatGPT and Transfer Learning for Tweets Sentiment Analysis Using Wolfram Mathematica
by Yankang Su and Zbigniew J. Kabala
Data 2023, 8(12), 180; https://doi.org/10.3390/data8120180 - 28 Nov 2023
Cited by 3 | Viewed by 2310
Abstract
Understanding public opinion on ChatGPT is crucial for recognizing its strengths and areas of concern. By utilizing natural language processing (NLP), this study delves into tweets regarding ChatGPT to determine temporal patterns, content features, and topic modeling and perform a sentiment analysis. Analyzing [...] Read more.
Understanding public opinion on ChatGPT is crucial for recognizing its strengths and areas of concern. By utilizing natural language processing (NLP), this study delves into tweets regarding ChatGPT to determine temporal patterns, content features, and topic modeling and perform a sentiment analysis. Analyzing a dataset of 500,000 tweets, our research shifts from conventional data science tools like Python and R to exploit Wolfram Mathematica’s robust capabilities. Additionally, with the aim of solving the problem of ignoring semantic information in the LDA model feature extraction, a synergistic methodology entwining LDA, GloVe embeddings, and K-Nearest Neighbors (KNN) clustering is proposed to categorize topics within ChatGPT-related tweets. This comprehensive strategy ensures semantic, syntactic, and topical congruence within classified groups by utilizing the strengths of probabilistic modeling, semantic embeddings, and similarity-based clustering. While built-in sentiment classifiers often fall short in accuracy, we introduce four transfer learning techniques from the Wolfram Neural Net Repository to address this gap. Two of these techniques involve transferring static word embeddings, “GloVe” and “ConceptNet”, which are further processed using an LSTM layer. The remaining techniques center on fine-tuning pre-trained models using scantily annotated data; one refines embeddings from language models (ELMo), while the other fine-tunes bidirectional encoder representations from transformers (BERT). Our experiments on the dataset underscore the effectiveness of the four methods for the sentiment analysis of tweets. This investigation augments our comprehension of user sentiment towards ChatGPT and emphasizes the continued significance of exploration in this domain. Furthermore, this work serves as a pivotal reference for scholars who are accustomed to using Wolfram Mathematica in other research domains, aiding their efforts in text analytics on social media platforms. Full article
(This article belongs to the Special Issue Sentiment Analysis in Social Media Data)
Show Figures

Figure 1

33 pages, 2391 KiB  
Article
A Tourist-Based Framework for Developing Digital Marketing for Small and Medium-Sized Enterprises in the Tourism Sector in Saudi Arabia
by Rishaa Abdulaziz Alnajim and Bahjat Fakieh
Data 2023, 8(12), 179; https://doi.org/10.3390/data8120179 - 28 Nov 2023
Viewed by 2197
Abstract
Social media has become an essential tool for travel planning, with tourists increasingly using it to research destinations, book accommodation, and make travel arrangements. However, little is known about how tourists use social media for travel planning and what factors influence their intentions [...] Read more.
Social media has become an essential tool for travel planning, with tourists increasingly using it to research destinations, book accommodation, and make travel arrangements. However, little is known about how tourists use social media for travel planning and what factors influence their intentions to use social media for this purpose. This thesis aims to understand tourists’ intentions to use social media for travel planning. Specifically, it investigates the factors influencing tourists’ intentions to use social media for planning travel to Saudi Arabia. It develops a machine learning (ML) classification model to assist Saudi tourism SMEs in creating effective digital marketing strategies for social media platforms. A survey was conducted with 573 tourists interested in visiting Saudi Arabia, using the Design Science Research (DSR) approach. The findings support the tourist-based theoretical framework, showing that perceived usefulness (PU), perceived ease of use (PEOU), satisfaction (SAT), marketing-generated content (MGC), and user-generated content (UGC) significantly impact tourists’ intentions to use social media for travel planning. Tourists’ characteristics and visit characteristics influenced their intentions to use MGC but not UGC. The tourist-based ML classification model, developed using the LinearSVC algorithm, achieved an accuracy of 99% when evaluated using the K-Fold Cross-Validation (KF-CV) technique. The findings of this study have several implications for Saudi tourism SMEs. First, the results suggest that SMEs should focus on developing social media content that is perceived as useful, easy to use, and satisfying. Second, the findings suggest that SMEs should focus on using MGC in their social media marketing campaigns. Third, the results suggest that SMEs should tailor their social media marketing campaigns to the characteristics of their target tourists. This study contributes to the literature on tourism marketing and social media by providing a better understanding of how tourists use social media for travel planning. Saudi tourism SMEs can use the findings of this study to develop more effective digital marketing strategies for social media platforms. Full article
(This article belongs to the Topic Decision-Making and Data Mining for Sustainable Computing)
Show Figures

Figure 1

9 pages, 5659 KiB  
Data Descriptor
In Vivo Drug Testing during Embryonic Wound Healing: Establishing the Avian Model
by Martin Bablok, Beate Brand-Saberi, Morris Gellisch and Gabriela Morosan-Puopolo
Data 2023, 8(12), 178; https://doi.org/10.3390/data8120178 - 25 Nov 2023
Viewed by 1370
Abstract
The relevance of identifying pathological processes in the context of embryonic development is increasingly gaining attention in terms of professionalized prenatal care. To analyze local effects of prenatally administered drugs during embryonic development, the model organism of the chicken embryo can be used [...] Read more.
The relevance of identifying pathological processes in the context of embryonic development is increasingly gaining attention in terms of professionalized prenatal care. To analyze local effects of prenatally administered drugs during embryonic development, the model organism of the chicken embryo can be used in a first exploratory approach. For the examination of local dexamethasone administration—as an exemplary drug—common bead implantation protocols have been adapted to serve as an in vivo technique for local drug testing during embryonic skin regeneration. For this, acrylic beads were soaked in a dexamethasone solution and implanted into skin incisional wounds of 4-day-old chicken embryos. After further incubation, the effects of the applied substance on the process of embryonic skin regeneration were analyzed using histological and molecular biological techniques. This data descriptor contains a detailed microsurgical protocol, a representative video demonstration, and exemplary results of local glucocorticoid-induced changes during embryonic wound healing. To conclude, this method allows for the analysis of the local effects of a particular substance on a cellular level and can be extended to serve as an in vivo technique for numerous other drugs to be tested on embryonic tissue. Full article
Show Figures

Figure 1

7 pages, 1502 KiB  
Data Descriptor
Dataset: Impact of β-Galactosylceramidase Overexpression on the Protein Profile of Braf(V600E) Mutated Melanoma Cells
by Davide Capoferri, Paola Chiodelli, Stefano Calza, Marcello Manfredi and Marco Presta
Data 2023, 8(12), 177; https://doi.org/10.3390/data8120177 - 24 Nov 2023
Cited by 1 | Viewed by 1375
Abstract
β-Galactosylceramidase (GALC) is a lysosomal enzyme involved in sphingolipid metabolism by removing β-galactosyl moieties from β-galactosyl ceramide and β-galactosyl sphingosine. Previous observations have shown that GALC exerts a pro-oncogenic activity in human melanoma. Here, the impact of GALC overexpression on the proteomic landscape [...] Read more.
β-Galactosylceramidase (GALC) is a lysosomal enzyme involved in sphingolipid metabolism by removing β-galactosyl moieties from β-galactosyl ceramide and β-galactosyl sphingosine. Previous observations have shown that GALC exerts a pro-oncogenic activity in human melanoma. Here, the impact of GALC overexpression on the proteomic landscape of BRAF-mutated A2058 and A375 human melanoma cell lines was investigated by liquid chromatography–tandem mass spectrometry analysis of the cell extracts. The results indicate that GALC overexpression causes the upregulation/downregulation of 172/99 proteins in GALC-transduced cells when compared to control cells. Gene ontology categorization of up/down-regulated proteins indicates that GALC may modulate the protein landscape in BRAF-mutated melanoma cells by affecting various biological processes, including RNA metabolism, cell organelle fate, and intracellular redox status. Overall, these data provide further insights into the pro-oncogenic functions of the sphingolipid metabolizing enzyme GALC in human melanoma. Full article
Show Figures

Figure 1

13 pages, 348 KiB  
Article
Model Design and Applied Methodology in Geothermal Simulations in Very Low Enthalpy for Big Data Applications
by Roberto Arranz-Revenga, María Pilar Dorrego de Luxán, Juan Herrera Herbert and Luis Enrique García Cambronero
Data 2023, 8(12), 176; https://doi.org/10.3390/data8120176 - 23 Nov 2023
Viewed by 1461
Abstract
Low-enthalpy geothermal installations for heating, air conditioning, and domestic hot water are gaining traction due to efforts towards energy decarbonization. This article is part of a broader research project aimed at employing artificial intelligence and big data techniques to develop a predictive system [...] Read more.
Low-enthalpy geothermal installations for heating, air conditioning, and domestic hot water are gaining traction due to efforts towards energy decarbonization. This article is part of a broader research project aimed at employing artificial intelligence and big data techniques to develop a predictive system for the thermal behavior of the ground in very low-enthalpy geothermal applications. In this initial article, a summarized process is outlined to generate large quantities of synthetic data through a ground simulation method. The proposed theoretical model allows simulation of the soil’s thermal behavior using an electrical equivalent. The electrical circuit derived is loaded into a simulation program along with an input function representing the system’s thermal load pattern. The simulator responds with another function that calculates the values of the ground over time. Some examples of value conversion and the utility of the input function system to encode thermal loads during simulation are demonstrated. It bears the limitation of invalidity in the presence of underground water currents. Model validation is pending, and once defined, a corresponding testing plan will be proposed for its validation. Full article
Show Figures

Figure 1

6 pages, 2139 KiB  
Data Descriptor
Long-Term Spatiotemporal Oceanographic Data from the Northeast Pacific Ocean: 1980–2022 Reconstruction Based on the Korea Oceanographic Data Center (KODC) Dataset
by Seong-Hyeon Kim and Hansoo Kim
Data 2023, 8(12), 175; https://doi.org/10.3390/data8120175 - 23 Nov 2023
Viewed by 1223
Abstract
The Korea Oceanographic Data Center (KODC), overseen by the National Institute of Fisheries Science (NIFS), is a pivotal hub for collecting, processing, and disseminating marine science data. By digitizing and subjecting observational data to rigorous quality control, the KODC ensures accurate information in [...] Read more.
The Korea Oceanographic Data Center (KODC), overseen by the National Institute of Fisheries Science (NIFS), is a pivotal hub for collecting, processing, and disseminating marine science data. By digitizing and subjecting observational data to rigorous quality control, the KODC ensures accurate information in line with international standards. The center actively engages in global partnerships and fosters marine data exchange. A wide array of marine information is provided through the KODC website, including observational metadata, coastal oceanographic data, real-time buoy records, and fishery environmental data. Coastal oceanographic observational data from 207 stations across various sea regions have been collected biannually since 1961. This dataset covers 14 standard water depths; includes essential parameters, such as temperature, salinity, nutrients, and pH; serves as the foundation for news, reports, and analyses by the NIFS; and is widely employed to study seasonal and regional marine variations, with researchers supplementing the limited data for comprehensive insights. The dataset offers information for each water depth at a 1 m interval over 1980–2022, facilitating research across disciplines. Data processing, including interpolation and quality control, is based on MATLAB. These data are classified by region and accessible online; hence, researchers can easily explore spatiotemporal trends in marine environments. Full article
(This article belongs to the Collection Modern Geophysical and Climate Data Analysis: Tools and Methods)
Show Figures

Figure 1

15 pages, 1982 KiB  
Article
Machine Learning Applications to Identify Young Offenders Using Data from Cognitive Function Tests
by María Claudia Bonfante, Juan Contreras Montes, Mariana Pino, Ronald Ruiz and Gabriel González
Data 2023, 8(12), 174; https://doi.org/10.3390/data8120174 - 21 Nov 2023
Viewed by 1528
Abstract
Machine learning techniques can be used to identify whether deficits in cognitive functions contribute to antisocial and aggressive behavior. This paper initially presents the results of tests conducted on delinquent and nondelinquent youths to assess their cognitive functions. The dataset extracted from these [...] Read more.
Machine learning techniques can be used to identify whether deficits in cognitive functions contribute to antisocial and aggressive behavior. This paper initially presents the results of tests conducted on delinquent and nondelinquent youths to assess their cognitive functions. The dataset extracted from these assessments, consisting of 37 predictor variables and one target, was used to train three algorithms which aim to predict whether the data correspond to those of a young offender or a nonoffending youth. Prior to this, statistical tests were conducted on the data to identify characteristics which exhibited significant differences in order to select the most relevant features and optimize the prediction results. Additionally, other feature selection methods, such as Boruta, RFE, and filter, were applied, and their effects on the accuracy of each of the three machine learning models used (SVM, RF, and KNN) were compared. In total, 80% of the data were utilized for training, while the remaining 20% were used for validation. The best result was achieved by the K-NN model, trained with 19 features selected by the Boruta method, followed by the SVM model, trained with 24 features selected by the filter method. Full article
Show Figures

Figure 1

Previous Issue
Next Issue
Back to TopTop