Data | June 2023 - Browse Articles

10 pages, 8394 KiB

Open AccessData Descriptor

RipSetCocoaCNCH12: Labeled Dataset for Ripeness Stage Detection, Semantic and Instance Segmentation of Cocoa Pods

by Juan Felipe Restrepo-Arias, María Isabel Salinas-Agudelo, María Isabel Hernandez-Pérez, Alejandro Marulanda-Tobón and María Camila Giraldo-Carvajal

Data 2023, 8(6), 112; https://doi.org/10.3390/data8060112 - 18 Jun 2023

Viewed by 1856

Abstract

Fruit counting and ripeness detection are computer vision applications that have gained strength in recent years due to the advancement of new algorithms, especially those based on artificial neural networks (ANNs), better known as deep learning. In agriculture, those algorithms capable of fruit [...] Read more.

Fruit counting and ripeness detection are computer vision applications that have gained strength in recent years due to the advancement of new algorithms, especially those based on artificial neural networks (ANNs), better known as deep learning. In agriculture, those algorithms capable of fruit counting, including information about their ripeness, are mainly applied to make production forecasts or plan different activities such as fertilization or crop harvest. This paper presents the RipSetCocoaCNCH12 dataset of cocoa pods labeled at four different ripeness stages: stage 1 (0–2 months), stage 2 (2–4 months), stage 3 (4–6 months), and harvest stage (>6 months). An additional class was also included for pods aborted by plants in the early stage of development. A total of 4116 images were labeled to train algorithms that mainly perform semantic and instance segmentation. The labeling was carried out with CVAT (Computer Vision Annotation Tool). The dataset, therefore, includes labeling in two formats: COCO 1.0 and segmentation mask 1.1. The images were taken with different mobile devices (smartphones), in field conditions, during the harvest season at different times of the day, which could allow the algorithms to be trained with data that includes many variations in lighting, colors, textures, and sizes of the cocoa pods. As far as we know, this is the first openly available dataset for cocoa pod detection with semantic segmentation for five classes, 4116 images, and 7917 instances, comprising RGB images and two different formats for labels. With the publication of this dataset, we expect that researchers in smart farming, especially in cocoa cultivation, can benefit from the quantity and variety of images it contains. Full article

► Show Figures

Figure 1

7 pages, 224 KiB

Open AccessData Descriptor

Self-Reported Mental Health and Psychosocial Correlates during the COVID-19 Pandemic: Data from the General Population in Italy

by Daniela Marchetti, Roberta Maiella, Rocco Palumbo, Melissa D’Ettorre, Irene Ceccato, Marco Colasanti, Adolfo Di Crosta, Pasquale La Malva, Emanuela Bartolini, Daniela Biasone, Nicola Mammarella, Piero Porcelli, Alberto Di Domenico and Maria Cristina Verrocchio

Data 2023, 8(6), 111; https://doi.org/10.3390/data8060111 - 16 Jun 2023

Viewed by 1329

Abstract

The COVID-19 pandemic tremendously impacted people’s day-to-day activities and mental health. This article describes the dataset used to investigate the psychological impact of the first national lockdown on the general Italian population. For this purpose, an online survey was disseminated via Qualtrics between [...] Read more.

The COVID-19 pandemic tremendously impacted people’s day-to-day activities and mental health. This article describes the dataset used to investigate the psychological impact of the first national lockdown on the general Italian population. For this purpose, an online survey was disseminated via Qualtrics between 1 April and 20 April 2020, to record various socio-demographic and psychological variables. The measures included both validated (namely, the Impact of the Event Scale-Revised, the Perceived Stress Scale, the nine-item Patient Health Questionnaire, the seven-item Generalized Anxiety Disorder scale, the Big Five Inventory 10-Item, and the Whiteley Index-7) and ad hoc questionnaires (nine items to investigate in-group and out-group trust). The final sample comprised 4081 participants (18–85 years old). The dataset could be helpful to other researchers in understanding the psychological impact of the COVID-19 pandemic and its related preventive and protective measures. Furthermore, the present data might help shed some light on the role of individual differences in response to traumatic events. Finally, this dataset can increase the knowledge in investigating psychological distress, health anxiety, and personality traits. Full article

27 pages, 604 KiB

Open AccessArticle

Deep Learning-Based Black Spot Identification on Greek Road Networks

by Ioannis Karamanlis, Alexandros Kokkalis, Vassilios Profillidis, George Botzoris, Chairi Kiourt, Vasileios Sevetlidis and George Pavlidis

Data 2023, 8(6), 110; https://doi.org/10.3390/data8060110 - 16 Jun 2023

Cited by 1 | Viewed by 3364

Abstract

Black spot identification, a spatiotemporal phenomenon, involves analysing the geographical location and time-based occurrence of road accidents. Typically, this analysis examines specific locations on road networks during set time periods to pinpoint areas with a higher concentration of accidents, known as black spots. [...] Read more.

Black spot identification, a spatiotemporal phenomenon, involves analysing the geographical location and time-based occurrence of road accidents. Typically, this analysis examines specific locations on road networks during set time periods to pinpoint areas with a higher concentration of accidents, known as black spots. By evaluating these problem areas, researchers can uncover the underlying causes and reasons for increased collision rates, such as road design, traffic volume, driver behaviour, weather, and infrastructure. However, challenges in identifying black spots include limited data availability, data quality, and assessing contributing factors. Additionally, evolving road design, infrastructure, and vehicle safety technology can affect black spot analysis and determination. This study focused on traffic accidents in Greek road networks to recognize black spots, utilizing data from police and government-issued car crash reports. The study produced a publicly available dataset called Black Spots of North Greece (BSNG) and a highly accurate identification method. Full article

(This article belongs to the Special Issue Signal Processing for Data Mining)

► Show Figures

Figure 1

16 pages, 5580 KiB

Open AccessData Descriptor

Dataset of Program Source Codes Solving Unique Programming Exercises Generated by Digital Teaching Assistant

by Liliya A. Demidova, Elena G. Andrianova, Peter N. Sovietov and Artyom V. Gorchakov

Data 2023, 8(6), 109; https://doi.org/10.3390/data8060109 - 14 Jun 2023

Cited by 6 | Viewed by 1857

Abstract

This paper presents a dataset containing automatically collected source codes solving unique programming exercises of different types. The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). [...] Read more.

This paper presents a dataset containing automatically collected source codes solving unique programming exercises of different types. The programming exercises were automatically generated by the Digital Teaching Assistant (DTA) system that automates a massive Python programming course at MIREA—Russian Technological University (RTU MIREA). Source codes of the small programs grouped by the type of the solved task can be used for benchmarking source code classification and clustering algorithms. Moreover, the data can be used for training intelligent program synthesizers or benchmarking mutation testing frameworks, and more applications are yet to be discovered. We describe the architecture of the DTA system, aiming to provide detailed insight regarding how and why the dataset was collected. In addition, we describe the algorithms responsible for source code analysis in the DTA system. These algorithms use vector representations of programs based on Markov chains, compute pairwise Jensen–Shannon divergences of programs, and apply hierarchical clustering algorithms in order to automatically discover high-level concepts used by students while solving unique tasks. The proposed approach can be incorporated into massive programming courses when there is a need to identify approaches implemented by students. Full article

► Show Figures

Figure 1

14 pages, 2190 KiB

Open AccessArticle

How Expert Is the Crowd? Insights into Crowd Opinions on the Severity of Earthquake Damage

by Motti Zohar, Amos Salamon and Carmit Rapaport

Data 2023, 8(6), 108; https://doi.org/10.3390/data8060108 - 14 Jun 2023

Viewed by 1251

Abstract

The evaluation of earthquake damage is central to assessing its severity and damage characteristics. However, the methods of assessment encounter difficulties concerning the subjective judgments and interpretation of the evaluators. Thus, it is mainly geologists, seismologists, and engineers who perform this exhausting task. [...] Read more.

The evaluation of earthquake damage is central to assessing its severity and damage characteristics. However, the methods of assessment encounter difficulties concerning the subjective judgments and interpretation of the evaluators. Thus, it is mainly geologists, seismologists, and engineers who perform this exhausting task. Here, we explore whether an evaluation made by semiskilled people and by the crowd is equivalent to the experts’ opinions and, thus, can be harnessed as part of the process. Therefore, we conducted surveys in which a cohort of graduate students studying natural hazards (n = 44) and an online crowd (n = 610) were asked to evaluate the level of severity of earthquake damage. The two outcome datasets were then compared with the evaluation made by two of the present authors, who are considered experts in the field. Interestingly, the evaluations of both the semiskilled cohort and the crowd were found to be fairly similar to those of the experts, thus suggesting that they can provide an interpretation close enough to an expert’s opinion on the severity level of earthquake damage. Such an understanding may indicate that although our analysis is preliminary and requires more case studies for this to be verified, there is vast potential encapsulated in crowd-sourced opinion on simple earthquake-related damage, especially if a large amount of data is to be handled. Full article

► Show Figures

Figure 1

12 pages, 496 KiB

Open AccessArticle

A Preliminary Investigation of a Single Shock Impact on Italian Mortality Rates Using STMF Data: A Case Study of COVID-19

by Maria Francesca Carfora and Albina Orlando

Data 2023, 8(6), 107; https://doi.org/10.3390/data8060107 - 13 Jun 2023

Viewed by 1058

Abstract

Mortality shocks, such as pandemics, threaten the consolidated longevity improvements, confirmed in the last decades for the majority of western countries. Indeed, just before the COVID-19 pandemic, mortality was falling for all ages, with a different behavior according to different ages and countries. [...] Read more.

Mortality shocks, such as pandemics, threaten the consolidated longevity improvements, confirmed in the last decades for the majority of western countries. Indeed, just before the COVID-19 pandemic, mortality was falling for all ages, with a different behavior according to different ages and countries. It is indubitable that the changes in the population longevity induced by shock events, even transitory ones, affecting demographic projections, have financial implications in public spending as well as in pension plans and life insurance. The Short Term Mortality Fluctuations (STMF) data series, providing data of all-cause mortality fluctuations by week within each calendar year for 38 countries worldwide, offers a powerful tool to timely analyze the effects of the mortality shock caused by the COVID-19 pandemic on Italian mortality rates. This dataset, recently made available as a new component of the Human Mortality Database, is described and techniques for the integration of its data with the historical mortality time series are proposed. Then, to forecast mortality rates, the well-known stochastic mortality model proposed by Lee and Carter in 1992 is first considered, to be consistent with the internal processing of the Human Mortality Database, where exposures are estimated by the Lee–Carter model; empirical results are discussed both on the estimation of the model coefficients and on the forecast of the mortality rates. In detail, we show how the integration of the yearly aggregated STMF data in the HMD database allows the Lee–Carter model to capture the complex evolution of the Italian mortality rates, including the higher lethality for males and older people, in the years that follow a large shock event such as the COVID-19 pandemic. Finally, we discuss some key points concerning the improvement of existing models to take into account mortality shocks and evaluate their impact on future mortality dynamics. Full article

(This article belongs to the Special Issue Challenges and Perspectives of Open Data in Modelling Infectious Diseases)

► Show Figures

Figure 1

9 pages, 4359 KiB

Open AccessData Descriptor

Curated Dataset for Red Blood Cell Tracking from Video Sequences of Flow in Microfluidic Devices

by Ivan Cimrák, Peter Tarábek and František Kajánek

Data 2023, 8(6), 106; https://doi.org/10.3390/data8060106 - 13 Jun 2023

Viewed by 1363

Abstract

This work presents a dataset comprising images, annotations, and velocity fields for benchmarking cell detection and cell tracking algorithms. The dataset includes two video sequences captured during laboratory experiments, showcasing the flow of red blood cells (RBC) in microfluidic channels. From the first [...] Read more.

This work presents a dataset comprising images, annotations, and velocity fields for benchmarking cell detection and cell tracking algorithms. The dataset includes two video sequences captured during laboratory experiments, showcasing the flow of red blood cells (RBC) in microfluidic channels. From the first video 300 frames and from the second video 150 frames are annotated with bounding boxes around the cells, as well as tracks depicting the movement of individual cells throughout the video. The dataset encompasses approximately 20,000 bounding boxes and 350 tracks. Additionally, computational fluid dynamics simulations were utilized to generate 2D velocity fields representing the flow within the channels. These velocity fields are included in the dataset. The velocity field has been employed to improve cell tracking by predicting the positions of cells across frames. The paper also provides a comprehensive discussion on the utilization of the flow matrix in the tracking steps. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

22 pages, 1514 KiB

Open AccessArticle

Assessing the Effectiveness of Masking and Encryption in Safeguarding the Identity of Social Media Publishers from Advanced Metadata Analysis

by Mohammed Khader and Marcel Karam

Data 2023, 8(6), 105; https://doi.org/10.3390/data8060105 - 13 Jun 2023

Cited by 1 | Viewed by 2808

Abstract

Machine learning algorithms, such as KNN, SVM, MLP, RF, and MLR, are used to extract valuable information from shared digital data on social media platforms through their APIs in an effort to identify anonymous publishers or online users. This can leave these anonymous [...] Read more.

Machine learning algorithms, such as KNN, SVM, MLP, RF, and MLR, are used to extract valuable information from shared digital data on social media platforms through their APIs in an effort to identify anonymous publishers or online users. This can leave these anonymous publishers vulnerable to privacy-related attacks, as identifying information can be revealed. Twitter is an example of such a platform where identifying anonymous users/publishers is made possible by using machine learning techniques. To provide these anonymous users with stronger protection, we have examined the effectiveness of these techniques when critical fields in the metadata are masked or encrypted using tweets (text and images) from Twitter. Our results show that SVM achieved the highest accuracy rate of 95.81% without using data masking or encryption, while SVM achieved the highest identity recognition rate of 50.24% when using data masking and AES encryption algorithm. This indicates that data masking and encryption of metadata of tweets (text and images) can provide promising protection for the anonymity of users’ identities. Full article

► Show Figures

Figure 1

18 pages, 7783 KiB

Open AccessArticle

Comparison of ARIMA and LSTM in Predicting Structural Deformation of Tunnels during Operation Period

by Chuangfeng Duan, Min Hu and Haozuan Zhang

Data 2023, 8(6), 104; https://doi.org/10.3390/data8060104 - 13 Jun 2023

Cited by 1 | Viewed by 1327

Abstract

Accurately predicting the structural deformation trend of tunnels during operation is significant to improve the scientificity of tunnel safety maintenance. With the development of data science, structural deformation prediction methods based on time-series data have attracted attention. Auto Regressive Integrated Moving Average model [...] Read more.

Accurately predicting the structural deformation trend of tunnels during operation is significant to improve the scientificity of tunnel safety maintenance. With the development of data science, structural deformation prediction methods based on time-series data have attracted attention. Auto Regressive Integrated Moving Average model (ARIMA) is a classical statistical analysis model, which is suitable for processing non-stationary time-series data. Long- and Short-Term Memory (LSTM) is a special cyclic neural network that can learn long-term dependent information in time series. Both are widely used in the field of temporal prediction. In view of the lack of time-series prediction in the tunnel deformation field, the body of this paper uses historical data of the Xinjian Road and the Dalian Road tunnel in Shanghai to propose a new way of modeling based on single points and road sections. ARIMA and LSTM models are applied in comprehensive experiments, and the results show that: (1) Both LSTM and ARIMA models have great performance for settlement and convergence deformation. (2) The overall robustness of ARIMA is better than that of LSTM, and it is more adaptable to the datasets. (3) The model prediction performance is closely related to the data quality. ARIMA has more stable performance under the lack of data volume, while LSTM has better performance with high-quality data and higher upper limit. Full article

► Show Figures

Figure 1

12 pages, 945 KiB

Open AccessData Descriptor

Physico-Chemical Quality and Physiological Profiles of Microbial Communities in Freshwater Systems of Mega Manila, Philippines

by Marie Christine M. Obusan, Arizaldo E. Castro, Ren Mark D. Villanueva, Margareth Del E. Isagan, Jamaica Ann A. Caras and Jessica F. Simbahan

Data 2023, 8(6), 103; https://doi.org/10.3390/data8060103 - 04 Jun 2023

Cited by 1 | Viewed by 2361

Abstract

Studying the quality of freshwater systems and drinking water in highly urbanized megalopolises around the world remains a challenge. This article reports data on the quality of select freshwater systems in Mega Manila, Philippines. Water samples collected between 2020 and 2021 were analyzed [...] Read more.

Studying the quality of freshwater systems and drinking water in highly urbanized megalopolises around the world remains a challenge. This article reports data on the quality of select freshwater systems in Mega Manila, Philippines. Water samples collected between 2020 and 2021 were analyzed for physico-chemical parameters and microbial community metabolic fingerprints, i.e., carbon substrate utilization patterns (CSUPs). The detection of arsenic, lead, cadmium, mercury, polyaromatic hydrocarbons (PAHs), and organochlorine pesticides (OCPs) was carried out using standard chromatography- and spectroscopy-based protocols. Physiological profiles were determined using the Biolog EcoPlate™ system. Eight samples were free of heavy metals, and none contained PAHs or OCPs. Fourteen samples had high microbial activity, as indicated by average well color development (AWCD) and community metabolic diversity (CMD) values. Community-level physiological profiling (CLPP) revealed that (1) samples clustered as groups according to shared CSUPs, and (2) microbial communities in non-drinking samples actively utilized all six substrate classes compared to drinking samples. The data reported here can provide a baseline or a comparator for prospective quality assessments of drinking water and freshwater sources in the region. Metabolic fingerprinting using CSUPs is a simple and cheap phenotypic analysis of microbial communities and their physiological activity in aquatic environments. Full article

(This article belongs to the Section Chemoinformatics)

► Show Figures

Figure 1

17 pages, 1279 KiB

Open AccessArticle

A Self-Attention-Based Imputation Technique for Enhancing Tabular Data Quality

by Do-Hoon Lee and Han-joon Kim

Data 2023, 8(6), 102; https://doi.org/10.3390/data8060102 - 04 Jun 2023

Cited by 1 | Viewed by 2025

Abstract

Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results [...] Read more.

Recently, data-driven decision-making has attracted great interest; this requires high-quality datasets. However, real-world datasets often feature missing values for unknown or intentional reasons, rendering data-driven decision-making inaccurate. If a machine learning model is trained using incomplete datasets with missing values, the inferred results may be biased. In this case, a commonly used technique is the missing value imputation (MVI), which fills missing data with possible values estimated based on observed values. Various data imputation methods using machine learning, statistical inference, and relational database theories have been developed. Among them, conventional machine learning based imputation methods that handle tabular data can deal with only numerical columns or are time-consuming and cumbersome because they create an individualized predictive model for each column. Therefore, we have developed a novel imputational neural network that we term the Denoising Self-Attention Network (DSAN). Our proposed DSAN can deal with tabular datasets containing both numerical and categorical columns; it considers discretized numerical values as categorical values for embedding and self-attention layers. Furthermore, the DSAN learns robust feature expression vectors by combining self-attention and denoising techniques, and can predict multiple, appropriate substituted values simultaneously (via multi-task learning). To verify the validity of the method, we performed data imputation experiments after arbitrarily generating missing values for several real-world tabular datasets. We evaluated both imputational and downstream task performances, and we have seen that the DSAN outperformed the other models, especially in terms of category variable imputation. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

19 pages, 5079 KiB

Open AccessData Descriptor

Labelled Indoor Point Cloud Dataset for BIM Related Applications

by Nuno Abreu, Rayssa Souza, Andry Pinto, Anibal Matos and Miguel Pires

Data 2023, 8(6), 101; https://doi.org/10.3390/data8060101 - 01 Jun 2023

Cited by 1 | Viewed by 2744

Abstract

BIM (building information modelling) has gained wider acceptance in the AEC (architecture, engineering, and construction) industry. Conversion from 3D point cloud data to vector BIM data remains a challenging and labour-intensive process, but particularly relevant during various stages of a project lifecycle. While [...] Read more.

BIM (building information modelling) has gained wider acceptance in the AEC (architecture, engineering, and construction) industry. Conversion from 3D point cloud data to vector BIM data remains a challenging and labour-intensive process, but particularly relevant during various stages of a project lifecycle. While the challenges associated with processing very large 3D point cloud datasets are widely known, there is a pressing need for intelligent geometric feature extraction and reconstruction algorithms for automated point cloud processing. Compared to outdoor scene reconstruction, indoor scenes are challenging since they usually contain high amounts of clutter. This dataset comprises the indoor point cloud obtained by scanning four different rooms (including a hallway): two office workspaces, a workshop, and a laboratory including a water tank. The scanned space is located at the Electrical and Computer Engineering department of the Faculty of Engineering of the University of Porto. The dataset is fully labelled, containing major structural elements like walls, floor, ceiling, windows, and doors, as well as furniture, movable objects, clutter, and scanning noise. The dataset also contains an as-built BIM that can be used as a reference, making it suitable for being used in Scan-to-BIM and Scan-vs-BIM applications. For demonstration purposes, a Scan-vs-BIM change detection application is described, detailing each of the main data processing steps. Full article

(This article belongs to the Section Spatial Data Science and Digital Earth)

► Show Figures

Figure 1

27 pages, 3498 KiB

Open AccessData Descriptor

Progress in the Cost-Optimal Methodology Implementation in Europe: Datasets Insights and Perspectives in Member States

by Paolo Zangheri, Delia D’Agostino, Roberto Armani, Carmen Maduta and Paolo Bertoldi

Data 2023, 8(6), 100; https://doi.org/10.3390/data8060100 - 31 May 2023

Cited by 1 | Viewed by 1254

Abstract

This data article relates to the paper “Review of the cost-optimal methodology implementation in Member States in compliance with the Energy Performance of Buildings Directive”. Datasets linked with this article refer to the analysis of the latest national cost-optimal reports, providing an assessment [...] Read more.

This data article relates to the paper “Review of the cost-optimal methodology implementation in Member States in compliance with the Energy Performance of Buildings Directive”. Datasets linked with this article refer to the analysis of the latest national cost-optimal reports, providing an assessment of the implementation of the cost-optimal methodology, as established by the Energy Performance of Building Directive (EPBD). Based on latest national reports, the data provided a comprehensive update to the cost-optimal methodology implementation throughout Europe, which is currently lacking harmonization. Datasets allow an overall overview of the status of the cost-optimal methodology implementation in Europe with details on the calculations carried out (e.g., multi-stage, dynamic, macroeconomic, and financial perspectives, included energy uses, and full-cost approach). Data relate to the implemented methodology, reference buildings, assessed cost-optimal levels, energy performance, costs, and sensitivity analysis. Data also provide insight into energy consumption, efficiency measures for residential and non-residential buildings, nearly zero energy buildings (NZEBs) levels, and global costs. The reported data can be useful to quantify the cost-optimal levels for different building types, both residential (average cost-optimal level 80 kWh/m²y for new, 130 kWh/m²y for existing buildings) and non-residential buildings (140 kWh/m²y for new, 180 kWh/m²y for existing buildings). Data outline weak and strong points of the methodology, as well as future developments in the light of the methodology revision foreseen in 2026. The data support energy efficiency and energy policies related to buildings toward the EU building stock decarbonization goal within 2050. Full article

► Show Figures

Figure 1

24 pages, 5989 KiB

Open AccessArticle

Classification of Cocoa Pod Maturity Using Similarity Tools on an Image Database: Comparison of Feature Extractors and Color Spaces

by Kacoutchy Jean Ayikpa, Diarra Mamadou, Pierre Gouton and Kablan Jérôme Adou

Data 2023, 8(6), 99; https://doi.org/10.3390/data8060099 - 30 May 2023

Cited by 2 | Viewed by 1568

Abstract

Côte d’Ivoire, the world’s largest cocoa producer, faces the challenge of quality production. Immature or overripe pods cannot produce quality cocoa beans, resulting in losses and an unprofitable harvest. To help farmer cooperatives determine the maturity of cocoa pods in time, our study [...] Read more.

Côte d’Ivoire, the world’s largest cocoa producer, faces the challenge of quality production. Immature or overripe pods cannot produce quality cocoa beans, resulting in losses and an unprofitable harvest. To help farmer cooperatives determine the maturity of cocoa pods in time, our study evaluates the use of automation tools based on similarity measures. Although standard techniques, such as visual inspection and weighing, are commonly used to identify the maturity of cocoa pods, the use of automation tools based on similarity measures can improve the efficiency and accuracy of this process. We set up a database of cocoa pod images and used two feature extractors: one based on convolutional neural networks (CNN), in particular, MobileNet, and the other based on texture analysis using a gray-level co-occurrence matrix (GLCM). We evaluated the impact of different color spaces and feature extraction methods on our database. We used mathematical similarity measurement tools, such as the Euclidean distance, correlation distance, and chi-square distance, to classify cocoa pod images. Our experiments showed that the chi-square distance measurement offered the best accuracy, with a score of 99.61%, when we used GLCM as a feature extractor and the Lab color space. Using automation tools based on similarity measures can improve the efficiency and accuracy of cocoa pod maturity determination. The results of our experiments prove that the chi-square distance is the most appropriate measure of similarity for this task. Full article

► Show Figures

Figure 1

14 pages, 3902 KiB

Open AccessData Descriptor

Unmanned Aerial Vehicle (UAV) and Spectral Datasets in South Africa for Precision Agriculture

by Cilence Munghemezulu, Zinhle Mashaba-Munghemezulu, Phathutshedzo Eugene Ratshiedana, Eric Economon, George Chirima and Sipho Sibanda

Data 2023, 8(6), 98; https://doi.org/10.3390/data8060098 - 30 May 2023

Cited by 3 | Viewed by 1692

Abstract

Remote sensing data play a crucial role in precision agriculture and natural resource monitoring. The use of unmanned aerial vehicles (UAVs) can provide solutions to challenges faced by farmers and natural resource managers due to its high spatial resolution and flexibility compared to [...] Read more.

Remote sensing data play a crucial role in precision agriculture and natural resource monitoring. The use of unmanned aerial vehicles (UAVs) can provide solutions to challenges faced by farmers and natural resource managers due to its high spatial resolution and flexibility compared to satellite remote sensing. This paper presents UAV and spectral datasets collected from different provinces in South Africa, covering different crops at the farm level as well as natural resources. UAV datasets consist of five multispectral bands corrected for atmospheric effects using the PIX4D mapper software to produce surface reflectance images. The spectral datasets are filtered using a Savitzky–Golay filter, corrected for Multiplicative Scatter Correction (MSC). The first and second derivatives and the Continuous Wavelet Transform (CWT) spectra are also calculated. These datasets can provide baseline information for developing solutions for precision agriculture and natural resource challenges. For example, UAV and spectral data of different crop fields captured at spatial and temporal resolutions can contribute towards calibrating satellite images, thus improving the accuracy of the derived satellite products. Full article

► Show Figures

Figure 1

13 pages, 1462 KiB

Open AccessArticle

A Fast Deep Learning ECG Sex Identifier Based on Wavelet RGB Image Classification

by Jose-Luis Cabra Lopez, Carlos Parra and Gonzalo Forero

Data 2023, 8(6), 97; https://doi.org/10.3390/data8060097 - 29 May 2023

Cited by 1 | Viewed by 1743

Abstract

Human sex recognition with electrocardiogram signals is an emerging area in machine learning, mostly oriented toward neural network approaches. It might be the beginning of a field of heart behavior analysis focused on sex. However, a person’s heartbeat changes during daily activities, which [...] Read more.

Human sex recognition with electrocardiogram signals is an emerging area in machine learning, mostly oriented toward neural network approaches. It might be the beginning of a field of heart behavior analysis focused on sex. However, a person’s heartbeat changes during daily activities, which could compromise the classification. In this paper, with the intention of capturing heartbeat dynamics, we divided the heart rate into different intervals, creating a specialized identification model for each interval. The sexual differentiation for each model was performed with a deep convolutional neural network from images that represented the RGB wavelet transformation of ECG pseudo-orthogonal X, Y, and Z signals, using sufficient samples to train the network. Our database included 202 people, with a female-to-male population ratio of 49.5–50.5% and an observation period of 24 h per person. As our main goal, we looked for periods of time during which the classification rate of sex recognition was higher and the process was faster; in fact, we identified intervals in which only one heartbeat was required. We found that for each heart rate interval, the best accuracy score varied depending on the number of heartbeats collected. Furthermore, our findings indicated that as the heart rate increased, fewer heartbeats were needed for analysis. On average, our proposed model reached an accuracy of 94.82% ± 1.96%. The findings of this investigation provide a heartbeat acquisition procedure for ECG sex recognition systems. In addition, our results encourage future research to include sex as a soft biometric characteristic in person identification scenarios and for cardiology studies, in which the detection of specific male or female anomalies could help autonomous learning machines move toward specialized health applications. Full article

(This article belongs to the Special Issue Signal Processing for Data Mining)

► Show Figures

Figure 1

18 pages, 10780 KiB

Open AccessArticle

Exploring the Evolution of Sentiment in Spanish Pandemic Tweets: A Data Analysis Based on a Fine-Tuned BERT Architecture

by Carlos Henríquez Miranda, German Sanchez-Torres and Dixon Salcedo

Data 2023, 8(6), 96; https://doi.org/10.3390/data8060096 - 29 May 2023

Cited by 1 | Viewed by 2079

Abstract

The COVID-19 pandemic has had a significant impact on various aspects of society, including economic, health, political, and work-related domains. The pandemic has also caused an emotional effect on individuals, reflected in their opinions and comments on social media platforms, such as Twitter. [...] Read more.

The COVID-19 pandemic has had a significant impact on various aspects of society, including economic, health, political, and work-related domains. The pandemic has also caused an emotional effect on individuals, reflected in their opinions and comments on social media platforms, such as Twitter. This study explores the evolution of sentiment in Spanish pandemic tweets through a data analysis based on a fine-tuned BERT architecture. A total of six million tweets were collected using web scraping techniques, and pre-processing was applied to filter and clean the data. The fine-tuned BERT architecture was utilized to perform sentiment analysis, which allowed for a deep-learning approach to sentiment classification. The analysis results were graphically represented based on search criteria, such as “COVID-19” and “coronavirus”. This study reveals sentiment trends, significant concerns, relationship with announced news, public reactions, and information dissemination, among other aspects. These findings provide insight into the emotional impact of the COVID-19 pandemic on individuals and the corresponding impact on social media platforms. Full article

(This article belongs to the Section Information Systems and Data Management)

► Show Figures

Figure 1

10 pages, 1503 KiB

Open AccessData Descriptor

A Dataset of Scalp EEG Recordings of Alzheimer’s Disease, Frontotemporal Dementia and Healthy Subjects from Routine EEG

by Andreas Miltiadous, Katerina D. Tzimourta, Theodora Afrantou, Panagiotis Ioannidis, Nikolaos Grigoriadis, Dimitrios G. Tsalikakis, Pantelis Angelidis, Markos G. Tsipouras, Euripidis Glavas, Nikolaos Giannakeas and Alexandros T. Tzallas

Data 2023, 8(6), 95; https://doi.org/10.3390/data8060095 - 27 May 2023

Cited by 15 | Viewed by 8283

Abstract

Recently, there has been a growing research interest in utilizing the electroencephalogram (EEG) as a non-invasive diagnostic tool for neurodegenerative diseases. This article provides a detailed description of a resting-state EEG dataset of individuals with Alzheimer’s disease and frontotemporal dementia, and healthy controls. [...] Read more.

Recently, there has been a growing research interest in utilizing the electroencephalogram (EEG) as a non-invasive diagnostic tool for neurodegenerative diseases. This article provides a detailed description of a resting-state EEG dataset of individuals with Alzheimer’s disease and frontotemporal dementia, and healthy controls. The dataset was collected using a clinical EEG system with 19 scalp electrodes while participants were in a resting state with their eyes closed. The data collection process included rigorous quality control measures to ensure data accuracy and consistency. The dataset contains recordings of 36 Alzheimer’s patients, 23 frontotemporal dementia patients, and 29 healthy age-matched subjects. For each subject, the Mini-Mental State Examination score is reported. A monopolar montage was used to collect the signals. A raw and preprocessed EEG is included in the standard BIDS format. For the preprocessed signals, established methods such as artifact subspace reconstruction and an independent component analysis have been employed for denoising. The dataset has significant reuse potential since Alzheimer’s EEG Machine Learning studies are increasing in popularity and there is a lack of publicly available EEG datasets. The resting-state EEG data can be used to explore alterations in brain activity and connectivity in these conditions, and to develop new diagnostic and treatment approaches. Additionally, the dataset can be used to compare EEG characteristics between different types of dementia, which could provide insights into the underlying mechanisms of these conditions. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

7 pages, 872 KiB

Open AccessData Descriptor

MicroRNA Profiling of Fresh Lung Adenocarcinoma and Adjacent Normal Tissues from Ten Korean Patients Using miRNA-Seq

by Jihye Park, Sae Jung Na, Jung Sook Yoon, Seoree Kim, Sang Hoon Chun, Jae Jun Kim, Young-Du Kim, Young-Ho Ahn, Keunsoo Kang and Yoon Ho Ko

Data 2023, 8(6), 94; https://doi.org/10.3390/data8060094 - 25 May 2023

Viewed by 1251

Abstract

MicroRNA transcriptomes from fresh tumors and the adjacent normal tissues were profiled in 10 Korean patients diagnosed with lung adenocarcinoma using a next-generation sequencing (NGS) technique called miRNA-seq. The sequencing quality was assessed using FastQC, and low-quality or adapter-contaminated portions of the reads [...] Read more.

MicroRNA transcriptomes from fresh tumors and the adjacent normal tissues were profiled in 10 Korean patients diagnosed with lung adenocarcinoma using a next-generation sequencing (NGS) technique called miRNA-seq. The sequencing quality was assessed using FastQC, and low-quality or adapter-contaminated portions of the reads were removed using Trim Galore. Quality-assured reads were analyzed using miRDeep2 and Bowtie. The abundance of known miRNAs was estimated using the reads per million (RPM) normalization method. Subsequently, using DESeq2 and Wx, we identified differentially expressed miRNAs and potential miRNA biomarkers for lung adenocarcinoma tissues compared to adjacent normal tissues, respectively. We defined reliable miRNA biomarkers for lung adenocarcinoma as those detected by both methods. The miRNA-seq data are available in the Gene Expression Omnibus (GEO) database under accession number GSE196633, and all processed data can be accessed via the Mendeley data website. Full article

(This article belongs to the Section Computational Biology, Bioinformatics, and Biomedical Data Science)

► Show Figures

Figure 1

7 pages, 457 KiB

Open AccessData Descriptor

Target Screening of Chemicals of Emerging Concern (CECs) in Surface Waters of the Swedish West Coast

by Pedro A. Inostroza, Eric Carmona, Åsa Arrhenius, Martin Krauss, Werner Brack and Thomas Backhaus

Data 2023, 8(6), 93; https://doi.org/10.3390/data8060093 - 25 May 2023

Cited by 4 | Viewed by 1544

Abstract

The aquatic environment faces increasing threats from a variety of unregulated organic chemicals originating from human activities, collectively known as chemicals of emerging concern (CECs). These include pharmaceuticals, personal-care products, pesticides, surfactants, industrial chemicals, and their transformation products. CECs enter aquatic environments through [...] Read more.

The aquatic environment faces increasing threats from a variety of unregulated organic chemicals originating from human activities, collectively known as chemicals of emerging concern (CECs). These include pharmaceuticals, personal-care products, pesticides, surfactants, industrial chemicals, and their transformation products. CECs enter aquatic environments through various sources, including effluents from wastewater treatment plants, industrial facilities, runoff from agricultural and residential areas, as well as accidental spills. Data on the occurrence of CECs in the marine environment are scarce, and more information is needed to assess the chemical and ecological status of water bodies, and to prioritize toxic chemicals for further studies or risk assessment. In this study, we describe a monitoring campaign targeting CECs in surface waters at the Swedish west coast using, for the first time, an on-site large volume solid phase extraction (LVSPE) device. We detected up to 80 and 227 CECs in marine sites and the wastewater treatment plant (WWTP) effluent, respectively. The dataset will contribute to defining pollution fingerprints and assessing the chemical status of marine and freshwater systems affected by industrial hubs, agricultural areas, and the discharge of urban wastewater. Full article

► Show Figures

Figure 1

Journal Menu

Journal Browser

Data, Volume 8, Issue 6 (June 2023) – 20 articles

Further Information

Guidelines

MDPI Initiatives

Follow MDPI