Bioinformatics Applications Based On Machine Learning

A special issue of Processes (ISSN 2227-9717). This special issue belongs to the section "Biological Processes and Systems".

Deadline for manuscript submissions: closed (15 December 2020) | Viewed by 48807

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu 16100, Malaysia
Interests: artificial intelligence and intelligent systems; bioinformatics and computational biology
Special Issues, Collections and Topics in MDPI journals

Special Issue Information

Dear Colleagues,

Research in the area of bioinformatics has always been one of the most active lines of research in the scientific community. However, it has gained even more interest thanks to the increased processing capacities of computers, which allow processing large volumes of data and analyzing them with techniques such as machine learning.

Thanks to these advances, new applications appear in the area of bioinformatics.  In them, the results obtained generally improve those of previous applications that do not use these computation techniques.

In this Special Issue, we seek research and case studies that demonstrate the application of machine learning to support applied scientific research, in any area of bioinformatics. Example topics include (but are not limited to) the following topics applied to bioinformatics:

- New machine learning algorithms
- Distributed machine learning systems
- New applications on bioinformatics
- Health-care applications
- Bio imaging
- Next generation sequencing
- Data and software integration
- Visualization of biological systems and networks
- High-throughput data analysis (transcriptomics, proteomics, etc)
- Comparison and alignment methods

Dr. Pablo Chamoso
Dr. Sara Rodríguez González
Prof. Dr. Mohd Saberi Mohamad
Dr. Alfonso González Briones
Guest editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Processes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2400 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • bioinformatics applications
  • machine learning
  • artificial intelligence

Published Papers (11 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Review

14 pages, 1333 KiB  
Article
Population-Based Parameter Identification for Dynamical Models of Biological Networks with an Application to Saccharomyces cerevisiae
by Ewelina Weglarz-Tomczak, Jakub M. Tomczak, Agoston E. Eiben and Stanley Brul
Processes 2021, 9(1), 98; https://doi.org/10.3390/pr9010098 - 05 Jan 2021
Cited by 3 | Viewed by 2991
Abstract
One of the central elements in systems biology is the interaction between mathematical modeling and measured quantities. Typically, biological phenomena are represented as dynamical systems, and they are further analyzed and comprehended by identifying model parameters using experimental data. However, all model parameters [...] Read more.
One of the central elements in systems biology is the interaction between mathematical modeling and measured quantities. Typically, biological phenomena are represented as dynamical systems, and they are further analyzed and comprehended by identifying model parameters using experimental data. However, all model parameters cannot be found by gradient-based optimization methods by fitting the model to the experimental data due to the non-differentiable character of the problem. Here, we present POPI4SB, a Python-based framework for population-based parameter identification of dynamic models in systems biology. The code is built on top of PySCeS that provides an engine to run dynamic simulations. The idea behind the methodology is to provide a set of derivative-free optimization methods that utilize a population of candidate solutions to find a better solution iteratively. Additionally, we propose two surrogate-assisted population-based methods, namely, a combination of a k-nearest-neighbor regressor with the Reversible Differential Evolution and the Evolution of Distribution Algorithm, that speeds up convergence. We present the optimization framework on the example of the well-studied glycolytic pathway in Saccharomyces cerevisiae. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Graphical abstract

23 pages, 7074 KiB  
Article
A Genetic Programming Strategy to Induce Logical Rules for Clinical Data Analysis
by José A. Castellanos-Garzón, Yeray Mezquita Martín, José Luis Jaimes Sánchez, Santiago Manuel López García and Ernesto Costa
Processes 2020, 8(12), 1565; https://doi.org/10.3390/pr8121565 - 27 Nov 2020
Viewed by 2064
Abstract
This paper proposes a machine learning approach dealing with genetic programming to build classifiers through logical rule induction. In this context, we define and test a set of mutation operators across from different clinical datasets to improve the performance of the proposal for [...] Read more.
This paper proposes a machine learning approach dealing with genetic programming to build classifiers through logical rule induction. In this context, we define and test a set of mutation operators across from different clinical datasets to improve the performance of the proposal for each dataset. The use of genetic programming for rule induction has generated interesting results in machine learning problems. Hence, genetic programming represents a flexible and powerful evolutionary technique for automatic generation of classifiers. Since logical rules disclose knowledge from the analyzed data, we use such knowledge to interpret the results and filter the most important features from clinical data as a process of knowledge discovery. The ultimate goal of this proposal is to provide the experts in the data domain with prior knowledge (as a guide) about the structure of the data and the rules found for each class, especially to track dichotomies and inequality. The results reached by our proposal on the involved datasets have been very promising when used in classification tasks and compared with other methods. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

12 pages, 1297 KiB  
Article
A Hybrid of Particle Swarm Optimization and Harmony Search to Estimate Kinetic Parameters in Arabidopsis thaliana
by Mohamad Saufie Rosle, Mohd Saberi Mohamad, Yee Wen Choon, Zuwairie Ibrahim, Alfonso González-Briones, Pablo Chamoso and Juan Manuel Corchado
Processes 2020, 8(8), 921; https://doi.org/10.3390/pr8080921 - 02 Aug 2020
Cited by 5 | Viewed by 2202
Abstract
Recently, modelling and simulation have been used and applied to understand biological systems better. Therefore, the development of precise computational models of a biological system is essential. This model is a mathematical expression derived from a series of parameters of the system. The [...] Read more.
Recently, modelling and simulation have been used and applied to understand biological systems better. Therefore, the development of precise computational models of a biological system is essential. This model is a mathematical expression derived from a series of parameters of the system. The measurement of parameter values through experimentation is often expensive and time-consuming. However, if a simulation is used, the manipulation of computational parameters is easy, and thus the behaviour of a biological system model can be altered for a better understanding. The complexity and nonlinearity of a biological system make parameter estimation the most challenging task in modelling. Therefore, this paper proposes a hybrid of Particle Swarm Optimization (PSO) and Harmony Search (HS), also known as PSOHS, designated to determine the kinetic parameter values of essential amino acids, mainly aspartate metabolism, in Arabidopsis thaliana. Three performance measurements are used in this paper to evaluate the proposed PSOHS: the standard deviation, nonlinear least squared error, and computational time. The proposed algorithm outperformed the other two methods, namely Simulated Annealing and the downhill simplex method, and proved that PSOHS is a more suitable algorithm for estimating kinetic parameter values. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

15 pages, 2206 KiB  
Article
MPPIF-Net: Identification of Plasmodium Falciparum Parasite Mitochondrial Proteins Using Deep Features with Multilayer Bi-directional LSTM
by Samee Ullah Khan and Ran Baik
Processes 2020, 8(6), 725; https://doi.org/10.3390/pr8060725 - 22 Jun 2020
Cited by 30 | Viewed by 3286
Abstract
Mitochondrial proteins of Plasmodium falciparum (MPPF) are an important target for anti-malarial drugs, but their identification through manual experimentation is costly, and in turn, their related drugs production by pharmaceutical institutions involves a prolonged time duration. Therefore, it is highly desirable for pharmaceutical [...] Read more.
Mitochondrial proteins of Plasmodium falciparum (MPPF) are an important target for anti-malarial drugs, but their identification through manual experimentation is costly, and in turn, their related drugs production by pharmaceutical institutions involves a prolonged time duration. Therefore, it is highly desirable for pharmaceutical companies to develop computationally automated and reliable approach to identify proteins precisely, resulting in appropriate drug production in a timely manner. In this direction, several computationally intelligent techniques are developed to extract local features from biological sequences using machine learning methods followed by various classifiers to discriminate the nature of proteins. Unfortunately, these techniques demonstrate poor performance while capturing contextual features from sequence patterns, yielding non-representative classifiers. In this paper, we proposed a sequence-based framework to extract deep and representative features that are trust-worthy for Plasmodium mitochondrial proteins identification. The backbone of the proposed framework is MPPF identification-net (MPPFI-Net), that is based on a convolutional neural network (CNN) with multilayer bi-directional long short-term memory (MBD-LSTM). MPPIF-Net inputs protein sequences, passes through various convolution and pooling layers to optimally extract learned features. We pass these features into our sequence learning mechanism, MBD-LSTM, that is particularly trained to classify them into their relevant classes. Our proposed model is experimentally evaluated on newly prepared dataset PF2095 and two existing benchmark datasets i.e., PF175 and MPD using the holdout method. The proposed method achieved 97.6%, 97.1%, and 99.5% testing accuracy on PF2095, PF175, and MPD datasets, respectively, which outperformed the state-of-the-art approaches. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

19 pages, 2802 KiB  
Article
Measuring Performance Metrics of Machine Learning Algorithms for Detecting and Classifying Transposable Elements
by Simon Orozco-Arias, Johan S. Piña, Reinel Tabares-Soto, Luis F. Castillo-Ossa, Romain Guyot and Gustavo Isaza
Processes 2020, 8(6), 638; https://doi.org/10.3390/pr8060638 - 27 May 2020
Cited by 23 | Viewed by 6294
Abstract
Because of the promising results obtained by machine learning (ML) approaches in several fields, every day is more common, the utilization of ML to solve problems in bioinformatics. In genomics, a current issue is to detect and classify transposable elements (TEs) because of [...] Read more.
Because of the promising results obtained by machine learning (ML) approaches in several fields, every day is more common, the utilization of ML to solve problems in bioinformatics. In genomics, a current issue is to detect and classify transposable elements (TEs) because of the tedious tasks involved in bioinformatics methods. Thus, ML was recently evaluated for TE datasets, demonstrating better results than bioinformatics applications. A crucial step for ML approaches is the selection of metrics that measure the realistic performance of algorithms. Each metric has specific characteristics and measures properties that may be different from the predicted results. Although the most commonly used way to compare measures is by using empirical analysis, a non-result-based methodology has been proposed, called measure invariance properties. These properties are calculated on the basis of whether a given measure changes its value under certain modifications in the confusion matrix, giving comparative parameters independent of the datasets. Measure invariance properties make metrics more or less informative, particularly on unbalanced, monomodal, or multimodal negative class datasets and for real or simulated datasets. Although several studies applied ML to detect and classify TEs, there are no works evaluating performance metrics in TE tasks. Here, we analyzed 26 different metrics utilized in binary, multiclass, and hierarchical classifications, through bibliographic sources, and their invariance properties. Then, we corroborated our findings utilizing freely available TE datasets and commonly used ML algorithms. Based on our analysis, the most suitable metrics for TE tasks must be stable, even using highly unbalanced datasets, multimodal negative class, and training datasets with errors or outliers. Based on these parameters, we conclude that the F1-score and the area under the precision-recall curve are the most informative metrics since they are calculated based on other metrics, providing insight into the development of an ML application. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

24 pages, 800 KiB  
Article
An Adjective Selection Personality Assessment Method Using Gradient Boosting Machine Learning
by Bruno Fernandes, Alfonso González-Briones, Paulo Novais, Miguel Calafate, Cesar Analide and José Neves
Processes 2020, 8(5), 618; https://doi.org/10.3390/pr8050618 - 21 May 2020
Cited by 6 | Viewed by 5793
Abstract
Goldberg’s 100 Unipolar Markers remains one of the most popular ways to measure personality traits, in particular, the Big Five. An important reduction was later preformed by Saucier, using a sub-set of 40 markers. Both assessments are performed by presenting a set of [...] Read more.
Goldberg’s 100 Unipolar Markers remains one of the most popular ways to measure personality traits, in particular, the Big Five. An important reduction was later preformed by Saucier, using a sub-set of 40 markers. Both assessments are performed by presenting a set of markers, or adjectives, to the subject, requesting him to quantify each marker using a 9-point rating scale. Consequently, the goal of this study is to conduct experiments and propose a shorter alternative where the subject is only required to identify which adjectives describe him the most. Hence, a web platform was developed for data collection, requesting subjects to rate each adjective and select those describing him the most. Based on a Gradient Boosting approach, two distinct Machine Learning architectures were conceived, tuned and evaluated. The first makes use of regressors to provide an exact score of the Big Five while the second uses classifiers to provide a binned output. As input, both receive the one-hot encoded selection of adjectives. Both architectures performed well. The first is able to quantify the Big Five with an approximate error of 5 units of measure, while the second shows a micro-averaged f1-score of 83%. Since all adjectives are used to compute all traits, models are able to harness inter-trait relationships, being possible to further reduce the set of adjectives by removing those that have smaller importance. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

14 pages, 660 KiB  
Article
Bioinspired Hybrid Model to Predict the Hydrogen Inlet Fuel Cell Flow Change of an Energy Storage System
by Héctor Alaiz-Moretón, Esteban Jove, José-Luis Casteleiro-Roca, Héctor Quintián, Hilario López García, José Alberto Benítez-Andrades, Paulo Novais and Jose Luis Calvo-Rolle
Processes 2019, 7(11), 825; https://doi.org/10.3390/pr7110825 - 07 Nov 2019
Cited by 9 | Viewed by 3151
Abstract
The present research work deals with prediction of hydrogen consumption of a fuel cell in an energy storage system. Due to the fact that these kind of systems have a very nonlinear behaviour, the use of traditional techniques based on parametric models and [...] Read more.
The present research work deals with prediction of hydrogen consumption of a fuel cell in an energy storage system. Due to the fact that these kind of systems have a very nonlinear behaviour, the use of traditional techniques based on parametric models and other more sophisticated techniques such as soft computing methods, seems not to be accurate enough to generate good models of the system under study. Due to that, a hybrid intelligent system, based on clustering and regression techniques, has been developed and implemented to predict the necessary variation of the hydrogen flow consumption to satisfy the variation of demanded power to the fuel cell. In this research, a hybrid intelligent model was created and validated over a dataset from a fuel cell energy storage system. Obtained results validate the proposal, achieving better performance than other well-known classical regression methods, allowing us to predict the hydrogen consumption with a Mean Absolute Error (MAE) of 3.73 with the validation dataset. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

23 pages, 4111 KiB  
Article
Ear Detection and Localization with Convolutional Neural Networks in Natural Images and Videos
by William Raveane, Pedro Luis Galdámez and María Angélica González Arrieta
Processes 2019, 7(7), 457; https://doi.org/10.3390/pr7070457 - 17 Jul 2019
Cited by 18 | Viewed by 7252
Abstract
The difficulty in precisely detecting and locating an ear within an image is the first step to tackle in an ear-based biometric recognition system, a challenge which increases in difficulty when working with variable photographic conditions. This is in part due to the [...] Read more.
The difficulty in precisely detecting and locating an ear within an image is the first step to tackle in an ear-based biometric recognition system, a challenge which increases in difficulty when working with variable photographic conditions. This is in part due to the irregular shapes of human ears, but also because of variable lighting conditions and the ever changing profile shape of an ear’s projection when photographed. An ear detection system involving multiple convolutional neural networks and a detection grouping algorithm is proposed to identify the presence and location of an ear in a given input image. The proposed method matches the performance of other methods when analyzed against clean and purpose-shot photographs, reaching an accuracy of upwards of 98%, but clearly outperforms them with a rate of over 86% when the system is subjected to non-cooperative natural images where the subject appears in challenging orientations and photographic conditions. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

18 pages, 3496 KiB  
Article
An Accurate Clinical Implication Assessment for Diabetes Mellitus Prevalence Based on a Study from Nigeria
by Muhammad Noman Sohail, Ren Jiadong, Musa Uba Muhammad, Sohaib Tahir Chauhdary, Jehangir Arshad and Antony John Verghese
Processes 2019, 7(5), 289; https://doi.org/10.3390/pr7050289 - 15 May 2019
Cited by 10 | Viewed by 4013
Abstract
The increasing rate of diabetes is found across the planet. Therefore, the diagnosis of pre-diabetes and diabetes is important in populations with extreme diabetes risk. In this study, a machine learning technique was implemented over a data mining platform by employing Rule classifiers [...] Read more.
The increasing rate of diabetes is found across the planet. Therefore, the diagnosis of pre-diabetes and diabetes is important in populations with extreme diabetes risk. In this study, a machine learning technique was implemented over a data mining platform by employing Rule classifiers (PART and Decision table) to measure the accuracy and logistic regression on the classification results for forecasting the prevalence in diabetes mellitus patients suffering simultaneously from other chronic disease symptoms. The real-life data was collected in Nigeria between December 2017 and February 2019 by applying ten non-intrusive and easily available clinical variables. The results disclosed that the Rule classifiers achieved a mean accuracy of 98.75%. The error rate, precision, recall, F-measure, and Matthew’s correlation coefficient MCC were 0.02%, 0.98%, 0.98%, 0.98%, and 0.97%, respectively. The forecast decision, achieved by employing a set of 23 decision rules (DR), indicates that age, gender, glucose level, and body mass are fundamental reasons for diabetes, followed by work stress, diet, family diabetes history, physical exercise, and cardiovascular stroke history. The study validated that the proposed set of DR is practical for quick screening of diabetes mellitus patients at the initial stage without intrusive medical tests and was found to be effective in the initial diagnosis of diabetes. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

11 pages, 2919 KiB  
Article
A Machine Learning-based Pipeline for the Classification of CTX-M in Metagenomics Samples
by Diego Ceballos, Diana López-Álvarez, Gustavo Isaza, Reinel Tabares-Soto, Simón Orozco-Arias and Carlos D. Ferrin
Processes 2019, 7(4), 235; https://doi.org/10.3390/pr7040235 - 24 Apr 2019
Cited by 5 | Viewed by 4908
Abstract
Bacterial infections are a major global concern, since they can lead to public health problems. To address this issue, bioinformatics contributes extensively with the analysis and interpretation of in silico data by enabling to genetically characterize different individuals/strains, such as in bacteria. However, [...] Read more.
Bacterial infections are a major global concern, since they can lead to public health problems. To address this issue, bioinformatics contributes extensively with the analysis and interpretation of in silico data by enabling to genetically characterize different individuals/strains, such as in bacteria. However, the growing volume of metagenomic data requires new infrastructure, technologies, and methodologies that support the analysis and prediction of this information from a clinical point of view, as intended in this work. On the other hand, distributed computational environments allow the management of these large volumes of data, due to significant advances in processing architectures, such as multicore CPU (Central Process Unit) and GPGPU (General Propose Graphics Process Unit). For this purpose, we developed a bioinformatics workflow based on filtered metagenomic data with Duk tool. Data formatting was done through Emboss software and a prototype of a workflow. A pipeline was also designed and implemented in bash script based on machine learning. Further, Python 3 programming language was used to normalize the training data of the artificial neural network, which was implemented in the TensorFlow framework, and its behavior was visualized in TensorBoard. Finally, the values from the initial bioinformatics process and the data generated during the parameterization and optimization of the Artificial Neural Network are presented and validated based on the most optimal result for the identification of the CTX-M gene group. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

Review

Jump to: Research

18 pages, 527 KiB  
Review
A Review of Computational Methods for Clustering Genes with Similar Biological Functions
by Hui Wen Nies, Zalmiyah Zakaria, Mohd Saberi Mohamad, Weng Howe Chan, Nazar Zaki, Richard O. Sinnott, Suhaimi Napis, Pablo Chamoso, Sigeru Omatu and Juan Manuel Corchado
Processes 2019, 7(9), 550; https://doi.org/10.3390/pr7090550 - 21 Aug 2019
Cited by 12 | Viewed by 5091
Abstract
Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict [...] Read more.
Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict the possible number of potential clusters and hence increase the chances of identifying biologically informative genes. This paper reviews and provides examples of existing methods for clustering genes, optimization of the objective function, and clustering validation. Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based techniques. We also highlight the advantages and the disadvantages of each category. To optimize the objective function, here we introduce the swarm intelligence technique and compare the performances of other methods. Moreover, we discuss the differences of measurements between internal and external criteria to validate a cluster quality. We also investigate the performance of several clustering techniques by applying them on a leukemia dataset. The results show that grid-based clustering techniques provide better classification accuracy; however, partitioning clustering techniques are superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters. Full article
(This article belongs to the Special Issue Bioinformatics Applications Based On Machine Learning)
Show Figures

Figure 1

Back to TopTop