Bioinformatics of Sequencing Data: A Machine Learning Approach

A special issue of Genes (ISSN 2073-4425). This special issue belongs to the section "Bioinformatics".

Deadline for manuscript submissions: closed (15 March 2023) | Viewed by 7738

Special Issue Editor

Bioinformatics/Scientific Computing, Core Bioinformatics Group, Wellcome-MRC Cambridge Stem Cell Institute, Cambridge CB2 0AW, UK
Interests: machine learning; bioinformatics; sequencing; multi-omics; spatial transcriptomics; clustering; classification; gene regulatory networks; small RNAs

Special Issue Information

Dear Colleagues,

Over the past few years, the deluge of sequencing data (bulk and single-cell, focusing on one or more modalities, retaining also spatial information) prompted us to look for more efficient approaches to summarize and synthesize biological signals. Machine learning methods, diverse and flexible yet robust, provide answers in terms of the optimized processing of large amounts of data and uncovering underlying signals that are hidden (e.g., masked by noise) in traditional approaches.

This Special Issue is open for cutting-edge research spanning the wide range of bioinformatics interests, from purely algorithmic to tightly embedded in the particularities of a data modality. Bold applications of machine learning approaches are welcome (unsupervised, e.g., clustering in single-cell data; supervised, e.g., classifiers/regression approaches aimed at improving predictions; semi-supervised methods to illustrate the handling of missing labels; or reinforcement learning when the online processing of information is required). Applications of extracting the essence quantified by sequencing experiments to the interpretation of biological phenomena (e.g., detailing gene regulatory networks) are also invited.

This Special Issue will both underline recent developments in the field (research papers) and summarize the next set of data-processing challenges which may be tackled using machine learning methods (review papers). Case studies are also welcome but should specifically address limitations/shortcomings of current computational approaches.

Dr. Irina Mohorianu
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Genes is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • machine learning
  • high-throughput sequencing
  • multi-omics
  • (standard/spatial) transcriptomics
  • unsupervised learning (clustering)
  • supervised learning (classification/regression)
  • semi-supervised learning
  • reinforcement learning
  • gene regulatory networks

Published Papers (4 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

20 pages, 7378 KiB  
Article
EDLM: Ensemble Deep Learning Model to Detect Mutation for the Early Detection of Cholangiocarcinoma
by Asghar Ali Shah, Fahad Alturise, Tamim Alkhalifah, Amna Faisal and Yaser Daanial Khan
Genes 2023, 14(5), 1104; https://doi.org/10.3390/genes14051104 - 18 May 2023
Cited by 2 | Viewed by 1615
Abstract
The most common cause of mortality and disability globally right now is cholangiocarcinoma, one of the worst forms of cancer that may affect people. When cholangiocarcinoma develops, the DNA of the bile duct cells is altered. Cholangiocarcinoma claims the lives of about 7000 [...] Read more.
The most common cause of mortality and disability globally right now is cholangiocarcinoma, one of the worst forms of cancer that may affect people. When cholangiocarcinoma develops, the DNA of the bile duct cells is altered. Cholangiocarcinoma claims the lives of about 7000 individuals annually. Women pass away less often than men. Asians have the greatest fatality rate. Following Whites (20%) and Asians (22%), African Americans (45%) saw the greatest increase in cholangiocarcinoma mortality between 2021 and 2022. For instance, 60–70% of cholangiocarcinoma patients have local infiltration or distant metastases, which makes them unable to receive a curative surgical procedure. Across the board, the median survival time is less than a year. Many researchers work hard to detect cholangiocarcinoma, but this is after the appearance of symptoms, which is late detection. If cholangiocarcinoma progression is detected at an earlier stage, then it will help doctors and patients in treatment. Therefore, an ensemble deep learning model (EDLM), which consists of three deep learning algorithms—long short-term model (LSTM), gated recurrent units (GRUs), and bi-directional LSTM (BLSTM)—is developed for the early identification of cholangiocarcinoma. Several tests are presented, such as a 10-fold cross-validation test (10-FCVT), an independent set test (IST), and a self-consistency test (SCT). Several statistical techniques are used to evaluate the proposed model, such as accuracy (Acc), sensitivity (Sn), specificity (Sp), and Matthew’s correlation coefficient (MCC). There are 672 mutations in 45 distinct cholangiocarcinoma genes among the 516 human samples included in the proposed study. The IST has the highest Acc at 98%, outperforming all other validation approaches. Full article
(This article belongs to the Special Issue Bioinformatics of Sequencing Data: A Machine Learning Approach)
Show Figures

Figure 1

15 pages, 17686 KiB  
Article
Unraveling the Dysbiosis of Vaginal Microbiome to Understand Cervical Cancer Disease Etiology—An Explainable AI Approach
by Karthik Sekaran, Rinku Polachirakkal Varghese, Mohanraj Gopikrishnan, Alsamman M. Alsamman, Achraf El Allali, Hatem Zayed and George Priya Doss C
Genes 2023, 14(4), 936; https://doi.org/10.3390/genes14040936 - 18 Apr 2023
Cited by 1 | Viewed by 1759
Abstract
Microbial Dysbiosis is associated with the etiology and pathogenesis of diseases. The studies on the vaginal microbiome in cervical cancer are essential to discern the cause and effect of the condition. The present study characterizes the microbial pathogenesis involved in developing cervical cancer. [...] Read more.
Microbial Dysbiosis is associated with the etiology and pathogenesis of diseases. The studies on the vaginal microbiome in cervical cancer are essential to discern the cause and effect of the condition. The present study characterizes the microbial pathogenesis involved in developing cervical cancer. Relative species abundance assessment identified Firmicutes, Actinobacteria, and Proteobacteria dominating the phylum level. A significant increase in Lactobacillus iners and Prevotella timonensis at the species level revealed its pathogenic influence on cervical cancer progression. The diversity, richness, and dominance analysis divulges a substantial decline in cervical cancer compared to control samples. The β diversity index proves the homogeneity in the subgroups’ microbial composition. The association between enriched Lactobacillus iners at the species level, Lactobacillus, Pseudomonas, and Enterococcus genera with cervical cancer is identified by Linear discriminant analysis Effect Size (LEfSe) prediction. The functional enrichment corroborates the microbial disease association with pathogenic infections such as aerobic vaginitis, bacterial vaginosis, and chlamydia. The dataset is trained and validated with repeated k-fold cross-validation technique using a random forest algorithm to determine the discriminative pattern from the samples. SHapley Additive exPlanations (SHAP), a game theoretic approach, is employed to analyze the results predicted by the model. Interestingly, SHAP identified that the increase in Ralstonia has a higher probability of predicting the sample as cervical cancer. New evidential microbiomes identified in the experiment confirm the presence of pathogenic microbiomes in cervical cancer vaginal samples and their mutuality with microbial imbalance. Full article
(This article belongs to the Special Issue Bioinformatics of Sequencing Data: A Machine Learning Approach)
Show Figures

Figure 1

19 pages, 14683 KiB  
Article
The Sum of Two Halves May Be Different from the Whole—Effects of Splitting Sequencing Samples Across Lanes
by Eleanor C. Williams, Ruben Chazarra-Gil, Arash Shahsavari and Irina Mohorianu
Genes 2022, 13(12), 2265; https://doi.org/10.3390/genes13122265 - 01 Dec 2022
Viewed by 1605
Abstract
The advances in high-throughput sequencing (HTS) have enabled the characterisation of biological processes at an unprecedented level of detail; most hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains a main challenge. Although [...] Read more.
The advances in high-throughput sequencing (HTS) have enabled the characterisation of biological processes at an unprecedented level of detail; most hypotheses in molecular biology rely on analyses of HTS data. However, achieving increased robustness and reproducibility of results remains a main challenge. Although variability in results may be introduced at various stages, e.g., alignment, summarisation or detection of differential expression, one source of variability was systematically omitted: the sequencing design, which propagates through analyses and may introduce an additional layer of technical variation. We illustrate qualitative and quantitative differences arising from splitting samples across lanes on bulk and single-cell sequencing. For bulk mRNAseq data, we focus on differential expression and enrichment analyses; for bulk ChIPseq data, we investigate the effect on peak calling and the peaks’ properties. At the single-cell level, we concentrate on identifying cell subpopulations. We rely on markers used for assigning cell identities; both smartSeq and 10× data are presented. The observed reduction in the number of unique sequenced fragments limits the level of detail on which the different prediction approaches depend. Furthermore, the sequencing stochasticity adds in a weighting bias corroborated with variable sequencing depths and (yet unexplained) sequencing bias. Subsequently, we observe an overall reduction in sequencing complexity and a distortion in the biological signal across technologies, experimental contexts, organisms and tissues. Full article
(This article belongs to the Special Issue Bioinformatics of Sequencing Data: A Machine Learning Approach)
Show Figures

Figure 1

11 pages, 1169 KiB  
Article
Unexpected Actors in Inflammatory Bowel Disease Revealed by Machine Learning from Whole-Blood Transcriptomic Data
by Jan K. Nowak, Cyntia J. Szymańska, Aleksandra Glapa-Nowak, Rémi Duclaux-Loras, Emilia Dybska, Jerzy Ostrowski, Jarosław Walkowiak and Alex T. Adams
Genes 2022, 13(9), 1570; https://doi.org/10.3390/genes13091570 - 01 Sep 2022
Viewed by 1635
Abstract
Although big data from transcriptomic analyses have helped transform our understanding of inflammatory bowel disease (IBD), they remain underexploited. We hypothesized that the application of machine learning using lasso regression to transcriptomic data from IBD patients and controls can help identify previously overlooked [...] Read more.
Although big data from transcriptomic analyses have helped transform our understanding of inflammatory bowel disease (IBD), they remain underexploited. We hypothesized that the application of machine learning using lasso regression to transcriptomic data from IBD patients and controls can help identify previously overlooked genes. Transcriptomic data provided by Ostrowski et al. (ENA PRJEB28822) were subjected to a two-stage process of feature selection to discriminate between IBD and controls. First, a principal component analysis was used for dimensionality reduction. Second, the least absolute shrinkage and selection operator (lasso) regression was employed to identify genes potentially involved in the pathobiology of IBD. The study included data from 294 participants: 100 with ulcerative colitis (48 adults and 52 children), 99 with Crohn’s disease (45 adults and 54 children), and 95 controls (46 adults and 49 children). IBD patients presented a wide range of disease severity. Lasso regression preceded by principal component analysis successfully selected interesting features in the IBD transcriptomic data and yielded 12 models. The models achieved high discriminatory value (range of the area under the receiver operating characteristic curve 0.61–0.95) and identified over 100 genes as potentially associated with IBD. PURA, GALNT14, and FCGR1A were the most consistently selected, highlighting the role of the cell cycle, glycosylation, and immunoglobulin binding. Several known IBD-related genes were among the results. The results included genes involved in the TGF-beta pathway, expressed in NK cells, and they were enriched in ontology terms related to immunity. Future IBD research should emphasize the TGF-beta pathway, immunoglobulins, NK cells, and the role of glycosylation. Full article
(This article belongs to the Special Issue Bioinformatics of Sequencing Data: A Machine Learning Approach)
Show Figures

Figure 1

Back to TopTop