Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods

Yang, Yong; Zhang, Yuhang; Ren, Jingxin; Feng, Kaiyan; Li, Zhandong; Huang, Tao; Cai, Yudong

doi:10.3390/life13091876

Open AccessArticle

Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods

by

Yong Yang

^1,†,

Yuhang Zhang

^2,†

,

Jingxin Ren

³,

Kaiyan Feng

⁴,

Zhandong Li

⁵,

Tao Huang

^6,7,*

and

Yudong Cai

^3,*

¹

Qianwei Hospital of Jilin Province, Changchun 130012, China

²

Channing Division of Network Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston, MA 02115, USA

³

School of Life Sciences, Shanghai University, Shanghai 200444, China

⁴

Department of Computer Science, Guangdong AIB Polytechnic College, Guangzhou 510507, China

⁵

College of Biological and Food Engineering, Jilin Engineering Normal University, Changchun 130052, China

⁶

Bio-Med Big Data Center, CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China

⁷

CAS Key Laboratory of Tissue Microenvironment and Tumor, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China

^*

Authors to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Life 2023, 13(9), 1876; https://doi.org/10.3390/life13091876

Submission received: 30 July 2023 / Revised: 24 August 2023 / Accepted: 4 September 2023 / Published: 7 September 2023

(This article belongs to the Section Genetics and Genomics)

Download

Browse Figures

Versions Notes

Abstract

:

Immune cell infiltration that occurs at the site of colon tumors influences the course of cancer. Different immune cell compositions in the microenvironment lead to different immune responses and different therapeutic effects. This study analyzed single-cell RNA sequencing data in a normal colon with the aim of screening genetic markers of 25 candidate immune cell types and revealing quantitative differences between them. The dataset contains 25 classes of immune cells, 41,650 cells in total, and each cell is expressed by 22,164 genes at the expression level. They were fed into a machine learning-based stream. The five feature ranking algorithms (last absolute shrinkage and selection operator, light gradient boosting machine, Monte Carlo feature selection, minimum redundancy maximum relevance, and random forest) were first used to analyze the importance of gene features, yielding five feature lists. Then, incremental feature selection and two classification algorithms (decision tree and random forest) were combined to filter the most important genetic markers from each list. For different immune cell subtypes, their marker genes, such as KLRB1 in CD4 T cells, RPL30 in B cell IGA plasma cells, and JCHAIN in IgG producing B cells, were identified. They were confirmed to be differentially expressed in different immune cells and involved in immune processes. In addition, quantitative rules were summarized by using the decision tree algorithm to distinguish candidate immune cell types. These results provide a reference for exploring the cell composition of the colon cancer microenvironment and for clinical immunotherapy.

Keywords:

colon immune cell; marker gene; machine learning; feature selection

1. Introduction

The large intestine is the last section of the gastrointestinal tract, ending with the anal canal [1]. It is composed of four parts, the cecum, colon, rectum, and anal canal [2], and its main function is to receive food from the small intestine and dehydrate it to form stool. The colon, as the most important section of the large intestine, is responsible for absorbing remaining nutrients and water and delivering feces for excretion [3,4]. As a unique environment with immune tolerance, the colon contains a diverse community of microbes [5] that are associated with various disease states. The microbial community, also known as the microbiome, interacts with the host immune system and has been shown to remodel the immune microenvironment of the human colon [6].

Immune cell profiling may be associated with the physiology and pathology of colon tissue. However, the heterogeneous and dynamic nature of immune cells often hinders the accurate analysis of immune cell profiles using traditional experimental methods. Moreover, the limitations on the number of markers and the low resolution of traditional experimental methods complicate the identification process. With the development of single-cell sequencing, we can now recognize the transcriptomic profiling of all immune cells. Traditionally, the identification of different cell types requires specific biomarkers qualitatively. For instance, CD4 and CD8 are used to differentiate mature T cells [7,8]. However, single-cell sequencing biomarkers cannot (1) quantitatively reveal the differences between different immune cell types or (2) recognize novel biomarkers with great biological significance.

In this study, we utilized single cell RNA sequencing data of a normal gut from the Gut Cell Atlas (https://www.gutcellatlas.org/, assessed on 28 December 2020) [9,10]. We profiled immune cells from 25 candidate immune cell types. Each cell is represented by more than 22,000 genes. We applied several machine learning algorithms to such data for mining important information. The algorithms included five feature ranking algorithms (least absolute shrinkage and selection operator (LASSO) [11], light gradient boosting machine (light GBM) [12], Monte Carlo feature selection (MCFS) [13], minimum redundancy maximum relevance (mRMR) [14], and random forest (RF) [15]), an incremental feature selection (IFS) method [16], a synthetic minority oversampling technique (SMOTE) [17] and two classification algorithms (decision tree (DT) [18] and random forest (RF) [19]). The purpose was to (1) identify potential biomarkers to distinguish different cell types qualitatively and (2) quantify the differences between different cell types. Machine learning models have been summarized to be effective for the prediction and reconstruction of metabolic pathways [20], including a series of reference-based approaches like BlastKOALA [21], KAAS [22], GhostKOALA [21] and RAST [23]. The utilization of machine learning models on metabolic pathway predication and annotation can not only enable low-cost automatic biological function annotation, but also reveal the internal relationships between different biological pathways. Overall, our study reveals cell-type specific biomarkers for immune cell profiling, providing a reference for further monitoring of pathological cell-type composition and exploration of molecular alterations.

2. Materials and Methods

Figure 1 demonstrates the machine learning-based pipeline designed in this study. The expression levels of genes from immune cells in normal colon were used as features, which were ranked by five feature-ranking algorithms based on the degree of correlation between genes and cell types. The resulting feature lists were fed into the IFS framework one by one to filter out the most critical subset of genes that can efficiently classify cells into different types. As a result, efficient classifiers and rules were constructed.

2.1. Data

The scRNA-seq data of normal colon obtained by James KR et al. [9] were divided into 25 groups based on immune cell subtypes. The dataset comprised a total of 41,650 cells, each with expression levels of 22,164 genes. Table 1 shows the 25 cell subtypes and sample sizes for each immune cell subtype. The sample size of each immune cell type varied greatly, and the dataset needed to be balanced.

2.2. Feature Ranking Algorithms

Each cell contained 22,164 genes at the expression level as features. A large number of these genes were not associated with immune cell-specific expression, and feature selection was performed using five feature ranking algorithms. The genes were sorted on the basis of their degree of association with immune cell subtypes. The five ranking methods were LASSO [11], LightGBM [12], MCFS [13], mRMR [14], and RF [15]. These methods have been successfully applied in machine learning applications in the life sciences [24,25,26,27,28,29,30,31].

2.2.1. Last Absolute Shrinkage and Selection Operator

LASSO [11] generates a sparse model that more easily explains the ability of features to contribute to the model. It optimizes a loss function with an L1 regularization term to solve the overfitting problem that tends to arise in linear models. Irrelevant features are removed by penalizing the coefficients of unimportant features to zero, saving computational effort. Features with comparable correlation are assigned similar coefficients, which can alleviate the covariance problem. Features with higher coefficients are considered more important. Thus, features can be ranked in a list in decreasing order of their coefficients. Here, the program of LASSO collected in Scikit-learn (version 1.2.0) [32] was adopted (accessed on 20 January 2023), which was implemented by python (version 3.10). It was performed with default parameters.

2.2.2. Light Gradient Boosting Machine

LightGBM [12] is implemented on the basis of the principles of the gradient-boosted DT method. It introduces a histogram algorithm that binds the dataset into intervals rather than traversing the entire dataset. Gradient-based one-sided sampling and bucketing are used in constructing the tree. It uses a leafwise form when making judgments at the nodes of the tree and only continues the branches with the greatest differentiation. All these features make LightGBM more efficient to train and take up less memory compared with other methods. The more a feature contributes to constructing the trees, the more important it is to the classification task. The features are ranked in a list according to their used times in constructed trees. The current study adopted the LightGBM python program available at https://lightgbm.readthedocs.io/en/latest/ (accessed on 10 May 2020). It was also executed with default parameters.

2.2.3. Monte Carlo Feature Selection

MCFS [13] samples the feature set and the training sample set in a repeated random sampling fashion. For each feature subset, their performance when using different training sample sets is evaluated. The relative importance score of each feature is calculated by combining its contribution to different models. The importance ranking of the features is completed on the basis of this criterion. The MCFS program used in this study was obtained at http://www.ipipan.eu/staff/m.draminski/mcfs.html (accessed on 4 June 2019). It was implemented in Java software dmLab (version 2.1.1). Default parameters were used.

2.2.4. Minimum Redundancy Maximum Relevance

mRMR [14] considers the degree of overlap between features and the degree of correlation between features and target variables. Considering the information overlap of similar features, the algorithm considers that similar features should not be selected simultaneously. mRMR selects the features that correlate most with the target variables and the least redundancy between already selected features. This principle improves the efficiency and accuracy of selecting the best feature set. The mRMR selects features one by one. The selected feature in each round must satisfy the following conditions: (1) maximum correlation to target variable; and (2) minimum redundancies to already-selected features. Based on the selection order, a feature list can be built. Here, the mRMR program available at http://home.penglab.com/proj/mRMR/ (accessed on 2 May 2018) was used, which was implemented by C++. It was also performed using default parameters.

2.2.5. Random Forest

RF uses the reciprocal feature importance measure to rank features. This concept was first introduced by Breiman for RF [19] and later extended to other algorithms by Fisher, Rudin, and Dominici [15]. RF constructs the tree by randomly selecting a subset of features. For each feature that is drawn away, the algorithm compares the performance of the model before and after the extraction, and features that have a greater effect on accuracy are considered more important. The present study used the corresponding package collected in Scikit-learn (version 1.2.0) [32], which was implemented by python (version 3.10). Also, default parameters were adopted.

The above five feature ranking algorithms were applied to the scRNA-seq data mentioned in Section 2.1. Accordingly, five feature lists were obtained, which were called LASSO, LightGBM, MCFS, mRMR, and RF feature lists, respectively.

2.3. Incremental Feature Selection

From the five feature lists yielded by five feature ranking algorithms, it was still difficult to extract essential gene features as the threshold of rank was not easy to determine. In view of this, the IFS method was employed [16], which is a widely used method to determine the optimal features for a given classification algorithm. For a feature list, it creates a series of feature subsets with an equal interval s. In detail, the first s features in the list comprise the first feature subset. Then, the following s features are added to the first subset to build the second feature subset. Generally, the following s features are added to the current subset to construct the next subset. On each constructed feature subset, a classifier is built with a given classification algorithm. All classifiers are evaluated using a cross-validation method [33]. After assessing the predicted results of cross-validation, the classifier with the best performance can be obtained. Such classifier was called the optimal classifier, whereas features used in this classifier were called optimal features, which comprised the optimal feature subset.

2.4. Synthetic Minority Oversampling Technique

As listed in Table 1, the sample sizes of the 25 immune cell types varied dramatically, with B cell (IgA plasma) having 1252 times the sample size of lymphoid DC. This severe imbalance can skew the predictions of the model toward the category with a larger sample size. This study used SMOTE method [17] to tackle this problem by increasing the samples of minority categories. SMOTE maps the training samples into the feature space and generates several new samples for a minority category by concatenating it with one of its nearest samples in the same category. For each minority category, the above procedure was executed several times until it contains as many samples as those in the largest category. This operation is conducted on each category except the largest category, thereby balancing the dataset. The SMOTE python package used in this study was sourced from https://github.com/scikitlearn-contrib/imbalanced-learn (accessed on 24 March 2020), which was performed with its default parameters.

2.5. Classification Algorithm

In the IFS method, one classification algorithm is necessary. Its optimal features can be extracted through IFS. Here, two classification algorithms were attempted, DT [18] and RF [19].

2.5.1. Decision Tree

The DT algorithm [18] is a basic, transparent classification algorithm that constructs a tree structure. The algorithm starts at the root node, and a series of recursive operations are performed to reach the leaf node, which contains the category labels. Several attribute judgments are made on the instances between the leaf node and the root node, and the instances are assigned to the downstream subsets based on the judgment results. In this study, the attribute judgments are based on the expression levels of key genes of immune cells.

2.5.2. Random Forest

The RF algorithm [19] judges the class of instances by constructing several DT classifiers and combining the classification results of all DT classifiers in a voting manner. Generally, the RF algorithm has better generalization ability and robustness than the DT algorithm and is more inclusive of data containing noise.

The above DT and RF algorithms were implemented by corresponding packages in Scikit-learn (version 1.2.0) [32] (accessed on 20 January 2023). They were all implemented by python (version 3.10). For convenience, they were performed with default parameters.

2.6. Performance Evaluation

In binary classification, precision, recall, and F1-measure [34,35,36,37,38,39,40] are important metrics to evaluate the performance of classifiers. For multi-class classification, these measurements can be defined on each class. The precision, recall, and F1-measure for the i-th class can be computed by

P r e c i s i o n_{i} = \frac{T P_{i}}{T P_{i} + F P_{i}},

(1)

R e c a l l_{i} = \frac{T P_{i}}{T P_{i} + F N_{i}},

(2)

F 1 - m e a s u r e_{i} = \frac{2 \times R e c a l l_{i} \times P r e c i s i o n_{i}}{R e c a l l_{i} + P r e c i s i o n_{i}},

(3)

where

T P_{i}

,

F P_{i}

, and

F N_{i}

represent true positives, false positives, and false negatives for the i-th class. The average of F1-measure on all classes can be used to evaluate the overall performance of classifiers, which is generally called macro F1. However, such measurement may lead to biased results when dealing with imbalanced datasets, where the sample sizes of different classes vary widely. Therefore, a weighted version of F1-measure, termed as weighted F1, is designed, which can be computed by

W e i g h t e d F 1 = \sum_{i = 1}^{L} F 1 - m e a s u r e_{i} \times w_{i},

(4)

where

L

represents the number of classes (L = 25 in this study) and

w_{i}

represents the proportion of samples in that category to the overall sample. As weighted F1 is more accurate than macro F1, it was selected as the key measurement in this study.

In addition, accuracy (ACC) and Matthew correlation coefficients (MCC) [41] were also employed, which are widely used to measure the performance of classifiers. ACC is defined as the proportion of correctly predicted samples among all samples. The calculation of MCC is more complicated. Two matrices X and Y should be constructed first, which store the true and predicted class of each sample. Then, MCC can be computed by

MCC = \frac{cov (X, Y)}{\sqrt{cov (X, X) cov (Y, Y)}}

(5)

where cov(X,Y) stands for the correlation coefficient of two matrices.

3. Results

The scRNA-seq data from 25 classes of immune cells in normal colon was deeply investigated in this study, where each cell was represented by the expression level of 22,164 genes. Five feature ranking algorithms were used to rank gene features according to their degree of association with subtypes of immune cells, and the algorithms outputted five feature lists (Table S1). As these algorithms are designed following different principles, they can overview the scRNA-seq data from different points of views. Thus, the above five lists were quite different. Generally, the higher ranked gene features have a greater variation in expression levels in different immune cell subtypes from a certain aspect, implying a greater contribution when making judgments about unknown cells.

3.1. Dynamics of Classifier Performance

As listed in Table S1, each list contained lots of gene features. If all features were considered, the running time of IFS methods would be very long. On the other hand, the essential genes that are highly related to the classification of immune cells generally occupy a small proportion of all genes. Thus, to save time, only the top 2000 gene features in each list were considered, which were fed into the IFS method. Setting the interval to 10, 200 gene subsets were constructed from each list. For each gene subset, DT and RF were adopted to construct classifiers, and each classifier was evaluated by 10-fold cross-validation. The cross-validation results were counted as macro F1, weighted F1, ACC, and MCC, which are available in Table S2. Furthermore, the ACC for each fold is also provided in Table S2. To display the performance change in classifiers with different feature subsets, some IFS curves were plotted, as shown in Figure 2, where weighted F1 was defined as Y-axis and the size of the subset was set as the X-axis.

When DT was used in the IFS method, its highest weighted F1 values under the five feature lists were 0.827, 0.895, 0.893, 0.875, and 0.892, respectively. Such performance was obtained using the top 1240, 300, 1620, 350, and 170 features in five feature lists, respectively. Accordingly, the optimal DT classifiers can be constructed on each feature list using the above top features in the corresponding feature lists. Their detailed performance is listed in Table 2. These classifiers all have good performance for classification of immune cells.

Additionally, RF was also attempted in the IFS method. According to Figure 2, the highest weighted F1 values under the five feature lists were 0.959, 0.978, 0.976, 0.972, and 0.978, respectively. The top 1230, 1200, 1720, 560, and 1510, respectively, gene features in five feature lists were used to access such performance. Evidently, such performance was higher than that of DT. Likewise, with above features, five optimal RF classifiers can be set up. The detailed performance of these five classifiers is also listed in Table 2. Clearly, their performance was extreme high and all measurements were higher than 0.9.

Based on the above results, we can find that the optimal RF classifiers on the LightGBM and RF feature lists were better than other optimal RF/DT classifiers. These two classifiers can be efficient tools for classifying immune cells.

3.2. Relationships between the Most Essential Genes Extracted from Five Lists

From the results in Section 3.1, the optimal RF classifier was much better than the optimal DT classifiers on the same feature list. Thus, we selected the optimal features for RF in five lists for further analysis. However, the number of optimal features for RF were always too many to give a detailed investigation. For example, there were 1230 optimal features for the optimal RF classifier on the LASSO feature list. In view of this, we must extract the most essential genes from each optimal feature subset. By checking the IFS results (Table S2), it can be found that some RF classifiers provided a little lower performance than the optimal RF classifiers, but they needed much less gene features. We refer to these classifiers as feasible classifiers. The weighted F1 of these feasible classifiers are marked on the IFS curves, as shown in Figure 2, and their detailed performance is listed in Table 2. The feasible RF classifiers on five feature lists used the top 50, 60, 90, 90, and 70 features in the corresponding lists. The relationship between the above five gene subsets is illustrated by a Venn diagram, as shown in Figure 3. The detailed intersection results can be found in Table S3. It can be observed that some genes were identified as the most essential genes by multiple feature ranking algorithms, implying they were more likely to be the biomarker genes. More attention should be paid to these genes. As for those identified by one feature ranking algorithm, we cannot deny their possibility to be latent biomarker genes. Thus, we listed all of them in Table S3, which may give useful insights for other investigators.

Furthermore, the enrichment analysis was conducted on above five gene subsets. The enriched gene ontology (GO) terms and KEGG pathways for each subset are illustrated in Figure 4. The GO enrichment results showed that the genes of different subsets always performed the same molecular function. One of the most popular GO terms including GO:0003735, GO:0003823, and GO:0023023 were enriched by at least four gene subsets. GO:0003735 describes the structural constituent of ribosomes. In colorectal cancer, increased protein synthesis may be a marker of rapid proliferation and growth of cancer cells. In colorectal cancer, increased protein synthesis may be a marker for rapid proliferation and growth of cancer cells, and abnormalities in ribosome biosynthesis and function may be associated with tumorigenesis [42,43]. GO:0003823 describes antigen binding. Antigen binding is directly related to the immune response. Colon and colorectal cancer cells may express different tumor antigens that can be recognized by the immune system. Therefore, antigen binding may be associated with tumor immune escape or immunotherapy [44]. GO:0023023 describes MHC protein complex binding. The major histocompatibility complex (MHC) plays a key role in the immune response, especially in the cell-mediated immune response. Aberrant expression or modification of MHC proteins may be associated with immune escape of tumor cells, especially in colorectal cancer [45]. MHC proteins are also involved in regulating the function of tumor-infiltrating lymphocytes during antigen processing [46]. KEGG results likewise showed that different sets of genes shared multiple pathways. One of the popular pathways contained hsa03010, hsa04612, etc. hsa03010 describes ribosome and hsa04612 describes antigen processing and presentation. KEGG results are highly similar to those of GO enrichment results, emphasizing the role of ribosome and antigen processing and presentation in the colon immune response.

3.3. Classification Rules

Based on the IFS results, the performance of DT was generally lower than that of RF. However, DT has a special merit that is not shared by RF. From the constructed DT, a group of classification rules can be obtained, each of which indicated a path from the root to one leaf. As mentioned above, the optimal DT classifiers used the top 1240, 300, 1620, 350, and 170 features in the five lists. Based on these features, we constructed five DTs using all cell samples, thereby accessing five rule groups. Table S4 shows these five rule groups, which contained 4660, 3457, 3143, 3761, and 3665 rules. Any rule group can be used to classify immune cells. Furthermore, as each rule contained a number of genes and thresholds on the expression level of involved genes, it can indicate the expression pattern of genes for one cell type. In each rule group, any cell type received one or more rules. The number of rules for each cell type in five groups are illustrated in Figure 5. It can be observed that some cell types (e.g., Treg, Th17, Th1, Activated CD4 T, etc.) were assigned more rules than other cell types. For the comparison of classification rules generated on different feature lists for different cell types, we summarized the gene markers present in these rules, termed as rule-associated genes (Figure 6 and Figure 7). Cell types were further clustered on the basis of the presence or absence of these gene markers. The clustering results revealed the molecular similarity between different cell types. As each rule group was generated on the list yielded by different feature ranking algorithms, which offered distinct perspectives, the identified cell-type clusters were different.

4. Discussion

Predicting and annotating metabolic pathways has been a time-consuming and expensive task for biological studies. The utilization of machine learning models to reconstruct metabolic pathways provides a novel way to understand the molecular mechanisms for biological processes and reveals internal correlations between different biological functions. Compared to traditional experimental or computational methods, we summarized molecular signatures for specific cell types or the cell-type specific biological functions, providing a higher and quantitative level approach for cell clustering and functional exploration. In this study, we applied several machine learning algorithms to (1) identify biomarkers for distinguishing 25 cell types and (2) establish quantitative classification rules for cell-type prediction. Recent publications have supported the predicted biomarkers and rules, validating the efficacy and accuracy of using machine learning algorithms for cell-type feature recognition. Our analytic results for the 25 cell types were summarized into three major categories: T cell family identification, B cell family identification, and other cells, including innate lymphoid cells. The discussed genes are listed in Table 3.

4.1. T Cell Family

The first identified predicted gene is KLRB1, identified by the LASSO. The receptor of KLRB1 has been reported to interact with and be correlated with CD69 [47], a well-known immune cell biomarker for cultured peripheral T cells [48]. During the activation of T cells, the expression of CD69 is upregulated [49,50]. Therefore, the interactive and correlated receptor KLRB can help us distinguish activated CD4 T cells from other cells, validating our prediction.

The next feature ranking algorithm, LightGBM, identified specific genes, such as CST3 and AIF1, which have been predicted to distinguish cell types from the colon transcriptome. CST3 has been widely observed in mast cells [51]. Therefore, such a gene may help recognize mast cells from other cell types. AIF1 is a regulatory allograft inflammation initiator [52,53]. As for its cell specificity, CST3 has been shown to be specifically expressed T cells [54] and fibroblasts [55] to degrade extracellular matrix, remodel immune responses, and regulate tissue development [56]. AIF1 has been shown to regulate trophoblast differentiation and participate in FasL/CD95L-mediated cell death [57]. Considering that FasL/CD95L is mediated by T cells, the identification of AIF1 may help recognize T cells, including cytotoxic T cell and T helper cells.

The next algorithm, MCFS, recognized JCHAIN as a cell-type specific biomarker for prediction, validating the efficacy and accuracy of the newly presented methods. HLA-DRA, another predicted biomarker, has been shown to be specifically expressed in IFN-γ-stimulated cells [58], including CD8 T cell and NK T cells. Therefore, such a gene is definitely a biomarker for cell-type recognition.

The mRMR identified TIGIT, which has been widely recognized in various T cell subtypes [59,60]. Differentially expressed TIGIT has been observed in central memory T cells, even compared to other cell types [59,61]. Therefore, regarding TIGIT as a potential cell-specific biomarker to distinguish different cell types in colon tissues is reasonable.

For RF, in the summarized quantitative rules, RPS12 has been predicted to be highly expressed in T follicular helper cells. In accordance with recent publications, a dynamic atlas of immune cells recognizes a high expression of RPS12 in T follicular helper cells [62]. MS4A1 is predicted to be a quantitative biomarker for Th1 cells in our newly established rules, which has also been validated by recent publications [63].

4.2. B Cell Family

In the first rule generated on the LASSO feature list, a low expression level of RPL30 helps us recognize B cell IGA plasma cells. In accordance with recent publications, RPL13 participates in the reprogramming of immune cells, including B cells, during immune responses [64,65]. A lower expression level of RPL30 has been observed in plasma cells compared with immature B cells, validating our quantitative rule [64,66]. Similarly, a lower expression level of ANXA1 has been confirmed to help recognize B cell IGA plasma cells and B memory cells with low expression levels. ANXA1 is shown to be highly expressed during cell proliferation [67]. Therefore, using ANXA1 to eliminate highly proliferative cells from candidates is reasonable.

Among various quantitative rules generated on the LightGBM feature list, JCHAIN has been shown to be negatively correlated with the production of antibodies, including IgG [68,69,70]. Therefore, establishing the rule that low expression of JCHAIN help to recognize IgG-producing B cell is reasonable.

For the results using the MCFS algorithm, we also recognized a low expression of JCHAIN in B cells (IgA, IgG, and memory B cells) [68,69,70]. Another gene, ICA1, as a hub gene, has been shown to be stably expressed in T regulatory cells and T helper cells [71,72]. In addition, such a gene has been shown to be associated with exhausted CD4 T cell activation during the progression of colorectal cancer [73]. Therefore, recognizing such a gene as a biomarker for T helper cells is reasonable. Another gene, TYROBP, is also a widely reported immune-associated gene. In accordance with recent publications, upregulation of TYROBP has been shown to be downregulated in a specific cell type, Lyve1 M2-like macrophage [74], corresponding with the quantitative rule generated on the MCFS feature list.

RPS27, as a specifical gene identified by mRMR, has been widely reported to be a housekeeping gene in the colon [75]. However, further studies showed that during pathological immune activation, RPS27 has shown to be abnormally regulated [75]. Therefore, such a gene may help recognize activated immune cells from our candidates, including plasma B cells and activated CD4 T cells. We also recognized some genes that we have discussed above, including JCHAIN and HLA-DRA, as candidate biomarkers for cell-type classification.

As for quantitative rules, we also identified genes, such as JCHAIN and HLA-DRA, as quantitative rule biomarkers. IFITM1, as an interferon-induced transmembrane protein, has been widely reported to be associated with follicular B cells [76,77,78]. In accordance with the quantitative rule we established, a low expression of such a gene was shown in mature follicular B cells during virus infection [79], consistent with recent publications.

The component of immune cells in the colon has been regulated by IGHA1 during ulcerative colitis [80]. B cells uniquely express IGHA1 during the immune activation procedures. Therefore, recognizing such a gene as a biomarker to distinguish other different cells from B cells is reasonable. Similarly, CD37 regulates T-cell dependent B-cell activation [81]. Therefore, we can distinguish T cells and B cells from other cell types using CD37, validating the efficacy and accuracy of the RF feature ranking algorithm in cell-type prediction.

4.3. Other Cells

RPS13 and RPL34 are both ribosome-associated genes. Ribosome-associated genes are general genes without cell specificity or tissue specificity [82]. However, ribosome-associated proteins have been reported to be activated by natural humoral immune regulation under the pathogenic condition of colon tissue [83]. Therefore, a high expression of ribosome-associated genes may indicate the activation of natural humoral immune responses, involving the recognition of specific cell types, including macrophage, monocytes, and mast cells. Genes, such as MALAT1 and TMSB4X, have been reported to be cell-type specific genes. MALAT1 is reported to be expressed in dendritic cells [84], helping us identify lymphoid dendritic cells in the colon. TMSB4X has been reported to be expressed in monocytes and dendritic cells. Such a gene can also help us distinguish different cell types although with less specificity.

SOCS3 as a specific monocyte and macrophage-associated gene is shown to participate in modulating the immune microenvironment of the colon [85,86]. A low expression of SOCS3 reflects the physiological status of monocytes and macrophages [86]. The expression level of such a gene is highly upregulated after immune activation. In addition, IL7R has been predicted to help recognize gdT cells. In accordance with recent publications, the inhibition of IL-17A protects against thyroid immune-related adverse events mediated by gdT [87]. Therefore, such a gene is a quantitative biomarker to recognize gdT.

CD-74 has been recognized in the colon as a potential biomarker for colon cancer [88,89]. Although most function-associated publications on such a gene focused on epithelial cells, such a gene has also been shown to participate in antigen presentation and inflammation regulation [90]. Therefore, using such a gene as a biomarker for dendritic cells (antigen presentation), macrophages (antigen presentation), and T cells (inflammation regulation) is reasonable [89,90].

RPL28 has been shown to regulate inflammatory and tumorigenic networks in colon tissue [91,92,93]. RPL28 is related to the prognosis of colorectal cancer by participating in immune remodulation [92,94]. Innate lymphoid cells can be quantitatively recognized by ANKRD28 and TYROBP with high expression patterns based on predicted rules. Expression patterns of two such genes in innate lymphoid cells have been validated by recent publications [95,96,97].

As discussed above, we identified a series of biomarkers and several quantitative rules for cell-type classification. Additional features identified by the RF feature ranking algorithm have been validated by other algorithms. Recent publications have supported the features and rules identified by RF, suggesting that it may be the most effective approach for predicting cell types and identifying potential cell biomarkers. Therefore, the machine learning models employed in this study conducted a comprehensive analysis on single-cell level cell classification and provided a novel approach for identifying cell-type biomarkers at the single-cell level.

4.4. Limitation of This Study

In this study, several feature ranking algorithms were employed to analyze the scRNA-seq data. It is not certain whether these algorithms can mine all essential genes. Some biomarker genes may still not be discovered, which can be exclusively identified by a certain algorithm. In the future, we will continue this study to design a more perfect method.

5. Conclusions

Based on single-cell RNA sequencing data of 25 different types of immune cells from the normal intestine, a machine learning-based pipeline was designed. It identified genetic markers that can qualitatively and quantitatively distinguish between these cell subtypes. Our analysis revealed key genes representing different immune cell subtypes, and we constructed some multi-class classifiers to classify these cell types. We further generated rules for quantifying gene expression differences between immune cell subtypes using the DT algorithm. These results were validated by previously published studies, and they may serve as a useful reference for clinical diagnosis and targeted therapy of colon cancer. The codes used in this study can be accessed at https://github.com/chenlei1982/ColonImmune/.

Supplementary Materials

The following supporting information can be downloaded at: https://www.mdpi.com/article/10.3390/life13091876/s1, Table S1: Feature ranking results obtained by LASSO, LightGBM, MCFS mRMR, and RF. Table S2: IFS results with different classification algorithms. Table S3: Intersection of the most essential genes extracted from five lists yielded by LASSO, LightGBM, MCFS, mRMR, and RF. The features that appear in the five, four, three, two, and one feature subsets are shown. Table S4. Classification rules generated by the DT classifier on its optimal features in each list.

Author Contributions

Conceptualization, T.H. and Y.C.; methodology, J.R. and K.F.; validation, T.H.; formal analysis, Y.Y., Y.Z. and Z.L.; data curation, T.H.; writing—original draft preparation, Y.Y., Y.Z. and J.R.; writing—review and editing, T.H. and Y.C.; supervision, Y.C.; funding acquisition, T.H. and Y.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Key R&D Program of China [2022YFF1203202], the Strategic Priority Research Program of Chinese Academy of Sciences [XDA26040304, XDB38050200], the Fund of the Key Laboratory of Tissue Microenvironment and Tumor of Chinese Academy of Sciences [202002], the Shandong Provincial Natural Science Foundation [ZR2022MC072], the Natural Science Foundation of Jilin Province [20210101220JC], and the Health Commission Project of Jilin Province [2021LC042].

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are openly available at https://www.nature.com/articles/s41590-020-0602-z (assessed on 28 December 2020), reference number [9].

Conflicts of Interest

The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

References

Azzouz, L.L.; Sharma, S. Physiology, Large Intestine; StatPearls Publishing: St. Petersburg, FL, USA, 2018. [Google Scholar]
Kahai, P.; Mandiga, P.; Wehrle, C.J.; Lobo, S. Anatomy, abdomen and pelvis, large intestine. In Statpearls; StatPearls Publishing: St. Petersburg, FL, USA, 2021. [Google Scholar]
Nigam, Y.; Knight, J.; Williams, N. Gastrointestinal tract 5: The anatomy and functions of the large intestine. Nurs. Times 2019, 115, 50–53. [Google Scholar]
Louis, P.; Scott, K.P.; Duncan, S.H.; Flint, H.J. Understanding the effects of diet on bacterial metabolism in the large intestine. J. Appl. Microbiol. 2007, 102, 1197–1208. [Google Scholar] [CrossRef] [PubMed]
Cho, I.; Blaser, M.J. The human microbiome: At the interface of health and disease. Nat. Rev. Genet. 2012, 13, 260–270. [Google Scholar] [CrossRef] [PubMed]
Börnigen, D.; Morgan, X.C.; Franzosa, E.A.; Ren, B.; Xavier, R.J.; Garrett, W.S.; Huttenhower, C. Functional profiling of the gut microbiome in disease-associated inflammation. Genome Med. 2013, 5, 65. [Google Scholar] [CrossRef]
Germain, R.N. T-cell development and the CD4–CD8 lineage decision. Nat. Rev. Immunol. 2002, 2, 309–322. [Google Scholar] [CrossRef]
Amadori, A.; Zamarchi, R.; De Silvestro, G.; Forza, G.; Cavatton, G.; Danieli, G.A.; Clementi, M.; Chieco-Bianchi, L. Genetic control of the CD4/CD8 T-cell ratio in humans. Nat. Med. 1995, 1, 1279–1283. [Google Scholar] [CrossRef] [PubMed]
James, K.R.; Gomes, T.; Elmentaite, R.; Kumar, N.; Gulliver, E.L.; King, H.W.; Stares, M.D.; Bareham, B.R.; Ferdinand, J.R.; Petrova, V.N.; et al. Distinct microbial and immune niches of the human colon. Nat. Immunol. 2020, 21, 343–353. [Google Scholar] [CrossRef]
Elmentaite, R.; Kumasaka, N.; Roberts, K.; Fleming, A.; Dann, E.; King, H.W.; Kleshchevnikov, V.; Dabrowska, M.; Pritchard, S.; Bolt, L.; et al. Cells of the human intestinal tract mapped across space and time. Nature 2021, 597, 250–255. [Google Scholar] [CrossRef] [PubMed]
Ranstam, J.; Cook, J. Lasso regression. J. Br. Surg. 2018, 105, 1348. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T.-Y. Lightgbm: A highly efficient gradient boosting decision tree. Adv. Neural Inf. Process. Syst. 2017, 30, 3146–3154. [Google Scholar]
Draminski, M.; Rada-Iglesias, A.; Enroth, S.; Wadelius, C.; Koronacki, J.; Komorowski, J. Monte carlo feature selection for supervised classification. Bioinformatics 2008, 24, 110–117. [Google Scholar] [CrossRef] [PubMed]
Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell. 2005, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed]
Fisher, A.; Rudin, C.; Dominici, F. All models are wrong, but many are useful: Learning a variable’s importance by studying an entire class of prediction models simultaneously. J. Mach. Learn. Res. 2019, 20, 1–81. [Google Scholar]
Liu, H.; Setiono, R. Incremental feature selection. Appl. Intell. 1998, 9, 217–230. [Google Scholar] [CrossRef]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. Smote: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Safavian, S.R.; Landgrebe, D. A survey of decision tree classifier methodology. IEEE Trans. Syst. Man Cybern. 1991, 21, 660–674. [Google Scholar] [CrossRef]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Shah, H.A.; Liu, J.; Yang, Z.; Feng, J. Review of machine learning methods for the prediction and reconstruction of metabolic pathways. Front. Mol. Biosci. 2021, 8, 634141. [Google Scholar] [CrossRef]
Kanehisa, M.; Sato, Y.; Morishima, K. Blastkoala and ghostkoala: Kegg tools for functional characterization of genome and metagenome sequences. J. Mol. Biol. 2016, 428, 726–731. [Google Scholar] [CrossRef]
Moriya, Y.; Itoh, M.; Okuda, S.; Yoshizawa, A.C.; Kanehisa, M. Kaas: An automatic genome annotation and pathway reconstruction server. Nucleic Acids Res. 2007, 35, W182–W185. [Google Scholar] [CrossRef]
Aziz, R.K.; Bartels, D.; Best, A.A.; DeJongh, M.; Disz, T.; Edwards, R.A.; Formsma, K.; Gerdes, S.; Glass, E.M.; Kubal, M. The rast server: Rapid annotations using subsystems technology. BMC Genom. 2008, 9, 75. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Zhang, S.; Chen, L.; Pan, X.; Li, Z.; Huang, T.; Cai, Y.D. Identifying functions of proteins in mice with functional embedding features. Front. Genet. 2022, 13, 909040. [Google Scholar] [CrossRef] [PubMed]
Li, H.; Huang, F.; Liao, H.; Li, Z.; Feng, K.; Huang, T.; Cai, Y.D. Identification of covid-19-specific immune markers using a machine learning method. Front. Mol. Biosci. 2022, 9, 952626. [Google Scholar] [CrossRef] [PubMed]
Li, Z.; Guo, W.; Ding, S.; Chen, L.; Feng, K.; Huang, T.; Cai, Y.D. Identifying key microrna signatures for neurodegenerative diseases with machine learning methods. Front. Genet. 2022, 13, 880997. [Google Scholar] [CrossRef] [PubMed]
Lu, J.; Meng, M.; Zhou, X.; Ding, S.; Feng, K.; Zeng, Z.; Huang, T.; Cai, Y.D. Identification of Covid-19 severity biomarkers based on feature selection on single-cell rna-seq data of CD8⁺ T cells. Front. Genet. 2022, 13, 1053772. [Google Scholar] [CrossRef]
Huang, F.; Fu, M.; Li, J.; Chen, L.; Feng, K.; Huang, T.; Cai, Y.-D. Analysis and prediction of protein stability based on interaction network, gene ontology, and kegg pathway enrichment scores. BBA Proteins Proteom. 2023, 1871, 140889. [Google Scholar] [CrossRef] [PubMed]
Huang, F.; Ma, Q.; Ren, J.; Li, J.; Wang, F.; Huang, T.; Cai, Y.-D. Identification of smoking associated transcriptome aberration in blood with machine learning methods. BioMed Res. Int. 2023, 2023, 5333361. [Google Scholar] [CrossRef] [PubMed]
Ren, J.; Zhang, Y.; Guo, W.; Feng, K.; Yuan, Y.; Huang, T.; Cai, Y.-D. Identification of genes associated with the impairment of olfactory and gustatory functions in covid-19 via machine-learning methods. Life 2023, 13, 798. [Google Scholar] [CrossRef]
Zhao, X.; Chen, L.; Lu, J. A similarity-based method for prediction of drug side effects with heterogeneous information. Math. Biosci. 2018, 306, 136–144. [Google Scholar] [CrossRef]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Kohavi, R. A study of Cross-Validation and Bootstrap for Accuracy Estimation and Model Selection. In Proceedings of the 14th International joint Conference on artificial intelligence, Montreal, QC, Canada, 20–25 August 1995; Lawrence Erlbaum Associates Ltd.: Mahwah, NJ, USA, 1995; pp. 1137–1145. [Google Scholar]
Powers, D. Evaluation: From precision, recall and f-measure to roc., informedness, markedness & correlation. J. Mach. Learn. Technol. 2011, 2, 37–63. [Google Scholar]
Tang, S.; Chen, L. iATC-NFMLP: Identifying classes of anatomical therapeutic chemicals based on drug networks, fingerprints and multilayer perceptron. Curr. Bioinform. 2022, 17, 814–824. [Google Scholar] [CrossRef]
Wu, C.; Chen, L. A model with deep analysis on a large drug network for drug classification. Math. Biosci. Eng. 2023, 20, 383–401. [Google Scholar] [CrossRef]
Wu, Z.; Chen, L. Similarity-based method with multiple-feature sampling for predicting drug side effects. Comput. Math. Methods Med. 2022, 2022, 9547317. [Google Scholar] [CrossRef] [PubMed]
Liang, H.; Chen, L.; Zhao, X.; Zhang, X. Prediction of drug side effects with a refined negative sample selection strategy. Comput. Math. Methods Med. 2020, 2020, 1573543. [Google Scholar] [CrossRef]
Zhao, X.; Chen, L.; Guo, Z.-H.; Liu, T. Predicting drug side effects with compact integration of heterogeneous networks. Curr. Bioinform. 2019, 14, 709–720. [Google Scholar] [CrossRef]
Chen, L.; Chen, K.; Zhou, B. Inferring drug-disease associations by a deep analysis on drug and disease networks. Math. Biosci. Eng. 2023, 20, 14136–14157. [Google Scholar] [CrossRef]
Gorodkin, J. Comparing two k-category assignments by a k-category correlation coefficient. Comput. Biol. Chem. 2004, 28, 367–374. [Google Scholar] [CrossRef]
Pelletier, J.; Thomas, G.; Volarević, S. Ribosome biogenesis in cancer: New players and therapeutic avenues. Nat. Rev. Cancer 2018, 18, 51–63. [Google Scholar] [CrossRef]
Pecoraro, A.; Pagano, M.; Russo, G.; Russo, A. Ribosome biogenesis and cancer: Overview on ribosomal proteins. Int. J. Mol. Sci. 2021, 22, 5496. [Google Scholar] [CrossRef]
Lee, M.Y.; Jeon, J.W.; Sievers, C.; Allen, C.T. Antigen processing and presentation in cancer immunotherapy. J. Immunother. Cancer 2020, 8, e001111. [Google Scholar] [CrossRef]
Zhang, B.; Li, J.; Hua, Q.; Wang, H.; Xu, G.; Chen, J.; Zhu, Y.; Li, R.; Liang, Q.; Wang, L.; et al. Tumor cemip drives immune evasion of colorectal cancer via mhc-i internalization and degradation. J. Immunother. Cancer 2023, 11, e005592. [Google Scholar] [CrossRef]
Kasajima, A.; Sers, C.; Sasano, H.; Jöhrens, K.; Stenzinger, A.; Noske, A.; Buckendahl, A.C.; Darb-Esfahani, S.; Müller, B.M.; Budczies, J.; et al. Down-regulation of the antigen processing machinery is linked to a loss of inflammatory response in colorectal cancer. Hum. Pathol. 2010, 41, 1758–1769. [Google Scholar] [CrossRef] [PubMed]
Gupta, R.K.; Gupta, G. Klrb receptor family and human early activation antigen (CD69). In Animal Lectins: Form, Function and Clinical Applications; Springer: Berlin/Heidelberg, Germany, 2012; pp. 619–638. [Google Scholar]
Boix, F.; Millan, O.; San Segundo, D.; Mancebo, E.; Rimola, A.; Fabrega, E.; Fortuna, V.; Mrowiec, A.; Castro-Panete, M.J.; de la Pena, J. High expression of CD38, CD69, CD95 AND CD154 biomarkers in cultured peripheral T lymphocytes correlates with an increased risk of acute rejection in liver allograft recipients. Immunobiology 2016, 221, 595–603. [Google Scholar] [CrossRef]
Testi, R.; Phillips, J.; Lanier, L. T cell activation via leu-23 (CD69). J. Immunol. 1989, 143, 1123–1128. [Google Scholar] [CrossRef]
Ziegler, S.F.; Ramsdell, F.; Alderson, M.R. The activation antigen CD69. Stem Cells 1994, 12, 456–465. [Google Scholar] [CrossRef]
Song, F.; Zhang, Y.; Chen, Q.; Bi, D.; Yang, M.; Lu, L.; Li, M.; Zhu, H.; Liu, Y.; Wei, Q. Mast cells inhibit colorectal cancer development by inducing er stress through secreting cystatin c. Oncogene 2022, 42, 209–223. [Google Scholar] [CrossRef] [PubMed]
Utans, U.; Arceci, R.J.; Yamashita, Y.; Russell, M.E. Cloning and characterization of allograft inflammatory factor-1: A novel macrophage factor identified in rat cardiac allografts with chronic rejection. J. Clin. Investig. 1995, 95, 2954–2962. [Google Scholar] [CrossRef]
Vu, D.; Tellez-Corrales, E.; Shah, T.; Hutchinson, I.; Min, D.I. Influence of cyclooxygenase-2 (cox-2) gene promoter-1195 and allograft inflammatory factor-1 (aif-1) polymorphisms on allograft outcome in hispanic kidney transplant recipients. Hum. Immunol. 2013, 74, 1386–1391. [Google Scholar] [CrossRef] [PubMed]
Sinigaglia, F.; Guttinger, M.; Kilgus, J.; Doran, D.; Matile, H.; Etlinger, H.; Trzeciak, A.; Gillessen, D.; Pink, J. A malaria T-cell epitope recognized in association with most mouse and human mhc class ii molecules. Nature 1988, 336, 778–780. [Google Scholar] [CrossRef]
Kim, Y.-I.; Shin, H.-W.; Chun, Y.-S.; Park, J.-W. Cst3 and gdf15 ameliorate renal fibrosis by inhibiting fibroblast growth and activation. Biochem. Biophys. Res. Commun. 2018, 500, 288–295. [Google Scholar] [CrossRef] [PubMed]
Burnside, E.; Bradbury, E. Manipulating the extracellular matrix and its role in brain and spinal cord plasticity and repair. Neuropathol. Appl. Neurobiol. 2014, 40, 26–59. [Google Scholar] [CrossRef]
Zhao, Y.-Y.; Yan, D.-J.; Chen, Z.-W. Role of aif-1 in the regulation of inflammatory activation and diverse disease processes. Cell Immunol. 2013, 284, 75–83. [Google Scholar] [CrossRef]
Tau, G.; Rothman, P. Biologic functions of the ifn-γ receptors. Allergy 1999, 54, 1233. [Google Scholar] [CrossRef]
Sun, H.; Hartigan, C.R.; Chen, C.w.; Sun, Y.; Tariq, M.; Robertson, J.M.; Krummey, S.M.; Mehta, A.K.; Ford, M.L. Tigit regulates apoptosis of risky memory T cell subsets implicated in belatacept-resistant rejection. Am. J. Transplant. 2021, 21, 3256–3267. [Google Scholar] [CrossRef] [PubMed]
Fuhrman, C.A.; Yeh, W.-I.; Seay, H.R.; Lakshmi, P.S.; Chopra, G.; Zhang, L.; Perry, D.J.; McClymont, S.A.; Yadav, M.; Lopez, M.-C. Divergent phenotypes of human regulatory T cells expressing the receptors tigit and CD226. J. Immunol. 2015, 195, 145–155. [Google Scholar] [CrossRef]
Milcent, B.; Josseaume, N.; Petitprez, F.; Riller, Q.; Amorim, S.; Loiseau, P.; Toubert, A.; Brice, P.; Thieblemont, C.; Teillaud, J.-L. Recovery of central memory and naive peripheral T cells in follicular lymphoma patients receiving rituximab-chemotherapy based regimen. Sci. Rep. 2019, 9, 13471. [Google Scholar] [CrossRef]
Masuda, K.; Kornberg, A.; Miller, J.; Lin, S.; Suek, N.; Botella, T.; Secener, K.; Bacarella, A.M.; Cheng, L.; Ingham, M. Multiplexed single-cell analysis reveals prognostic and non-prognostic T cell types in human colorectal cancer. JCI Insight 2022, 7, e154646. [Google Scholar] [CrossRef]
Morille, J.; Mandon, M.; Rodriguez, S.; Roulois, D.; Leonard, S.; Garcia, A.; Wiertlewski, S.; Le Page, E.; Berthelot, L.; Nicot, A. Multiple sclerosis csf is enriched with follicular T cells displaying a th1/eomes signature. Neurol. Neuroimmunol. Neuroinflammation 2022, 9, e200033. [Google Scholar] [CrossRef]
Béguelin, W.; Teater, M.; Meydan, C.; Hoehn, K.B.; Phillip, J.M.; Soshnev, A.A.; Venturutti, L.; Rivas, M.A.; Calvo-Fernández, M.T.; Gutierrez, J. Mutant ezh2 induces a pre-malignant lymphoma niche by reprogramming the immune response. Cancer Cell 2020, 37, 655–673.e611. [Google Scholar] [CrossRef] [PubMed]
Arjunaraja, S.; Nosé, B.D.; Sukumar, G.; Lott, N.M.; Dalgard, C.L.; Snow, A.L. Intrinsic plasma cell differentiation defects in b cell expansion with nf-κb and T cell anergy patient b cells. Front. Immunol. 2017, 8, 913. [Google Scholar] [CrossRef] [PubMed]
Pan, X.; Jones, M.; Jiang, J.; Zaprazna, K.; Yu, D.; Pear, W.; Maillard, I.; Atchison, M.L. Increased expression of PcG protein YY1 negatively regulates b cell development while allowing accumulation of myeloid cells and LT-HSC cells. PLoS ONE 2012, 7, e30656. [Google Scholar] [CrossRef] [PubMed]
Snir, O.; Kanduri, C.; Lundin, K.E.; Sandve, G.K.; Sollid, L.M. Transcriptional profiling of human intestinal plasma cells reveals effector functions beyond antibody production. United Eur. Gastroenterol. J. 2019, 7, 1399–1407. [Google Scholar] [CrossRef]
Johansen, F.; Braathen, R.; Brandtzaeg, P. Role of j chain in secretory immunoglobulin formation. Scand. J. Immunol. 2000, 52, 240–248. [Google Scholar] [CrossRef]
Brandtzaeg, P.; Korsrud, F. Significance of different J chain profiles in human tissues: Generation of IgA and IgM with binding site for secretory component is related to the J chain expressing capacity of the total local immunocyte population, including IgG and IgD producing cells, and depends on the clinical state of the tissue. Clin. Exp. Immunol. 1984, 58, 709. [Google Scholar]
Bjerke, K.; Brandtzaeg, P. Terminally differentiated human intestinal B cells. J chain expression of IgA and IgG subclass-producing immunocytes in the distal ileum compared with mesenteric and peripheral lymph nodes. Clin. Exp. Immunol. 1990, 82, 411–415. [Google Scholar] [CrossRef]
Stockis, J.; Fink, W.; François, V.; Connerotte, T.; De Smet, C.; Knoops, L.; van der Bruggen, P.; Boon, T.; Coulie, P.G.; Lucas, S. Comparison of stable human treg and th clones by transcriptional profiling. Eur. J. Immunol. 2009, 39, 869–882. [Google Scholar] [CrossRef] [PubMed]
Höllbacher, B.; Duhen, T.; Motley, S.; Klicznik, M.M.; Gratz, I.K.; Campbell, D.J. Transcriptomic profiling of human effector and regulatory T cell subsets identifies predictive population signatures. Immunohorizons 2020, 4, 585–596. [Google Scholar] [CrossRef]
Lin, C.; Yang, H.; Zhao, W.; Wang, W. Ctsb+ macrophage repress memory immune hub in the liver metastasis site of colorectal cancer patient revealed by multi-omics analysis. Biochem. Biophys. Res. Commun. 2022, 626, 8–14. [Google Scholar] [CrossRef]
Duan, R.; Liu, Y.; Tang, D.; Xiao, S.; Lin, R.; Zhao, M. Single-cell rna-seq reveals collagen vi antibody-induced expressing lyve1 m2-like macrophages reduce atherosclerotic plaque area on apoe-/-mice. Int. Immunopharmacol. 2023, 116, 109794. [Google Scholar] [CrossRef]
Wan, J.; Lv, J.; Wang, C.; Zhang, L. Rps27 selectively regulates the expression and alternative splicing of inflammatory and immune response genes in thyroid cancer cells. Adv. Clin. Exp. Med. 2022, 31, 889–901. [Google Scholar] [CrossRef] [PubMed]
Bao, Y.; Liu, X.; Han, C.; Xu, S.; Xie, B.; Zhang, Q.; Gu, Y.; Hou, J.; Qian, L.; Qian, C. Identification of ifn-γ-producing innate b cells. Cell Res. 2014, 24, 161–176. [Google Scholar] [CrossRef] [PubMed]
Gomes, A.Q.; Correia, D.V.; Grosso, A.R.; Lança, T.; Ferreira, C.; Lacerda, J.F.; Barata, J.T.; da Silva, M.G.; Silva-Santos, B. Identification of a panel of ten cell surface protein antigens associated with immunotargeting of leukemias and lymphomas by peripheral blood γδ T cells. Haematologica 2010, 95, 1397. [Google Scholar] [CrossRef] [PubMed]
Gottenberg, J.-E.; Cagnard, N.; Lucchesi, C.; Letourneur, F.; Mistou, S.; Lazure, T.; Jacques, S.; Ba, N.; Ittah, M.; Lepajolec, C. Activation of ifn pathways and plasmacytoid dendritic cell recruitment in target organs of primary sjögren’s syndrome. Proc. Natl. Acad. Sci. USA 2006, 103, 2770–2775. [Google Scholar] [CrossRef] [PubMed]
Ascough, S.; Paterson, S.; Chiu, C. Induction and subversion of human protective immunity: Contrasting influenza and respiratory syncytial virus. Front. Immunol. 2018, 9, 323. [Google Scholar] [CrossRef]
Smillie, C.S.; Biton, M.; Ordovas-Montanes, J.; Sullivan, K.M.; Burgin, G.; Graham, D.B.; Herbst, R.H.; Rogel, N.; Slyper, M.; Waldman, J. Intra-and inter-cellular rewiring of the human colon during ulcerative colitis. Cell 2019, 178, 714–730.e722. [Google Scholar] [CrossRef]
Knobeloch, K.-P.; Wright, M.D.; Ochsenbein, A.F.; Liesenfeld, O.; Löhler, J.r.; Zinkernagel, R.M.; Horak, I.; Orinska, Z. Targeted inactivation of the tetraspanin CD37 impairs T-cell-dependent b-cell response under suboptimal costimulatory conditions. Mol. Cell. Biol. 2000, 20, 5363–5369. [Google Scholar] [CrossRef]
Komili, S.; Farny, N.G.; Roth, F.P.; Silver, P.A. Functional specificity among ribosomal proteins regulates gene expression. Cell 2007, 131, 557–571. [Google Scholar] [CrossRef] [PubMed]
Benvenuto, M.; Sileri, P.; Rossi, P.; Masuelli, L.; Fantini, M.; Nanni, M.; Franceschilli, L.; Sconocchia, G.; Lanzilli, G.; Arriga, R. Natural humoral immune response to ribosomal p0 protein in colorectal cancer patients. J. Transl. Med. 2015, 13, 101. [Google Scholar] [CrossRef]
Zhang, Y.; Zhang, G.; Liu, Y.; Chen, R.; Zhao, D.; McAlister, V.; Mele, T.; Liu, K.; Zheng, X. Gdf15 regulates malat-1 circular rna and inactivates nfκb signaling leading to immune tolerogenic dcs for preventing alloimmune rejection in heart transplantation. Front. Immunol. 2018, 9, 2407. [Google Scholar] [CrossRef]
Kolseth, I.B.; Førland, D.T.; Risøe, P.K.; Flood-Kjeldsen, S.; Ågren, J.; Reseland, J.E.; Lyngstadaas, S.P.; Johnson, E.; Dahle, M.K. Human monocyte responses to lipopolysaccharide and 9-cis retinoic acid after laparoscopic surgery for colon cancer. Scand. J. Clin. Lab. Investig. 2012, 72, 593–601. [Google Scholar] [CrossRef] [PubMed]
Zhang, X.; Wang, Y.; Yuan, J.; Li, N.; Pei, S.; Xu, J.; Luo, X.; Mao, C.; Liu, J.; Yu, T. Macrophage/microglial ezh2 facilitates autoimmune inflammation through inhibition of socs3. J. Exp. Med. 2018, 215, 1365–1382. [Google Scholar] [CrossRef] [PubMed]
Lechner, M.G.; Cheng, M.I.; Patel, A.Y.; Hoang, A.T.; Yakobian, N.; Astourian, M.; Pioso, M.S.; Rodriguez, E.D.; McCarthy, E.C.; Hugo, W. Inhibition of il-17a protects against thyroid immune-related adverse events while preserving checkpoint inhibitor antitumor efficacy. J. Immunol. 2022, 209, 696–709. [Google Scholar] [CrossRef] [PubMed]
Xia, P.; Xu, X.-Y. Prognostic significance of CD44 in human colon cancer and gastric cancer: Evidence from bioinformatic analyses. Oncotarget 2016, 7, 45538. [Google Scholar] [CrossRef]
Gold, D.V.; Stein, R.; Burton, J.; Goldenberg, D.M. Enhanced expression of CD74 in gastrointestinal cancers and benign tissues. Int. J. Clin. Exp. Pathol. 2011, 4, 1–12. [Google Scholar]
Beswick, E.J.; Reyes, V.E. Cd74 in antigen presentation, inflammation, and cancers of the gastrointestinal tract. World J. Gastroenterol. WJG 2009, 15, 2855. [Google Scholar] [CrossRef] [PubMed]
Edvardsson, K.; Ström, A.; Jonsson, P.; Gustafsson, J.-. Å.; Williams, C. Estrogen receptor β induces antiinflammatory and antitumorigenic networks in colon cancer cells. Mol. Endocrinol. 2011, 25, 969–979. [Google Scholar] [CrossRef]
Labriet, A.; Lévesque, É.; Cecchin, E.; De Mattia, E.; Villeneuve, L.; Rouleau, M.; Jonker, D.; Couture, F.; Simonyan, D.; Allain, E.P. Germline variability and tumor expression level of ribosomal protein gene rpl28 are associated with survival of metastatic colorectal cancer patients. Sci. Rep. 2019, 9, 13008. [Google Scholar] [CrossRef] [PubMed]
Labriet, A.; Lévesque, É.; Mattia, E.D.; Cecchin, E.; Jonker, D.; Couture, F.; Simonyan, D.; Buonadonna, A.; D’Andrea, M.; Villeneuve, L. Rpl28 promoter polymorphism rs4806668 is associated with reduced survival in folfiri-treated metastatic colorectal cancer patients. Cancer Res. 2018, 78, 3889. [Google Scholar] [CrossRef]
Nirmal, A.J.; Regan, T.; Shih, B.B.; Hume, D.A.; Sims, A.H.; Freeman, T.C. Immune cell gene signatures for profiling the microenvironment of solid tumorsimmune cell gene signatures for profiling solid tumors. Cancer Immunol. Res. 2018, 6, 1388–1400. [Google Scholar] [CrossRef] [PubMed]
Martin, J.C.; Chang, C.; Boschetti, G.; Ungaro, R.; Giri, M.; Grout, J.A.; Gettler, K.; Chuang, L.-S.; Nayar, S.; Greenstein, A.J. Single-cell analysis of crohn’s disease lesions identifies a pathogenic cellular module associated with resistance to anti-tnf therapy. Cell 2019, 178, 1493–1508.e1420. [Google Scholar] [CrossRef] [PubMed]
Ziembik, M.A.; Bender, T.P.; Larner, J.M.; Brautigan, D.L. Functions of protein phosphatase-6 in nf-κb signaling and in lymphocytes. Biochem. Soc. Trans. 2017, 45, 693–701. [Google Scholar] [CrossRef] [PubMed]
Björklund, Å.K.; Forkel, M.; Picelli, S.; Konya, V.; Theorell, J.; Friberg, D.; Sandberg, R.; Mjösberg, J. The heterogeneity of human CD127+ innate lymphoid cells revealed by single-cell rna sequencing. Nat. Immunol. 2016, 17, 451–460. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flow chart of the entire analysis process. scRNA-seq data from 25 immune cells in the normal colon are analyzed by using a machine learning-based approach. The dataset was grouped in accordance with the subtype of immune cells. Gene expression levels were analyzed by five feature selection methods, namely, LASSO, LightGBM, MCFS, mRMR, and RF. The obtained feature lists were fed into the IFS method, which combines DT and RF algorithms, to extract key genes and construct efficient classifiers and classification rules.

Figure 2. IFS curves for evaluating the performance of the two classification algorithms based on weighted F1. (A). IFS curves based on the LASSO feature list. (B). IFS curves based on the LightGBM feature list. (C). IFS curves based on the MCFS feature list. (D). IFS curves based on the mRMR feature list. (E). IFS curves based on the RF feature list. On each curve, the best performance is marked, while the relative high performance based on a few features is also marked on the curve of RF.

Figure 3. Venn diagram of the most essential features extracted from the lists yielded by LASSO, LightGBM, MCFS, mRMR, and RF. The overlapping circles indicate genes that are identified as the most essential genes by multiple ranking algorithms.

Figure 4. Enrichment results on essential genes extracted from five lists. The above figure shows the enriched gene ontology terms and the below figure shows the enriched KEGG pathways.

Figure 5. A bar chart to show the number of rules for each immune cell type in five rule groups on five feature lists.

Figure 6. Heatmap for rule-associated genes identified on two feature lists (red for genes selected to predict the cell type in the rule and sky blue for other genes). (A). Heatmap based on rules on the LASSO feature list. (B). Heatmap based on rules on the LightGBM feature list.

Figure 7. Heatmap for rule-associated genes identified on three feature lists (red for genes selected to predict the cell type in the rule and sky blue for other genes). (A). Heatmap based on rules on the MCFS feature list. (B). Heatmap based on rules on the mRMR feature list. (C). Heatmap based on rules on the RF feature list.

Table 1. Twenty-five immune cell subtypes and sample sizes.

Index	Immune Cell Subtypes	Sample Size
1	Activated CD4 T	1531
2	B cell (cycling)	15
3	B cell (IgA Plasma)	12,522
4	B cell (IgG Plasma)	342
5	B cell (memory)	4508
6	CD8 T	3145
7	cDC1	38
8	cDC2	107
9	cycling DCs	47
10	cycling gd T	25
11	Follicular B cell	2582
12	gd T	548
13	ILC	832
14	Lymphoid DC	10
15	LYVE1 Macrophage	91
16	Macrophage	268
17	Mast	1151
18	Monocyte	98
19	NK	452
20	pDC	13
21	Tcm	3042
22	Tfh	1786
23	Th1	2833
24	Th17	3432
25	Treg	2232

Table 2. Performance of some key classifiers under different feature lists and classification algorithms.

Feature List	Classification Algorithm	Number of Features	ACC	MCC	Macro F1	Weighted F1
LASSO feature list	Decision tree	1240	0.823	0.796	0.775	0.827
	Random forest	1230	0.960	0.953	0.953	0.959
	Random forest	50	0.933	0.923	0.937	0.933
LightGBM feature list	Decision tree	300	0.895	0.878	0.871	0.895
	Random forest	1200	0.978	0.974	0.985	0.978
	Random forest	60	0.943	0.934	0.949	0.943
MCFS feature list	Decision tree	1620	0.892	0.875	0.881	0.893
	Random forest	1720	0.976	0.972	0.983	0.976
	Random forest	90	0.944	0.935	0.954	0.944
mRMR feature list	Decision tree	350	0.873	0.853	0.855	0.875
	Random forest	560	0.972	0.968	0.973	0.972
	Random forest	90	0.961	0.955	0.969	0.961
RF feature list	Decision tree	170	0.891	0.874	0.863	0.892
	Random forest	1510	0.977	0.974	0.984	0.978
	Random forest	70	0.962	0.956	0.970	0.962

Table 3. Latent biomarker genes identified in this study.

Category	Gene Symbol	Description
T cell family	KLRB1	Killer Cell Lectin Like Receptor B1
	CST3	Cystatin C
	AIF1	Allograft Inflammatory Factor 1
	JCHAIN	Joining Chain of Multimeric IgA and IgM
	HLA-DRA	Major Histocompatibility Complex, Class II, DR Alpha
	TIGIT	T Cell Immunoreceptor with Ig and ITIM Domains
	RPS12	Ribosomal Protein S12
	MS4A1	Membrane Spanning 4-Domains A1
B cell family	RPL30	Ribosomal Protein L30
	ANXA1	Annexin A1
	JCHAIN	Joining Chain of Multimeric IgA and IgM
	ICA1	Islet Cell Autoantigen 1
	TYROBP	Transmembrane Immune Signaling Adaptor TYROBP
	RPS27	Ribosomal Protein S27
	HLA-DRA	Major Histocompatibility Complex, Class II, DR Alpha
	IFITM1	Interferon Induced Transmembrane Protein 1
	IGHA1	Immunoglobulin Heavy Constant Alpha 1
	CD37	CD37 Molecule
Other cells	RPS13	Ribosomal Protein S13
	RPL34	Ribosomal Protein L34
	MALAT1	Metastasis Associated Lung Adenocarcinoma Transcript 1
	TMSB4X	Thymosin Beta 4 X-Linked
	SOCS3	Suppressor Of Cytokine Signaling 3
	IL7R	Interleukin 7 Receptor
	CD-74	CD74 Molecule
	RPL28	Ribosomal Protein L28
	ANKRD28	Ankyrin Repeat Domain 28
	TYROBP	Transmembrane Immune Signaling Adaptor TYROBP

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yang, Y.; Zhang, Y.; Ren, J.; Feng, K.; Li, Z.; Huang, T.; Cai, Y. Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods. Life 2023, 13, 1876. https://doi.org/10.3390/life13091876

AMA Style

Yang Y, Zhang Y, Ren J, Feng K, Li Z, Huang T, Cai Y. Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods. Life. 2023; 13(9):1876. https://doi.org/10.3390/life13091876

Chicago/Turabian Style

Yang, Yong, Yuhang Zhang, Jingxin Ren, Kaiyan Feng, Zhandong Li, Tao Huang, and Yudong Cai. 2023. "Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods" Life 13, no. 9: 1876. https://doi.org/10.3390/life13091876

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Identification of Colon Immune Cell Marker Genes Using Machine Learning Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. Data

2.2. Feature Ranking Algorithms

2.2.1. Last Absolute Shrinkage and Selection Operator

2.2.2. Light Gradient Boosting Machine

2.2.3. Monte Carlo Feature Selection

2.2.4. Minimum Redundancy Maximum Relevance

2.2.5. Random Forest

2.3. Incremental Feature Selection

2.4. Synthetic Minority Oversampling Technique

2.5. Classification Algorithm

2.5.1. Decision Tree

2.5.2. Random Forest

2.6. Performance Evaluation

3. Results

3.1. Dynamics of Classifier Performance

3.2. Relationships between the Most Essential Genes Extracted from Five Lists

3.3. Classification Rules

4. Discussion

4.1. T Cell Family

4.2. B Cell Family

4.3. Other Cells

4.4. Limitation of This Study

5. Conclusions

Supplementary Materials

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI