# On the Role of Clustering and Visualization Techniques in Gene Microarray Data

^{*}

## Abstract

**:**

## 1. Introduction

#### 1.1. A Biological Introduction: Microarray Gene Expression Technology

**Chip manufacture:**A microarray is a tiny chip (produced of chemically covered glass, nylon mesh, or silicon) on which tens of thousands of molecules (samples) of DNA are connected in set grids.**Target preparation, labeling, and hybridization:**Usually, two samples of mRNA (i.e., test and control samples) are backward transcribed into cDNA (targets), marked with either fluorescent or radioactive isotopics, and then hybridized with the samples on the chip surface.**The scanning process:**To read the signal intensity emitted from the marked and hybridized targtes, chips are scanned.

#### Microarray Experimental Data

- ALLAML [20] contains two classes of samples, namely ALL and AML, each of 47 and 25 samples, respectively, for an overall number of 72 samples. Each sample is formed by 7129 gene expression values.
- LEUKEMIA [20] contains in total 72 samples in two classes: acute lymphoblastic and acute myeloid corresponding to 7129 genes.
- CLL SUB 111 [21] dataset has gene expressions from high density oligonucleotide arrays containing genetically and clinically distinct subgroups of B-cell chronic lymphocytic leukemia (B-CLL). The dataset consists of 11,340 gene expression levels, 111 instances and three classes.
- GLIOMA [22] contains in total 50 samples in four classes: cancer glioblastomas, non-cancer glioblastomas, cancer oligodendrogliomas and non-cancer oligodendrogliomas, which have 14, 14, 7, 15 samples, respectively. Each sample is formed by 12,625 genes.
- LUNG [23] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, with 139, 21, 20, 6, 17 samples, respectively.
- LUNG DISCRETE [24] contains 73 samples in seven classes where each sample consists of 325 gene expressions.
- DLBCL [25] is a modified version of the original DLBCL dataset. It consists of 96 samples in nine classes, where each sample is defined by the expression of 4026 genes.
- CARCINOM [26] contains 174 samples in 11 classes, prostate, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocarcinomas and lung squamous cell carcinoma.

## 2. Clustering for Microarray Gene Expression Data

#### 2.1. Partitive Clustering Algorithms

- for set-oriented clustering, ${u}_{ij}\in \{0,1\}$, $0<{\sum}_{j=1}^{N}{u}_{ij}<N$, for $i=1,2,\dots ,K$, ${\sum}_{i=1}^{K}{u}_{ij}=1$, for $j=1,2,\dots ,N$,
- for fuzzy-oriented clustering, we get ${u}_{ij}\in [0,1]$ with the same two requirements as stated above.

- each cluster is nonempty;
- each point belongs exactly to one cluster (set-oriented clustering) or might belong to more than one cluster simultaneously, i.e., the sum of its membership values for all clusters is 1 (fuzzy set-oriented clustering).

#### 2.2. Hierarchical Clustering

- single linkage method: The closeness between H and T is calculated on the basis of the minimum distance between the samples that belong to the respective clusters.
- complete linkage method: The closeness between H and T is calculated as the distance between the most two distant samples, one from each cluster.
- average linkage method: The closeness between H and T is calculated as the average of all the distances between pairs of samples, one from each cluster. The criterion considers all possible pairs of distances between samples in the clusters, and is thus far more accurate and resistant to outliers.

#### 2.3. Model-Based Clustering

#### 2.4. Other Approaches

**Self Organizing Maps**. The k-means method is a well-known centroid approach. A neural variation that allows samples to influence the location of neighboring clusters is known as the self-organizing map (SOM) or Kohonen map [48]. Such maps, usually a $2D$ rectangular grid of neurons, are particularly valuable for describing the relationships between clusters. The neurons of the neural network are all connected with their own reference vector, and each data point is mapped to the neuron with the closest reference vector. During the training steps, each data point directs the movement of the reference vectors towards the denser areas of the input vector space, so that those reference vectors are trained to fit the distributions of the input data set. When the training is complete, clusters are identified by mapping all data points to the output neurons. SOM was applied to gene expression data in [41] with good results over the k-means approach, nonetheless it requires an a priori clusters number and lattice structure of the neural network.**Graph-Theoretical Clustering**. Graph-theoretical clustering techniques are explicitly presented in terms of a graph, thus converting the problem of clustering a data set into such graph theoretical problems as finding minimum cut or maximal cliques in the proximity graph G. In [49], the CLuster Identification via Connectivity Kernels (CLICK) was proposed. CLICK tries to discover highly connected components in the proximity graph as clusters. In [49], the authors applied their CLICK clustering method to public gene expression data demonstrating better quality in terms of homogeneity and separation compared with other methods.In [50], both a theoretical algorithm and a practical heuristic called CAST (Cluster Affinity Search Technique) is presented. CAST takes as input a real, symmetric, $N\times N$ similarity matrix $Sim\left(Sim\right(i,j)\in [0,1\left]\right)$ and an affinity threshold $\u03f5$. CAST alternates between adding high affinity samples to a given cluster and removing low affinity samples from it. CAST does not require the number of clusters and is effective in handling outliers. Nevertheless, for CAST, it is difficult to determine a proper value for the global parameter $\u03f5$.**Density-Based Hierarchical Approaches**. In [51], a density-based, hierarchical clustering method (DHC) was proposed in order to identify the co-expressed gene groups from gene expression data. As the name suggests, DHC combines both the the model-based and hierarchical clustering approaches. DHC is effective in detecting the co-expressed genes (which have relatively higher density) from noise (which have relatively lower density), and thus is robust in the noisy environment. However, DHC is not efficient from the computational complexity point of view and exhibits the typical difficulty to determine the appropriate value of its parameters. A different approach, called NEC, was defined in [52]. The authors argued that most of clustering algorithms proposed in the literature were based on the Euclidean metric, even though Euclidean metric is often limited and inadequate. NEC is a clustering method accomplished in two steps, where a Probabilistic Principal Surfaces approach (i.e., a density-based modelling) [2] is firstly used to find an initial rough clusterization (with a large number of clusters), and, secondly, it begins an agglomeration phase on the basis of specific non-Euclidean metric defined in terms of Fisher’s and Negentropy information. The computational burden of NEC is limited due to Fisher information and Negentropy, thus the technique can efficiently and effectively be applied to gene expression data [53,54] and, with some generalization, to other kinds of data [55].**Biclustering**. Biclustering [56,57], also named subspace clustering, aims at finding a subgroup of genes with similar expressions belonging to a subset of samples. Rows, or genes, and columns, or samples, of a gene expression matrix are clustered simultaneously. The rationale in using biclustering is that, among the large number of genes, only a subset contributes to the target in which a researcher is interested in, while the remaining ones might mask the role of relevant genes in pursuing that target. Furthermore, it has been argued that coexpressed genes might behave independently.**Multi-objective evolutionary clustering algorithms.**A very recent trend in clustering gene expression data tries to overcome two main deficiencies in clustering techniques when facing with different molecular data sets, namely, (i) the impossibility to discriminate among the importance of features during the cluster formation; indeed, different features could have, and frequently do actually have, different effects on clustering, and (ii) the lack of multiple internal evaluation functions. To this end, in [19], a multiobjective framework has been proposed in order to gain robustness when facing several and different molecular data; to this end, the authors select five diverse group validity indices as multiobjective functions simultaneously optimized, in order to properly seize multiple characteristics of the evolving agglomerations.

## 3. Visualization Techniques for Microarray Gene Expression Data

#### 3.1. Visualization Methods: A Brief Review

**Principal Component Analysis (PCA)**[59]. A well established linear projection technique is used to map data from higher to lower dimensional spaces. PCA linearly transforms data, preserving as much as possible its variance.**Probabilistic PCA**[60]. The lack of a generative model in PCA gives no means to interpret its error function in a principled way. Probabilistic Principal Component Analysis (PPCA) was introduced to enhance PCA, i.e., turning PCA into a generative model by using a latent variable approach. PPCA consists of a Gaussian mixture model where each component has a diagonal covariance matrix with a single variance parameter for describing the variance in each Gaussian of the mixture.**Mixture of PPCA**[61]. Because PCA describes only a linear data projection, it is a method that is rather restricted. Using a set of local linear models is one way around this. This is attractive since each model is easier to comprehend and generally easier to accommodate.**Multidimensional Scaling (MDS)**[62]. MDS maps data points from an original high dimensional space to spaces of a lower dimension, likewise several other methods, but approaching the problem differently, i.e., on the basis of the dissimilarities between data points rather than the points themselves. In particular, MDS tries to find a lower-dimensional representation of the data preserving the pairwise distances as much as possible. A variant of MDS is the so-called Sammon mapping [62,63].

#### 3.2. Visualization Based on Probabilistic Principal Surfaces

#### Cluster Visualization

## 4. Conclusions

## Author Contributions

## Funding

## Acknowledgments

## Conflicts of Interest

## References

- Hand, D.; Mannila, H.; Smyth, P. Principles of Data Mining; The MIT Press: Cambridge, MA, USA, 2001. [Google Scholar]
- Staiano, A.; De Vinco, L.; Ciaramella, A.; Raiconi, G.; Tagliaferri, R.; Longo, G.; Miele, G.; Amato, R.; Del Mondo, C.; Donalek, C.; et al. Probabilistic principal surfaces for yeast gene microarray data-mining. In Proceedings of the ICDM’04 Fourth IEEE International Conference on Data Mining Brighton (UK), Brighton, UK, 1–4 November 2004; pp. 202–209. [Google Scholar]
- Calcagno, G.; Staiano, A.; Fortunato, G.; Brescia-Morra, V.; Salvatore, E.; Liguori, R.; Capone, S.; Filla, A.; Longo, G.; Sacchetti, L. A multilayer perceptron neural network-based approach for the identification of responsiveness to interferon therapy in multiple sclerosis patients. Inf. Sci.
**2010**, 180, 4153–4163. [Google Scholar] [CrossRef] - Camastra, F.; Di Taranto, M.D.; Staiano, A. Statistical and computational methods for genetic diseases: An overview. Comput. Math. Methods Med.
**2015**, 2015, 954598. [Google Scholar] [CrossRef] [PubMed] - Di Taranto, M.D.; Staiano, A.; D’Agostino, M.N.; D’Angelo, A.; Bloise, E.; Morgante, A.; Marotta, G.; Gentile, M.; Rubba, P.; Fortunato, G. Association of USF1 and APOA5 polymorphisms with familial combined hyperlipidemia in an Italian population. Mol. Cell. Probes
**2015**, 29, 19–24. [Google Scholar] [CrossRef] [PubMed] - Staiano, A.; Di Taranto, M.D.; Bloise, E.; D’Agostino, M.N.; D’Angelo, A.; Marotta, G.; Gentile, M.; Jossa, F.; Iannuzzi, A.; Rubba, P.; et al. Investigation of single nucleotide polymorphisms associated with familial combined hyperlipidemia with random forests. Neural Nets Surround.
**2013**, 19, 169–178. [Google Scholar] - Pirim, H.; Ekşioğlu1, B.; Perkins, A.; Yüceer, Ç. Clustering of High Throughput Gene Expression Data. Comput. Oper. Res.
**2012**, 39, 3046–3061. [Google Scholar] [CrossRef] [PubMed] - Heath, L.S.; Ramakrishnan, N.; Sederoff, R.R.; Whetten, R.W.; Chevone, B.I.; Struble, C.A.; Jouenne, V.Y.; Chen, D.; van Zyl, L.; Grene, R. Studying the Functional Genomics of Stress Responses in Loblolly Pine with the Expresso Microarray Experiment Management System. Comp. Funct. Genom.
**2002**, 3, 226–243. [Google Scholar] [CrossRef] - Lockhart, D.J.; Dong, H.; Byrne, M.C.; Follettie, M.T.; Gallo, M.V.; Chee, M.S.; Mittmann, M.; Wang, C.; Kobayashi, M.; Horton, H.; et al. Expression Monitoring by Hybridization to High-Density Oligonucleotide Arrays. Nat. Biotechnol.
**1996**, 14, 1675–1680. [Google Scholar] [CrossRef] - Schena, M.D.; Shalon, R.; Davis, R.; Brown, P. Quantitative Monitoring of Gene Expression Patterns with a Compolementatry DNA Microarray. Science
**1995**, 270, 467–470. [Google Scholar] [CrossRef] - Tefferi, A.; Bolander, E.; Ansell, M.; Wieben, D.; Spelsberg, C. Primer on Medical Genomics Part III: Microarray Experiments and Data Analysis. Mayo Clin. Proc.
**2002**, 77, 927–940. [Google Scholar] [CrossRef] - Jiang, D.; Tang, C.; Zhang, A. Cluster Analysis for Gene Expression Data: A Survey. IEEE Trans. Knowl. Data Eng.
**2004**, 18, 1370–1386. [Google Scholar] [CrossRef] - Amato, R.; Ciaramella, A.; Deniskina, N.; del Mondo, C.; di Bernardo, D.; Donalek, C.; Longo, G.; Mangano, G.; Miele, G.; Raiconi, G.; et al. A Multi-Step Approach to Time Series Analysis and Gene Expression Clusterings. Bioinformatics
**2006**, 22, 589–596. [Google Scholar] [CrossRef] - Troyanskaya, O.; Cantor, M.; Sherlock, G.; Brown, P.; Hastie, T.; Tibshirani, R.; Botstein, D.; Altman, R. Missing Value Estimation Methods for Dna Microarrays. Bioinformatics
**2019**, in press. [Google Scholar] [CrossRef] - Hill, A.; Brown, E.; Whitley, M.; Tucker-Kellogg, G.; Hunter, C.; Slonim, A. Evaluation of Normalization Procedures for Oligonucleotide Array Data Based on Spiked cRNA Contros. Genome Biol.
**2001**, 2, research0055.1–esearch0055.13. [Google Scholar] [CrossRef] [PubMed] - Schuchhardt, J.; Beule, D.; Malik, A.; Wolski, E.; Eickhoff, H.; Lehrach, H.; Herzel, H. Normalization Strategies for cDNA Microarrays. Nucleic Acids Res.
**2000**, 28, e47. [Google Scholar] [CrossRef] [PubMed] - Ciaramella, A.; Gianfico, M.; Giunta, G. Compressive sampling and adaptive dictionary learning for the packet loss recovery in audio multimedia streaming. Multimed. Tools Appl.
**2016**, 75, 17375–17392. [Google Scholar] [CrossRef] - Ciaramella, A.; Giunta, G. CPacket loss recovery in audio multimedia streaming by using compressive sensing. IET Commun.
**2016**, 10, 387–392. [Google Scholar] [CrossRef] - Li, X.; Wong, K.-C. Evolutionary Multiobjective Clustering and Its Applications to Patient Stratification. IEEE Trans. Cybern.
**2019**, 45, 1680–1693. [Google Scholar] [CrossRef] - Golub, T.R.; Slonim, D.K.; Tamayo, P.; Huard, C.; Gaasenbeek, M.; Mesirov, J.P.; Coller, H.; Loh, M.L.; Downing, J.R.; Caligiuri, M.A.; et al. Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science
**1999**, 286, 531–537. [Google Scholar] [CrossRef] - Haslinger, C.; Schweifer, N.; Stilgenbauer, S.; Döhner, H.; Lichter, P.; Kraut, N.; Stratowa, C.; Abseher, R. Microarray gene expression profiling of B-cell chronic lymphocytic leukemia subgroups defined by genomic aberrations and VH mutation status. J. Clin. Oncol.
**2004**, 22, 3937–3949. [Google Scholar] [CrossRef] - Nutt, C.L.; Mani, D.R.; Betensky, R.A.; Tamayo, P.; Cairncross, J.G.; Ladd, C.; Pohl, U.; Hartmann, C.; McLaughlin, M.E.; Batchelor, T.T.; et al. Gene expression-based classification of malignant gliomas correlates better with survival than histological classification. Cancer Res.
**2003**, 63, 1602–1607. [Google Scholar] - Bhattacharjee, A.; Richards, W.G.; Staunton, J.; Li, C.; Monti, S.; Vasa, P.; Ladd, C.; Beheshti, J.; Bueno, R.; Gillette, M.; et al. Classification of human lung carcinomas by mRNA expression profiling reveals distinct adenocarcinoma subclasses. Proc. Natl. Acad. Sci. USA
**2001**, 98, 13790–13795. [Google Scholar] [CrossRef] [Green Version] - Peng, H.; Long, F.; Ding, C. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans. Pattern Anal. Mach. Intell.
**2005**, 27, 1226–1238. [Google Scholar] [CrossRef] [PubMed] - Alizadeh, A.A.; Eisen, M.B.; Davis, R.E.; Ma, C.; Lossos, I.S.; Rosenwald, A.; Boldrick, J.C.; Sabet, H.; Tran, T.; Yu, X.; et al. Distinct Types of Diffuse Large B-Cell Lymphoma Identified by Gene Expression Profiling. Nature
**2000**, 403, 503–511. [Google Scholar] [CrossRef] [PubMed] - Su, A.I.; Welsh, J.B.; Sapinoso, L.M.; Kern, S.G.; Dimitrov, P.; Lapp, H.; Schultz, P.G.; Powell, S.M.; Moskaluk, C.A.; Frierson, H.F., Jr.; et al. Molecular classification of human carcinomas by use of gene expression signatures. Cancer Res.
**2001**, 61, 7388–7393. [Google Scholar] [PubMed] - Liew, A.W.; Yan, H.; Yang, M. Pattern Recognition Techniques for the Emerging Field of Bioinformatics: A review. Pattern Recognit.
**2005**, 38, 2055–2073. [Google Scholar] [CrossRef] - Bezdek, J.C.; Keller, J.; Krisnapuram, R.; Pal, N.R. Fuzzy Models and Algorithms for Pattern Recognition and Image Processing; Kluwer Academic Publisher: Norwell, MA, USA, 1999. [Google Scholar]
- McQueen, J.B. Some Methods for Classification and Analysis of Multivariate Observations. In Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, Berkeley, CA, USA, 7 January 1966. [Google Scholar]
- Sherlock, G. Analysis of Large-Scale Gene Expression Data. Curr. Opin. Immunol.
**2000**, 12, 201–205. [Google Scholar] [CrossRef] - Smet, F.D.; Mathys, J.; Marchal, K.; Thijs, G.; Moor, M.; Bart, D.; Moreau, A. Adaptive Quality-Based Clustering of Gene Expression Profiles. Bioinformatics
**2002**, 18, 735–746. [Google Scholar] [CrossRef] [PubMed] - Heyer, L.J.; Kruglyak, S.; Yooseph, S. Exploring Expression Data: Identification and Analysis of Coexpressed Genes. Genome Res.
**1999**, 9, 1106–1115. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Ralf-Herwig, P.A.; Muller, C.; Bull, C.; Lehrach, H.; Brien, J.O. Large-Scale Clustering of cDNA-Fingerprinting Data. Genome Res.
**1999**, 9, 1093–1105. [Google Scholar] [CrossRef] [Green Version] - Dubes, R.; Jain, A. Algorithms for Clustering Data; Prentice Hall: Upper Saddle River, NJ, USA, 1988. [Google Scholar]
- Duda, R.O.; Hart, P.E.; Stork, D.G. Pattern Classification, 2nd ed.; John Wiley & Sons Inc.: Hoboken, NJ, USA, 2001. [Google Scholar]
- Kaufman, L.; Rousseeuw, P.J. Finding Groups in Data: An Introduction to Cluster Analysis; John Wiley and Sons: Hoboken, NJ, USA, 1990. [Google Scholar]
- Eisen, M.B.; Spellman, P.T.; Brown, P.O.; Botstein, D. Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc. Natl. Acad. Sci. USA
**1998**, 95, 14863–14868. [Google Scholar] [CrossRef] - Iyer, V.R.; Eisen, M.B.; Ross, D.T.; Schuler, G.; Moore, T.; Lee, J.C.F.; Trent, J.M.; Staudt, L.M.; Hudson, J., Jr.; Boguski, M.S.; et al. The Transcriptional Program in the Response of Human Fibroblasts to Serum. Science
**1999**, 283, 83–87. [Google Scholar] [CrossRef] [PubMed] - Perou, C.M.; Jeffrey, S.S.; Rijn, M.V.D.; Rees, C.A.; Eisen, M.B.; Ross, D.T.; Pergamenschikov, A.; Williams, C.F.; Zhu, S.X.; Lee, J.C.F.; et al. Distinctive Gene Expression Patterns in Human Mammary Epithelial Cells and Breast Cancers. Proc. Natl. Acad. Sci. USA
**1999**, 96, 9212–9217. [Google Scholar] [CrossRef] [PubMed] - Liang, F.; Wang, N. Dynamic agglomerative clustering of gene expression proles. Pattern Recognit. Lett.
**2007**, 28, 1062–1076. [Google Scholar] [CrossRef] - Tamayo, P.; Solni, D.; Mesirov, J.; Zhu, Q.; Kitareewan, S.; Dmitrovsky, E.; Lander, E.S.; Golub, T.R. Interpreting Patterns of Gene Expression with Self-Organizing Maps: Methods and Application to Hematopoietic Differentiation. Proc. Natl. Acad. Sci. USA
**1999**, 96, 2907–2912. [Google Scholar] [CrossRef] [PubMed] - Jain, A.K.; Murty, M.N.; Flynn, P.J. Data Clustering: A Review. ACM Comput. Surv.
**1999**, 31, 254–323. [Google Scholar] [CrossRef] - Fraley, C.; Raftery, A.E. How Many Clusters? Which Clustering Method? Answers Via Model-Based Cluster Analysis. Comput. J.
**1998**, 41, 578–588. [Google Scholar] [CrossRef] - McLachlan, G.J.; Bean, R.W.; Peel, D. A Mixture Model-Based Approach to the Clustering of Microarray Expression Data. Bioinformatics
**2002**, 18, 413–422. [Google Scholar] [CrossRef] [PubMed] - McLachlan, G.J.; Peel, D. Finite Mixture Models; John Wiley & Sons, Inc.: Hoboken, NJ, USA, 2000. [Google Scholar]
- Yeung, K.Y.; Fraley, C.; Murua, A.; Raftery, A.E.; Ruzz, A.L. Model-Based Clustering and Data Transformations for Gene Expression Data. Bioinformatics
**2001**, 17, 977–987. [Google Scholar] [CrossRef] - Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum-Likelihood from Incomplete Data Via the EM Algorithm. J. R. Stat. Soc.
**1977**, 39, 1–38. [Google Scholar] [CrossRef] - Kohonen, T. Self Organizing Maps; Springer: Berlin/Heidelberg, Germany, 1995. [Google Scholar]
- Shamir, R.; Sharan, R. Click: A Clustering Algorithm for Gene Expression Analysis. In Proceedings of the Eighth International Conference on Intelligent Systems for Molecular Biology, La Jolla/San Diego, CA, USA, 19–23 August 2000. [Google Scholar]
- Ben-Dor, A.; Shamir, R.; Yakhini, Z. Clustering Gene Expression Patterns. J. Comput. Biol.
**1999**, 6, 281–297. [Google Scholar] [CrossRef] - Jiang, D.; Pei, J.; Zhang, A. DHC: A Density-Based Hierarchical Clustering Method for Time-Series Gene Expression Data. In Proceedings of the Third IEEE Symposium on Bioinformatics and Bioengineering, Bethesda, MD, USA, 12 March 2003. [Google Scholar]
- Ciaramella, A.; Staiano, A.; Tagliaferri, R.; Longo, G. NEC: A Hierarchical Agglomerative Clustering based on Fischer and Negentropy Information. In Neural Nets; Springer: Berlin/Heidelberg, Germany, 2005; pp. 49–56. [Google Scholar]
- Napolitano, F.; Raiconi, G.; Tagliaferri, R.; Ciaramella, A.; Staiano, A.; Miele, G. Clustering and visualization approaches for human cell cycle gene expression data analysis. Int. J. Approx. Reason.
**2008**, 47, 70–84. [Google Scholar] [CrossRef] - Ciaramella, A.; Cocozza, S.; Iorio, F.; Miele, G.; Napolitano, F.; Pinelli, M.; Raiconi, G.; Tagliaferri, R. Interactive data analysis and clustering of genomic data. Neural Netw.
**2008**, 21, 368–378. [Google Scholar] [CrossRef] [PubMed] - Camastra, F.; Ciaramella, A.; Son, L.H.; Riccio, A.; Staiano, A. Fuzzy Similarity-Based Hierarchical Clustering for Atmospheric Pollutants Prediction, Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2019; Volume 11291. [Google Scholar]
- Mitra, S.; Das, R.; Banka, H.; Mukhopadhyay, S. Gene Interaction—An evolutionary biclustering approach. Inf. Fusion
**2009**, 10, 242–249. [Google Scholar] [CrossRef] - Pontes, B.; Giráldez, R.; Aguilar-Ruiz, J.S. Biclustering on expression data: A review. J. Biomed. Informat.
**2015**, 57, 163–180. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Staiano, A.; Tagliaferri, R. Visualization of High Dimensional Scientific Data, Book of Tutorials. In Proceedings of the International Joint Conference on Neural Networks, Montreal, QC, Canada, 31 July–4 August 2005. [Google Scholar]
- Bishop, C.M. Neural Networks for Pattern Recognition; Oxford University Press: Oxford, UK, 1995. [Google Scholar]
- Tipping, M.E.; Bishop, C.M. Probabilistic principal component analysis. J. R. Stat. Soc.
**1999**, 21, 611–622. [Google Scholar] [CrossRef] - Tipping, M.E.; Bishop, C.M. Mixtures of probabilistic principal component analyzers. Neural Comput.
**1999**, 11, 443–482. [Google Scholar] [CrossRef] [PubMed] - Hastie, T.; Tibshirani, R.; Friedman, J. The Elements of Statistical Learning—Data Mining, Inference, and Prediction, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 2009. [Google Scholar]
- Vesanto, J. SOM-Based Data Visualization Methods. Intell. Data Anal. J.
**1999**, 3, 111–126. [Google Scholar] [CrossRef] - Kaski, S. Data Exploration Using Self Organizing Maps. Ph.D. Thesis, Helsinki Institute of Technology, Espoo, Finland, 1997. [Google Scholar]
- Bishop, C.M.; Svensen, M.; Williams, C.K.I. GTM: The Generative Topographic Mapping. Neural Comput.
**1998**, 10, 215–234. [Google Scholar] [CrossRef] - Bishop, C.M.; Tipping, M.E. A hierarchical latent variable model for data visualization. IEEE Trans. Pattern Anal. Mach. Intell.
**1998**, 20, 281–293. [Google Scholar] [CrossRef] - Tino, P.; Nabney, I. Hierarchical GTM: Constructing localized nonlinear projection manifolds in a principled way. IEEE Trans. Pattern Anal. Mach. Intell.
**2002**, 24, 639–656. [Google Scholar] [CrossRef] - Bishop, C.M. Latent variable models. In Learning in Graphical Models; Jordan, M.I., Ed.; MIT Press: Cambridge, MA, USA, 1999; pp. 371–403. [Google Scholar]
- Chang, K. Nonlinear Dimensionality Reduction Using Probabilistic Principal Surfaces. Ph.D. Thesis, The University of Texas at Austin, Austin, TX, USA, 2000. [Google Scholar]
- Whitfield, M.L.; Sherlock, G.; Saldanha, A.J.; Murray, J.I.; Ball, C.A.; Alexander, K.E.; Matese, J.C.; Perou, C.M.; Hurt, M.M.; Brown, P.O.; et al. Identification of genes periodically expressed in the human cell cycle and their expression in tumors. Mol. Biol. Cell
**2002**, 13, 1977–2000. [Google Scholar] [CrossRef] [PubMed] - Spellman, P.T.; Sherlock, G.; Zhang, M.Q.; Iyer, V.R.; Anders, K.; Eisen, B.; Brown, P.O.; Botstein, D.; Futcher, B. Comprehensive identification of cell cycle-regulated genes of the yeast saccharomyces cerevisiae by microarray hybridization. Mol. Biol. Cell
**1998**, 9, 3273–3297. [Google Scholar] [CrossRef] [PubMed] - Domingos, P. The Master Algorithms. How the Quest for the Ultimate Learning Machine Will Remake Our World; Basic Books; Hachette Book Group: New York, NY, USA, 2015. [Google Scholar]
- Camastra, F.; Staiano, A. Intrinsic dimension estimation: Advances and open problems. Inf. Sci.
**2016**, 328, 26–41. [Google Scholar] [CrossRef] - Satija, R.; Farrell, J.A.; Gennert, D.; Schier, A.F.; Regev, A. Spatial reconstruction of single-cell gene expression data. Nat. Biotechnol.
**2015**, 33, 495–502. [Google Scholar] [CrossRef] [PubMed] - Wolf, F.A.; Theis, P.A.F.J. SCANPY: Large-scale single-cell gene expression data analysis. Genome Biol.
**2018**, 19, 15. [Google Scholar] [CrossRef] [PubMed]

**Figure 1.**Latent variable responsibilities onto the spherical latent space. Note how the red areas correspond to higher density locations.

Size | # of Features | # of Classes | |
---|---|---|---|

ALLAML | 72 | 7129 | 2 |

LEUKEMIA | 72 | 7070 | 2 |

CLL_SUB_111 | 111 | 11,340 | 3 |

GLIOMA | 50 | 4434 | 4 |

LUNG_C | 203 | 3312 | 5 |

LUNG_D | 73 | 325 | 7 |

DLBCL | 96 | 4026 | 9 |

CAR | 174 | 9182 | 11 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Ciaramella, A.; Staiano, A.
On the Role of Clustering and Visualization Techniques in Gene Microarray Data. *Algorithms* **2019**, *12*, 123.
https://doi.org/10.3390/a12060123

**AMA Style**

Ciaramella A, Staiano A.
On the Role of Clustering and Visualization Techniques in Gene Microarray Data. *Algorithms*. 2019; 12(6):123.
https://doi.org/10.3390/a12060123

**Chicago/Turabian Style**

Ciaramella, Angelo, and Antonino Staiano.
2019. "On the Role of Clustering and Visualization Techniques in Gene Microarray Data" *Algorithms* 12, no. 6: 123.
https://doi.org/10.3390/a12060123