Next Article in Journal
Comparison of the Average Kappa Coefficients of Two Binary Diagnostic Tests with Missing Data
Next Article in Special Issue
In Search of Complex Disease Risk through Genome Wide Association Studies
Previous Article in Journal
Snow Leopard Optimization Algorithm: A New Nature-Based Optimization Algorithm for Solving Optimization Problems
Previous Article in Special Issue
Evaluating the Performances of Biomarkers over a Restricted Domain of High Sensitivity
 
 
Article
Peer-Review Record

Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical Factorisation Model

Mathematics 2021, 9(21), 2833; https://doi.org/10.3390/math9212833
by José Carbonell-Caballero 1,*, Antonio López-Quílez 2, David Conesa 2 and Joaquín Dopazo 3,4,5
Reviewer 1: Anonymous
Reviewer 2: Anonymous
Mathematics 2021, 9(21), 2833; https://doi.org/10.3390/math9212833
Submission received: 29 September 2021 / Revised: 3 November 2021 / Accepted: 4 November 2021 / Published: 8 November 2021
(This article belongs to the Special Issue Models and Methods in Bioinformatics: Theory and Applications)

Round 1

Reviewer 1 Report

Title: “Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical Factorization Model “

 

In this work, the authors present a hierarchical factorization model conceived from a systems

biology point of view. This model integrates the topology of molecular pathways, allowing to

simultaneously factorize genes and pathways activity matrices. The protocol was evaluated by

using simulations, showing a high degree of accuracy. Furthermore, the analysis with a real

cohort of breast cancer patients depicted the internal composition of some of the most relevant

altered biological processes in the disease, describing gene and pathway level strategies and their

observed combinations in the population of patients. The authors claim that this kind of approaches could be used to better understand the hallmarks of cancer.

General comment: The aim of this work seems to be interesting. However, the overall style of the work is suboptimal, thus the current version of the main text does not allow the interested readers to clearly understand its value. Indeed, within the main text there is not a clear relationship between the mathematical approach presented by the authors and the provided biological results. Interested readers with a more general (or not specific) mathematical background should be better guided to understand the genesis of the presented results. Also the supplementary material seems to be not particularly helpful. As a consequence, this work should be deeply reworked in order to increase its relevance and impact.

 

Some detailed comments:

 

Section “2. Materials and Methods”

 

*) This section should be deeply reworked to provide a better logic flow.

 

Lines: “The protocol starts estimating the matrix Xg 2 Rmg×n. With mg genes and n
112 individuals this matrix stores the gene activity in the set of selected individuals. To build
X 113 g we need to combine the gene expression matrix Xge, and the mutation effect matrix
Xv 114 g, that describes the effect of somatic mutations on the structure of measured genes
115 (see suppl. Section
116 g and Xp matrices describe tumour activities from two different biological levels,
117 and constitute the input variables for the hierarchical factorization model following
118 described. In practice, the model is not applied to the whole activity matrices, as this
119 would provide a too complex description of the system. Instead, the mp signalling cas-
120 cades, and their mg implicated genes, are stratified according to the biological processes
121 in which they are involved (see suppl. Section “

*) The authors should improve this section to better clarify the construction of the Xp and Xp matrices. It is not clear the meaning of these words “ the model is not applied to the whole activity matrices, as this would provide a too complex description of the system. Instead, the mp signalling cascades, and their mg implicated genes, are stratified according to the biological processes in which they are involved”, please clarify better.

In addition scattered along the main text there are the words “(see suppl. Section”. What should be their meaning. Please correct.

 

2.1. Hierarchical model of factorization

*) This subsection is not clear. The authors should better present their model and further explain the biological value of their assumptions.

 

Lines: “In this case, as suggested by Lee and Seung[9], we use multiplicative updates, hence
avoiding possible negative values that could be obtained from the standard gradient
descent rule: “

 

*) please provide more details about the matrix product used in this work.

 

Lines : “This term allows to evaluate the difference between both sets of components. In par-
129 ticular, the components at pathway level (W
p) are compared against the gene level
130 component after applying the function } that represents a customized Hipathia function
131 (see suppl. Section “

 

*) Please clarify the meaning of these lines, which are definitely not clear. Also the supporting materials is not so useful. What are the main characteristics of the “ customized Hipathia function”

 

lines: “Given that the two levels could use a different number of components, to perform
this comparison we need to introduce the auxiliary matrix S 2 R
kg×kp, defined as a binary
matrix. In practice, S concentrates the essence of the hierarchical model, as it directly
describes which gene-level components are associated with the same pathway-level
component, representing different gene-level alternatives to obtain the same response at
the pathway level. Because of its binary structure, to produce a smooth convergence, S
is approximated during optimization by using a sigmoid expression such as “

*) The matrix S has been defined after its use in Eq. (9) through Eq(10) which is not clear. Why just this choice ?

Lines: “where Ok
134 p y Okg correspond to column vectors of ones, with a kp y kp size, respectively.
On the other hand, it is important to converge to S solutions with balanced correspondence between gene and pathway components, avoiding cases in which a reduced
number of pathway components attract the majority of gene components. To solve this
limitation, we add this term “

 

*) please clarify the meaning of these lines. What is the meaning of “ where Okp y Okg

 

*) Eq (12) and Eq(13) are not clearly explained. Please rework accordingly.

 

Lines: “From this equation, we derive the following partial derivatives with respect to each matrix: “

*) The following calculations are not clear. Please explain better and check.

 

*) Eq (14) and Eq (15) should be better presented and explained to the readers

 

lines: “Finally, each weight term in the cost function was optimised by using the R package
141 DEoptim [28], which, through a genetic algorithm, explores the error space. In this case,
142 each model execution used 100 iterations, with a total of 50 generations”

 

*) Please explain in all details.

 

2.2. Estimating the optimal number of components

* ) Please rework to make this paragraph fully understandable to the interested readers.

 

Figure 1. Followed protocol to estimate the optimal number of components. On the
left, the error curve exponential fitting is described for pathways (top) and genes
(bottom), taking as reference points the 5 performed factorizations. On the right, the
result obtained by the hierarchical model for the 9 selected alternatives is described,
highlighting in red the selected solution.

 

*) Please improve this caption and describe more effectively all details related to this figure.

 

2.3. Model validation

*) This crucial paragraph is not clear. Explain in all the needed details.

 

Figure 2. Error distribution obtained estimating the optimal number of components
for the cophenetic correlation coefficient, the silhouette method and the hierarchical
factorization model (HFM), for genes and pathways respectively

*) The value of this figure is not clear. Please explain and describe in a better way.

 

Lines: “Figure 3. Distribution of correlation values obtained between the simulated and the
optimized matrices obtained by the original method proposed by Lee and Seung, the
alternating non-negative least squares (NNLS) method and the hierarchical factorization
model (HFM). “

*) The value of this figure is not clear. Please explain and describe in a better way.

 

Lines: “Figure 4. Graphical representation of the hierarchical model obtained for the biological function cellular response to epidermal growth factor stimulus, applied to the Her2 subtype “

 

*) The value of this figure is not clear. Please explain and describe in a better way.

 

Lines: “Figure 5. Graphical representation of the hierarchical model obtained for the biological function. Notch signalling, applied to the Basal subtype. “

 

lines: “Figure 6. Graphical representation of the hierarchical model obtained in the biological function response to estrogen, applied to the subtypes Luminal A and Luminal B.

 

Figure 7. Graphical representation of the hierarchical model obtained in the biological function
G2 M transition of mitotic cell cycle, applied to the subtypes Luminal A and Luminal B

*) All these figures should be explained in a detailed way also within their captions. Please provide fully informative panels.

 

Lines: “4. Discussion
386 The proposed methodology has been used for revealing the internal composition of
387 patients, addressing the statistical modeling of cell function both at the gene and pathway
388 level, establishing specific connections between both spaces. For this purpose, the model
389 is based on a set of equations derived from the Hipathia tool, implicitly containing the
390 structure and topology of signaling networks, while describing the contribution of each
391 gene to the level of activity of each pathway in which it participates.”

 

*) The authors should rework this “Discussion” section in order to better underline the value of this work also in comparison to the state of the art.

 

*) What is the meaning of “ The proposed methodology has been used for revealing the internal composition of patients” ? Please clarify.

 

Lines: “5. Conclusions
433 The model presented in this work has been designed to address the study of ge-
434 nomic heterogeneity in a group of patients with cancer disease, representing one of
435 the most inherent aspects of tumours. The model, conceived from a systems biology
436 point of view, has provided a portrait of the internal composition of patients, describing
437 in detail a set of cellular strategies that individual tumours implement to regulate a
438 certain biological function altered in the disease. This approach renders a quantitative
439 description of the differences and similarities”

 

*) Please improve this section.

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Reviewer 2 Report

Article Review Consideration

Article: Deciphering Genomic Heterogeneity and the Internal Composition of Tumour Activities through a Hierarchical factorization Model

Authors: José Carbonell-Caballero, Antonio López-Quílez, David Conesa, Joaquín Dopazo

 

In this paper, Carbonell-Caballero et al. propose a hierarchical version of the Non-Negative Matrix Factorization (NMF) method for integration of two levels of heterogeneity (genomic and molecular pathways) involved in subtypes of tumors. In order to calculate, at each level of heterogeneity, hierarchically compatible components, the authors consider a model that imposes a series of restrictions in the optimization procedure. Identification of the optimal number of latent components is also included in the optimization. The performance of the proposed hierarchical method was evaluated by a specially designed simulation study and using a cohort of breast cancer patients obtained from the International Cancer Genome Consortium. R package libraries, customized R script, as well as resources from SnpEff, Gene Ontology database and Hypathia, were used in the optimization steps and to obtain normalized genomic and molecular pathways data.

The article offers interesting contributions to understanding the internal structure of tumor subtypes, allowing the identification of strategies that individual tumors use to alter certain biological functions. However, the success of the model strongly depends on the information contained in the input matrices (Xg and Xp).

Of the analytical viewpoint, the proposed methodology involves several objective functions to be optimized, whose steps are well formulated in the article. However, the computational implementation used can be better understood and reproduced if codes in R are made available.

Following are some comments:

  1. My main concern is to understand how to build input matrices, Xg and Xp. How can the collection of this data be planned? Can the authors provide a pipeline for this purpose? Furthermore, it is not clear how robust the factorization results are for deviations that may occur in data collection (in setting matrices Xg and Xp). Can the authors clarify this further?
  2. In the Results Section (Figures 4, 5, 6 and 7), it is not clear how the binarized version of the mixing matrices was defined for gene and pathway levels, Hg and Hp, respectively. Can the authors clarify this?
  3. In the Results Section (Figures 4, 5, 6 and 7), it is not clear how to relate the latent components, kp and kg, of the two different heterogeneity levels of the model. Is this done by patient frequency distribution? Is there an objective criterion for that?
  4. In Figures 4-7 it is not clear the meaning of the numbers used for the kg and kp components. For instance, is kp4 in Figures 4 and 5 the same component?
  5. In the Simulation study for model validation, why the proposed methodology was not compared with alternative methodologies, such as the Adaptive Multiview NMF Algorithm (Ray, Liu and Fenyö, Cancer Informatics 16, 1-12, 2017)? My question is whether good, or even better, results could occur by simply imposing that the mixing matrices, Hg and Hp, had the same projected direction?
  6. Can the proposed hierarchical method be extended to penalized solutions, as is the case in big-data with n>>max(mg,mp)?

Some minor reviews to be done in the text:

  • In the title, change "tumour" to tumor
  • Page 5: Expression below (13) must end with a comma and not a period
  • Page 8, Figure 3: MJF is HFM? Wgp is Wg?

Author Response

Please see the attachment

Author Response File: Author Response.pdf

Round 2

Reviewer 1 Report

Title:" Deciphering Genomic Heterogeneity and the Internal
Composition of Tumor Activities through a Hierarchical
Factorization Model"


In this work, the authors present a hierarchical factorization model conceived
from a systems biology point of view. This model integrates the topology of
molecular pathways, allowing to simultaneously factorize genes and pathways
activity matrices. The protocol was evaluated by using simulations, showing a
high degree of accuracy. Furthermore, the analysis with a real cohort of breast
cancer patients depicted the internal composition of some of the most relevant
altered biological processes in the disease, describing gene and pathway level
strategies and their observed combinations in the population of patients. The
authors claim that this kind of approaches could be used to better understand the
hallmarks of cancer.

General comment: The authors provided a revised version of the their work, which is more clear than the previous one.
However, some difficult passages are still present for readers with a not specific background. This could limit the impact of this work.

Back to TopTop