Author Contributions
Conceptualization, D.B.H., S.P. and D.C.W.II; methodology, S.P. and D.B.H.; software, S.P. and D.B.H.; validation, S.P.; formal analysis, S.P. and D.B.H.; investigation, S.P. and D.B.H.; resources, D.B.H. and D.C.W.II; data curation, D.B.H.; writing—original draft preparation, S.P. and D.B.H.; writing—review and editing, S.P., D.B.H., T.O.-A., M.A.B., E.J.T., W.E.M., M.S. and D.C.W.II; visualization, S.P. and D.B.H.; supervision, D.C.W.II; project administration, D.C.W.II and E.J.T.; funding acquisition, S.P. and D.C.W.II. All authors have read and agreed to the published version of the manuscript.
Figure 1.
Comparison of the constituency relation parse trees of START (a) to the dependency parsing syntax trees of Gram-ART (b) for the simple algebraic statement . START TreeNodes are full constituency relation parse trees containing terminal symbols at the leaves of the tree, while START ProtoNodes contain only non-terminal symbols at non-terminal positions on the parse tree. As in the Grammar Listing 1, non-terminal symbols are surrounded by arrows <·> and terminal symbols are in single quotations. Here, <> denotes “operation”, <> denotes “operator,” and <> and <> denote the two “arguments” of the operator.
Figure 2.
A set of figures demonstrating the evaluation and update of a START prototype on a new sample. (
a (left)) demonstrates a START prototype as a rooted tree of ProtoNodes instantiated on the algebraic statement
(
a). ProtoNodes are labeled by a non-terminal symbol, and they contain a probability mass function (PMF) of the terminal symbols generated both by that non-terminal and by any descendant non-terminals, where
is shorthand for the PMF of the set of outcomes
that gives
. (
b (center)) demonstrates the relation parse tree of a new algebraic statement
. The rooted trees of the prototype and parsed statement are aligned and compared as a graph intersection at the non-terminal positions. The START match rule (
Section 3.1.3) then determines the activation and match values of this graph intersection as a function of the PMFs at each non-terminal position and the terminal symbols at the leaf nodes of the sample, and the hypothetical prototype of (
a) is selected from a pool of other candidate prototypes. (
c (right)) demonstrates the prototype after update, accommodating the new non-terminal symbol positions of the sample and updating the PMFs at each non-terminal position according to the START weight update rule (Equations (
3) and (4)).
Figure 3.
Effect of vigilance parameter on number of clusters. A Monte Carlo of shuffled sample presentation order was run to generate intervals of the results at each vigilance parameter value. As was increased from to , the maximum cluster size decreased, the number of clusters increased, and the number of singleton clusters increased. A value of (yellow dashed line) was selected to yield 9 clusters with only two singleton clusters. Larger values gave too many singleton clusters, and smaller ones put too many cases into one cluster.
Figure 4.
With
, clustering by START yielded nine clusters from 81 variants of CMT. Each cluster is a different color on the heat map. Order of clusters on heat map is 7, 9, 8, 2, 1, 6, 5, 3, 4, with ordering by Euclidean distance between cluster centroids [
69]. The largest cluster is 4 (dark green), with 53 members. Singleton clusters are 9 (white) and 6 (pea green). A shortened variant name is shown in the right margin. Dejerine–Sottas disease appears four times in the heat map because it is caused by four distinct mutations in the MPZ, PMP22, PRX, and EGR2 genes.
Figure 5.
Heat map of molecular function for proteins in CMT clusters. Kinase function is associated with cluster 9, hydrolase function with clusters 1 and 8, DNA binding with cluster 7, activator function with cluster 7, and transferase function with cluster 9.
Figure 6.
Heat map of biological process for proteins by CMT cluster. Cluster 1 is apoptosis, cluster 8 is autophagy and apoptosis, cluster 3 is protein synthesis, cluster 6 is transcription and immunity, and cluster 7 is UBL protein conjugation and transcription.
Figure 7.
Heat map of protein locations by CMT cluster. Cluster 2 is cytoplasm, clusters 5, 6, and 7 are plasma membrane, cluster 8 is nucleus, and cluster 9 is mitochondrion.
Figure 8.
Heat map showing protein motifs and domains by CMT cluster. Motifs and domains are characteristics of configurations of the amino acid chains that make up proteins and are often associated with a specific function. Note the over-representation of the transmembrane (TM) domains in clusters 5, 6, and 9 (red arrow). The CC motif is found in most proteins except for cluster 7.
Figure 9.
Heat map of molecular weights and amino acid chain lengths for proteins for CMT clusters.
Figure 10.
Phenotype scores for each of the nine clusters for the 81 variants of CMT. Scores have been normalized to the interval , where 1 indicates and 0 indicates . Note, as expected, that gait, atrophy, deformity, hyporeflexia, weakness, and sensory loss are common features in most cases (red bracket). Cluster 6 with one case and cluster 9 with one case are different because they manifest auditory and cognitive symptoms (cluster 9) or ataxia, cognitive, hyperreflexia, hypertonia, seizures, and speech symptoms (cluster 6). Cluster 6 is also of interest because it lacks weakness and atrophy, two of the core symptoms of CMT. Cluster 2 (3 cases) is also interesting because subjects have hypertonia. Cluster 4, with 53 cases, is the most common pattern and shows a typical phenotype of gait, atrophy, deformity, hyporeflexia, weakness, and sensory symptoms, which is characteristic of CMT.
Figure 11.
Modes of inheritance for the nine CMT clusters. Cluster 8 is largely autosomal recessive. Cluster 9 is X-linked recessive. Clusters 5, 6, and 7 are autosomal dominant inheritance.
Figure 12.
SHAP cluster summary plot for the 9 clusters derived from CMT dataset with
. The SHAP plot shows which features contributed the most to the cluster configuration by cluster. Important features are protein length, chromosome, mode of inheritance (autosomal dominant and recessive), protein location (cytoplasm and plasma membrane), and certain phenotypes (auditory, cognitive, and hypertonia). The domain expert rated these features as highly biologically plausible. SHAP plots were created using the method of Lundberg et al. [
68].
Table 1.
Shared START notation. The learning dynamics of START and its variants follow the activation, competition, match, update, and initialization rules of unsupervised ART algorithms, so the notation here largely adheres to the elementary ART algorithm notation outlined in [
19]. Dual-vigilance lower bound
and upper bound
follow the notation in DVFA [
25] and DDVFA [
26].
: set of prototype nodes. |
R: a single prototype node. |
: set of prototype node indices. |
: subset of active ART module node indices . |
: START vigilance threshold, . |
: dual-vigilance lower-bound vigilance threshold . |
: dual-vigilance upper-bound vigilance threshold . |
n: number of input dataset statements. |
: statements parsed as syntax trees with terminal metadata. |
: syntactic parsing algorithm taking a set of statements and a grammar and producing rooted constituency parse trees. |
fT: activation function. |
fM: match function. |
fN: node initialization function. |
fL: node weight update function. |
fV: the vigilance test function. |
: internal supervised category indices. |
: set of cluster indices. |
Table 2.
A simple UML diagram of the stateful information of one START TreeNode [
24]. A symbol in a TreeNode in START is realized by either a terminal or non-terminal symbol at the syntax tree position of the node. A rooted tree of TreeNodes in this regard contains the minimum information necessary to describe the syntax tree of a statement parsed with a prescribed grammar.
TreeNode |
---|
Symbol: GrammarSymbol |
Children: Vector{TreeNode} |
Table 3.
A simple UML diagram of the stateful information of one START ProtoNode, which is the basic element of the rooted trees constituting the prototypes of START [
24]. A rooted tree of START ProtoNodes encodes only through the non-terminal positions of the syntax tree of a TreeNode tree. Each ProtoNode encodes a PMF of terminal symbols encountered at and below the non-terminal position of the ProtoNode itself, with instance counts of each terminal encoded for the renormalization of the PMF when learning occurs at the node itself.
ProtoNode |
---|
Symbol: NonTerminalGrammarSymbol |
Distribution: Dictionary{TerminalGrammarSymbol, Float} |
InstanceCount: Dictionary{TerminalGrammarSymbol, Integer} |
Children: Vector{ProtoNode} |
Table 4.
Distributed dual-vigilance START activation and match linkage methods where hierarchical agglomerative clustering (HAC) functions and distributed dual-vigilance notation are shared with DDVFA [
26]. Global activation
and match
functions are defined via the generic function
for the global F2 node index
i as a function of inner node indices
, where
k is the number of
nodes in the local START module
i. Each HAC method then is a “function of functions” evaluated at each F2 node in the global module to determine either the match or activation value in the global module match rule dynamics.
HAC Method | |
---|
Single | |
Complete | |
Median | |
Average | |
Weighted 1 | |
Table 5.
A summary of the START variants and their abbreviations. Three vigilance formulations are developed, starting with a core START algorithm and extending it with dual-vigilance and distributed dual-vigilance variants (
Section 3.1.5). These three variants are intrinsically incremental, unsupervised clustering algorithms, but a supervised procedure in the vein of Simplified FuzzyARTMAP (summarized in
Section 3.1.6) generates a supervised variant for each of these three algorithms as well.
Vigilance Formulation | Unsupervised | Supervised |
---|
Single-Vigilance | START | Simplified STARTMAP |
Dual-Vigilance | DV-START | Simplified DV-STARTMAP |
Distributed Dual-Vigilance | DDV-START | Simplified DDV-STARTMAP |
Table 6.
Table of features and their characteristics in CMT flat file. Protein numbers were from UniProtKB [
79]. Variant and gene numbers were from OMIM [
77]. The phenotype numbers were from HPO [
1,
81]. Since genes, proteins, and diseases have multiple names, the names were normalized to the standard form. Most of the features were categorical, and some were multi-categorial. The features were formatted as integers or strings of variable or fixed length.
Feature | Type | Format | Length | Multi-Category |
---|
variant name | categorical | string | variable | no |
variant number | categorical | string | fixed | no |
gene name | categorical | string | variable | no |
gene number | categorical | integer | fixed | no |
protein name | categorical | string | variable | no |
protein number | categorical | string | fixed | no |
protein length | numerical | integer | variable | no |
protein weight | numerical | integer | variable | no |
protein location | categorical | string | variable | yes |
protein molecular function | categorical | string | variable | yes |
protein biological process | categorical | string | variable | yes |
protein class | categorical | string | variable | yes |
mode of inheritance | categorical | string | variable | yes |
phenotype | categorical | string | variable | yes |
phenotype number | categorical | string | variable | yes |
chromosome | categorical | string | variable | no |
chromosome location | categorical | string | variable | no |
chromosome location | categorical | string | variable | no |
Table 7.
Summary of features that characterize CMT clusters.
k is the cluster number and
N is the count of members in each cluster. Phenotype Plus lists signs and symptoms in addition to weakness, atrophy, deformities, sensory loss, and hyporeflexia that characterize most cases of CMT. AD is autosomal dominant inheritance; AR is autosomal recessive; XLR is X-linked recessive. TM is the transmembrane protein domain. GNRF is the guanine nucleotide-releasing factor. Note that some of the characteristics identified by the SHAP analysis, including cognitive, hypertonia, auditory, plasma membrane, autosomal recessive, and autosomal dominant (
Figure 12), recur in this summary table.
k | N | Process | Function | Location | Domain | Inherit | Phenotype Plus |
---|
1 | 6 | apoptosis | hydrolase | | | AD | auditory, visual |
2 | 3 | | | cytoplasm | | AD | hypertonia |
3 | 7 | protein | transferase | | | AD, AR | |
| | synthesis | | | | | |
4 | 53 | | | plasma | TM | AD, AR | |
| | | | membrane | | | |
5 | 4 | | | plasma | TM | AD | cognitive, auditory |
| | | | membrane | | | |
6 | 1 | immunity | transferase | plasma | | AD | cognitive, ataxia, |
| | transcription | | membrane | | | seizure, hypertonia, |
| | | | | | | speech, hyperreflexia |
7 | 4 | transcription | DNA binding | plasma | | AD, AR | cognitive, hypotonia |
| | | | membrane | | | |
| | | transferase | | | | |
8 | 2 | autophagy | hydrolase | nucleus | | AR | cognitive, auditory, |
| | apoptosis | GNRF | | | | hypertonia |
9 | 1 | | transferase | mitochondrion | TM | XLR | cognitive, auditory |