# Clustering of Monolingual Embedding Spaces

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Materials and Methods

#### 2.1. FastText Word Vectors

#### 2.1.1. Embedding Size

#### 2.1.2. Language Families

#### 2.2. Word Level Alignment of Cross-Lingual Word Embeddings

#### 2.2.1. Regression Method

#### 2.2.2. Orthogonal Methods

#### 2.3. Degree of Isomorphism

#### 2.3.1. Eigensimilarity

#### 2.3.2. Gromov–Hausdorff Distance

#### 2.3.3. Relational Similarity

#### 2.4. Clustering of Embedding Spaces

#### 2.4.1. Hierarchical Clustering

#### 2.4.2. Fuzzy C-Means Clustering

## 3. Results

#### 3.1. Hierarchical Clustering: Dendrogram

#### 3.1.1. Hierarchical Clustering: Eigensimilarity

**Impact of Embedding Size**

**Impact of Typological Similarities**

#### 3.1.2. Hierarchical Clustering: Gromov–Hausdorff Distance

**Impact of Embedding Size**

**Impact of Typological Similarity**

#### 3.1.3. Hierarchical Clustering: Relational Similarity

**Impact of Embedding Size**

**Impact of Typological Similarity**

#### 3.2. Fuzzy C-Means Clustering Algorithm

#### 3.2.1. FCM: Eigensimilarity

**Impact of Typological Similarity**

**Impact of Embedding Size**

#### 3.2.2. FCM: Gromov–Hausdorff Distance

**Impact of Embedding Size**

**Impact of Typological Similarity**

#### 3.2.3. FCM: Relational Similarity

**Impact of Embedding Size**

**Impact of Typological Similarity**

## 4. Discussion

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Søgaard, A.; Ruder, S.; Vulić, I. On the limitations of unsupervised bilingual dictionary induction. arXiv
**2018**, arXiv:1805.03620. [Google Scholar] - Vulić, I.; Glavaš, G.; Reichart, R.; Korhonen, A. Do we really need fully unsupervised cross-lingual embeddings? arXiv
**2019**, arXiv:1909.01638. [Google Scholar] - Vulić, I.; Ruder, S.; Søgaard, A. Are all good word vector spaces isomorphic? arXiv
**2020**, arXiv:2004.04070. [Google Scholar] - Patra, B.; Moniz, J.R.A.; Garg, S.; Gormley, M.R.; Neubig, G. Bilingual lexicon induction with semi-supervision in non-isometric embedding spaces. arXiv
**2019**, arXiv:1908.06625. [Google Scholar] - Nakashole, N.; Flauger, R. Knowledge distillation for bilingual dictionary induction. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; pp. 2497–2506. [Google Scholar]
- Bojanowski, P.; Grave, E.; Joulin, A.; Mikolov, T. Enriching word vectors with subword information. Trans. Assoc. Comput. Linguist.
**2017**, 5, 135–146. [Google Scholar] [CrossRef] [Green Version] - Comrie, B. (Ed.) The World’s Major Languages; Routledge: London, UK, 1987. [Google Scholar]
- Bech, K.; Walkden, G. English is (still) a West Germanic language. Nord. J. Linguist.
**2016**, 39, 65–100. [Google Scholar] [CrossRef] [Green Version] - De Vaan, M. Etymological dictionary of Latin and the other Italic languages. Leiden·Boston
**2008**, 7, 1–12. [Google Scholar] - Ramallo, F.; Rei-Doval, G. The standardization of Galician. Sociolinguistica
**2015**, 29, 61–82. [Google Scholar] [CrossRef] - Joseph, B.D. The Balkan languages. Int. Encycl. Linguist.
**1992**, 4, 153–155. [Google Scholar] - Corbett, G.; Comrie, B. The Slavonic Languages; Routledge: London, UK, 2003. [Google Scholar]
- Kornfilt, J. Turkish and the Turkic languages. In The World’s Major Languages; Routledge: London, UK, 2018; pp. 536–561. [Google Scholar]
- Horrocks, G. Greek: A History of the Language and Its Speakers; John Wiley & Sons: Hoboken, NJ, USA, 2014. [Google Scholar]
- Ruder, S.; Vulić, I.; Søgaard, A. A survey of cross-lingual word embedding models. J. Artif. Intell. Res.
**2019**, 65, 569–631. [Google Scholar] [CrossRef] [Green Version] - Mikolov, T.; Le, Q.V.; Sutskever, I. Exploiting similarities among languages for machine translation. arXiv
**2013**, arXiv:1309.4168. [Google Scholar] - Xing, C.; Wang, D.; Liu, C.; Lin, Y. Normalized word embedding and orthogonal transform for bilingual word translation. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, CO, USA, 31 May–5 June 2015; pp. 1006–1011. [Google Scholar]
- Artetxe, M.; Labaka, G.; Agirre, E. Learning principled bilingual mappings of word embeddings while preserving monolingual invariance. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–5 November 2016; pp. 2289–2294. [Google Scholar]
- Smith, S.L.; Turban, D.H.; Hamblin, S.; Hammerla, N.Y. Offline bilingual word vectors, orthogonal transformations and the inverted softmax. arXiv
**2017**, arXiv:1702.03859. [Google Scholar] - Ormazabal, A.; Artetxe, M.; Labaka, G.; Soroa, A.; Agirre, E. Analyzing the limitations of cross-lingual word embedding mappings. arXiv
**2019**, arXiv:1906.05407. [Google Scholar] - Shigehalli, V.S.; Shettar, V.M. Spectral techniques using normalized adjacency matrices for graph matching. Int. J. Comput. Sci. Math.
**2011**, 3, 371–378. [Google Scholar] - Chazal, F.; Cohen-Steiner, D.; Guibas, L.J.; Mémoli, F.; Oudot, S.Y. Gromov-Hausdorff stable signatures for shapes using persistence. In Computer Graphics Forum; Blackwell Publishing Ltd.: Oxford, UK, 2009; Volume 28, pp. 1393–1403. [Google Scholar]
- Ma, X.; Dhavala, S. Hierarchical clustering with prior knowledge. arXiv
**2018**, arXiv:1806.03432. [Google Scholar] - Suganya, R.; Shanthi, R. Fuzzy c-means algorithm—A review. Int. J. Sci. Res. Publ.
**2012**, 2, 1. [Google Scholar] - Zadeh, L.A.; Klir, G.J.; Yuan, B. Fuzzy Sets, Fuzzy Logic, and Fuzzy Systems: Selected Papers; World Scientific: Singapore, 1996; Volume 6, pp. 19–34. [Google Scholar]
- Dunn, J.C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters. J. Cybern.
**1973**, 3, 32–57. [Google Scholar] [CrossRef] - Bezdek, J.C.; Ehrlich, R.; Full, W. FCM: The fuzzy c-means clustering algorithm. Comput. Geosci.
**1984**, 10, 191–203. [Google Scholar] [CrossRef] - Beinborn, L.; Choenni, R. Semantic drift in multilingual representations. Comput. Linguist.
**2020**, 46, 571–603. [Google Scholar] [CrossRef]

Language | Cluster Label | 0 | 1 | 2 |
---|---|---|---|---|

Latin | 0 | 0.816125 | 0.066704 | 0.11717 |

Galician | 0 | 0.947083 | 0.015154 | 0.037764 |

Azerbaijani | 0 | 0.74006 | 0.061298 | 0.198643 |

Greek | 0 | 0.798175 | 0.052795 | 0.14903 |

Portuguese | 1 | 0.010361 | 0.951072 | 0.038567 |

Italian | 1 | 0.04371 | 0.742565 | 0.213725 |

Spanish | 1 | 0.00419 | 0.980757 | 0.015053 |

Russian | 1 | 0.058801 | 0.816661 | 0.124538 |

English | 1 | 0.006692 | 0.971533 | 0.021775 |

Belarusian | 2 | 0.075486 | 0.063486 | 0.861028 |

Slovak | 2 | 0.021567 | 0.022205 | 0.956228 |

Romanian | 2 | 0.014296 | 0.020679 | 0.965025 |

Turkish | 2 | 0.089349 | 0.330027 | 0.580625 |

Czech | 2 | 0.054219 | 0.417626 | 0.528155 |

Language | Cluster Label | 0 | 1 | 2 |
---|---|---|---|---|

Azerbaijani | 0 | 0.495611 | 0.049662 | 0.454728 |

Belarusian | 0 | 0.901068 | 0.012263 | 0.086669 |

Slovak | 0 | 0.852028 | 0.023885 | 0.124087 |

Romanian | 0 | 0.689999 | 0.070371 | 0.239631 |

Turkish | 0 | 0.661249 | 0.024543 | 0.314208 |

Czech | 0 | 0.80863 | 0.01881 | 0.172561 |

Russian | 0 | 0.547999 | 0.056037 | 0.395964 |

Galician | 1 | 0.051071 | 0.867585 | 0.081343 |

Italian | 1 | 0.02978 | 0.926727 | 0.043492 |

Latin | 2 | 0.341101 | 0.03313 | 0.625769 |

Greek | 2 | 0.129552 | 0.035976 | 0.834472 |

Portuguese | 2 | 0.160704 | 0.074351 | 0.764946 |

Spanish | 2 | 0.323441 | 0.032605 | 0.643954 |

English | 2 | 0.157306 | 0.090714 | 0.75198 |

Language | Cluster Label | 0 | 1 | 2 |
---|---|---|---|---|

Galician | 0 | 0.441000 | 0.266008 | 0.292991 |

Azerbaijani | 0 | 0.457129 | 0.224005 | 0.318866 |

Slovak | 0 | 0.496797 | 0.127349 | 0.375854 |

Romanian | 0 | 0.487956 | 0.239847 | 0.272196 |

Turkish | 0 | 0.547817 | 0.167559 | 0.284624 |

Czech | 0 | 0.515586 | 0.178321 | 0.306093 |

Portuguese | 1 | 0.158451 | 0.732514 | 0.109034 |

Italian | 1 | 0.155662 | 0.723955 | 0.120383 |

Spanish | 1 | 0.110976 | 0.810494 | 0.078530 |

English | 1 | 0.247609 | 0.535148 | 0.217243 |

Latin | 2 | 0.402406 | 0.152284 | 0.445310 |

Belarusian | 2 | 0.301698 | 0.154456 | 0.543846 |

Greek | 2 | 0.341317 | 0.102445 | 0.556238 |

Russian | 2 | 0.298074 | 0.207581 | 0.494345 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Bhowmik, K.; Ralescu, A.
Clustering of Monolingual Embedding Spaces. *Digital* **2023**, *3*, 48-66.
https://doi.org/10.3390/digital3010004

**AMA Style**

Bhowmik K, Ralescu A.
Clustering of Monolingual Embedding Spaces. *Digital*. 2023; 3(1):48-66.
https://doi.org/10.3390/digital3010004

**Chicago/Turabian Style**

Bhowmik, Kowshik, and Anca Ralescu.
2023. "Clustering of Monolingual Embedding Spaces" *Digital* 3, no. 1: 48-66.
https://doi.org/10.3390/digital3010004