# Recurrence Networks in Natural Languages

## Abstract

## 1. Introduction

## 2. Methods

#### 2.1. Recurrence Networks

#### 2.2. Network Metrics

- Density ($\rho $): The density of a network is defined as:$$\rho =\frac{2g}{n(n-1)},$$
- Closeness centrality (${K}_{c}$): Measures the centrality of a given node in the network, defined as the reciprocal of the sum of the length of the shortest paths between the node and all other nodes in the graph [29],$${K}_{c}=\frac{n}{{\Sigma}_{j}{d}_{ij}},$$
- Clustering coefficient (${C}_{i}$): Measures the degree of transitivity in connectivity amongstamong the nearest neighbors of a node i [29]. In recurrence terms, ${C}_{i}$ represents the extent to which neighbors of a node (pattern) i are also recurrent amongstamong themselves. Specifically, ${C}_{i}$ is given by,$${C}_{i}=\frac{2{E}_{i}}{{k}_{i}({k}_{i}-1)},$$
- Average nearest-neighbor degree (${\overline{k}}_{nn,i}$): This measure allows us to see the mean preference in connectivity of a given node [30,31,32]. The behavior of this quantity as a function of the node’s degree, reveals whether high-degree nodes connect with other equally high-degree ones (assortativity), or high-degree nodes preferentially connect to low-degree ones (dissortativity) [29]. For unweighted networks, ${\overline{k}}_{nn,i}$ is calculated as:$${\overline{k}}_{nn,i}=\frac{1}{{k}_{i}}\sum _{j=1}^{N}{A}_{ij}{k}_{j},$$
- Assortative mixing coefficient by degree (${A}_{r}$): This measure quantifies the tendency observed in networks that nodes with many connections are connected to other nodes with many (or a few) connections [33]. Formally, the coefficient is given by,$${A}_{r}=\frac{{\sum}_{ij}{A}_{ij}({k}_{i}-\mu )({k}_{j}-\mu )}{{\sum}_{ij}{A}_{ij}{({k}_{i}-\mu )}^{2}},$$

## 3. Results

## 4. Discussion and Conclusions

## References

**Figure 1.**Log-linear plot of density $\rho $ vs. the distance r for several values of the pattern length m. Here we show the cases $m=6,7,8,9,10$ and r runs from 1 to ${r}_{max}$, where ${r}_{max}=m-1$. The fit corresponds to the case $m=10$, which yields to $d\approx 1.3$.

**Figure 2.**Representative metrics of recurrence-pattern networks for different languages. (

**a**) Density for languages grouped by linguistic families. (

**b**) Average clustering coefficient $\mathcal{C}$. (

**c**) Closeness centrality. (

**d**) Assortativity coefficient. Vertical bars indicate the standard deviation of the data.

**Figure 3.**Mean nearest-neighbor connectivity as a function of the degree for (

**a**) Germanic, (

**b**) Romance, (

**c**) Slavic, and (

**d**) Uralic linguistic families. For each language, we also show the values of ${k}_{nn}$ corresponding to shuffled texts. A scaling behavior is observed for all cases of the form ${k}_{nn}\sim {k}^{\delta}$. We estimate the scaling exponent for degree values $10<k<500$, yielding the average values $\overline{\delta}\approx 0.49$ and $\overline{{\delta}_{r}}\approx 0.47$ for the original and random data, respectively. As a guide for the eye, the dashed line corresponds to the slope = $0.5$.

**Figure 4.**Results of classification analysis applied to European languages. (

**a**) Results of the linear discriminant method. Here we show the projection of density values from pattern lengths $m=5,6,7$. For each m-value and for each language, we considered ten segments with length ${10}^{4}$ to obtain ten $\rho $ values. Next, languages were labeled in classes according to the linguistic family to which they belong (Romance, Germanic, Slavic, Uralic). (

**b**) Results of the application of the k-nearest- neighbor classification method to data in panel a) but assigning the same label to languages of the same family. We used $k=20$ neighbors in the classifier. We observe that the families are segregated, except in the case of the Uralic family, which led to two disjoint regions. (

**c**) Results of the confusion matrix. The system makes a clear distinction between almost all family languages, except the case of Uralic, where we observe a problem distinguishing this family from Slavic and Romance.

**Table 1.**Recurrence symmetric matrix for the beginning of Hamlet’s famous soliloquy: To-be-or-not-to-be. Here $N=18$ and we set $m=3$. The resulting matrix has 16 rows and columns.

$\mathit{r}=2$ | To_ | o_b | _be | be_ | e_o | _or | or_ | r_n | _no | not | ot_ | t_t | _to | to_ | ob_ | _be |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

To_ | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |

o_b | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |

_be | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

be_ | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 |

e_o | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 | 0 |

_or | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 1 |

or | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 |

r_n | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 |

_no | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

not | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 | 1 | 0 | 0 |

ot_ | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 1 | 0 |

t_t | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 |

_to | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

to_ | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 0 | 0 |

o_b | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 |

_be | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

