# Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture

## Abstract

**:**

## 1. Introduction

## 2. Mathematical Models

**Theorem 1.**

**Theorem 2**

**.**For a finite energy process ${\left({X}_{i}\right)}_{i\in \mathbb{Z}}$, there exists a constant $C>0$, such that maximal repetition satisfies:

**Theorem 3.**

## 3. Empirical Data

Text | Maximal Repeated Substring |
---|---|

Mark Twain, Die Abenteuer Tom Sawyers | “selig sind, die da arm sind im Geiste, denn” (9 words, close) |

Lewis Carroll, Alice’s Abenteuer im Wunderland | “Edwin und Morcar, Grafen von Mercia und” (7 words) |

Friedrich Nietzsche, Also sprach Zarathustra | “stiftete mehr Leid, als die Thorheiten der Mitleidigen? Wehe allen Liebenden, die nicht noch eine Höhe haben, welche über ihrem Mitleiden ist! Also sprach der Teufel einst zu mir: «auch Gott hat seine Hölle: das ist seine Liebe zu den Menschen.» Und jüngst hörte ich ihn diess Wort sagen: «Gott ist todt; an seinem Mitleiden mit den Menschen ist Gott gestorben.»” (61 words) |

Thomas Mann, Buddenbrooks | “Mit raschen Schritten, die Arme ausgebreitet und den Kopf zur Seite geneigt, in der Haltung eines Mannes, welcher sagen will: Hier bin ich! Töte mich, wenn du willst!” (28 words) |

Goethe, Faust | “Kühn ist das Mühen, Herrlich der Lohn! Und die” (9 words) |

Dante Alighieri, Die Göttliche Komödie | “Da kehrt er sich zu mir” (6 words) |

Immanuel Kant, Kritik der reinen Vernunft | “als solche, selbst ein von ihnen unterschiedenes Beharrliches, worauf in Beziehung der Wechsel derselben, mithin mein Dasein in der Zeit, darin sie wechseln, bestimmt werden” (25 words, in quotes) |

Thomas Mann, Der Tod in Venedig | “diesem Augenblick dachte er an” (6 words) |

Sigmund Freud, Die Traumdeutung | “ich muß auch auf einen anderen im sprachlichen Ausdruck enthaltenen Zusammenhang hinweisen. In unseren Landen existiert eine unfeine Bezeichnung für den masturbatorischen Akt: sich einen ausreißen oder sich einen” (29 words, in footnote) |

Franz Kafka, Die Verwandlung | “daß sein Körper zu breit war, um” (7 words) |

Text | Maximal Repeated Substring |
---|---|

Voltaire, Candide ou l’optimisme | “voyez, tome XXI, le chapitre XXXI du Précis du Siècle de Louis XV. B.” (14 words, in footnote) |

Alexandre Dumas, Le comte de Monte-Cristo, Tome I | “le procureur du roi est prévenu, par un ami du trône et de la religion, que le nommé Edmond Dantès, second du navire le Pharaon, arrivé ce matin de Smyrne, après avoir touché à Naples et à Porto-Ferrajo, a été chargé, par Murat, d’une lettre pour l’usurpateur, et, par” (49 words, in quotes) |

Victor Hugo, L’homme qui rit | “trois hommes d’équipage, le patron ayant été enlevé par un coup de mer, il ne reste que” (17 words, in quotes) |

Gustave Flaubert, Madame Bovary | “et madame Tuvache, la femme du maire,” (7 words) |

Victor Hugo, Les miserables, Tome I | “livres Pour la société de charité maternelle” (7 words, close) |

Descartes, Oeuvres. Tome Premier | “que toutes les choses que nous concevons fort clairement et fort distinctement sont toutes” (14 words, many times in paraphrases) |

François Villon, Oeuvres completes | “mes lubres sentemens, Esguisez comme une pelote, M’ouvrist plus que tous les Commens D’Averroys sur” (15 words, in quotes in footnote in preface) |

Stendhal, Le Rouge et le Noir | “Which now shows all the beauty of the sun And by and by a cloud takes all away!” (18 words, in quotes) |

Alexandre Dumas, Les trois mousquetaires | “murmura Mme Bonacieux. «Silence!» dit d’Artagnan en lui” (9 words, close) |

Jules Verne, Vingt mille lieues sous les mers | “à la partie supérieure de la coque du «Nautilus», et” (10 words) |

Jules Verne, Voyage au centre de la terre | “D0 E6 B3 C5 BC D0 B4 B3 A2 BC BC C5 EF «Arne” (14 words, in quotes) |

Text | Maximal Repeated Substring |
---|---|

Jacques Casanova de Seingalt, Complete Memoirs | “but not deaf. I am come from the Rhone to bathe you. The hour of Oromasis has begun.»” (18 words, in quotes) |

Thomas Babington Macaulay, Critical and Historical Essays, Volume II | “therefore there must be attached to this agency, as that without which none of our responsibilities can be met, a religion. And this religion must be that of the conscience of the” (32 words, in quotes, close) |

Charles Darwin, The Descent of Man and Selection in Relation to Sex | “Variability of body and mind in man-Inheritance-Causes of variability-Laws of variation the same in man as in the lower animals–Direct action of the conditions of life-Effects of the increased use and disuse of parts-Arrested development-Reversion-Correlated variation-Rate of increase-Checks to increase-Natural selection-Man the most dominant animal in the world-Importance of his corporeal structure-The causes which have led to his becoming erect-Consequent changes of structure-Decrease in size of the canine teeth-Increased size and altered shape of the skull-Nakedness-Absence of a tail-Defenceless condition of man.” (86 words, in the table of contents, undeleted by omission) |

Jules Verne, Eight Hundred Leagues on the Amazon | “After catching a glimpse of the hamlet of Tahua-Miri, mounted on its piles as on stilts, as a protection against inundation from the floods, which often sweep up” (28 words, close, probably by mistake) |

William Shakespeare, First Folio/35 Plays | “And so am I for Phebe Phe. And I for Ganimed Orl. And I for Rosalind Ros. And I for no woman Sil. It is to be all made of” (30 words, close) |

Jules Verne, Five Weeks in a Balloon | “forty-four thousand eight hundred and forty-seven cubic feet of” (9 words, close) |

Jonathan Swift, Gulliver’s Travels | “of meat and drink sufficient for the support of 1724” (10 words, in quotes, close) |

Jonathan Swift, The Journal to Stella | “chocolate is a present, madam, for Stella. Don’t read this, you little rogue, with your little eyes; but give it to Dingley, pray now; and I will write as plain as the” (32 words, in quotes in preface) |

George Smith, The Life of William Carey, Shoemaker & Missionary | “I would not go, that I was determined to stay and see the murder, and that I should certainly bear witness of it at the tribunal of” (27 words, in quotes) |

Albert Bigelow Paine, Mark Twain. A Biography | “going to kill the church thus with bad smells I will have nothing to do with this work of” (19 words, in quotes, close) |

Etienne Leon Lamothe-Langon, Memoirs of the Comtesse du Barry | “M. de Maupeou, the duc de la Vrilliere, and the” (10 words) |

Jules Verne, The Mysterious Island | “we will try to get out of the scrape” (9 words, in the same sentence) |

Willa Cather, One of Ours | “big type on the front page of the” (8 words, close) |

Jules Verne, Twenty Thousand Leagues under the Sea | “variety of sites and landscapes along these sandbanks and” (9 words, close) |

Text | Maximal Repeated Substring |
---|---|

Character-based unigram Text 1 | “ u ti t r ” |

Character-based unigram Text 2 | “e t tloeu ” |

Character-based unigram Text 3 | “o d t eie” |

Character-based unigram Text 4 | “s ei er e” |

Word-based unigram Text 1 | “of that A for” |

Word-based unigram Text 2 | “was the of in” |

Word-based unigram Text 3 | “in of of the” |

Word-based unigram Text 4 | “of of of of a” |

**Figure 1.**Character-based maximal repetition on the logarithmic-linear scale. The lines are the regression lines.

**Figure 2.**Character-based maximal repetition on the doubly-logarithmic scale. The lines are the regression lines.

**Figure 3.**Word-based maximal repetition on the logarithmic linear scale. The lines are the regression lines.

**Figure 4.**Word-based maximal repetition on the doubly-logarithmic scale. The lines are the regression lines.

**Table 5.**The fitted parameters of Model (13). The values after the sign ± are the standard errors.

Level of Description | Class of Texts | A | α |
---|---|---|---|

characters | German | $0.076\pm 0.011$ | $2.71\pm 0.07$ |

characters | English | $0.093\pm 0.012$ | $2.64\pm 0.06$ |

characters | French | $0.074\pm 0.009$ | $2.69\pm 0.06$ |

characters | unigram | $0.42\pm 0.03$ | $1.21\pm 0.03$ |

words | German | $0.059\pm 0.014$ | $2.18\pm 0.13$ |

words | English | $0.086\pm 0.019$ | $2.09\pm 0.11$ |

words | French | $0.069\pm 0.010$ | $2.08\pm 0.08$ |

words | unigram | $0.24\pm 0.03$ | $1.14\pm 0.06$ |

## 4. Conclusions

## Acknowledgments

## Conflicts of Interest

## Appendix

## A. Proof of Theorem 1

## B. Proof of Theorem 3

**Lemma 1.**

**Proof.**

## References

- Jelinek, F. Statistical Methods for Speech Recognition; The MIT Press: Cambridge, MA, USA, 1997. [Google Scholar]
- Jurafsky, D.; Martin, J.H. Speech and Language Processing. An Introduction to Natural Language Processing, Computational Linguistics, and Speech Recognition; Prentice Hall: Englewood Cliffs, NJ, USA, 2000. [Google Scholar]
- Shields, P.C. String matching bounds via coding. Ann. Probab.
**1997**, 25, 329–336. [Google Scholar] [CrossRef] - Hilberg, W. Der bekannte Grenzwert der redundanzfreien Information in Texten—Eine Fehlinterpretation der Shannonschen Experimente? Frequenz
**1990**, 44, 243–248. [Google Scholar] [CrossRef] - Dębowski, Ł. Maximal Lengths of Repeat in English Prose. In Synergetic Linguistics. Text and Language as Dynamic System; Naumann, S., Grzybek, P., Vulanović, R., Altmann, G., Eds.; Praesens Verlag: Vienna, Austria, 2012; pp. 23–30. [Google Scholar]
- Billingsley, P. Probability and Measure; Wiley: New York, NY, USA, 1979. [Google Scholar]
- Dębowski, Ł. On the Vocabulary of Grammar-Based Codes and the Logical Consistency of Texts. IEEE Trans. Inf. Theory
**2011**, 57, 4589–4599. [Google Scholar] - De Luca, A. On the combinatorics of finite words. Theor. Comput. Sci.
**1999**, 218, 13–39. [Google Scholar] [CrossRef] - Kolpakov, R.; Kucherov, G. Finding Maximal Repetitions in a Word in Linear Time. In Proceedings of the 40th Annual Symposium on Foundations of Computer Science, New York, NY, USA, 17–19 October 1999; pp. 596–604.
- Kolpakov, R.; Kucherov, G. On Maximal Repetitions in Words. J. Discret. Algorithms
**1999**, 1, 159–186. [Google Scholar] - Crochemore, M.; Ilie, L. Maximal repetitions in strings. J. Comput. Syst. Sci.
**2008**, 74, 796–807. [Google Scholar] [CrossRef] [Green Version] - Erdős, P.; Rényi, A. On a new law of large numbers. J. D’Analyse Math.
**1970**, 22, 103–111. [Google Scholar] [CrossRef] - Arratia, R.; Waterman, M.S. The Erdös-Rényi strong law for pattern matching with a given proportion of mismatches. Ann. Probab.
**1989**, 17, 1152–1169. [Google Scholar] [CrossRef] - Shields, P.C. String matching: The ergodic case. Ann. Probab.
**1992**, 20, 1199–1203. [Google Scholar] [CrossRef] - Rényi, A. On measures of entropy and information. In Proceedings of the Fourth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics; University of California Press: Berkeley, CA, USA, 1961; pp. 547–561. [Google Scholar]
- Allouche, J.P.; Shallit, J. Automatic Sequences. Theory, Applications, Generalizations; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: New York, NY, USA, 1991. [Google Scholar]
- Yeung, R.W. First Course in Information Theory; Kluwer Academic Publishers: Dordrecht, The Netherlands, 2002. [Google Scholar]
- Dębowski, Ł. A general definition of conditional information and its application to ergodic decomposition. Stat. Probab. Lett.
**2009**, 79, 1260–1268. [Google Scholar] - Shannon, C. Prediction and entropy of printed English. Bell Syst. Tech. J.
**1951**, 30, 50–64. [Google Scholar] [CrossRef] - Ebeling, W.; Nicolis, G. Entropy of Symbolic Sequences: the Role of Correlations. Europhys. Lett.
**1991**, 14, 191–196. [Google Scholar] [CrossRef] - Ebeling, W.; Pöschel, T. Entropy and long-range correlations in literary English. Europhys. Lett.
**1994**, 26, 241–246. [Google Scholar] [CrossRef] - Bialek, W.; Nemenman, I.; Tishby, N. Complexity through nonextensivity. Phys. A
**2001**, 302, 89–99. [Google Scholar] [CrossRef] - Crutchfield, J.P.; Feldman, D.P. Regularities unseen, randomness observed: The entropy convergence hierarchy. Chaos
**2003**, 15, 25–54. [Google Scholar] [CrossRef] - Ziv, J.; Lempel, A. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory
**1977**, 23, 337–343. [Google Scholar] [CrossRef] - Dębowski, Ł. Hilberg’s Conjecture—A Challenge for Machine Learning. Schedae Inf.
**2014**, 23, 33–44. [Google Scholar] - Clauset, A.; Shalizi, C.R.; Newman, M.E.J. Power-law distributions in empirical data. SIAM Rev.
**2009**, 51, 661–703. [Google Scholar] [CrossRef] - Dębowski, Ł. Mixing, Ergodic, and Nonergodic Processes with Rapidly Growing Information between Blocks. IEEE Trans. Inf. Theory
**2012**, 58, 3392–3401. [Google Scholar] - Dębowski, Ł. On Hidden Markov Processes with Infinite Excess Entropy. J. Theor. Probab.
**2014**, 27, 539–551. [Google Scholar] - Berthé, V. Conditional entropy of some automatic sequences. J. Phys. A
**1994**, 27, 7993–8006. [Google Scholar] [CrossRef] - Gramss, T. Entropy of the symbolic sequence for critical circle maps. Phys. Rev. E
**1994**, 50, 2616–2620. [Google Scholar] [CrossRef] - Chandrasekaran, C.; Betrán, E. Origins of new genes and pseudogenes. Nat. Educ.
**2008**, 1, 181. [Google Scholar] - Kurtz, S.; Schleiermacher, C. REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics
**1999**, 15, 426–427. [Google Scholar] [CrossRef] [PubMed] - Koslicki, D. Topological entropy of DNA sequences. Bioinformatics
**2011**, 27, 1061–1067. [Google Scholar] [CrossRef] [PubMed] - Wang, J.D.; Liu, H.C.; Tsai, J.J.P.; Ng, K.L. Scaling Behavior of Maximal Repeat Distributions in Genomic Sequences. Int. J. Cogn. Inform. Nat. Intell.
**2008**, 2, 31–42. [Google Scholar] [CrossRef] - Dawkins, R. The Selfish Gene; Oxford University Press: Oxford, UK, 1976. [Google Scholar]
- Bloom, L.; Hood, L.; Lightbown, P. Imitation in language development: If, when, and why. Cogn. Psychol.
**1974**, 6, 380–420. [Google Scholar] [CrossRef] - Chaitin, G. Proving Darwin: Making Biology Mathematical; Random House: New York, NY, USA, 2013. [Google Scholar]

© 2015 by the author; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Dębowski, Ł.
Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture. *Entropy* **2015**, *17*, 5903-5919.
https://doi.org/10.3390/e17085903

**AMA Style**

Dębowski Ł.
Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture. *Entropy*. 2015; 17(8):5903-5919.
https://doi.org/10.3390/e17085903

**Chicago/Turabian Style**

Dębowski, Łukasz.
2015. "Maximal Repetitions in Written Texts: Finite Energy Hypothesis vs. Strong Hilberg Conjecture" *Entropy* 17, no. 8: 5903-5919.
https://doi.org/10.3390/e17085903