Next Article in Journal
Metric Ensembles Aid in Explainability: A Case Study with Wikipedia Data
Previous Article in Journal
Upgraded Thoth: Software for Data Visualization and Statistics
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Readability Indices Do Not Say It All on a Text Readability

Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, 20133 Milan, Italy
Analytics 2023, 2(2), 296-314; https://doi.org/10.3390/analytics2020016
Submission received: 2 February 2023 / Revised: 2 March 2023 / Accepted: 21 March 2023 / Published: 30 March 2023

Abstract

:
We propose a universal readability index, G U , applicable to any alphabetical language and related to cognitive psychology, the theory of communication, phonics and linguistics. This index also considers readers’ short-term-memory processing capacity, here modeled by the word interval I P , namely, the number of words between two interpunctions. Any current readability formula does not consider I p , but scatterplots of I p versus a readability index show that texts with the same readability index can have very different I p , ranging from 4 to 9, practically Miller’s range, which refers to 95% of readers. It is unlikely that I P has no impact on reading difficulty. The examples shown are taken from Italian and English Literatures, and from the translations of The New Testament in Latin and in contemporary languages. We also propose an extremely compact formula, relating the capacity of human short-term memory to the difficulty of reading a text. It should synthetically model human reading difficulty, a kind of “footprint” of humans. However, further experimental and multidisciplinary work is necessary to confirm our conjecture about the dependence of a readability index on a reader’s short-term-memory capacity.

1. Introduction

First developed in the United States [1,2,3,4,5,6,7,8,9], readability formulae are applicable to any alphabetical language. They are based on the length of words and sentences, and therefore they allow the comparison of different texts automatically and objectively to assess the difficulty that readers may find in reading them. From the point of view of the writer, a readability formula allows the design of the best possible match between readers and texts. Many readability formulae have been proposed for English [6], and only some for very few languages [10].
In Reference [11] we have defined a global readability formula applicable to any alphabetical language, based on a calque of the readability formula used in Italian [12], both for providing it for languages that have none, and also for estimating, on common grounds, the readability of texts belonging to different languages/translations.
In fact, because an “absolute” readability formula—i.e., a formula that provides numerical indices related to a universal origin, such as “zero”—might not exist at all, the readability formula proposed in Reference [11] can be used to compare different texts, because what counts, in this comparison, is the difference between numerical values. In other words, differences give more insight than absolute values for the purpose of comparing texts [11].
As the title of this article claims, any current readability formula, however, does not say everything about a text readability, because it neglects the response of readers’ short-term memory to the partial stimuli contained in a sentence, i.e., to how the words of a sentence are punctuated, a process described by the word interval I P [13]. All readability formula neglect, in fact, the empirical connection between the short−term memory capacity of readers (approximately described by Miller’s 7 ± 2 law [14]) and the word interval I P , which appears, at least empirically, justified and natural [11,13,15,16,17].
The purpose of this article is to propose a universal readability formula, applicable to any alphabetical language, which includes the effect of short-term memory capacity. We base this formula on the global readability formula defined in Reference [11], which we will modify by including the word interval I P .
After this Introduction, Section 2 revisits the classical readability formula of Italian and its relationship with the Flesch Reading Ease Index and the Automated Readability Index, largely used in English texts; Section 3 summarizes the relationship between the word interval (number of words between two interpunctions, modeling the short-term memory capacity [13]) and the number of words per sentence; examples are drawn from Italian [13] and English literature [17]; Section 4 defines and discusses our proposal of a universal readability index; Section 5 proposes a synthetic readability index of humans, a kind of “footprint” that links human short-term memory to reading difficulty; finally Section 6 draws a conclusion and suggests future work.

2. A Readability Formula for Alphabetical Languages

The observation that differences are more important than absolute values in using readability formulae [13] justifies the development of a readability formula that can be used to compare texts, even those written in different languages [15]. For most languages, in fact, no readability formula has been defined, and only few adapt English formulae to their texts [10,18]. The proposed formula, of course, does not exclude using other readability formulae specifically devised for a language—e.g., the large choice for English—[4,6] but it allows the comparison, on the same ground, of the readability of texts written in any language and in translation.
For this purpose, we have proposed in Reference [11] to adopt, as a reference, the readability formula developed for Italian, known by the acronym GULPEASE [12]:
G = 89 10 C P + 300 / P F
In Equation (1) C P is the number of characters per word, and P F , is the number of words per sentence. Notice that, like all readability formulae, Equation (1) does not contain any reference to interpunctions (besides, of course, full stops, question marks and exclamation marks, which determine the length of sentences), and therefore it does not consider the parameter very likely linked to the short–term memory capacity, namely the word interval I P [13].
G can be interpreted as a readability index by considering the number of years of school attended in Italy’s school system (see Reference [12]), as shown in Figure 1. The larger G , the more readable the text for any number of school years.
The continuous lines shown in Figure 1 divide the quadrant into areas of the same performance of texts, such as “almost unintelligible”, “very difficult”, etc. For example, the area labelled “easy” indicates all combinations of values of G and school years required to declare a text “easy” to read. In all cases, it is shown that, as the number of school years of the reader increases, the readability index he/she can tolerate decreases.
In Reference [11] we have shown, for Italian literature, that the term 10 C P varies very little from text to text and across seven centuries, while the term 300 / P F varies very much and, in practice, determines the value of the readability index.
Equation (1) says that a text is more difficult to read if P F is large, i.e., if sentences are long, and if C P is large, i.e., if words are long. In other words, a text is easier to read if it contains short words and short sentences, a result that is predicted by any known readability formula and should be true, of course, in any language.
In Reference [11], we have proposed the adoption of Equation (1) also for the other languages, such as those listed in Table 1, by scaling the constant 10 according to the ratio between the average number of characters per word in Italian, < C p , I T A >   =   4.48 and the average number of characters per word in another language, e.g., < C p , E N G >   =   4.24 for English. The rationale for this choice is that C P is a parameter typical of a language which, if not scaled, would bias G without really quantifying the change in reading difficulty of readers, who are surely accustomed to reading, in their language, shorter or longer words, on average, than those found in Italian. This scaling, therefore, avoids changing G for the only reason that a language has, on average, words shorter or longer than Italian. In any case, as recalled above, C p affects a readability formula much less than P F [13].
On the other hand, we have maintained the constant 300 because P F depends significantly on author’s style [13,15], not on language. Finally, notice that the constant 89 sets just the absolute ordinate scale, and therefore it has no impact on comparisons [13].
In conclusion, in Reference [11] we have defined a global readability index applicable to texts written in a language as:
G = 89 10 k C P + 300 / P F
with
k   =   < C P , I T A > / < C P >
By using Equations (2) and (3), we force the average value of 10 × C P of any language to be equal to that found in Italian, namely 10 × 4.48 . Table 1 reports for Greek, Latin and 35 contemporary languages, the average values of C P [11] and the calculated values of the constant k of Equation (3). For example, for English texts, C P of a sample text is multiplied by 10.6 , instead of 10 ; for Nahuatl (longer words) , C P is multiplied by 6.7 , and for Haitian (shorter words) by 13.3 .
Notice that k seems to be a stable factor. For example, in the sample of the English literature studied in Reference [17], we have found < C P , E N G >   =   4.23 (instead of the 4.24 of Table 1). Now, because the value found in the Italian literature [13] is < C P , I T A >   =   4.67 , therefore k = 4.67 / 4.23 = 1.10 , instead of the k = 4.48 / 4.24 = 1.06 of Table 1.
As recalled above, all readability formulae substantially tell the same story, and therefore they should be very similar and it is very likely that any one of them can be obtained from another. We illustrate this fact with an example.
Because English is the language that has more readability formulae than any other language, let us compare G to the most classical English readability formula proposed and amply discussed by Flesch [1,2], known as the Flesch Reading Ease ( R E ) formula:
R E = 206.8 1.015 w 84.6 s
In Equation (4), w is the average number of words per sentence, and s is the average number of syllables per word. Because the number of characters per word is, on average, proportional to the number of syllables per word, the parameter s paralles C P and, of course, w = P F .
How Equation (4) quantifies the degree of difficulty was defined by Flesch himself [1,2], and its values are reported in the vertical scale of Figure 1 (right ordinate scale), for comparison with G (left ordinate scale). Figure 2 shows the scatterplot between the values calculated with the global readability index G , Equation (2), versus those calculated with R E , Equation (4), according to WinWord, in novels from English literature [17], Table 2.
We can notice a fair agreement between the two indices, with a correlation coefficient of 0.850. The bias could be compensated by downscaling R E .
The attribution of the grade level G L in the USA school system was defined by Kincaid et al. [3], by using the same parameters w and s . The grade level is similar to that attributed to G .
Another readability formula, the Automated Readability Index (ARI), was also defined by Kincaid et al. ii for specific military documents [3]. It is fully related to G because it depends on the same parameters, C P and P F :
A R I = 4.71 C p + 0.5 P F 21.43
As A R I increases, the age of required readers increases too. Figure 3 shows the scatterplot between the global G , Equation (2), and A R I , for the the same English novels considered in Figure 2. We can see a very tight relationship for fixed C P .
In conclusion, the global readability formula, Equation (2), provides a readability index that can be directly scaled to A R I and approximately also to R E . For this reason, we continue studying G , which we will modify by introducing the word interval I P to obtain the universal readability formula/index mentioned above. To do so we need to recall, in the next section, some fundamental knowledge on I P .

3. Word Interval and Short-Term Memory

As we have discussed in References [11,13,15], the word interval I p namely the number of words per interpunctions—varies in the same range of the short-term memory capacity-given by Miller’s 7 ± 2 law [14], a range that includes 95% of all cases, and very likely the two ranges are deeply related because interpunctions organize small portions of more complex arguments (which make a sentence) in short chunks of text, which are the natural input to short-term memory [19,20,21,22,23,24,25,26,27]. Moreover, I p , drawn against the number of words per sentence, P F , tends to approach a horizontal asymptote as P F increases, and this occurs both in ancient classical languages (Greek and Latin) and in contemporary languages, as shown in References [11,13] by studying translations of the New Testament books from Greek. In other words, even if sentences get longer, I p cannot get larger than about the upper limit of Millers’ law (namely 9), because of the constraints imposed by the short-term memory capacity of readers and writers, as well.
The average value of I p can be empirically related to the average value of P F according to the non-linear relationship [13]:
< I P > = I P 1 × 1 e < P F > 1 P F o 1 + 1
where I P gives the horizontal asymptote, and P F o gives the value of < P F > at which the exponential falls at 1 / e of its maximum value.
Equation (6) is a good average mathematical model for Italian literature [13] and also for Greek, Latin and contemporary languages [11,15]. Reference [11] reports the values of I P and P F o for each language considered.
Presently, we have carried out the same analysis as for the large corpus of Italian literature [13] for a smaller but useful corpus of the English literature recently studied in Reference [17], and have calculated the best-fit values of Equation (6). Figure 4 shows the scatter plot of I p versus P F (values calculated for each chapter) and the best-fit curve, with I P = 6.70 and P F o = 6.78 , to be compared with I P = 7.37 and P F o = 10.22 of the Italian literature, whose curve is also drawn.
Notice that the constants of the English literature differ from those reported in Reference [17] ( I P = 6.57 , P F o = 4.16 ) for the same literary corpus, because the latter were the results of fitting Equation (6) to the average values of I P and P F , not to the values of I P and P F obtained by considering the samples (a sample for each chapter), which give the scatterplot drawn in Figure 4. The different values are due, of course, to the non-linear best fit.
Now, as we have recalled in Section 2, any readability index is practically a function only of P F . Readability formulae do not consider I p , but the scatterplots of I p versus G show an interesting story: texts with the same G do not show the same I p . In other words, according to the theory of readability formulae, a text with a given index should be readable with the same effort both by readers who display a powerful short-term memory processing capacity (large I p ) and by readers who do not (small I p ). For example, for G = 60 (“easy/standard” texts for readers with 8 years of school, Figure 1), Figure 5 shows that I P can vary from 4 to 9. This is practically Miller’s range, which refers to 95% of readers [14]. We think that these readers should be distinguished, and therefore, our aim is to propose, in the next section, a possible “universal” readability index, G U , based on G , which includes I p .

4. A Universal Readability Formula

We suppose that the global readability index G should be modified by introducing a function that depends linearly on I P . Our hypothesis is based on Miller’s law, which quantifies linearly the processing capacity of the short-term memory. Moreover, the function should not change the global value for a reader with an “average” processing short-term memory capacity. For words, this average is not 7, but about 6 [1,28]; therefore, in the following we assume this latter value. Notice that 6.03 is the average value of I P (standard deviation 1.11) of the data listed in Table 2 of Reference [11], a further indication of its barycentric value.
We write our proposed universal readability formula as:
G U = G Δ G Δ I P I P 6
where G is given by Equation (2).
We assume that the numerical value of the discrete derivative Δ G Δ I P is given by:
Δ G Δ I P = G m a x G m i n I P , m a x I P , m i n
In Equation (8), the numerical values are the maximum and minimum averages found in the Italian literature—see Reference [13], whose oldest texts (seven centuries old, e.g., Boccaccio’s Decameron) are still read today in Italian high schools with a reasonable effort, a possibility not available in other Western languages.
From [13], we calculate:
Δ G Δ I P = 69.84 49.54 8.24 4.94 = 6.15 6.00
Therefore, the proposed universal readability formula is given by
G U = G 6 I P 6
Equation (10) sets G U = G for I P = 6 ; G U < G for I P > 6 and G U > G for I P < 6 . In other words, if a text with a given G , has a small word interval I P , then it should be read more easily than a text with the same G , but larger I P . For example, texts with G = 60 would be transformed in Miller’s range of 5 to 9 to G U = 66 for I P = 5 and in G U = 42 for I P = 9 , and therefore, in the first case, the text considered “easy” after 8 years of school (Figure 1), is considered “easy” to read but only after 7.2 years of school; in the second case, the text would be considered “easy”, but only after about 13.2 years of school. The meaningful difference between the two indices is therefore very large: 66 42 = 24 , corresponding to 13.2 7.2 = 5 years of school. This significant difference would be lost in the original formula of Equation (2), or in any other readability formula.
Figure 6 shows the scatterplots between G U and I P (blue circles) for the samples concerning the literary texts considered in Italian [13] and in English Literatures, in Table 2. Compared to the scatterplots of Figure 5 (redrawn in Figure 6 with red circles), the difference between G U and G is evident: the linear dependence of G U on I P , according to Equation (10), spreads the values around a line and introduces significant correlation coefficients, 0.9016 for Italian, and 0.7730 for English. The regression line:
G U = a I P + b
is very similar in the two languages:
G U , I T A = 9.47 I P + 115.71
G U , E N G = 8.88 I P + 111.64
This result indicates that Equation (11) might be “universal”.
Finally, some specific examples concerning novels taken from Italian and English literatures will further illustrate the relationship between G and G U .
Table 3 shows how the readability index is modified from G to G U for some Italian novels written from the XIV to the XX century [13]. For example, it is interesting to notice how G is transformed into G U for the two novels written by Alessandro Manzoni.
Alessandro Manzoni (Milan 1785, Milan 1873), one of the most studied Italian novelist in Italian high schools (Licei) and universities, in 1827 published Fermo e Lucia (Fermo and Lucia), a text that scholars of Italian Literature—and Manzoni himself—consider the “first” version of his masterpiece I Promessi Sposi (The Betrothed, available in a new English translation [29]) published in the years 1840–1842. According to scholars of Italian literature [30,31,32,33], the two versions differ very much, both in story structure and characters and, as far as we are here concerned, also in style and language; therefore, it is interesting to see how much the author transformed (mathematically) Fermo e Lucia into I Promessi Sposi, a study partially carried out in References [13,15].
As far as readability is concerned, from Table 3 we notice a large improvement in I Promessi Sposi, compared to Fermo e Lucia, if differences are considered. In fact, G = 51.72 in Fermo e Lucia and G = 56.00 in I Promessi Sposi, a difference of only 4.28 units, leading to a decrease in school years (for “easy” reading, Figure 1) of only about 0.8 years. This difference does not justify the reading difficulty of the two texts discussed by scholars of Italian literature [30,31,32,33]. However, if we consider G U , then the difference is quite large, very likely measuring the relative reading difficulty, because G U ranges from 44.70 to 60.20, a difference of 15.5 units leading to a decrease in school years (for “easy” reading, Figure 1) from 11.8 (Fermo e Lucia) to only 8 (I Promessi Sposi), well justified by scholars of Italian literature [30,31,32,33]. In conclusion, G U is a better estimate than G in assessing the difference in reading difficulty between these two very studied novels.
Table 2 shows also how the readability index is modified from G to G U in some English novels.
As we can read from Table 2, in Robinson Crusoe the readability index decreases from 50.84 to 42.22, therefore passing from about 10.3 to 12.4 years of school for “easy” reading (Figure 1). For Hemingway’s novels, The Sun Also Rises is more readable (72.45) than A Farewell to Arms (66.99); the order given by G , i.e., 72.58 and 73.17, respectively, is reversed, therefore reducing the number of years of school required for “easy” reading by 1 (Figure 1). The Hound of The Baskervilles changes its readability index from 60.27 to 46.16, therefore passing from 8 to 11.5 years of school for “easy” reading (Figure 1).
In conclusion, by introducing the word interval I P in the definition of a readability index, as in Equation (10), readability differences in texts are more “fine-tuned” for readers.

5. A “Footprint” of Humans

As already recalled, in Reference [11] we have studied the translation of the New Testament from Greek to Latin and to contemporary languages. For all these translations, we have recently calculated the scatterplots between G and I P , and between G U and I P , with results very similar to those shown in Figure 6. Some specific examples are reported in Appendix A. Similarly, we have calculated the linear best fit between G U and I P . Appendix B lists the values of the constants a and b of Equation (11) for each translation/language.
This set of values are useful because they could be used to compare texts written in any language. For example, in David Copperfield G U is estimated to be 61.82 with Equation (13) and 66.02 with the values of Appendix B (the experimental average value is 59.66, Table 2); in The Hound of The Baskervilles, G U is estimated to be 42.11 with Equation (13) and 48.53 with the values of Appendix B (the experimental average value is 46.16, Table 2). In the first case, the difference in the readability of the two novels is 19.71, and in the second case it is 17.49, which implies an “error” of about 0.25 years of school (Figure 1).
It may be interesting to consider the most compact relationship between G U and I P , given by the overall average values of the constants reported in Appendix B:
G U = 8.94 I P + 116
Figure 7 show this average relationship together with ± 1 standard deviaton bounds. These extremely compacted curves can synthetically represent how the capacity of human short-term memory (modelled by I P ) is related to the difficulty of reading a text, in any alphabetical language; therefore, it may be considered as a kind of “footprint” of humans.

6. Conclusions

We have proposed a universal readability index, G U , Equation (10). Compared to the current readability indices, this index considers also readers’ short-term memory processing capacity, here described by the word interval I P , namely, the number of words between two interpunctions. The observation that differences give more insight than absolute values has justified, we think, the development of a universal readability formula which is useful for comparing texts written even in different languages and is applicable to alphabetical languages and related to cognitive psychology, the theory of communication, phonics and linguistics.
Scholars have never considered including in the current readability formulae the word interval, I p , but the scatterplots of I p versus any readability index show that texts with the same readability index can have very different values of I p . Now, it is unlikely that I P has no impact on reading difficulty. By introducing I P in the definition of a readability index, readability differences in texts are better “fine-tuned” for readers, e.g., to their school years as a reference. We have used the global readability index developed for Italian [11], after showing that Flesch’s index and ARI are connected to this index because they depend on the same variables.
We have calculated an extremely compact formula, Equation (14), which can measure how the capacity of human short-term memory (modelled by I P ) is likely related to the difficulty of reading a text, measured by the universal readability index G U , here defined. We think that it synthetically models human reading difficulty, i.e., it might be considered a “footprint” of humans.
However, there is an important aspect to be considered. Because, as far as we know, there are no direct experiments on the relationship between readability and short-term memory capacity, the universal index here proposed, Equation (10), should be considered a first step in researching this important relationship. Therefore, further work needs to be carried out by a multidisciplinary team of researchers to fully validate Equation (10).

Funding

This research received no external funding.

Data Availability Statement

Data are available, on request, by the author.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

Scatterplots between G U and I P , for selected languages. We show the scatterplots between G and I P (red circles), and between G U and I P , (blue circles) for some selected languages.
Figure A1. Scatter plot of the readability index G U versus the word interval I P (blue circles). The red circles refer to the scatterplots of G versus I P . (a): Greek; (b): Latin. Miller’s bounds are given by I P = 7 2 . The values of a and b of Equation (12) and the correlation coefficient between G U and I P are reported in Table A1.
Figure A1. Scatter plot of the readability index G U versus the word interval I P (blue circles). The red circles refer to the scatterplots of G versus I P . (a): Greek; (b): Latin. Miller’s bounds are given by I P = 7 2 . The values of a and b of Equation (12) and the correlation coefficient between G U and I P are reported in Table A1.
Analytics 02 00016 g0a1
Figure A2. Scatter plot of the readability index G U versus the word interval I P (blue circles). The red circles refer to the scatterplots of G versus I P . (a): Spanish; (b): Portuguese. Miller’s bounds are given by I P = 7 2 . The values of a and b of Equation (12) and the correlation coefficient between G U and I P are reported in Table A1.
Figure A2. Scatter plot of the readability index G U versus the word interval I P (blue circles). The red circles refer to the scatterplots of G versus I P . (a): Spanish; (b): Portuguese. Miller’s bounds are given by I P = 7 2 . The values of a and b of Equation (12) and the correlation coefficient between G U and I P are reported in Table A1.
Analytics 02 00016 g0a2
Figure A3. Scatter plot of the readability index G U versus the word interval I P (blue circles). The red circles refer to the scatterplots of G versus I P . (a): French; (b): German. Miller’s bounds are given by I P = 7 2 . The values of a and b of Equation (13) and the correlation coefficient between G U and I P are reported in Table A1.
Figure A3. Scatter plot of the readability index G U versus the word interval I P (blue circles). The red circles refer to the scatterplots of G versus I P . (a): French; (b): German. Miller’s bounds are given by I P = 7 2 . The values of a and b of Equation (13) and the correlation coefficient between G U and I P are reported in Table A1.
Analytics 02 00016 g0a3

Appendix B

Table A1. Values of a and b of Equation (12), and correlation coefficient between G U and I P for the indicated languages [11].
Table A1. Values of a and b of Equation (12), and correlation coefficient between G U and I P for the indicated languages [11].
Language a b Correlation Coefficient
Greek8.62113.66−0.9477
Latin10.59120.82−0.8666
Esperanto9.87114.20−0.8803
French7.46107.51−0.9311
Italian7.80108.54−0.9065
Portuguese8.34112.33−0.8261
Romanian8.08111.11−0.8163
Spanish8.46112.60−0.9061
Danish9.46120.71−0.9182
English7.88110.23−0.9129
Finnish10.06118.22−0.8057
German8.68113.79−0.8563
Icelandic8.68114.98−0.8848
Norwegian7.32110.28−0.9426
Swedish7.32109.98−0.9546
Bulgarian9.00117.63−0.8697
Czech10.41125.50−0.8269
Croatian9.86122.33−0.8868
Polish9.98123.60−0.7160
Russian10.70118.04−0.7326
Serbian8.71117.24−0.8312
Slovak10.03124.83−0.8417
Ukrainian8.34113.42−0.7092
Estonian9.97120.11−0.8643
Hungarian10.83118.91−0.8034
Albanian8.01107.04−0.8776
Armenian12.11133.87−0.7805
Welsh7.74103.12−0.7828
Basque9.99117.48−0.8361
Hebrew10.27129.58−0.8163
Cebuano6.97107.50−0.9683
Tagalog7.78112.54−0.9188
Chichewa8.40118.76−0.9325
Luganda8.69118.42−0.8713
Somali8.65113.41−0.9492
Haitian8.25115.41−0.9132
Nahuatl7.55113.02−0.9420
Overall8.94 ± 1.22116.00 ± 6.490.8681 ± 0.0661

References

  1. Flesch, R. A New Readability Yardstick. J. Appl. Psychol. 1948, 32, 222–233. [Google Scholar] [CrossRef] [PubMed]
  2. Flesch, R. The Art of Readable Writing; revised and enlarged edition; Harper & Row: New York, NY, USA, 1974. [Google Scholar]
  3. Kincaid, J.P.; Fishburne, R.P.; Rogers, R.L.; Chissom, B.S. Derivation of New Readability Formulas (Automated Readability Index, Fog Count and Flesch Reading Ease Formula) For Navy Enlisted Personnel; Research Branch Report 8-75; Chief of Naval Technical Training, Naval Air Station: Memphis, TN, USA, 1975. [Google Scholar]
  4. DuBay, W.H. The Principles of Readability; Impact Information: Costa Mesa, CA, USA, 2004. [Google Scholar]
  5. Bailin, A.; Graftstein, A. The linguistic assumptions underlying readability formulae: A critique. Lang. Commun. 2001, 21, 285–301. [Google Scholar] [CrossRef]
  6. DuBay, W.H. (Ed.) The Classic Readability Studies; Impact Information: Costa Mesa, CA, USA, 2006. [Google Scholar]
  7. Zamanian, M.; Heydari, P. Readability of Texts: State of the Art. Theory Pract. Lang. Stud. 2012, 2, 43–53. [Google Scholar] [CrossRef]
  8. Benjamin, R.G. Reconstructing Readability: Recent Developments and Recommendations in the Analysis of Text Difficulty. Educ. Psychol. Rev. 2011, 24, 63–88. [Google Scholar] [CrossRef]
  9. Collins-Thompson, K. Computational Assessment of Text Readability: A Survey of Past, in Present and Future Research, Recent Advances in Automatic Readability Assessment and Text Simplification. ITL Int. J. Appl. Linguist. 2014, 165, 97–135. [Google Scholar] [CrossRef]
  10. Kandel, L.; Moles, A. Application de l’indice de Flesch à la langue française. Cah. Etudes Radio-Télévis. 1958, 19, 253–274. [Google Scholar]
  11. Matricciani, E. A Statistical Theory of Language Translation Based on Communication Theory. Open J. Stat. 2020, 10, 936–997. [Google Scholar] [CrossRef]
  12. Lucisano, P.; Piemontese, M.E. GULPEASE: Una formula per la predizione della difficoltà dei testi in lingua italiana. Sc. Città 1988, 3, 110–124. [Google Scholar]
  13. Matricciani, E. Deep Language Statistics of Italian throughout Seven Centuries of Literature and Empirical Connections with Miller’s 7 ∓ 2 Law and Short-Term Memory. Open J. Stat. 2019, 09, 373–406. [Google Scholar] [CrossRef] [Green Version]
  14. Miller, G.A. The Magical Number Seven, Plus or Minus Two. Some Limits on Our Capacity for Processing Information. Psychol. Rev. 1955, 62, 343–352. [Google Scholar]
  15. Matricciani, E. Linguistic Mathematical Relationships Saved or Lost in Translating Texts: Extension of the Statistical Theory of Translation and Its Application to the New Testament. Information 2022, 13, 20. [Google Scholar] [CrossRef]
  16. Matricciani, E. Multiple Communication Channels in Literary Texts. Open J. Stat. 2022, 12, 486–520. [Google Scholar] [CrossRef]
  17. Matricciani, E. Capacity of Linguistic Communication Channels in Literary Texts: Application to Charles Dickens’ Novels. Information 2023, 14, 68. [Google Scholar] [CrossRef]
  18. François, T. An analysis of a French as Foreign language corpus for readability assessment. In Proceedings of the 3rd Workshop on NLP for CALL; NEALT Proceedings Series 22; Linköping 2014 Electronic Conference Proceedings; Linköping University Electronic Press: Linköping, Sweden, 2014; Volume 107, pp. 13–32. [Google Scholar]
  19. Baddeley, A.D.; Thomson, N.; Buchanan, M. Word Length and the Structure of Short-Term Memory. J. Verbal Learn. Verbal Behav. 1975, 14, 575–589. [Google Scholar] [CrossRef]
  20. Cowan, N. The magical number 4 in short-term memory: A reconsideration of mental storage capacity. Behav. Brain Sci. 2000, 24, 87–114. [Google Scholar] [CrossRef] [Green Version]
  21. Pothos, E.M.; Joula, P. Linguistic structure and short-term memory. Behav. Brain Sci. 2000, 24, 138–139. [Google Scholar] [CrossRef]
  22. Jones, G.; Macken, B. Questioning short-term memory and its measurements: Why digit span measures long-term associative learning. Cognition 2015, 144, 1–13. [Google Scholar] [CrossRef] [Green Version]
  23. Saaty, T.L.; Ozdemir, M.S. Why the Magic Number Seven Plus or Minus Two. Math. Comput. Model. 2003, 38, 233–244. [Google Scholar] [CrossRef]
  24. Mathy, F.; Feldman, J. What’s magic about magic numbers? Chunking and data compression in short-term memory. Cognition 2012, 122, 346–362. [Google Scholar] [CrossRef]
  25. Chen, Z.; Cowan, N. Chunk Limits and Length Limits in Immediate Recall: A Reconciliation. J. Exp. Psychol. Mem. Cogn. 2005, 31, 1235–1249. [Google Scholar] [CrossRef] [Green Version]
  26. Chekaf, M.; Cowan, N.; Mathy, F. Chunk formation in immediate memory and how it relates to data compression. Cognition 2016, 155, 96–107. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  27. Barrouillest, P.; Camos, V. As Time Goes By: Temporal Constraints in Working Memory. Curr. Dir. Psychol. Sci. 2012, 21, 413–419. [Google Scholar] [CrossRef]
  28. Conway, A.R.A.; Cowan, N.; Michael, F.; Bunting, M.F.; Therriaulta, D.J.; Minkoff, S.R.B. A latent variable analysis of working memory capacity, short-term memory capacity, processing speed, and general fluid intelligence. Intelligence 2002, 30, 163–183. [Google Scholar] [CrossRef]
  29. Manzoni, A. The Betrothed; Moore, M.F., Translator; The Modern Library: New York, NY, USA, 2022. [Google Scholar]
  30. Mazza, A. Studi Sulle Redazioni de I Promessi Sposi; Edizioni Paoline: Milan, Ialy, 1968. [Google Scholar]
  31. Giovanni Nencioni, N. La Lingua di Manzoni. Avviamento Alle Prose Manzoniane; Il Mulino: Bologna, Italy, 1993. [Google Scholar]
  32. Guntert, G. Manzoni Romanziere: Dalla Scrittura Ideologica Alla Rappresentazione Poetica; Franco Cesati Editore: Firenze, Italy, 2000. [Google Scholar]
  33. Frare, P. Leggere I Promessi Sposi; Il Mulino: Bologna, Italy, 2016. [Google Scholar]
Figure 1. Readability index, G , of Italian (GULPEASE, see Reference [12]), as a function of the number of school years attended in Italy. The continuous lines divide the quadrant into areas of the same performance of texts. Elementary school lasts 5 years, junior high school lasts 3 years, and high school lasts 5 years. Children stay at school till they are 19 years old. For comparison, the green vertical axis on the right refers to the Flesh Reading Ease index.
Figure 1. Readability index, G , of Italian (GULPEASE, see Reference [12]), as a function of the number of school years attended in Italy. The continuous lines divide the quadrant into areas of the same performance of texts. Elementary school lasts 5 years, junior high school lasts 3 years, and high school lasts 5 years. Children stay at school till they are 19 years old. For comparison, the green vertical axis on the right refers to the Flesh Reading Ease index.
Analytics 02 00016 g001
Figure 2. Flesch Reading Ease (RE) index, Equation (4), versus the global index G , Equation (2), for the novels of the English Literature listed in Table 2. Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”.
Figure 2. Flesch Reading Ease (RE) index, Equation (4), versus the global index G , Equation (2), for the novels of the English Literature listed in Table 2. Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”.
Analytics 02 00016 g002
Figure 3. Automated Readability Index (ARI), Equation (5), versus the global index G , Equation (3), for the novels of English literature listed in Table 2. The continuous lines assume constant values of C P . Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”.
Figure 3. Automated Readability Index (ARI), Equation (5), versus the global index G , Equation (3), for the novels of English literature listed in Table 2. The continuous lines assume constant values of C P . Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”.
Analytics 02 00016 g003
Figure 4. Scatter plot of the word interval, I P , and the number of words per sentence, P F , for all samples of the novels considered in the English literature, Table 2. The continuous black line refers to the best fit given by Equation (6). The green line refers to the Italian literature [13]. Miller’s bounds are given by I P = 7 2 .
Figure 4. Scatter plot of the word interval, I P , and the number of words per sentence, P F , for all samples of the novels considered in the English literature, Table 2. The continuous black line refers to the best fit given by Equation (6). The green line refers to the Italian literature [13]. Miller’s bounds are given by I P = 7 2 .
Analytics 02 00016 g004
Figure 5. Scatterplots of the readability index, G , versus the word interval, I P . (a): Italian Literature [11]; (b): English Literature, Table 2. Miller’s bounds are given by I P = 7 2 .
Figure 5. Scatterplots of the readability index, G , versus the word interval, I P . (a): Italian Literature [11]; (b): English Literature, Table 2. Miller’s bounds are given by I P = 7 2 .
Analytics 02 00016 g005
Figure 6. Scatter plot of the readability index, G U , versus the word interval, I P (blue circles). The red circles refer to the scatterplots of Figure 5. (a): Italian Literature [13]; (b): English Literature, Table 1. Miller’s bounds are given by I P = 7 2 .
Figure 6. Scatter plot of the readability index, G U , versus the word interval, I P (blue circles). The red circles refer to the scatterplots of Figure 5. (a): Italian Literature [13]; (b): English Literature, Table 1. Miller’s bounds are given by I P = 7 2 .
Analytics 02 00016 g006
Figure 7. Average (blue line) and ± 1 standard deviation lines (cyan lines) of the universal readability index, G U , versus the word interval, I P , from Table A1.
Figure 7. Average (blue line) and ± 1 standard deviation lines (cyan lines) of the universal readability index, G U , versus the word interval, I P , from Table A1.
Analytics 02 00016 g007
Table 1. Values of C P and k of Equations (2) and (3) in the New Testament texts in the indicated languages. Languages are listed according to their language family (see Reference [11]).
Table 1. Values of C P and k of Equations (2) and (3) in the New Testament texts in the indicated languages. Languages are listed according to their language family (see Reference [11]).
LanguageLanguage Family C P k
GreekHellenic4.860.92
LatinItalic5.160.87
EsperantoConstructed4.431.01
FrenchRomance4.201.07
ItalianRomance4.481.00
PortugueseRomance4.431.01
RomanianRomance4.341.03
SpanishRomance4.301.04
DanishGermanic4.141.08
EnglishGermanic4.241.06
FinnishGermanic5.900.76
GermanGermanic4.680.96
IcelandicGermanic4.341.03
NorwegianGermanic4.081.10
SwedishGermanic4.231.06
BulgarianBalto−Slavic4.411.02
CzechBalto−Slavic4.510.99
CroatianBalto−Slavic4.391.02
PolishBalto−Slavic5.100.88
RussianBalto−Slavic4.670.96
SerbianBalto−Slavic4.241.06
SlovakBalto−Slavic4.650.96
UkrainianBalto−Slavic4.560.98
EstonianUralic4.890.92
HungarianUralic5.310.84
AlbanianAlbanian4.071.10
ArmenianArmenian4.750.94
WelshCeltic4.041.11
BasqueIsolate6.220.72
HebrewSemitic4.221.06
CebuanoAustronesian4.650.96
TagalogAustronesian4.830.93
ChichewaNiger−Congo6.080.74
LugandaNiger−Congo6.230.72
SomaliAfro−Asiatic5.320.84
HaitianFrench Creole3.371.33
NahuatlUto−Aztecan6.710.67
Table 2. Novels from English literature. Deep-language parameters C P , P F , I P , G and universal readability index G U , the latter discussed in Section 4. Novels are listed according to the year of publication.
Table 2. Novels from English literature. Deep-language parameters C P , P F , I P , G and universal readability index G U , the latter discussed in Section 4. Novels are listed according to the year of publication.
Literary Work C p P F I P G G U
Matthew King James translation (1611)4.2723.515.9155.1455.86
Robinson Crusoe (D. Defoe, 1719)3.9457.757.1250.8442.22
Pride and Prejudice (J. Austen, 1813)4.4024.867.1652.7943.89
Wuthering Heights (E. Brontë, 1845–1846)4.2725.825.9753.6553.89
Vanity Fair (W. Thackeray, 1847–1848)4.6325.746.7349.7544.10
David Copperfield (C. Dickens, 1849–1850)4.0424.405.6156.6859.66
Moby Dick (H. Melville, 1851)4.5231.186.4549.1145.66
The Mill on The Floss (G. Eliot, 1860)4.2928.037.0952.7044.32
Alice’s Adventures in Wonderland (L. Carroll, 1865)3.9630.925.7956.1457.76
Little Women (L.M. Alcott, 1868–1869)4.1821.086.3057.3154.99
Treasure Island (R. L. Stevenson, 1881–1882)4.0221.896.0558.7858.39
Adventures of Huckleberry Finn (M. Twain, 1884)3.8524.896.6359.0154.14
Three Men in a Boat (J.K. Jerome, 1889)4.2513.716.1464.1963.13
The Picture of Dorian Gray (O. Wilde, 1890)4.1916.566.2962.8360.58
The Jungle Book (R. Kipling, 1894)4.1121.527.1557.9549.14
The War of the Worlds (H.G. Wells, 1897)4.3820.857.6755.3142.48
The Wonderful Wizard of Oz (L.F. Baum, 1900)4.0220.557.6359.3846.85
The Hound of The Baskervilles (A.C. Doyle, 1901–1902)4.1517.797.8360.2746.16
Peter Pan (J.M. Barrie, 1902)4.1218.206.3560.5357.85
A Little Princess (F.H. Burnett, 1902–1905)4.1816.386.8061.5755.45
Martin Eden (J. London, 1908–1909)4.3216.946.7659.3853.50
Women in love (D.H. Lawrence, 1920)4.2613.715.2263.9870.02
The Secret Adversary (A. Christie, 1922)4.2811.025.5269.0872.76
The Sun Also Rises (E. Hemingway, 1926)3.9210.706.0272.5872.45
A Farewell to Arms (H. Hemingway,1929)3.9410.126.8073.1766.99
Of Mice and Men (J. Steinbeck, 1937)4.029.675.6174.2077.24
Table 3. Novels from Italian Literature [13]. Average deep-language parameters C P , P F , I P , and G and corresponding universal readability index, G U . Novels are listed according to the alphabetical order of the author‘s name.
Table 3. Novels from Italian Literature [13]. Average deep-language parameters C P , P F , I P , and G and corresponding universal readability index, G U . Novels are listed according to the alphabetical order of the author‘s name.
Novel C P P F I P G G U
Anonymous (I Fioretti di San Francesco, XIV Century)4.6537.708.2450.7037.26
Boccaccio Giovanni (Decameron, XIV)4.4844.277.7951.1840.44
Buzzati Dino (Il deserto dei tartari, XX)5.1017.756.6355.2751.49
Calvino Italo (Marcovaldo, XX)4.7417.606.5959.1955.65
Cassola Carlo (La ragazza di Bube, XX)4.4811.935.6469.8472.00
Collodi Carlo (Pinocchio, XIX)4.6016.926.1961.5760.43
Deledda Grazia (Canne al vento, XX)4.5115.086.0664.3964.03
D’Annunzio Gabriele (Le novelle delle Pescara, XX)4.9117.996.3858.1655.88
Eco Umberto (Il nome della rosa, XX)4.8121.087.4655.7847.02
Fogazzaro (Piccolo mondo antico, XIX-XX)4.7916.086.1061.4660.86
Gadda (Quer pasticciaccio brutto… XX)4.7618.434.9858.2464.36
Machiavelli Niccolò (Il principe, XV-XVI)4.7140.176.4549.5446.84
Manzoni Alessandro (Fermo e Lucia, XIX)4.7530.987.1751.7244.70
Manzoni Alessandro (I promessi sposi, XIX)4.6024.835.3056.0060.20
Moravia Alberto (La ciociara, XX)4.5629.937.2853.5245.84
Pavese Cesare (La luna e i falò, XX)4.4717.836.8361.9056.92
Pirandello Luigi (Il fu Mattia Pascal)4.6314.574.9463.9470.30
Svevo Italo (Senilità, XX)4.8616.047.7559.3948.89
Tomasi di Lampedusa (Il gattopardo, XX)4.9926.427.9050.7239.32
Verga (I Malavoglia, XIX-XX)4.4620.456.8259.3454.42
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Matricciani, E. Readability Indices Do Not Say It All on a Text Readability. Analytics 2023, 2, 296-314. https://doi.org/10.3390/analytics2020016

AMA Style

Matricciani E. Readability Indices Do Not Say It All on a Text Readability. Analytics. 2023; 2(2):296-314. https://doi.org/10.3390/analytics2020016

Chicago/Turabian Style

Matricciani, Emilio. 2023. "Readability Indices Do Not Say It All on a Text Readability" Analytics 2, no. 2: 296-314. https://doi.org/10.3390/analytics2020016

Article Metrics

Back to TopTop