Next Article in Journal
The Law of Entropy Increase and the Meissner Effect
Previous Article in Journal
Structural Entropy of the Stochastic Block Models
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data

by
Jean-Marc Girault
1,2,* and
Sébastien Ménigot
1,2
1
Groupe ESEO, 49000 Angers, France
2
Laboratoire d’Acoustique de l’Université du Mans (LAUM), UMR 6613, Institut d’Acoustique-Graduate School (IA-GS), CNRS, Le Mans Université, 72085 Le Mans, France
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(1), 82; https://doi.org/10.3390/e24010082
Submission received: 24 November 2021 / Revised: 30 December 2021 / Accepted: 31 December 2021 / Published: 3 January 2022
(This article belongs to the Section Entropy and Biology)

Abstract

:
Today, the palindromic analysis of biological sequences, based exclusively on the study of “mirror” symmetry properties, is almost unavoidable. However, other types of symmetry, such as those present in friezes, could allow us to analyze binary sequences from another point of view. New tools, such as symmetropy and symmentropy, based on new types of palindromes allow us to discriminate binarized 1 / f noise sequences better than Lempel–Ziv complexity. These new palindromes with new types of symmetry also allow for better discrimination of binarized DNA sequences. A relative error of 6 % of symmetropy is obtained from the HUMHBB and YEAST1 DNA sequences. A factor of 4 between the slopes obtained from the linear fits of the local symmentropies for the two DNA sequences shows the discriminative capacity of the local symmentropy. Moreover, it is highlighted that a certain number of these new palindromes of sizes greater than 30 bits are more discriminating than those of smaller sizes assimilated to those from an independent and identically distributed random variable.

1. Introduction

The palindromic analysis of discrete sequences has partly revolutionized molecular biology and is widely used as shown by the following work [1,2,3,4,5,6,7,8], to name a few. Very recently, the study of quantum behavior [9], encountered in palindromes within the DNA structure, revealed that the symmetry properties of the unitary structure, other than those present in classical palindromes, play an important role in the origin and cause of mutations.
In the continuity of the work carried out by Tibatan and Sarisaman [9], our article aims to highlight the symmetry links between the concept of frieze and the concept of palindrome, which have been insufficiently exploited until now in the analysis of binary data.
The “mirror” symmetry on which the concept of palindrome was based is certainly the basis of the oldest symmetry descriptors. Its greatest success is undoubtedly derived from the analysis of biological sequences (DNA, RNA and proteins), even if in this case the definition of DNA palindromes is slightly different from the classical definition. (Let us consider the sequence of characters ‘ A T G G C C A T ’. It is qualified as a 8-palindrome sequence. It is composed of a 4-pattern on the right (‘ C C A T ’) obtained by a mirror reflection of its 4-pattern complementary on the left (‘ A T G G ’): ‘ A T G G ’|‘ C C A T ’, where | ̲ indicates the mirror reflection and ̲ indicates the complementary. Note that ‘T’ is the complementary of ‘A’, and ‘C’ is the complementary of ‘G’).
To fix ideas, a palindrome of size m, called “m-palindrome”, is a discrete sequence composed of two contiguous symmetrical (mirror) sub-sequences each composed of k-patterns with k = m / 2 . For example, the alphabetic character sequence d d d d d d d d b b b b b b b is a 16—palindrome composed of two 8—patterns: d d d d d d d d , b b b b b b b b . Even if the theoretical research around the palindrome is still going on, as shown by the recent article by Gabric and Shallit [10] to name but a few, it is the older work of Allouche et al. [11,12], which is used as a starting point in this work and in particular the notion of palindromic complexity.
Today, when studying a word or a discrete sequence, its analysis is still limited to only one type of symmetry: the “mirror” symmetry. Wanting to extract many more intrinsic features in the discrete sequences studied can consist of looking for other types of symmetries, as it is explicitly the case in friezes.
A frieze is a horizontal strip composed of an infinite number of symmetrical patterns, i.e., a periodic geometric object. As an illustration, five types of alphabetical sequences of 16 characters, having the same symmetries as friezes, are presented as follows: b b b b b b b b b b b b b b b b , d b d b d b d b d b d b d b d b , b p b p b p b p b p b p b p b p , b q b q b q b q b q b q b q b q , b q p d b q p d b q p d b q p d .
If the objective is indeed to extend the analysis of discrete periodic sequences to other types of sequences, then the search for all symmetric patterns is the next step. To reach this goal, the concept of palindrome and then that of frieze is presented in Section 2. Then, the concept of palindromes is extended and new tools such as symmetropy and symmentropy are proposed in Section 3. Finally, the set of symmetry descriptors are tested on binarized 1 / f noises and binarized DNA sequences in Section 4; then, the results are discussed in Section 5.

2. Palindromes and Friezes

In this section, we recall the concept of palindromes [11,12,13] and the concept of friezes [14,15].

2.1. Palindromes

For a binary sequence, an m-palindrome is, by definition, a grouping of m bits that form an m-pattern of mirror symmetry. In other words, for a binary sequence X = { x ( 1 ) , x ( 2 ) , , x ( M ) } composed of M bits, an m-palindrome can be defined as the concatenation of two k-patterns: X m ( i ) = [ X k ( i ) Γ R [ X k ( i ) ] , with k = m / 2 being the order of the palindrome. The first k-pattern X k ( i ) = { x ( i ) , x ( i + 1 ) , , x ( i + k 1 ) } , 1 i M k + 1 is the reference pattern, and the second k-pattern obtained by Γ R [ X k ( i ) ] is the symmetric pattern, where Γ R [ ] is the transformation corresponding to the mirror symmetry, a reflection.
For example, for the binary sequence X = { 01100110 } of 8 bits, the first 4-pattern of X of order 2 is written as X 4 ( 1 ) = [ X 2 ( 1 ) Γ R [ X 2 ( 1 ) ] = [ { 01 } { 10 } ] = { 0110 } , with X 2 ( 1 ) = { 01 } and Γ R [ X 2 ( 1 ) ] = { 10 } . In the same way, the 8-palindrome of X of order 4 is written as X 8 ( 1 ) = [ X 4 ( 1 ) Γ R [ X 4 ( 1 ) ] = [ { 0110 } { 0110 } ] = { 0110010 } , with X 4 ( 1 ) = { 0110 } and Γ R [ X 4 ( 1 ) ] = { 0110 } .
A palindrome of odd length can be seen as the concatenation of a pattern of size ( m 1 ) and its mirror, for which the rightmost bit of the ( m 1 ) -reference pattern (bit in bold in the following example) and the leftmost bit of the ( m 1 ) -mirror pattern (bit in bold in the following example) are merged to give only one. Example: [ { 0 1 } { 1 0 } ] = { 0 1   1 0 } becomes { 0 1 0 } .
Although there is a plethora of scalar descriptors such as those indicated in [11,12,13] to name but a few, here, we limit ourselves to the concept of palindromic complexity c ˜ computed from D , which lists, from the palindromic dictionary, the cardinal of the different palindromic words of size m:
D = [ d ( 0 ) , d ( 1 ) , , d ( m ) , , d ( M ) ] t .
where d ( m ) is the cardinal of “palindrome words” of size m [11] present in the binary sequence. The empty palindrome obtained for m = 0 is e and { e , 0 , 1 } are the trivial palindromes. The palindromic complexity c ˜ , which corresponds to the cardinal of D , is defined by the following:
c ˜ = c a r d D .
In order to measure the level of mirror symmetry present in a binary sequence, we propose to count the frequency of occurrence of m-palindromic patterns in the binary sequence studied by the following:
V = [ v ( 0 ) , v ( 1 ) , , v ( m ) , , v ( M ) ] t .
where v ( m ) is the frequency of occurrence of a palindromic pattern of size m. The “mirror” symmetry level σ ˜ is the sum of all occurrences for non-trivial palindromes:
σ ˜ = m = 2 M v ( m ) .
An illustration given in Table 1 for the binary sequence X = { 01101001 } , specifies the value of the palindromic complexity c ˜ = 5 . There are, in all, five non-zero elements in D = { 1 , 2 , 2 , 2 , 2 , 0 , 0 , 0 , 0 } which is itself computed from the empirical palindromic dictionary D i c t = { e , 0 , 1 , 00 , 11 , 010 , 101 , 0110 , 1001 } .

2.2. Friezes

As stated in the Introduction, a frieze is a periodic horizontal band composed of a few basic symmetrical patterns repeated ad infinitum. There are only seven different types of friezes [14,15] (see Figure 1) obtained from five types of isometries (isometry is a geometrical transformation that leaves the objects invariant thus transformed while preserving the distances, which is the case for the five following operations: translation, vertical reflection, horizontal reflection, inversion, and glide reflection). (TRIGH: Translation, vertical Reflection, Inversion and Glide reflection, Horizontal reflection). There are only 5 possible types of periodic discrete sequences obtained from 4 types of isometries (TRIG: Translation, vertical Reflection, Inversion and Glide reflection), vertical reflection not allowing to obtain a 1D-sequences.
For example, from the friezes in Figure 1 and replacing ⌜ by { 10 } , we can construct five types of periodic discrete sequences, all having different types of symmetry:
  • sequence X = { 10101010 } obtained with translations;
  • sequence X = { 10011001 } obtained with vertical reflections (mirror);
  • sequence X = { 10011001 } obtained with glide reflections and translations;
  • sequence X = { 10101010 } obtained with inversions and translations;
  • sequence X = { 10100101 } obtained with inversions and vertical reflections.
Among the five previous sequences, two are composed of mirror palindromes (the second and the last). By no longer limiting the search to mirror palindromes, it should be possible to describe binary sequences more precisely; this is the subject of the next section.

3. Methods

In this section, we propose to extend the different palindromic vector and scalar descriptors by integrating the different types of symmetry revealed in the friezes. Then, new palindromic descriptors such as the notions of symmetropy and symmentropy are proposed.
As mentioned later, through the notion of friezes, several types of symmetries can be considered using the combination of only four isometries (TRIG). On this basis, we propose to generalize the notion of palindromes by taking into account all types of symmetries.
For a binary sequence X = { x ( 1 ) , x ( 2 ) , , x ( M ) } composed of M bits, an m-palindrome of type j { T , R , I , G } can be defined as the concatenation of two k-patterns: X m ( i ) = [ X k ( i ) Γ j [ X k ( i ) ] ] with k = m / 2 . The first k-pattern X k ( i ) = { x ( i ) , x ( i + 1 ) , , x ( i + k 1 ) } , 1 i M k + 1 is the reference pattern, and the second k-pattern is the one obtained by one of the four isometries Γ j [ X k ( i ) ] with j { T , R , I , G } :
-
Γ T [ X k ( i ) ] ] = { x ( i ) , x ( i + 1 ) , , x ( i + k 1 ) } . A translation is simply a “copy and paste”;
-
Γ R [ X k ( i ) ] ] = { x ( i + k 1 ) , , x ( i + 1 ) , x ( i ) } . A vertical reflection is simply a “copy, return and paste”;
-
Γ I [ X k ( i ) ] ] = { x ( i + k 1 ) , , x ( i + 1 ) , x ( i ) } ̲ . An inversion is simply a “copy-complement-return-paste”;
-
Γ G [ X k ( i ) ] ] = { x ( i ) , x ( i + 1 ) , , x ( i + k 1 ) } ̲ . A glide reflection is simply a “copy-complement-paste”;
where ̲ is the logical function NOT, also called a complement. For example, with the binary sequence X = { 01010101 } , the first 4-palindrome of type ‘T’ is written as X 4 ( 1 ) = { 0101 } , with X 2 ( 1 ) = { 01 } and Γ T [ X 2 ( 1 ) ] = Γ I [ X 2 ( 1 ) ] = { 01 } .
If the objective is to measure the level of symmetry of a binary sequence through the presence of palindromes of type j { T , R , I , G } , then we can define the following measure:
v j * ( m ) = v j ( m ) 2 ( M m + 1 ) ( M 1 )
with v j ( m ) being the total number of palindromes of size m, v j ( 0 ) = M and the palindrome vector of type j by the following:
V j * = [ v j * ( 0 ) , v j * ( 1 ) , , v j * ( m ) , , v j * ( M ) ] t .
In order to propose a scalar measure of the level of symmetry of a given type, it seems judicious not to take into account the non-trivial palindromes because they could mask, for very long sequences, the presence of larger palindromes in smaller numbers. The total number of non-trivial palindromes σ j * of type j { T , R , I , G } , for the whole range of sizes m, is obtained by computing
σ j * = m = 2 M v j * ( m ) .
To obtain the global level of symmetry present in a binary sequence, the global palindromic symmetropy σ * is defined as follows:
σ * = σ T * + σ R * + σ I * + σ G * ,
where σ R * = σ ˜ is defined in Section 2. Note that, for binary sequences where the level of symmetry is the maximum as for example for the sequences X = { 01010101 } and X = { 111111 } , the symmetropy is maximum with σ * = 1 .
To quantify the “diversity” of different types of palindromes, the overall palindromic symmentropy  E can be defined as follows:
E = P t log 4 P ,
where P is the quarte probability P defined as follows:
P = [ p T , p R , p I , p G ] t ,
with p j = σ j / σ * . Note that the values of the symmentropy are between 1 / 2 and 1. When there is equi-probability, then E = 1 . For example, for the sequence X = { 01010101 } of M = 8 bits, the symmentropy is maximal at E = 0.99 , and the value of E = lim M 1 . When two probabilities out of four are null with P = [ 1 / 2 , 1 / 2 , 0 , 0 ] t , as is the case for the 8-bit sequence X = { 11111111 } , then the symmentropy is minimal and is E = 1 / 2 . This means that, when the symmentropy is minimal, there is always a minimum symmetric information content in the binary sequences.
Finally, it seems appropriate to compute a local palindromic symmentropy ϵ ( m ) for each m scale:
ϵ ( m ) = Q t ( m ) l o g 4 Q ( m ) ,
where Q ( m ) = [ q T ( m ) , q R ( m ) , q I ( m ) , q G ( m ) ] t is the quarte probability at scale m, where q j ( m ) = v j ( m ) σ ( m ) and with σ ( m ) = v T ( m ) + v R ( m ) + v I ( m ) + v G ( m ) .
To illustrate, let us consider the binary sequence X = { 01101001 } of 8 bits. We reported in Table 2 D i c t j , v j ( m ) , v j * ( m ) and q j with j { T , R , I , G } .
Remark: This measure of symmentropy is similar in idea to the one proposed by Yodogawa [16], who proposed an entropic measure of the level of symmetry present in the images via a decomposition in the Walsh–Hadamard basis. (The method of Yodogawa that measures the entropy of symmetric patterns is called symmetropy. From our point of view, it is rather a symmentropy since it is derived from an entropy measure, which is not the case of symmetropy as we define it in Section 3. On the other hand, in Yodogawa’s approach, the probabilities allowing us to computation the entropy in base 2 are obtained from a decomposition in the Walsh–Hadamard basis. In Yodogawa’s paper, it is clearly stated that not all symmetries are considered, which is not the case for our approach based on symmetry friezes.) Here, the proposed definition is different.

4. Results

In this section, we wish to show the interest of these new scalar and vector descriptors in the study of binarized sequences. We propose to compute the different proposed descriptors (palindromic vectors, symmetropy and symmentropy) for binarized sequences taken from 1 / f noises and 2 DNA sequences.

4.1. Binarized 1 / f Noise

One way to study complexity, in which the meaning here is reduced to that of irregularity as reported in [17], is to vary the exponent β of the noise in f β . For β = 0 , the generated noise is white noise, and for β = 2 , the generated noise is a Brownian motion, with the integral of a white noise being a Brownian motion.
Here, in order to stay within the framework of our study, the time series are binarized. All values above the median are replaced by ‘1’, otherwise ‘0’. Moreover, in order to compare the different scalar and vector descriptors, the Lempel–Ziv complexity C l z is proposed as a reference and is computed as presented in [18]. This normalized complexity is almost zero for periodic binary sequences and close to unity for random sequences such as white noise.
In Figure 2, Figure 3 and Figure 4, the scalar and vector descriptors obtained for noises in f β with β { 2 : 0 } by step of 0.2 are presented. For a same value of β , 300 binarized noises composed of 1000 bits are generated.
In Figure 2, the different scalar palindromic descriptors are computed and plotted as whisker boxes. From Figure 2, we observe that all scalar palindromic descriptors describe monotonic curves increasing for Lempel–Ziv complexity and symmentropy and decreasing for symmetropy (as well as these components through the quarte probability P ). This monotonicity property can be auspicious for tracing the values of β knowing the value of the descriptor. Indeed, it is possible to discriminate binarized noises in f β on larger or smaller regions depending on the descriptor considered. For example, for the Lempel–Ziv complexity, the body (second and third quartile) of the non-overlapping whisker boxes in the region 1.2 < β < 0.6 allows us, from a Lempel–Ziv complexity of 0.62 , to go back to a value of β = 1.2 without much error. When β = 0 , the complexity is maximal and tends to unity; when β = 2 , the complexity is less and is 0.2 for a Brownian motion. For symmetropy, the non-overlapping boxes for 1.2 < β < 0.4 also allows us to find the value of β from the symmetropy measures. Note that the discrimination range ( β > 1.2 ) of symmentropy is much larger than those obtained by Lempel–Ziv complexity and symmetropy. We also check that the values of the symmentropy are well between 1 / 2 and 1. Finally in Figure 2, we observe a decrease in the probabilities drawn from the quarte P . Indeed, it decreases as β approaches zero for types ‘T’ and ‘R’ to go from 50 % to 25 % and 33 % , respectively, and it increases progressively for types ‘I’ and ‘G’ to go from 0 % to 17 % and 25 % , respectively. At the maximum complexity β = 0 , we observe that the reflection symmetry level is always higher than the translation/glide reflection and inversion: σ R * > σ T * = σ G * > σ I * .
In Figure 3, the palindromic vectors obtained for β = 2 , 1 , 0 , which correspond to Brownian motion, pink noise and white noise, respectively, are presented. From Figure 3, we observe that all of the average palindromic vectors (obtained by averaging 300 palindromic vectors) decrease as the palindromic size m increases and this decrease is all the more marked as β approaches zero, i.e., when the correlations between samples are almost non-existent. Note that, for Brownian motions ( β = 2 ), there are large palindromes up to about 450. On the contrary, for white noise, we note that the size of the palindromes does not exceed 20 bits. Moreover, the palindromic vector obtained for β = 0 is very similar to the one obtained in the case of binary iid (independent and identically distributed) sequences, as shown in Figure 5.
In Figure 4, the local symmentropy (averaged from 300 trials) ϵ ( m ) computed for three different types of noise (Brownian motion, pink noise and white noise) is plotted. As for the palindromic vectors, the symmentropy decreases as the size of the palindromes increases. The spread out of the symmentropy depends on the type of noise and thus on the correlations between samples. The range in size is very small for white noise with no correlation between samples/bits compared with Brownian motion. Moreover, the value of the symmetropy is close to unity for the white noise and close to half for the Brownian motion.
In Figure 5, the palindromic vectors obtained from binary sequences independent and identically distributed are plotted. We observe in Figure 5 a different distribution between even and odd palindromes. There is an equi-distribution between the different types of symmetry for the even palindromes. For odd palindromes, we also note the non-presence of palindromes of type ‘I’. Note that there are no palindromes with sizes exceeding 40 bits. On average, the proportion of palindromes is P T = 25 % , P R = 33 % , P I = 17 % , P G = 25 % . We notice a decrease in the symmetry levels as the size of the palindromes increases. In logarithmic scale, the decrease in the symmetry level (and thus of the number of symmetrical palindromes) is linear. Indeed, for a fixed length of the binary sequence, the more the palindrome size increases, the smaller the number of palindromes composing the binary sequence. For example, a sequence of 8 bits can only be composed of one palindrome of size m = 8 , of two palindromes of size m = 4 , of four palindromes of size m = 2 and of eight palindromes of size m = 1 . This decrease is therefore inversely proportional to the size m. If we suppose that, for a given type of symmetry, the palindrome vector is expressed by V j ( m ) = K j / m , then l o g ( V j ( m ) ) = 1 × l o g ( m ) + l o g ( K j ) . This is indeed the affine line observed in Figure 5.

4.2. Biological Sequences: DNA

To show the relevance of the different symmetry descriptors proposed in a practical case, let us consider two DNA sequences. The objective is to identify descriptors that allow us to differentiate the two sequences: HUMHBB (human β -region, chromosome 11) with 73,308 bases and YEAST1 (Saccharomyces cerevisiae yeast, chromosome 1) with 230,209 bases obtained from (http://ncbi.nlm.nih.gov (accessed on 30 December 2021)). The DNA sequences is binarized, ‘A’ and ‘G’ are coded by 1, and ‘T’ and ‘C’ are coded by 0. For example, the sequence ‘ A T A T G C A T T T C C ’ is coded ‘101010100000’.
At first, it seems interesting to indicate that, although the sequence “YEAST1” is 3.14 larger than the sequence “HUMHBB”, the total number of palindromes coming from the sequence “YEAST1” is 2.95 times larger than that of the sequence “HUMHBB”, as indicated in Table 3.
Moreover, we notice in Table 3 that the proportion of palindromes of type “mirror” (i.e., ‘R’ type) is much higher than that in the other types regardless of the DNA sequence considered. This corroborates what has been observed for 1 / f noises, namely P R > P T > P G > p I , where P j is the palindromic probability of type j.
In Table 4, the Lempel–Ziv complexity C l z , the symmentropy E and the symmetropy σ * are reported. From Table 4, we notice that the scalar descriptors are slightly different for the 2 DNA sequences. We note a relative difference of 4 % for the Lempel–Ziv complexity ( 4 % = ( 0.98 0.94 ) / 0.94 ), of 1 % for the symmentropy ( 1 % = ( 0.97 0.96 ) / 0.96 ) and of 6 % for the symmetropy ( 6 % = ( 0.85 0.80 ) / 0.80 )
To go further in the analysis of DNA sequences, in Figure 6, the palindromic vector descriptors for each type j for m ( 0 , 100 ) are reported, even if the calculation has been made with m m a x = 500 . We notice that the palindromic vectors are rather concentrated in the 0–100 band with some peaks (not shown here) beyond m = 100 located in m = 270 , 192 for “YEAST1” and m = 124 for “HUMHBB”. As for the noises in 1 / f , we notice a different distribution of the types of palindromes. For example, there are no more palindromes of type ‘R’ for the sequence “YEAST1” beyond m = 60 , idem for the palindromes of type ‘I’ for the sequence “YEAST1” beyond m = 40 . By the way, note that there are no even palindromes of type ‘I’. This shows the importance of taking into account all types of palindromes and not only the “mirror” palindromes of type ‘R’. By superimposing the palindromic vectors obtained after randomization, we can better see the “useful” information. The signature after randomization being similar to that of an independent and identically distributed random variable seems to be less important information and therefore useless for DNA sequence discrimination.
Finally, it seems interesting to show how local symmentropies allow us to differentiate each DNA sequence. In Figure 7, the local symmentropies calculated from “HUMHBB”, randomized “HUMHBB”, “YEAST1” and randomized “YEAST1” are reported. Straight lines derived from linear fitting from symmentropies show slopes that are significantly different between each DNA sequence. Indeed, from odd palindromes, the slope derived from the linear fitting for YEAST1 is 4.41 times the slope obtained from HUMHBB. For even palindromes, the slope derived from the linear fitting for YEAST1 is 3.61 times the slope obtained from HUMHBB. As expected, symmentropies obtained from randomized DNA sequences are similar while m < 20 and close to unity. Indeed, the binary sequence obtained after randomization is very similar to an independent and identically distributed random variable for which the symmentropy is maximal and worth unity. For m > 30 , as shown in Figure 7, the symmentropies between the 2 DNA sequences are different.

5. Discussion and Conclusions

In this work, we proposed new palindromic descriptors (scalar and vector). The notions of palindromic vectors, palindromic symmetropy and palindromic symmentropy have been tested with binarized 1 / f noises and 2 DNA sequences. For f β noises for which the “complexity” level is adjustable via β , we showed that palindromic symmetropy as well as palindromic symmentropy allows us to better discriminate the different f β noises on a larger range than the Lempel–Ziv complexity. Moreover, we showed that symmentropy is a complexity descriptor very similar to the Lempel–Ziv complexity. However, the palindromic symmetropy indicates the level of symmetry and is a descriptor of “anti-complexity”.
From this preliminary study, we notice that the “mirror” symmetry is more present than the other types of symmetries regardless of the level of complexity (see Figure 2). This is probably why only the “mirror” symmetry through the classical notion of palindrome has been considered so far. However, we showed (see Figure 6) that the four types of palindromes are necessary to better discriminate the binary sequences. Moreover, we showed that the distribution of the types of palindrome evolves with complexity. It goes from 50 % for ‘T’ and ‘R’ types and 0 % for ‘I’ and ‘G’ types when the complexity is low to 25 % , 33 % , 17 % and 25 % for ‘T’, ‘R’, ‘I’ and ‘G’ types when the complexity is maximal. These values are found when the binarized DNA sequences have been randomized.
Multiscale palindromic exploration, i.e., for the whole m size range of palindromes, through palindromic vectors and local symmentropy, allows us to go further in the analysis of binary sequences. In particular, it allows us to highlight a particular signature of independent and identically distributed random binary sequences found for white noise ( β = 0 ) and in the two DNA sequences. This exploration also allows us to clearly identify regions that allow us to discriminate the two DNA sequences. Furthermore, a factor of 4 between the slopes of the linear fits of the local symmentropies calculated from the two DNA sequences shows the discriminative capacity of the local symmentropy.
It seems obvious, as in the article by Tibatan and Sarisaman [9], that symmetry properties, insufficiently exploited to date, play a more important role in the exploration of biological sequences, both at the molecular and sub-molecular levels. The new palindromic descriptors presented in this work should contribute in a non-negligible way and should be widely applied in the study of biological sequences.

Author Contributions

Writing—original draft, J.-M.G. and S.M. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Berry, M.; Nunez, A.M.; Chambon, P. Estrogen-responsive element of the human pS2 gene is an imperfectly palindromic sequence. Proc. Natl. Acad. Sci. USA 1989, 86, 1218–1222. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  2. Ohno, S. Intrinsic evolution of proteins. The role of peptidic palindromes. Riv. Biol. 1990, 83, 405–410. [Google Scholar]
  3. Cain, D.; Erlwein, O.; Grigg, A.; Russell, R.A.; McClure, M.O. Palindromic sequence plays a critical role in human foamy virus dimerization. J. Virol. 2001, 75, 3731–3739. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  4. Giel-Pietraszuk, M.; Hoffmann, M.; Dolecka, S.; Rychlewski, J.; Barciszewsk, J. Palindromes in Proteins. J. Protein Chem. 2003, 109–113. [Google Scholar] [CrossRef] [PubMed]
  5. Lisnic, B.; Svetec, I.-K.; Saric, H.; Nikolic, I.; Zgaga, Z. Palindrome content of the yeast Saccharomyces cerevisiae genome. Curr. Genet. 2005, 47, 289–297. [Google Scholar] [CrossRef] [PubMed]
  6. Pinotsis, N.; Wilmanns, M. Protein assemblies with palindromic structure motifs. Cell. Mol. Life Sci. 2008, 65, 2953–2956. [Google Scholar] [CrossRef] [PubMed]
  7. Lamprea-Burgunder, E.; Ludin, P.; Mäser, P. Species-specific Typing of DNA Based on Palindrome Frequency Patterns. DNA Res. 2011, 18, 117–124. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  8. Raykov, V.; Marvin, M.E.; Louis, E.J.; Maringele, L. Telomere dysfunction palindrome formation independently of double-strand break repair mechanisms. Genetics 2016, 203, 1659–1668. [Google Scholar] [CrossRef] [Green Version]
  9. Tibatan, M.A.; Sarisaman, M. Unitary structure of palindromes in DNA. BioSystems 2022, 211, 104565. [Google Scholar] [CrossRef] [PubMed]
  10. Gabric, D.; Shallit, J. Borders, palindrome prefixes, and square prefixes. Inf. Process. Lett. 2021, 165, 106027. [Google Scholar] [CrossRef]
  11. Allouche, J.-P. Sur la complexité des suites infinies. Bull. Belg. Math. 1994, 1, 133–143. [Google Scholar] [CrossRef]
  12. Allouche, J.-P.; Baake, M.; Cassaigne, J.; Damanik, D. Palindrome complexity. Theor. Comput. Sci. 2003, 292, 9–31. [Google Scholar] [CrossRef] [Green Version]
  13. Brlek, S.; Reutenauer, C. Complexity and palindromic defect of infinite words. Theor. Comput. Sci. 2011, 412, 493–497. [Google Scholar] [CrossRef] [Green Version]
  14. Cederberg, J.N. A Course in Modern Geometries; Springer: New York, NY, USA, 2001. [Google Scholar]
  15. Grunbaum, B.; Shephard, G.C. Tilings and Patterns; W.H. Freeman and Company: New York, NY, USA, 1989. [Google Scholar]
  16. Yodogawa, E. Symmetropy an entropy-like measure of visual symmetry. Percept. Psychophys. 1982, 32, 230–240. [Google Scholar] [CrossRef] [Green Version]
  17. Girault, J.-M.; Humeau-Heurtier, A. Centered and Averaged Fuzzy Entropy to Improve Fuzzy Entropy Precision. Entropy 2018, 20, 287. [Google Scholar] [CrossRef] [Green Version]
  18. Kaspar, F.; Schuster, H.G. Easily Calculable Measure for the Complexity of Spatiotemporal Patterns. Phys. Rev. A 1987, 36, 843–848. [Google Scholar] [CrossRef]
Figure 1. The seven types of friezes with a ⌜ pattern. The friezes 1, 2, 4, 5 and 6 can constitute periodic discrete sequences because no pattern appears with the same abscissa. This is not the case for friezes 3 and 7, which cannot constitute a discrete sequence. Among the five periodic sequences, friezes 2 and 6 are composed of palindromes.
Figure 1. The seven types of friezes with a ⌜ pattern. The friezes 1, 2, 4, 5 and 6 can constitute periodic discrete sequences because no pattern appears with the same abscissa. This is not the case for friezes 3 and 7, which cannot constitute a discrete sequence. Among the five periodic sequences, friezes 2 and 6 are composed of palindromes.
Entropy 24 00082 g001
Figure 2. Scalar palindromic descriptors obtained from binarized f β noises and for different values of β . Left, Lempel–Ziv complexity C l z , in which the boxes do not overlap for 1.2 < β < 0.6 . Left middle, symmetropy σ * , in which the boxes do not overlap for 1.2 < β < 0.4 . Right middle, symmentropy E , in which boxes do not overlap for β > 1.2 . Left, quarte probability P versus β . When β = 2 , the quarte probability is P = [ 0.50 , 0.50 , 0.00 , 0.00 ] . When β = 0 , the quarte probability is P = [ 0.25 , 0.33 , 0.17 , 0.25 ] and the level of reflection symmetry is higher than the translation/ glide reflection and the inversion: σ R * > σ T * = σ G * > σ I * . The closer β is to zero, the higher the complexity. We notice that both C l z and E increase as the complexity increases. On the contrary σ * decreases as the complexity increases.
Figure 2. Scalar palindromic descriptors obtained from binarized f β noises and for different values of β . Left, Lempel–Ziv complexity C l z , in which the boxes do not overlap for 1.2 < β < 0.6 . Left middle, symmetropy σ * , in which the boxes do not overlap for 1.2 < β < 0.4 . Right middle, symmentropy E , in which boxes do not overlap for β > 1.2 . Left, quarte probability P versus β . When β = 2 , the quarte probability is P = [ 0.50 , 0.50 , 0.00 , 0.00 ] . When β = 0 , the quarte probability is P = [ 0.25 , 0.33 , 0.17 , 0.25 ] and the level of reflection symmetry is higher than the translation/ glide reflection and the inversion: σ R * > σ T * = σ G * > σ I * . The closer β is to zero, the higher the complexity. We notice that both C l z and E increase as the complexity increases. On the contrary σ * decreases as the complexity increases.
Entropy 24 00082 g002
Figure 3. Average palindromic vectors obtained for binarized f β noises of length 1000, with β = 2.0 , 1.0 , 0.0 (from top to bottom) and m { 1 : 500 } . Top, average palindromic vectors obtained after averaging 300 vectors for β = 2.0 . Middle, average palindromic vectors obtained after averaging 300 vectors for β = 1.0 . Bottom, average palindrome vectors obtained after averaging 300 vectors for β = 0.0 . The more irregular the sequence (strong negative value of β ) and the larger the spread of the palindromic vector descriptors.
Figure 3. Average palindromic vectors obtained for binarized f β noises of length 1000, with β = 2.0 , 1.0 , 0.0 (from top to bottom) and m { 1 : 500 } . Top, average palindromic vectors obtained after averaging 300 vectors for β = 2.0 . Middle, average palindromic vectors obtained after averaging 300 vectors for β = 1.0 . Bottom, average palindrome vectors obtained after averaging 300 vectors for β = 0.0 . The more irregular the sequence (strong negative value of β ) and the larger the spread of the palindromic vector descriptors.
Entropy 24 00082 g003
Figure 4. Average local symmentropy (with 300 trials) computed for three types of noises. Top, local symmentropy of a Brownian motion ( β = 2 ). Middle, local symmentropy of a pink noise ( β = 1 ). Bottom, local symmentropy of a white noise ( β = 0 ). The sawtooth fluctuation comes from the fact that the symmentropy values are slightly different for even and odd palindromes. The local symmentropy synthesizes the information carried by the four palindromic vectors into only one.
Figure 4. Average local symmentropy (with 300 trials) computed for three types of noises. Top, local symmentropy of a Brownian motion ( β = 2 ). Middle, local symmentropy of a pink noise ( β = 1 ). Bottom, local symmentropy of a white noise ( β = 0 ). The sawtooth fluctuation comes from the fact that the symmentropy values are slightly different for even and odd palindromes. The local symmentropy synthesizes the information carried by the four palindromic vectors into only one.
Entropy 24 00082 g004
Figure 5. Logarithm of the four average palindromic vectors computed from 100 binary sequences iid (independent and identically distributed) of 5000 bits. We note a different distribution between even and odd palindromes. There is an equi-distribution between the different types of symmetry for even palindromes. For odd palindromes, we also note the non-presence of palindromes of ‘I’ type. We note a decrease in the symmetry levels as the size of the palindromes increases. Note that there are no palindromes with sizes exceeding 40. Finally, on average, the proportion of palindromes is P T = 25 % , P R = 33 % , P I = 17 % and P G = 25 % .
Figure 5. Logarithm of the four average palindromic vectors computed from 100 binary sequences iid (independent and identically distributed) of 5000 bits. We note a different distribution between even and odd palindromes. There is an equi-distribution between the different types of symmetry for even palindromes. For odd palindromes, we also note the non-presence of palindromes of ‘I’ type. We note a decrease in the symmetry levels as the size of the palindromes increases. Note that there are no palindromes with sizes exceeding 40. Finally, on average, the proportion of palindromes is P T = 25 % , P R = 33 % , P I = 17 % and P G = 25 % .
Entropy 24 00082 g005
Figure 6. Logarithm of the palindromic vectors obtained from the entirety of the two DNA sequences for m m a x = 500 . Zoom for m ( 1 , 100 ) . In green, logarithm of the palindromic vectors obtained after randomization of the DNA sequences.
Figure 6. Logarithm of the palindromic vectors obtained from the entirety of the two DNA sequences for m m a x = 500 . Zoom for m ( 1 , 100 ) . In green, logarithm of the palindromic vectors obtained after randomization of the DNA sequences.
Entropy 24 00082 g006
Figure 7. Local symmentropies obtained from binarized DNA sequences in the scale range m { 1 , 60 } . Top, odd palindromes. In blue, local symmentropy obtained from HUMHBB and straight line fitting. In orange, local symmentropy obtained from YEAST1 and straight line fitting. In magenta, local symmentropy obtained from randomized HUMHBB. In red, local symmentropy obtained from randomized Yeast. The slope α y derived from the linear fitting for YEAST1 is 4.41 times the slope α h obtained from HUMHBB. Bottom, even palindromes. The slope α y derived from the linear fitting for YEAST1 is 3.61 times the slope α h obtained from HUMHBB.
Figure 7. Local symmentropies obtained from binarized DNA sequences in the scale range m { 1 , 60 } . Top, odd palindromes. In blue, local symmentropy obtained from HUMHBB and straight line fitting. In orange, local symmentropy obtained from YEAST1 and straight line fitting. In magenta, local symmentropy obtained from randomized HUMHBB. In red, local symmentropy obtained from randomized Yeast. The slope α y derived from the linear fitting for YEAST1 is 4.41 times the slope α h obtained from HUMHBB. Bottom, even palindromes. The slope α y derived from the linear fitting for YEAST1 is 3.61 times the slope α h obtained from HUMHBB.
Entropy 24 00082 g007
Table 1. D i c t , d ( m ) and v ( m ) calculated from the binary sequence X = { 01101001 } composed of M = 8 bits. There are in total c ˜ = 5 sizes of palindromes ( 0 , 1 , 2 , 3 , 4 ) derived from the dictionary and used in the binary sequence X . There are two palindromes of size 2, two palindromes of size 3, and two palindromes of size 4, so a total of σ ˜ = 6 = 2 + 2 + 2 palindromes composing the binary sequence.
Table 1. D i c t , d ( m ) and v ( m ) calculated from the binary sequence X = { 01101001 } composed of M = 8 bits. There are in total c ˜ = 5 sizes of palindromes ( 0 , 1 , 2 , 3 , 4 ) derived from the dictionary and used in the binary sequence X . There are two palindromes of size 2, two palindromes of size 3, and two palindromes of size 4, so a total of σ ˜ = 6 = 2 + 2 + 2 palindromes composing the binary sequence.
m012345678
D i c t e0,100,11101,0100110,1001----
d ( m ) 122220000
v ( m ) 882220000
Table 2. D i c t j , v j ( m ) , q j and ϵ ( m ) with j { T , R , I , G } computed from the 8 binary sequence X = { 01101001 } . The non-trivial palindromic symmetropy is σ * = 0.44 = 1302 / 2940 with σ T * = 102 / 2940 , σ R * = 214 / 2940 , σ I * = 472 / 2940 and σ G * = 514 / 2940 , and the global palindromic symmentropy is E = 0.89 = 102 1302 l o g 4 ( 102 1302 ) + 214 1302 l o g 4 ( 214 1302 ) + 472 1302 l o g 4 ( 472 1302 ) + 514 1302 l o g 4 ( 514 1302 ) .
Table 2. D i c t j , v j ( m ) , q j and ϵ ( m ) with j { T , R , I , G } computed from the 8 binary sequence X = { 01101001 } . The non-trivial palindromic symmetropy is σ * = 0.44 = 1302 / 2940 with σ T * = 102 / 2940 , σ R * = 214 / 2940 , σ I * = 472 / 2940 and σ G * = 514 / 2940 , and the global palindromic symmentropy is E = 0.89 = 102 1302 l o g 4 ( 102 1302 ) + 214 1302 l o g 4 ( 214 1302 ) + 472 1302 l o g 4 ( 472 1302 ) + 514 1302 l o g 4 ( 514 1302 ) .
m012345678
D i c t T e0,100,11-1010----
v T ( m ) 882010000
v T * ( m ) -- 2 2 × 7 × 7 0 2 × 6 × 7 1 2 × 5 × 7 0 2 × 4 × 7 0 2 × 3 × 7 0 2 × 2 × 7 0 2 × 1 × 7
q T ( m ) --2/1401/6-0-0
D i c t R e0,100,11101,0100110,1001----
v R ( m ) 882220000
v R * ( m ) -- 2 2 × 7 × 7 2 2 × 6 × 7 2 2 × 5 × 7 0 2 × 4 × 7 0 2 × 3 × 7 0 2 × 2 × 7 0 2 × 1 × 7
q R ( m ) --2/141/22/6-0-0
D i c t I e0,101,10-1010-110100-01101001
v I ( m ) 885010101
v I * ( m ) -- 5 2 × 7 × 7 0 2 × 6 × 7 1 2 × 5 × 7 0 2 × 4 × 7 1 2 × 3 × 7 0 2 × 2 × 7 1 2 × 1 × 7
q I ( m ) --5/1401/6-101/2
D i c t G e0,101,10010,1010110,1001---01101001
v G ( m ) 885220001
v G * ( m ) -- 5 2 × 7 × 7 2 2 × 6 × 7 2 2 × 5 × 7 0 2 × 4 × 7 0 2 × 3 × 7 0 2 × 2 × 7 1 2 × 1 × 7
q G ( m ) --5/141/22/6-0-1/2
ϵ ( m ) --0.930.500.96-0-0.50
Table 3. Distribution in % of the total number of palindromes of different types present in each of the two non-randomized and randomized DNA sequences, m [ 1 , 500 ] . For the non-randomized sequences, the most frequent palindromes are reflection palindromes with N R > N T > N G > N I , while for the randomized sequences, the distribution is N R > N T = N G > N I . The distribution of the different types of palindromes is very similar regardless of the type of DNA sequence. The differences between the total number of palindromes from non-randomized and randomized HUMHBB and YEAST1 sequences are 496,028 − 441,299 = 54,729 and 1,463,633 − 1,384,396 = 79,237, respectively.
Table 3. Distribution in % of the total number of palindromes of different types present in each of the two non-randomized and randomized DNA sequences, m [ 1 , 500 ] . For the non-randomized sequences, the most frequent palindromes are reflection palindromes with N R > N T > N G > N I , while for the randomized sequences, the distribution is N R > N T = N G > N I . The distribution of the different types of palindromes is very similar regardless of the type of DNA sequence. The differences between the total number of palindromes from non-randomized and randomized HUMHBB and YEAST1 sequences are 496,028 − 441,299 = 54,729 and 1,463,633 − 1,384,396 = 79,237, respectively.
DNA seq N T / N Total N R / N Total N I / N Total N G / N Total N Total
HUMHBB29.5%36.8%13.5%20.2%496,028
randomized HUMHBB24.9%33.3%16.7%25.1%441,299
Yeast127.9%35.5%14.5%22.1%1 463 633
randomized Yeast125.0%33.3%16.7%25.0%1,384,396
Table 4. Scalar palindromic descriptors of binarized DNA sequences. Lempel–Ziv complexity C l z , symmentropy E and symmetropy σ * with m { 0 , 500 } . From scalar palindromic descriptors, it seems possible to differentiate the 2 DNA sequences. The values of Lempel–Ziv complexity and symmentropy are close to unity, indicating a high level of complexity. For randomized DNA sequences, Lempel–Ziv complexity and symmentropy tend toward unity.
Table 4. Scalar palindromic descriptors of binarized DNA sequences. Lempel–Ziv complexity C l z , symmentropy E and symmetropy σ * with m { 0 , 500 } . From scalar palindromic descriptors, it seems possible to differentiate the 2 DNA sequences. The values of Lempel–Ziv complexity and symmentropy are close to unity, indicating a high level of complexity. For randomized DNA sequences, Lempel–Ziv complexity and symmentropy tend toward unity.
DNA seq C lz E 100 × σ *
HUMHHB0.940.960.85
randomized HUMHHB1.020.980.75
Yeast10.980.970.80
randomized Yeast11.010.980.75
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Girault, J.-M.; Ménigot, S. Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data. Entropy 2022, 24, 82. https://doi.org/10.3390/e24010082

AMA Style

Girault J-M, Ménigot S. Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data. Entropy. 2022; 24(1):82. https://doi.org/10.3390/e24010082

Chicago/Turabian Style

Girault, Jean-Marc, and Sébastien Ménigot. 2022. "Palindromic Vectors, Symmetropy and Symmentropy as Symmetry Descriptors of Binary Data" Entropy 24, no. 1: 82. https://doi.org/10.3390/e24010082

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop