Next Article in Journal
Toward a Logic of the Organism: A Process Philosophical Consideration
Previous Article in Journal
Measuring Interactions in Categorical Datasets Using Multivariate Symmetrical Uncertainty
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Coding Technique for Stochastic Processes

by
Jesús E. García
1,*,
Verónica A. González-López
1,
Gustavo H. Tasca
1 and
Karina Y. Yaginuma
2
1
Department of Statistics, University of Campinas, Rua Sérgio Buarque de Holanda, 651, Campinas, São Paulo 13083-859, Brazil
2
Department of Statistics, Fluminense Federal University, Rua Professor Marcos Waldemar de Freitas Reis, s/n, Niterói, Rio de Janeiro 24210-201, Brazil
*
Author to whom correspondence should be addressed.
Entropy 2022, 24(1), 65; https://doi.org/10.3390/e24010065
Submission received: 30 October 2021 / Revised: 18 December 2021 / Accepted: 25 December 2021 / Published: 30 December 2021
(This article belongs to the Section Information Theory, Probability and Statistics)

Abstract

:
In the framework of coding theory, under the assumption of a Markov process ( X t ) on a finite alphabet A , the compressed representation of the data will be composed of a description of the model used to code the data and the encoded data. Given the model, the Huffman’s algorithm is optimal for the number of bits needed to encode the data. On the other hand, modeling ( X t ) through a Partition Markov Model (PMM) promotes a reduction in the number of transition probabilities needed to define the model. This paper shows how the use of Huffman code with a PMM reduces the number of bits needed in this process. We prove the estimation of a PMM allows for estimating the entropy of ( X t ) , providing an estimator of the minimum expected codeword length per symbol. We show the efficiency of the new methodology on a simulation study and, through a real problem of compression of DNA sequences of SARS-CoV-2, obtaining in the real data at least a reduction of 10.4%.

1. Introduction

The research and development of data compression methods based on stochastic processes consider the frequencies of the occurrences of the elements of the underlying alphabet to produce a code. Data compression methods are designed to transmit and store data, see [1]. The Huffman coding method is an icon in this field since it is optimal to compress genetic data, text files, etc. Due to the speed with which knowledge about DNA-related structures evolves, the volume of information has grown dramatically, making it impossible to cover all the information on relevant phenomena without a strategy that allows compressing a large number of sequences currently available in databases, such as the free source, The National Center for Biotechnology Information Advances Science and Health (NCBI, accessed on 24 December 2021, https://www.ncbi.nlm.nih.gov/).
The data compression process built by Huffman coding incorporates the use of probabilities to produce an optimal code. This article seeks to improve the Huffman code’s performance about (i) how many probabilities are needed and (ii) how the probabilities are estimated. Obtaining an efficient compression method requires a deep understanding of the behavior of the database. Then, the representation that promotes compression must be generated from an appropriate stochastic model that accurately represents the database. Such notions run in parallel since they appeal to the minimum information necessary to retrieve a message or to the minimum description of a model necessary to design the behavior of a database. While in compression’s field the goal is to use a reduced number of bits to encode the message, in modeling, the goal is to identify an accurate model for the representation of the message. Then, we can think about the premise of identifying an optimal method for compression that uses an optimal stochastic model to produce the data compression. A series of aspects arises when classifying a stochastic model as optimal, plausibility, complexity, etc. In this paper, we consider these aspects and we treat them from the perspective of model selection and data compression. This problem is also related to model selection’s Minimum Description Length principle (see [2]). This principle established that the best model for a given dataset is the one that produces the shortest encoding when considering the encoding of the model plus the encoding of the dataset using the model.
A message, a DNA sequence, among other data structures, can be seen as a realization of a stochastic process, subject to a certain alphabet and with a certain memory. For example, in [3,4], it is shown that the genetic structures of SARS-CoV-2, in FASTA format, can be modeled as a Markovian process. In this case, a model of partitions with interstice was used, which is a variant of the model initially proposed in [5] (and investigated in [6]). In those cases, the alphabet is composed of bases, A = { a , c , g , t } , where the bases are adenine (a), thymine (t), guanine (g), and cytosine (c). The dependence structure that exists in genetic sequences receives a description that is given by (i) a representation of the state space, by identifying genetic units, these units are groups of states that compound a partition of the whole state space and by (ii) the set of transition probabilities from each genetic unit to each base of A .
Throughout this paper, we are assuming that the true model generating the message is unknown and that it will be represented by a Markov chain. According to the Minimum Description Length’s principle, it is necessary to identify a parsimonious model to represent the Markov chain. The goal of this paper is to obtain an improvement in the efficiency of Huffman coding for Markov chains. Huffman’s algorithm needs to build a code for each transition probability function of the Markov process. The Partition Markov models ([6]) reduce the total number of transition probabilities needed to describe the Markov process. Then, using the partition Markov models together with the Huffman code produces a more efficient description of the model. The motivation is that we can consistently estimate the Partition Markov model to efficiently describe the message generation mechanism. We are also looking for an estimate of the minimum expected codeword length per symbol, which will allow us to validate the combined use of Huffman and partition Markov models as a cost-effective strategy.
The content of this article is as follows: Section 2 introduces the concepts and classification of coding systems. Section 3 is destined to give the main notions and way of consistent estimation of the stochastic strategy that we want to show as an option to improve the performance of the Huffman coding process. In addition, in this section, we show results establishing the relation between an optimal coding system and the estimation of a stochastic process. Section 4 exposes the stages of a message transmission process under the new strategy introduced in the previous section. Section 5 shows the advantages and disadvantages of the new strategy, compared with the application of the traditional Huffman code, and concludes with the application of the compression system to three SARS-CoV-2 sequences. The conclusions and details of the contributions of this paper follow in Section 6.

2. Theoretical Background and Optimal Codes

Among the qualities expected of a coding method, we note that an economy in the length of the coded message and an instantaneous method for decoding are desired, such that, it will not be necessary to receive the entire message to start the decoding process. Another aspect of coding is the complexity of the elements used for such a message since, if the complexity is high, the procedure can require high computational and memory conditions. The coding theory, see [7], organizes the problem by giving specific attributes to the encoding function, which are introduced below, associated with a random variable, and then easily extended for stochastic processes.
Definition 1.
Let X be a random variable taking values in A , D * the set of finite concatenations of elements of { 0 , 1 , 2 , , D 1 } . The source code of X, denoted by C , is a function C : A D * such that, for each x A , C ( x ) is the codeword of x . In addition, l ( x ) denotes de length of the codeword C ( x ) .
Definition 2.
Let ( X 1 , , X b ) be a size b random vector taking values in A b . A block source code of ( X 1 , , X b ) , (with block size b) denoted by C b , is a function C b : A b D * , such that, for each ( x 1 , , x b ) A b , C b ( x 1 , , x b ) is the codeword of ( x 1 , , x b ) , and l b ( x 1 , , x b ) is the codeword length for the block x 1 b .
For simplicity, in this section, all the properties and algorithms will be given for a source code of a random variable. However, they are valid or trivially extended to the block source code for a random vector as in the Definition 2.
For the case D = 2 (see Definition 1), l ( x ) records the number of bits used to represent C ( x ) . Since X is a random variable, it is possible to determine the expected length of the code C . Moreover, it is expected to use the fewest number of elements to encode a message, it being necessary to adopt a criterion given by the next definition.
Definition 3.
The expected length of the source code C for a random variable X with probability mass function p ( x ) , x A , is given by
E ( l ( X ) ) = x A p ( x ) l ( x ) .
As expected, we want a code that provides a minimum expected length E ( l ( X ) ) .
Necessarily, a C code that does not lead to ambiguity is one that verifies the following definition and such a code is called nonsingular.
Definition 4.
A source code C , under the Definition 1, is a nonsingular code if, given x , x A , x x C ( x ) C ( x ) .
Sending sequences of realizations of X would imply imposing certain conditions on a code that can be composed by the concatenation of codes of values of a random variable.
Definition 5.
Given a source code under the Definition 1, and given a sequence of realizations of X , say x 1 , x 2 , , x n , the extension of C on the sequence is C ( x 1 x 2 x n ) = C ( x 1 ) C ( x 2 ) C ( x n ) , being the second term, the concatenation of the codes C ( x 1 ) , C ( x 2 ) , , C ( x n ) .
Under Definition 5, we set the expected conditions to consider as ideal a source code.
Definition 6.
Under the Definition 1, the code C is said to be uniquely decodable if its extension, given by Definition 5, is nonsingular (see Definition 4).
As we can see, Definition 6 can be ideal, but the constraint is that such a code is not necessarily instantaneous. Then, it can be necessary to read the whole codified message to proceed to decode it. This brings us to the notion of prefix (or instantaneous) code.
Definition 7.
Under the Definition 1, the code C is said a prefix code if no codeword is a prefix of any other codeword.
According to the various classifications attributed to source codes, the prefix codes are the strictest one (see [7]), in particular, it is a class that verifies the Definition 6. An example is the Huffman code, as shown in [7]. Furthermore, the prefix codes are related to Kraft’s inequality, via Theorem 5.2.1 (see [7]). A prefix code (under Definition 1) is such that x A D l ( x ) 1 . That is, there is a conditioning on the possible values of l ( x ) , x A . Thus, the Kraft’s inequality characterizes the prefix condition. Finding the ideal code (which must be prefix) is related to the minimization of
x A p ( x ) l ( x ) ,
under the constraint
x A D l ( x ) 1 ( Kraf t s inequality ) ,
then, the solution is given by l ( x ) = log D p ( x ) , which implies in a possible non integer solution; thereby, the optimal size will be approximate. The first approximation is to consider the integer part of log D p ( x ) for each codeword, say
l * ( x ) = log D p ( x ) .
Consider the entropy of X ,
H D ( X ) = x A p ( x ) log D ( p ( x ) ) .
If E ( l * ( X ) ) is the expected length of a code following Equation (3), it verifies the Kraft’s inequality, guaranteeing H D ( X ) E ( l * ( X ) ) H D ( X ) + 1 , and, Theorem 5.2.1 (from [7]) ensures that there is an instantaneous code with these codeword lengths.
Any instantaneous code of expected length E ( l ( X ) ) satisfies H D ( X ) E ( l ( X ) ) , as demonstrated by Theorem 5.3.1 (from [7]), including the optimal code, with expected length E ( l o p t ( X ) ) , which in the set of instantaneous codes, verifies E ( l o p t ( X ) ) E ( l ( X ) ) , for all E ( l ( X ) ) coming from an instantaneous code, while, by Theorem 5.4.1 (from [7]),
H D ( X ) E ( l o p t ( X ) ) H D ( X ) + 1 ,
where E ( l o p t ( X ) ) = x l o p t ( x ) p ( x ) .
For a given distribution, an optimal prefix source code using the Huffman algorithm can be constructed, reproduced here (Algorithm 1), see also [8]. For simplicity, the algorithm is presented for a variable taking values from 1 to m ,   A = { 1 , 2 , , m } with D = 2 ; then, the code is composed by concatenation of elements in { 0 , 1 } , those impositions can be easily eliminated in practice.
Algorithm 1: Huffman Code
Input: P = { p ( 1 ) , p ( 2 ) , , p ( m ) } , G = { 1 , 2 , , m }
 1  If | P | = 1 , C ( 1 ) : = 0
 2  Otherwise, while | P | > 1
   –   define
      ∗   I : = a r g m i n { p ( i ) P : i G }
      ∗   J : = a r g m i n { p ( i ) P { p ( k ) : k I } : i G }
      ∗   H : = I J
   –   If | H | = 2 set l ( h ) : = 0 h H , C ( I ) : = 0 and C ( J ) : = 1
   –   Otherwise, if h I , C ( h ) : = 0 C ( h ) and if h J , C ( h ) = 1 C ( h ) .
   –   Set I : = H ,   P : = { p ( k ) : k I } ,   p ( I ) = k I p ( k ) ,   G = { k : k I }
   –   Define P : = { P P } { p ( I ) } , G : = { G G } { I } and return to 1.
Output: { C ( 1 ) , C ( 2 ) , , C ( m ) }
Where | M | denotes the cardinal of the set M . The canonical Huffman code (CHC) is a modification of the Huffman code (Algorithm 1), such that the only information needed to decode it is the set of codelengths.
Consider, again, the random variable X taking values on the alphabet A = { 1 , 2 , , m } . Assume that the probabilities are not increasing such that, if i < j , then P ( X = i ) P ( X = j ) . This assumption can be made without loss of generality because, if it is not true for X , there is a permutation π of A , such that Y = X π has no increasing probabilities.
Let { C ( i ) , i A } be a Huffman code for X , with codelengths l i = l ( C ( i ) ) , i = 1 m . Letting cal C , the corresponding CHC, C is defined from { l i } i = 1 m in the following way:
(a)
For i = 1 , C ( i ) is the concatenation of l 1 zeroes.
(b)
For j = i + 1 , assign C ( j ) as the next (larger) binary number. Do this until the first k A such that the codelength increases, and append zeros to the right of C ( k ) until l ( C ( k ) ) = l k .
(c)
repeat (b), until i = m .
For more information about the canonical Huffman code, see Section 5.8 in [7].
The following example illustrates the stages followed for the construction of the codes.
Example 1.
Consider X a random variable with five values, listed in the alphabet A = { a , b , c , d , e } with probabilities p ( a ) = 1 11 = p ( e ) ,   p ( b ) = 4 11 ,   p ( c ) = 3 11 ,   p ( d ) = 2 11 . We describe below the stages (i–iv) through which the coding process goes (using Algorithm 1).
i. 
P = { p ( a ) = 1 11 , p ( b ) = 4 11 , p ( c ) = 3 11 , p ( d ) = 2 11 , p ( e ) = 1 11 } , G = { a , b , c , d , e }
I = a , J = e , H = { a , e } , and C ( a ) = 0 , C ( e ) = 1
I = { a , e } , p ( I ) = 2 11 , G = { a , e } P = { p ( I ) = 2 11 , p ( b ) = 4 11 , p ( c ) = 3 11 , p ( d ) = 2 11 } G = { G , b , c , d }
ii 
P = { p ( { a , e } ) = 2 11 , p ( b ) = 4 11 , p ( c ) = 3 11 , p ( d ) = 2 11 } , G = { { a , e } , b , c , d }
I = { a , e } , J = d , H = { { a , e } , d } , and C ( a ) = 00 , C ( e ) = 01 , C ( d ) = 1
I = { a , e , d } , p ( I ) = 4 11 , G = { a , e , d } P = { p ( I ) = 4 11 , p ( b ) = 4 11 , p ( c ) = 3 11 } G = { G , b , c }
iii 
P = { p ( { a , e , d } ) = 4 11 , p ( b ) = 4 11 , p ( c ) = 3 11 } , G = { { a , e , d } , b , c }
I = c , J = { a , e , d } , H = { c , { a , e , d } } , and C ( c ) = 0 , C ( a ) = 100 , C ( e ) = 101 , C ( d ) = 11
I = { a , e , d , c } , p ( I ) = 7 11 , G = { a , e , d , c } P = { p ( I ) = 7 11 , p ( b ) = 4 11 } G = { G , b }
iv 
P = { p ( { a , e , d , c } ) = 7 11 , p ( b ) = 4 11 } , G = { { a , e , d , c } , b }
I = b , J = { a , e , d , c } , H = { b , { a , e , d , c } } .
Obtaining the following Huffman code for X:
C ( b ) = 0 , C ( c ) = 10 , C ( a ) = 1100 , C ( e ) = 1101 , C ( d ) = 111 .
To find the canonical Huffman code, we need to re-code X (column 1 of Table 1) so that the probabilities are not increasing in lexicographic order, obtaining Y . The Huffman code for Y and the corresponding codelengths are obtained directly from C (see columns 5 and 6 of Table 1).
From the codelengths of C , we obtain the CHC, see column 7 of Table 1. Observe that the Huffman code and the canonical Huffman code share the same codelengths.
The algorithm strategy is optimal, since, according to Theorem 5.8.1 (see [7]), if C is the Huffman code, or the CHC, and C ^ another code that verifies the Definition 6 (uniquely decodable),
E ( l C ( X ) ) E ( l C ^ ( X ) ) ,
where l C and l C ^ are the lengths of C and C ^ , respectively. That is, C offers the smallest expected size in the class of decodable codes. This means that elements less used in the encoding of the message occupy a greater number of bits than those used more by the message. It should be noted that optimality is not a guarantee of uniqueness, what Theorem 5.8.1 of [7] tells us is the first condition and not the second one.
In order to extend the idea to stochastic processes, we first introduce the appropriate notation and, after that, we adapt the coding process. Let ( X t ) be a discrete time order o Markov chain on a finite alphabet A , with o < . S = A o is the state space of the process ( X t ) . Denote the string a k a k + 1 a n by a k n , where a i A , i : k i n . In this context, the elements that define the process are the transition probabilities, for each a A and s S , define, P ( a | s ) = P r o b ( X t = a | X t o t 1 = s ) . Then, we have a set of probabilities to be identified, say
{ P ( a | s ) , a A , s S } .
According to this new context, it is necessary to implement the coding for each s S , as we show in the next example.
Example 2.
Consider a process with memory o = 1 and alphabet A = { a , b , c } , with transition probabilities given by Table 2 (left):
Suppose that the message to be coded is ‘ c a a b a c b a ’. As the process has memory o = 1 , we do not have (transition) probabilities to use the canonical Huffman code on the first o elements. The first o elements are codified with the binary number corresponding to their lexicographic positions in the alphabet. In this case, o = 1 and the first letter in the sequence is “c”, which corresponds to the third letter in the alphabet; as a result, the first element in the sequence has code 11 corresponding to 3 in binary.
The result is the string 11 10 11 10 10 0 11 10 , which follows the strategy to read the string c a a b a c b a from left to right. c is coded by 11 (3 in binary), c a means that c is followed by a , then using Table 2a is coded by 10, a a means that a is followed by a , then a is coded by 11, etc. To decode the sequence 11 10 11 10 10 0 11 10 , we need to consider the order o = 1 ; then, the first element is decoded as c (third letter of the alphabet); now, we use the conditional line associated with c given in Table 2, so, if the symbol to decode has code 10 and previously we had c , then the original letter in the second position was a . In this stage, we use the conditional line associated with a; then, because the next symbol is 11, the letter with that conditional code is a, and so on.
The example shows that the instantaneous coding is applicable to each conditional space of the ( X t ) process. That is, the process’ code is { C ( x | s ) , s S , x A } , a collection of conditional codes. In addition, { l ( x | s ) , s S , x A } is the collection of conditional lengths for the set of codes. That is, to identify the conditional codes C ( a | a ) , C ( b | a ) , and C ( c | a ) , we consider the transition probabilities P ( a | a ) = 1 / 6 , P ( b | a ) = 2 / 6 and P ( c | a ) = 3 / 6 and, proceeding as in the previous example, we obtain C ( a | a ) = 00 , C ( b | a ) = 01 , and C ( c | a ) = 1 . Furthermore, this example shows how the initial part of the message that has not yet reached the minimum memory of the process can be treated. There are other ways to deal with the initial elements of a message, not necessarily like the one shown in the example, but such a question does not impact the contributions of the article. The procedure encodes one element at a time, with an optimal code depending on the past. It can easily be extended to b consecutive elements of the sample at a time, with an optimal Huffman code depending on the past. To calculate the conditional Huffman code for a block of b symbols, we first need to calculate the conditional transition probabilities for sequences of b letters of the alphabet.
Let ( X t ) be a discrete time order o Markov chain on a finite alphabet A , and state space S = A o , with transition probabilities, { P ( x | s ) , x A , s S } . Consider now x 1 b A b , the transition probability P ( x 1 b | s ) , for the block x 1 b given the state s S . This probability can be calculated from the set of transition probabilities { P ( x | s ) , x A , s S } , in the following way:
P ( x 1 b | s ) = i = 1 b P ( x i | ( s x 1 b ) i i + o 1 ) ,
where ( s x 1 b ) k j , k j is the concatenation of the elements in positions k , k + 1 , , j of the sequence resulting of the concatenation of s x 1 b . This is done for each x 1 b A b , s A o , to obtain { P ( x 1 b | s ) , x 1 b A b } . After that, the set of transition probabilities for size b blocks is used to calculate { C b ( . | s ) , s A o } , the conditional block Huffman code for the state s for each s A o , applying the Huffman algorithm to the set of probabilities { P ( x 1 b | s ) , x 1 b A b } .
The Huffman code for blocks is relevant since Theorem 5.4.2 ([7]) claims that the expected code length per symbol, for a size b block Huffman code, converges to the entropy of the process, which is the theoretical minimal code length per symbol. For more information, see [9,10,11], and Section 5 in [7]. In our simulations, we will use blocks of size 4 to encode the data.
In the next section, we present the strategy for estimating transition probabilities, in a real message or data set. We also present the Partition Markov Model (PMM) that allows us to design the proposal for reducing the number of bits in the transmission of a message, using canonical Huffman encoding.

3. Stochastic Estimation Strategy and Optimal Coding

In practice, we need to estimate the transition probabilities; then, let x 1 n be a sample of the process X t , s S , a A and n > o . We denote by N n ( s , a ) the number of occurrences of the state s followed by a in the sample x 1 n ; this is N n ( s , a ) = | { t : o < t n , x t o t 1 = s , x t = a } | . In addition, the number of occurrences of s in the sample x 1 n is denoted by N n ( s ) with N n ( s ) = | { t : o < t n , x t o t 1 = s } | . Thus, for each a A and s S , N n ( s , a ) N n ( s ) is the estimator of P ( a | s ) .
A reliable estimation of probabilities allows a precise coding, since the proposal of the Algorithm 1 depends on an adequate quantification of the probabilities, to develop the method of attribution of codes in an accurate way. In search of such precision, the idea of partitioning the state space is incorporated, associated with a stochastic process. The proposal aims to use different states, all elements of one specific part of that partition, to estimate the same probability. Let L = { L 1 , L 2 , , L | L | } be a partition of S , a A and L L , define P ( L , a ) = s L Prob ( X t o t 1 = s , X t = a ) ,   P ( L ) = s L Prob ( X t o t 1 = s ) . If P ( L ) > 0 , we define P ( a | L ) = P ( L , a ) P ( L ) . Then, a A , P ( a | L ) = P ( a | s ) s L , meaning that we use all the states s L to estimate the same parameter. However, which is the strategic partition? We formalize the notion in the next definition.
Definition 8.
Let ( X t ) be a discrete time order o ( o < ) Markov chain on a finite alphabet A , and state space S = A o ,
1. 
s , r S are equivalent if P ( a | s ) = P ( a | r ) a A .
2. 
( X t ) is a Markov chain with partition L = { L 1 , L 2 , , L | L | } if this partition is the one defined by the relationship introduced by item 1.
This way to represent a stochastic process ( X t ) is called a Partition Markov Model.
Given Definition 8, it is clear that there will be a reduction in the number of probabilities necessary to describe the Markov chain in comparison with the full Markov chain that is to say that the cardinal | { P ( a | L ) , a A , L L } | is smaller than or equal to the cardinal | { P ( a | s ) , a A , s S } | . In order to obtain { P ( a | L ) , a A , L L } , previously, it is necessary to identify L , which leads us to the estimation process that is detailed next. Once the structure L is identified, we can calculate the probabilities of the parts L : L L , using them in Algorithm 1, for encoding the message.
In a similar way to the one mentioned at the beginning of this section, the number of occurrences of elements in L followed by a and the total number of occurrences of elements in L are given by N n ( L , a ) = s L N n ( s , a ) ,   N n ( L ) = s L N n ( s ) , L L . Then, the estimation of P ( a | L ) is given by N n ( L , a ) N n ( L ) , which shows the benefit in terms of precision to calculate such probability.
The model given by Definition 8 is studied in [6]. The challenge is to find a way to produce a strong consistent estimation of L . In that paper, the proposal is to use the Bayesian Information Criterion (BIC), see [12], to achieve this goal. The BIC is formulated as follows:
BIC ( L , x 1 n ) = ln ML ( L , x 1 n ) ( | A | 1 ) | L | 2 ln ( n ) ,
with ML ( L , x 1 n ) = L L , a A N n ( L , a ) N n ( L ) N n ( L , a ) , f o r N n ( L ) 0 , L L .   ML ( L , x 1 n ) is the pseudo maximum likelihood given a partition L with sample x 1 n . The logarithm (ln) of the pseudo maximum likelihood term can also be written as
ln L L , a A N n ( L , a ) N n ( L ) N n ( L , a ) = L L , a A ln N n ( L , a ) N n ( L ) N n ( L , a ) .
It is shown in [6] that, by maximizing the BIC, it is possible to obtain a strong consistent estimate of the partition L . In practice, this maximization is attainable because of a characteristic of the BIC criterion that allows the definition of a BIC-based metric between candidate parts (see [6]). The metric is used in a clustering algorithm, like the one proposed in [5], to produce a consistent estimation of L . It is relevant to note that the BIC criterion allows the consistent estimation in other models of the area of inference in processes, see [13,14].
Remark 1.
The Partition estimation problem studied in [6] is beyond the scope of this paper, but it is important to note that the estimation algorithms require to compute the frequency ratios,
N n ( s , a ) N n ( s ) , a A , s S .
The number of frequency ratios ( | A | 1 ) | S | should be much smaller than n to have confidence in the calculations. In practice, this restricts the maximal memory o admissible for the model used to compress the data, o < log | A | ( n ) 1 . In Section 5 (situation 6) with n = 1000 , we see an example of a model too large for the sample size.
Now, and within the framework of stochastic processes, we establish the link between the length of a code and the entropy of the process, as established by Equation (5), for random variables.
The notion of entropy plays a fundamental role in the theory of codes and, as we will show below, it allows us to identify the expected codeword length per symbol of a message. For a stochastic process ( X t ) , the entropy is given by
H D ( X t ) = lim n H D ( X 1 , X 2 , , X n ) n .
Using some basic properties of entropy, see [7], it is possible to prove the following result. The result allows for identifying the entropy of the process, under the partition structure given by Definition 8.
Theorem 1.
Let ( X t ) be a discrete time order o ( o < ) Markov chain on a finite alphabet A , and state space S = A o , with partition L following Definition 8. Then,
H D ( X t ) = L L P ( L ) a A P ( a | L ) log D ( P ( a | L ) ) .
Proof. 
By Theorem 4.2.1 in [7],
                                  H D ( X t ) = lim n H D ( X n | X 1 , X 2 , , X n 1 )
                                                                      = lim n H D ( X n | X n o , X n o + 1 , , X n 1 )
= H D ( X o + 1 | X 1 o )
                                                                      = s A o P ( s ) H D ( X o + 1 | X 1 o = s )                                                                       = s A o P ( s ) a A P ( a | s ) log D ( P ( a | s ) )                                                                       = L L P ( L ) a A P ( a | L ) log D ( P ( a | L ) )
(8) follows from Theorem 4.2.1 in [7], (9) and (10) are consequence of the process having order o and for being stationary (see Equation (4.25) of [7]), (11) follows from the definition of conditional entropy. □
Under the assumptions of Theorem 1, and given a sample x 1 n of the process ( X t ) , we can estimate the entropy of the process by
H ^ D ( X t ) = 1 n × ln ( D ) ln ( ML ( L ^ , x 1 n ) ) ,
where L ^ is the estimation of L .
Define H L = a A P ( a | L ) log D ( P ( a | L ) ) , L L , we obtain from Theorem 1
H D ( X t ) = L L P ( L ) × H L .
Defining H ^ L = a A N n ( L , a ) N n ( L ) log D ( N n ( L , a ) N n ( L ) ) , L L ^ , we build the plug-in estimator of H D ( X t )
H ^ D ( X t ) = L L ^ N n ( L ) n × H ^ L = 1 n L L ^ N n ( L ) a A N n ( L , a ) N n ( L ) log D N n ( L , a ) N n ( L ) = 1 n L L ^ , a A log D N n ( L , a ) N n ( L ) N n ( L , a ) = 1 n × ln ( D ) L L ^ , a A ln N n ( L , a ) N n ( L ) N n ( L , a ) = 1 n × ln ( D ) ln L L ^ , a A N n ( L , a ) N n ( L ) N n ( L , a ) = 1 n × ln ( D ) ln ( ML ( L ^ , x 1 n ) ) .
That is, the estimation of the partition L ^ under the effect of x 1 n allows us to identify an estimator of the entropy of the process, which leads us to ask if we can estimate an expected code size. For random variables, there is a relationship between H D and the expected length of the optimal code, see Equation (5). For such adaptation, consider an instantaneous block source code for the random vector ( X 1 , X 2 , , X n ) (Definition 2), with codelength function l . Let R n be the expected codeword length per symbol,
R n = E ( l ( X 1 , X 2 , , X n ) ) n .
According to Theorem 5.4.2 ([7]), the minimum expected codeword length per symbol for an optimal block code R n o p t , verifies R n o p t n H D ( X t ) . Then, we obtain the next corollary.
Corollary 1.
Under the assumptions of Theorem 1, the minimum expected codeword length per symbol R n o p t satisfies
R n o p t L L P ( L ) a A P ( a | L ) log D 1 P ( a | L ) ,
when n , where R n o p t is given by Equation (14) and computed for an optimal block source code.
Remark 2.
If we consider the estimator of H D ( X t ) given by Equation (12), R n o p t can be approximate by ln ( ML ( L ^ , x 1 n ) ) ln ( D n ) , which is proportional to the first term of the BIC value of Equation (6), using the estimated partition L ^ , and with a negative sign.
The compressed representation of the dataset will be composed of two parts—first, a description of the model used to code the data, and second the coded data. The minimum expected codeword length per symbol refers to the encoding of the data. Remark 2 shows that, for given data, the minimum expected codeword is smaller for models with larger values for the maximum likelihood.
Remark 3.
A model with redundant parameters will require more bits for its description than the correct model. On the other hand, a model with redundant parameters usually has a maximum likelihood value larger than the correct model, and, because of Remark 2, will produce a shorter expected code length for the data than the correct model. We will see in Section 5 that using models with redundant parameters for compression produces larger files because the decrease in the size of the portion of the compressed output corresponding to the encoding of the data are smaller than the increase in the size of the part of the output with the specification of the model.
The strategy that we investigate in this paper is supported by Definition 8 and the BIC criterion, whose tendency is to decrease the number of probabilities necessary for the encoding of a message compared to the use of a full Markov chain. While in the case of a full Markov chain such probabilities are ( | A | 1 ) | S | , the same does not happen when approaching the problem from the perspective of a PMM, since the total number of probabilities depends on the cardinal of estimated L , representing a total equal to ( | A | 1 ) | L | , and | L | | S | . In practice, if a full Markov chain is used, the encoding of the message occurs by Algorithm 1, provided as inputs { N n ( s , a ) , a A , s S } . However, if a PMM is used, the encoding of the message is given by Algorithm 1, provided as inputs { N n ( L , a ) , a A , L L ^ } , where L ^ is the estimate of L .
In the next section, the purpose is to describe the process of transmitting a message. The first subsection shows to the reader the steps that need to be considered. Later, we present the advantages of using the counts { N n ( L , a ) , a A , L L ^ } , related to a PMM instead of using the counts { N n ( s , a ) , a A , s S } , related to a full Markov process.

4. Codification of the Model Information

In this section, we show how we encode the model and parameters. For simplicity and clarity of exposition, we will assume that the receiver knows the alphabet A , and there is no need to encode A in the message to be transmitted. In the case of an order o Markov chain, the model is determined by the order o . In the case of a PMM, the model is determined by the partition L ^ of the state space. In both cases, the parameters of the model are the estimated transition probabilities. In our case, as the transition probabilities are estimated from the message to encode, the parameters will be the transition frequencies { N n ( s , a ) , a A , s S } in the case of an order o Markov chain, and { N n ( L , a ) , a A , L L ^ } in the case of a PMM. Note that this is not a review of the ways to implement the coding/decoding process, since our goal is to show how we can incorporate and take advantage, in the compression process, of the notion of partition (Definition 8).

4.1. Codification of the Transition Frequencies

To send a coded sequence of numbers, we need a protocol so that the receiver knows where the code of each number starts and finishes. If we send the integer k in binary, the receiver has to know how many bits we are using to encode that number. For example, the integer 5 can be encoded as 101 or as 00101 . For any integer k , the minimal number of bits needed to represent k in binary is equal to log 2 ( k ) + 1 bits. We know that N n ( s , a ) n , a A , s S , and N n ( L , a ) n , a A , L L ^ ; this means that each frequency, on any possible model for the data, can be encoded in log 2 ( n ) + 1 bits. In what follows, all the frequencies will be encoded with a fixed number ( log 2 ( n ) + 1 ) of bits. For example, to send 30 frequencies, it will be used 30 × ( log 2 ( n ) + 1 ) bits.

4.2. Codification of the Model

In the case of an order o Markov chain model, the model is determined by the order o . We will assume that the order is smaller than log | A | ( n ) 1 ; this means that it can be written in log 2 ( log | A | ( n ) 1 ) + 1 bits or less. We will reserve log 2 ( log | A | ( n ) 1 ) + 1 bits for the order o of the model.
In the case of a PMM, to send the model, we need to send the partition, that is, the relationship between the states s S and the parts L L . In the next example, we show how to do that through the canonical Huffman code.
Now we detail our proposal. If L ^ is the estimated partition for the state space S such that L ^ = { L 1 , L 2 , , L | L ^ | } , then we associate to each state s S an index i , if s L i . The states are arranged in lexical order, and this list is replaced by the indices previously assigned. As each L i has a cardinal, which is the total number of states contained in it, say | L i | , each i is associated with a probability | L i | | S | , and the canonical Huffman code is applied to these frequencies. Finally, the index list is replaced by the canonical Huffman code list and communicated by a string (say C(Index)) in the transmission process. Example 3 illustrates the process.
Example 3.
Consider the process ( X t ) over the alphabet A = { 0 , 1 } , state space A o , with order o = 3 . Suppose that under the Definition 8 the partition L = { L 1 , L 2 , L 3 } is identified, with the composition given by Table 3.
Considering the lexical order of the elements, we associate to each state s the value i , if s L i ,
Then, the list of indices is 12211231 (see Table 4), and each index can be coded by the canonical Huffman algorithm, since we have associated with each index a frequency. See the results reported in Table 5.
As a consequence, to indicate to which part each state belongs, we need to codify the sequence 12211231 with the codes in column 4 of Table 5. The string to communicate is given by (15),
0 10 10 0 0 10 11 0 ,
which requires 12 bits. To inform the codification, we need to send the codelengths. In this case, we send the first codelength (1) and then the increments (last column in Table 5). The string is 1 1 0 , which requires 3 bits. In total to inform the partition, we need to communicate 15 bits, plus log 2 ( log 2 ( n ) 1 ) + 1 bits for the order.
The next example complements the previous one showing how using a PMM (see Definition 8 and Table 3) can cause a reduction in the number of bits necessary for the transmission of the model.
Example 4.
Consider that the partition of Table 3 was generated from a size n = 393 sample, with the frequencies represented by Table 6, as the information that generates the partition reported in Example 3.
First, suppose we want to transmit the model and parameters using an order 3 Markov chain. To communicate the model, we need to transmit the order using 3 = log 2 ( log 2 ( 393 ) 1 ) + 1 bits. In order to communicate the counts reported in Table 6, we use the strategy previously described. We use a fixed number of bits to represent each N n ( s , a ) , a = 0 , 1 , s S . For example, with n = 393 , we can produce the binary representation of N n ( s , a ) with 9 = log 2 ( 393 ) + 1 bits for each a = 0 , 1 and s S . Then, the string that is communicated is built from the concatenation in lexical order of the representations of each count, fixing first the state and plugging all the representations following the lexical order of the elements of the alphabet:
000011110 000010100 000011000 000011001 000011010 000011001 000011111 000010100 000011101 000010100 000011001 000011001 00001010 000011110 000011110 000010100
Then, the string requires 16 × 9 = 144 bits. Adding the 3 bits for the order of the model, we get the total of 147 bits.
In conclusion, in the case of using an order 3 Markov chain model, the total number of bits required for the specification of the model plus the parameters is 147 bits.
Now, suppose that we want to transmit the model and parameters using a partition Markov model. Considering the partition structure of Table 3 plus the frequencies in Table 6, we have the frequencies in Table 7:
Then, under this new scheme that considers the counts related to the parts, this is N ( L , a ) = s L N ( s , a ) , the string that needs to be communicated is the concatenation of the binary representations of the six frequencies in Table 3, written using 9 = log 2 ( 393 ) + 1 bits for each N ( L , a ) , a = 0 , 1 and L L , The strings is
001111000 001010000 001001011 001001011 000001010 000011110 .
The number of bits that the configuration needs is 6 × 9 = 54 bits.
From the concatenation of strings 011 corresponding to the order, plus the string (15) corresponding to the codification of the partition, and (17) corresponding to the frequencies, we obtain the final string, encoding the partition Markov model,
011 0 10 10 0 0 10 11 0 001111000 001010000 001001011 001001011 000001010 000011110 .
In addition, the string requires 69 bits.
Table 8 summarizes the results under two perspectives, using a regular order 3 Markov chain versus the approach by partition:
We see then, from the previous example, that the compression produced by the approach (Definition 8) allows for reducing some costs if there is a reasonable number of parts that allows for reducing the information to be transmitted, in comparison with the model with all the states separated, or with a shy reduction.
In the next section, we show simulation studies and an application illustrating the use of Huffman code allied to partition Markov models, and we describe its performance in each situation. We compare the performance of the Huffman coding under two strategies, that is, we compare the performance of the Huffman code when it is implemented using a full order o Markov chain versus using a partition Markov model.

5. Simulation Study and Application

5.1. Simulation Study

In the next simulations, we generate samples x 1 n , of sizes n = 1000,5000 , and 10,000, for each setting simulated—following some specific stochastic model, given for each case by a partition Markov model, since a PMM is a generalization of full Markov models.
We implement the two strategies for comparison. The first called by States is the Huffman code applied to a full order o Markov chain, executed on the transition probabilities estimated from the sample, state by state. The second is called by Parts, which is the Huffman code executed on the transition probabilities estimated from the sample, part by part, see Definition 8 and Equation (6).
In all the cases (we consider eight settings), the codification of the dataset was made using the Huffman code in blocks of size 4.
On the simulations, we want to check the influence of the number of parameters of the simulated model on the resulting compression. In the case of a full order o Markov chain over the alphabet A , the number of parameters is ( | A | 1 ) | S | . In the case of a Partition Markov model, the number of parameters is ( | A | 1 ) | L | . We simulate eight models with the following principles. The first two models have the same number of parameters for both families of models—meaning that each part contains a single state. This could be considered the worst condition for the by Parts methodology. On the contrary, the third and four models have tiny partitions with only two parts; these two models seem very favorable to the PMM as the models can be described with a minimal number of parameters. Even more favorable to the PMM is a small partition and large state space; scenarios implemented in cases 5 and 6, with state space of size 64 and 256, respectively, and partitions of size 4. Finally, in models 7 and 8, we consider the case of having a small state space and small partition. In that scenario, both strategies should work well.
For each sample size and case, we performed 100 independent simulations and applied the two strategies for the sample. In Table 9, we report the mean number of bits used for the compression of the model (columns 6 and 7), the mean number of bits used for the compression of the dataset (columns 8 and 9), and the mean compression ratio, defined as
R c = uncompressed size compressed size .
Simulation 1. 
Situations 1 and 2, with Markov partitions L having the same size as the state space S .
The two first situations we show in Table 9 could be considered at first, unfavorable to the partition structure since the number of parts is equal to the number of states. Nonetheless, as we can see, the rates are better for the by Parts methodology being that the BIC criterion chooses a model with a set of parameters that are considered relevant from the point of view of the likelihood and for a given sample size. Note how the number of bits reserved for the model grows with the size of the dataset; consider a fixed distribution, for larger datasets, the number of bits dedicated to the header can be larger. In addition, since the BIC criterion is consistent, there is a sample size such that, from that sample size on, the number of parameters will not increase, only the number of bits needed to transmit larger frequencies. In the two situations, we can observe a huge difference in the number of bits being used to describe the models, columns 6 and 7 in Table 9. Again, this is caused by the BIC criterion restricting the parameters to the ones that are relevant, for the likelihood of the sample. It produces the reverse effect on the size of the codified data.
We can also observe that, for situation 2, the number of parameters to be codified for the strategy by States is so large ( 256 × 4 transition frequencies) that it does not compensate the compression of the data x 1 n , producing a compression ratio smaller than 1 for all three sample sizes. For the strategy by Parts, this only happens for the smaller sample size n = 1000 .
Simulation 2. 
Situations 3 and 4, with a cardinal of the Markov partitions L being very small.
The situations corresponding to the models labeled as 3 and 4 in Table 9 could be considered as favorable to the partition structure. Scenarios with smaller partitions could induce a small size on the description of the estimated model. Looking at columns 6 and 7 in Table 9, we can see that, indeed, the number of bits required to encode the description of the model is small compared to situations 1 and 2, and also that it changes very little with the size of the dataset. This happens because the model selection mechanism detects and incorporates into the estimated model all the relevant parameters even for the smaller sample size n = 1000 .
As seen in Remark 3, given a set of models for the dataset, the model with the larger number of parameters will have a larger maximum likelihood value and a smaller minimum expected codeword length per symbol. We note that, even with a significant difference in the number of parameters, the differences in the length of the code corresponding to the data are relatively small. The reason is that the extra parameters on the model by States are irrelevant as they correspond to transition probabilities already considered in other states. In addition, the estimated transition probabilities are more precise for the Partition Markov model as they are drawn from a large sample corresponding to each one of the estimated parts.
Simulation 3. 
Situations 5 and 6. Small Markov partitions L , and large state space S .
In both situations, the size of the partition generating the models is small | L | = 4 . However, the sizes of the state spaces are large, for situation 5, | S | = 64 , for model 6 | S | = 256 . The cases produce the larger differences in rates of compression, in the whole simulation study.
We also notice that, in the case of model 6 and sample size 1000 , both methods are ineffective for data compression, as the rate of compression is lower than 1. The reason is in Remark 1, where we state that, to fit a PMM model, the order o should satisfy o < log | A | ( n ) 1 ( = 3 ) . In the case of the model for States, we have a similar problem, the number of transition probabilities to be computed from the data are 256 × 3 = 768 , almost the size of the dataset, which is 1000. We can see that, for n = 1000 , the number of bits for the data (column 9) is disproportionately larger for the methodology by Parts, which shows that the BIC criterion could not find a good model for the sample. The situation pointed before is solved for sample sizes 5000 and 10,000 .
Simulation 4. 
Situations 7 and 8. Small Markov partitions L , and small state space S .
For both models, we use | S | = 8 . The rates of compression for both approaches are similar and show the efficiency of the compression procedure. For sample size 1000 , the rates for the method by Parts are significantly better as the codification of the model is more significant for the smaller dataset. For larger sample sizes, there is a small advantage for the methodology by Parts, approaching the limit given by the entropy of the process. The difference, in favor of the methodology by Parts, is caused by the codification of the model as seen in columns 6 and 7 in Table 9. The difference in the size of the codification of the data is very small, see columns 8 and 9 in Table 9.
Two facts can be seen through Table 9. The first one is that the number of bits needed for the transmission of the model is always smaller for the method by Parts. This happens because of the ability of the Bayesian Information Criterion to incorporate parameters into the model, according to its relevance to describe the dataset.
Second, the number of bits needed to transmit the data for the by Parts structure is always larger than for the by States structure. The cause is stated in Remark 2, which says that the negative maximum likelihood estimator ln ( M L ( L ^ , x 1 n ) ) ln ( D n ) can approximate the optimal expected codeword length per symbol R n o p t . As the partition model has a number of parameters smaller than or equal to the standard order o Markov chain, the maximum log-likelihood is smaller or equal for the partition Markov model, which, through Remark 3, produces an expected code length larger or equal for the partition Markov model. This effect is compensated by the smaller size of the model provided by the conception of the partition Markov model, promoting a reduction in the number of bits needed by the approach by Parts in comparison to the traditional one by States (values in bold in Table 9).

5.2. SARS-CoV-2 and Compression

In this application, we use as a background the current health crisis produced by SARS-CoV-2, which affects the entire world. Sites like https://www.worldometers.info/coronavirus/ (accessed on 24 December 2021) record the enormous number of deaths and daily cases that humanity faces. The magnitude of this situation has led scientists to genetic analysis to identify the alterations that are leading today to genetic mutations of the virus, which potentiate its distribution capacity and its capacity to lead to death. Ref. [4] records that, for example, in Brazil, alterations in the genetic structure of SARS-CoV-2 were already detected at the first half of 2020. Today, such mutations are factors that jeopardize the efficacy of vaccines against SARS-CoV-2, developed during 2020. We begin 2021 with the uncertainty of the course that this epidemic may take. Locations in an uncontrolled situation can be controlled if ways are found to inhibit the evolution of new strains of the virus.
Counting on the rapid communications that occur today between scientists around the world and the volume of genetic information available, this application aims to contribute to the development of this type of analysis, considering genetic databases such as the one available in NCBI (National Center for Biotechnology Information, accessed on 24 December 2021), via link https://www.ncbi.nlm.nih.gov/.
Specifically in this article, we use three genetic sequences, in the Fasta format—one from China and the other two collected in Brazil. Our selection is based on the discoveries made in [3,4] that have allowed us to know the stochastic behavior of these sequences. The sequence MN908947.3, coming from a patient of Central Hospital in Wuhan (Hubei, China), can be obtained from the free base NCBI. It is the first SARS-CoV-2 complete genome sequence registered in the literature, see [15]. In [3], we model this sequence (using its Fasta format), and the resulting model is a PMM model with interstice on the alphabet { a , c , g , t } . As introduced in [3], the model was selected using the BIC criterion. This model is very natural for genetic behavior. To the usual parameter o (see Definition 8), we add a new parameter G , G > o , which means that a piece of extra information about the behavior of the sample further in the past needs to be incorporated into the model to better catch the stochastic behavior of the sequence. In this case, o = 3 (since we deal with the genomic alphabet, which is organized in triples), and G = 9 (see table 6 (right) in [3]). Then, in the compression process, we need to report two parameters o = 3 and G = 9 . We have a continuous past of size o = 3 and a gap between positions 4 and 8 in the past, the value in the position G = 9 being used to define the states of the state space, as illustrated in Figure 1. Each state of the state space is composed of a value (from the alphabet A = { a , c , g , t } ) associated with the distant past located 9 positions behind position t and a triple of values related to the positions t 3 , t 2 , and t 1 ; then, each state is s A × A 3 .
In [3], this model is widely investigated within the framework of the sequence MN908947.3 of SARS-CoV-2. It results in a model made up of 13 parts, where each part gathers states with the constitution described above. In the present application, we use the same model that was also addressed later in [4], for the other sequences investigated here, MT126808.1 (patient traveled from Switzerland and Italy (Milan) to Brazil) and MT350282.1 (also from a patient in Brazil).
The link http://www.ime.unicamp.br/~veronica/PMM_Covid19/ (accessed on 24 December 2021) exposes for each Brazilian sequence the content of the partitions and their transition probabilities. We see that all the sequences have approximately the same number of estimated parts; MN908947.3 and MT350282.1 have 13 parts and MT126808.1 has 12 parts.
In Table 10, we present a summary of the compression process. We do not report the bits for the alphabet, as it is the genetic alphabet { a , c , g , t } itself. For the memory parameters, o and G , we have allocated 4 and 8 bits, respectively, both for strategy by States and for strategy by Parts.
We observe that, for the three genetic sequences of SARS-CoV-2, there is a reduction in the total number of bits that exceeds 10.4%, when comparing the strategy by States with the strategy by Parts. The situation here is due to the fact that the number of bits destined to encode the model required for strategy by Parts is much lower than the number of bits destined to encode the model required for strategy by States.
One aspect that could be considered in the compression process of many sequences of the same virus is to use a model that determines the parts considering a generalization of the Definition 8, such as the notion introduced in [16]. The notion introduced in [16] proposes to determine the parts using any sequence and any state, joining in the same part sequences and states if they share the transition probabilities. With such a model in hand, the bits used for the description of the model would be informed only once, remaining only to inform the coding of the sequences.

6. Conclusions

In this paper, we propose an improvement in the Huffman coding system’s efficiency, taking into account that the method uses transition probabilities that are computed from a message to be transmitted. A message has a dependent structure for which the element a, that is, a letter of the alphabet, a A , which is in the position t that occurs depending on the past, o values of the message, s A o ( S = A o the state space). The Huffman coding system uses the transition probabilities P ( a | s ) to compress the message without loss. To improve the Huffman coding system in a Markov chain setting, we introduce the notion of Partition (see Definition 8, Section 3), which is a parsimonious structure on the state space that allows using several states to identify the same probability P ( a | L ) , where L contains several states. The proposal is to reduce the number of bits needed to transmit the frequencies that estimate the transition probabilities associated with the message. Section 4.2 shows and exemplifies the strategy.
On the other hand, there is a potential loss of efficiency due to needing to identify which part L each state s belongs to, and, depending on the context, the number of bits used for such transmission must be considered. These situations are investigated in Section 5. We conclude that the strategy, via Definition 8, compensates, since a model with redundant parameters will use more bits for the description of the model than the correct one. We report in Section 5 that, for all the situations studied, the partition strategy (by Parts) improves the compression concerning a full Markov chain by States. In addition, except for the two cases in which the sample size is similar to the number of transition probabilities, we need to estimate to choose a model, the strategy by Parts has a compression rate significantly larger than 1.
Using Theorem 1, we show that the entropy of a process can be expressed using the structure of a partition. By estimating such a model, we can estimate the entropy of a process, which is the generator of a message, see Equation (12). A partition Markov model is identified through the Bayesian Information Criterion (see Equation (6)), showing a clear relationship with the minimum expected codeword per symbol (see Corollary 1), which can be estimated using the Bayesian Information Criterion and the estimated partition L ^ , see Remark 2. These results allow us to estimate the minimum expected codeword per symbol through the estimation process of a partition Markov model.
We complete this article with a compression application in the framework of SARS-CoV-2 sequences (see Section 5.2). We see that the procedure proposed in this article is capable of producing a reduction of more than 10% in compression in relation to Huffman code without the incorporation of the partition Markov model notion. The strategy shows promise in times of need for extra agility in the transmission and modeling of genetic sequences, as is the case of this application.
As a legacy of this article, we can say that the development of new models, such as the Partition Markov model, as well as estimation methods with relevant properties, such as the Bayesian Information Criterion, provide enough substance to contribute in the field of data compression, and this article aims to show this fact.

Author Contributions

All authors of this paper jointly conceived the idea for this paper, discussed the agenda for the research, performed the theoretical and numerical calculations, and prepared each draft of the paper. All authors have read and agreed to the published version of the manuscript.

Funding

G.H.T. gratefully acknowledges the financial support provided by CAPES with a fellowship from the PhD Program of Department of Statistics—University of Campinas.

Data Availability Statement

The National Center for Biotechnology Information Advances Science and Health, https://www.ncbi.nlm.nih.gov/ (accessed on 24 December 2021). https://www.worldometers.info/coronavirus/ (accessed on 24 December 2021). http://www.ime.unicamp.br/~veronica/PMM_Covid19/ (accessed on 24 December 2021).

Acknowledgments

The authors wish to express their gratitude to two referees for their helpful comments on an earlier draft of this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rissanen, J. AUniversal Data Compression System. Trans. Inf. Theory 1983, 29, 656–664. [Google Scholar] [CrossRef] [Green Version]
  2. Rissanen, J. Modelling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  3. García, J.E.; González-López, V.A.; Tasca, G.H. Partition Markov Model for COVID-19 Virus. 4open 2020, 3, 13. [Google Scholar] [CrossRef]
  4. García, J.E.; González-López, V.A.; Tasca, G.H. A Stochastic Inspection about Genetic Variants of COVID-19 Circulating in Brazil during 2020. AIP Conf. Proc. forthcoming.
  5. García, J.E.; González-López, V.A. Minimal markov models. In Proceedings of the Fourth Workshop on Information Theoretic Methods in Science and Engineering, Helsinki, Finland, 7–10 August 2011; p. 25. [Google Scholar]
  6. García, J.E.; González-López, V.A. Consistent Estimation of Partition Markov Models. Entropy 2017, 19, 160. [Google Scholar] [CrossRef]
  7. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  8. Huffman, D.A. A method for the construction of minimum-redundancy codes. Proc. IRE 1952, 40, 1098–1101. [Google Scholar] [CrossRef]
  9. McMillan, B. The basic theorems of information theory. Ann. Math. Stat. 1953, 24, 196–219. Available online: https://www.jstor.org/stable/2236328 (accessed on 24 December 2021). [CrossRef]
  10. McMillan, B. Two inequalities implied by unique decipherability. IRE Trans. Inf. Theory 1956, 2, 115–116. [Google Scholar] [CrossRef]
  11. Mannan, M.A.; Kaykobad, N. Block huffman coding. Comput. Math. Appl. 2003, 46, 1581–1587. [Google Scholar] [CrossRef] [Green Version]
  12. Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
  13. Buhlmann, P.; Wyner, A.J. Variable length Markov chains. Ann. Stat. 1999, 27, 480–513. [Google Scholar] [CrossRef]
  14. Csiszár, I.; Talata, Z. Context tree estimation for not necessarily finite memory processes, via BIC and MDL. IEEE Trans. Inf. Theory 2006, 52, 1007–1016. [Google Scholar] [CrossRef]
  15. Wu, F.; Zhao, S.; Yu, B.; Chen, Y.M.; Wang, W.; Song, Z.G.; Hu, Y.; Tao, Z.W.; Tian, J.H.; Pei, Y.Y.; et al. A new coronavirus associated with human respiratory disease in China. Nature 2020, 579, 265–269. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  16. Cordeiro, M.T.A.; García, J.E.; González-López, V.A.; Mercado Londoño, S.L. Partition Markov model for multiple processes. Math. Methods Appl. Sci. 2020, 43, 7677–7691. [Google Scholar] [CrossRef]
Figure 1. Scheme of the past necessary to determine the partition of the state space of the SARS-CoV-2 process at time t , in dashed line, the irrelevant period with limits on top of the scheme [ t 8 , t 4 ] .
Figure 1. Scheme of the past necessary to determine the partition of the state space of the SARS-CoV-2 process at time t , in dashed line, the irrelevant period with limits on top of the scheme [ t 8 , t 4 ] .
Entropy 24 00065 g001
Table 1. Huffman and canonical Huffman for X . From left to right: x , P ( X = x ) , y, P ( Y = y ) , Huffman code for Y , codelength for Y , CHC for Y ( C Y ) , and codelength increments.
Table 1. Huffman and canonical Huffman for X . From left to right: x , P ( X = x ) , y, P ( Y = y ) , Huffman code for Y , codelength for Y , CHC for Y ( C Y ) , and codelength increments.
x P ( X = x ) y P ( Y = y ) C Y ( y ) (Huffman) l ( y ) C Y ( y ) (CHC) l ( y ) Increment
b4/11a4/110101
c3/11b3/11102101
d2/11c2/1111131101
a1/11d1/111100411101
e1/11e1/111101411110
Table 2. Left: transition probabilities P ( x | s ) . Right: conditional canonical Huffman code C ( x | s ) and length l ( x | s ) , x A = { a , b , c } , s S = { a , b , c } .
Table 2. Left: transition probabilities P ( x | s ) . Right: conditional canonical Huffman code C ( x | s ) and length l ( x | s ) , x A = { a , b , c } , s S = { a , b , c } .
P ( a | s ) P ( b | s ) P ( c | s ) C ( a | s ) l ( a | s ) C ( b | s ) l ( b | s ) C ( c | s ) l ( c | s )
s = a 1/62/63/611210201
s = b 1/31/31/310211201
s = c 2/71/74/710211201
Table 3. List of parts of ( X t ) . The second column reports the composition of the part (on the left) and the third column reports the cardinal of the part.
Table 3. List of parts of ( X t ) . The second column reports the composition of the part (on the left) and the third column reports the cardinal of the part.
Part ( L i )Composition of L i | L i |
L 1 { 000 , 011 , 100 , 111 } 4
L 2 { 001 , 010 , 101 } 3
L 3 { 110 } 1
Table 4. List of states in lexical order and related index.
Table 4. List of states in lexical order and related index.
s000001010011100101110111
index12211231
Table 5. List of indices that identify the states with the parts reported in Table 3. The second column reports the frequency of the index (on the left). The third column reports the Huffman code of the index. The last three columns report the CHC, the codelengths, and the increments in codelengths, respectively.
Table 5. List of indices that identify the states with the parts reported in Table 3. The second column reports the frequency of the index (on the left). The third column reports the Huffman code of the index. The last three columns report the CHC, the codelengths, and the increments in codelengths, respectively.
IndexFrequencyHuffmanCHCCodelengthIncrease in Codelength
14/81011
23/8011021
31/8001120
Table 6. Counts (and binary form) of each state s followed by 0 (and 1), from a message x 1 n , also represented by partition in Table 3—in blue, the cases of part L 1 , in red the cases of part L 2 , and in magenta the case of part L 3 .
Table 6. Counts (and binary form) of each state s followed by 0 (and 1), from a message x 1 n , also represented by partition in Table 3—in blue, the cases of part L 1 , in red the cases of part L 2 , and in magenta the case of part L 3 .
s 000001010011100101110111
N n ( s , 0 ) 3024263129251030
binary111101100011010111111110111001101011110
N n ( s , 1 ) 2025252020253020
binary1010011001110011010010100110011111010100
Table 7. Counts of each part L followed by 0 (and 1), from the message represented by the partition in Table 3.
Table 7. Counts of each part L followed by 0 (and 1), from the message represented by the partition in Table 3.
L L 1 L 2 L 3
N n ( L , 0 ) 1207510
binary111100010010111010
N n ( L , 1 ) 807530
binary1010000100101111110
Table 8. | { N n ( s , a ) , s S , a A } | ,   | { N n ( L , a ) , L L , a A } | and number of bits necessary for the transmission, for each case.
Table 8. | { N n ( s , a ) , s S , a A } | ,   | { N n ( L , a ) , L L , a A } | and number of bits necessary for the transmission, for each case.
{ N n ( s , a ) , s S , a A } { N n ( L , a ) , L L , a A }
Number of counts166
Number of bits14769
Table 9. Results for the simulation study. From left to right. (1) Model description: label (situation), order, number of parts, and size of the alphabet. (2) Sample size ( n ) of x 1 n . In the next columns, we report mean values over 100 samples. On the left for the strategy by States, on the right, for the strategy by Parts. (3) Bits for the model: cost in bits for the codification of the model (columns 6 and 7). (4) Bits for the data: cost in bits for the codification of the data x 1 n (columns 8 and 9). (5) Bits (total): total of bits (columns 10 and 11). (6) Compression ratio R c (columns 12 and 13). In bold type, the cases with the smallest number of bits and higher R c .
Table 9. Results for the simulation study. From left to right. (1) Model description: label (situation), order, number of parts, and size of the alphabet. (2) Sample size ( n ) of x 1 n . In the next columns, we report mean values over 100 samples. On the left for the strategy by States, on the right, for the strategy by Parts. (3) Bits for the model: cost in bits for the codification of the model (columns 6 and 7). (4) Bits for the data: cost in bits for the codification of the data x 1 n (columns 8 and 9). (5) Bits (total): total of bits (columns 10 and 11). (6) Compression ratio R c (columns 12 and 13). In bold type, the cases with the smallest number of bits and higher R c .
Model DescriptionnBits for the ModelBits for the Data x 1 n Bits (Total)Compression Ratio ( R c )
Label o | L | | A | StatesPartsStatesPartsStatesPartsStatesParts
10002326.2366.6831.91121.53158.11488.10.641.35
1364450003143.5829.24673.25131.67816.65960.81.281.68
10,0003401.51299.99380.39864.912,781.811,164.81.571.80
10003415.9567.3653.71558.24069.62125.50.490.94
242564500011508.11069.44379.65842.115,887.76911.50.631.45
10,00013,449.81391.19206.211,103.722,656.012,494.70.881.60
1000713.0116.71125.61166.11838.61282.80.821.17
33235000991.5131.65777.35816.86768.85948.41.111.26
10,0001074.7138.711,572.611,620.812,647.311,759.51.191.28
1000739.9125.01127.91188.11867.81313.10.801.14
43235000989.3131.55768.45807.96757.85939.41.111.26
10,0001076.3137.711,598.411,645.512,674.611,783.21.181.27
10002460.1358.4899.41060.73359.51419.10.601.41
534450003144.4395.65016.05168.58160.55564.11.231.80
10,0003400.5411.610,173.010,341.413,573.510,753.01.471.86
10003341.8613.2634.11842.63975.92455.80.500.81
6444500011,874.7971.64531.55201.916,406.26173.50.611.62
10,00013,484.8987.69709.910,371.023,194.811,358.60.861.76
1000158.699.5549.9555.7708.5655.21.411.53
73625000196.6139.92737.02744.22933.62884.11.701.73
10,000212.6176.05465.65469.95678.25646.01.761.77
1000157.874.3564.8570.2722.6644.51.391.56
83325000196.6100.32804.12807.43000.72907.61.671.72
10,000212.6106.95618.75625.55831.35732.51.721.74
Table 10. From top to bottom: sequence ( x 1 n ) name, sample size n of x 1 n , country of origin. On the left, for the strategy by States, on the right, for the strategy by Parts, (1) cost in bits for the codification of the model, (2) cost in bits for the codification of the data x 1 n , (3) bits (total)—in bold type, the cases with the smallest number of bits.
Table 10. From top to bottom: sequence ( x 1 n ) name, sample size n of x 1 n , country of origin. On the left, for the strategy by States, on the right, for the strategy by Parts, (1) cost in bits for the codification of the model, (2) cost in bits for the codification of the data x 1 n , (3) bits (total)—in bold type, the cases with the smallest number of bits.
Sequence ( x 1 n )MN908947.3 MT126808.1 MT350282.1
n29,903 29,876 29,903
OriginChina Brazil Brazil
StrategyStatesPartsStatesPartsStatesParts
Bits of Model916818119169171191691799
Bits of x 1 n 58,11658,40158,06358,50158,12358,489
Bits (total)67,28460,21467,23260,21267,29260,288
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

García, J.E.; González-López, V.A.; Tasca, G.H.; Yaginuma, K.Y. An Efficient Coding Technique for Stochastic Processes. Entropy 2022, 24, 65. https://doi.org/10.3390/e24010065

AMA Style

García JE, González-López VA, Tasca GH, Yaginuma KY. An Efficient Coding Technique for Stochastic Processes. Entropy. 2022; 24(1):65. https://doi.org/10.3390/e24010065

Chicago/Turabian Style

García, Jesús E., Verónica A. González-López, Gustavo H. Tasca, and Karina Y. Yaginuma. 2022. "An Efficient Coding Technique for Stochastic Processes" Entropy 24, no. 1: 65. https://doi.org/10.3390/e24010065

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop