Next Article in Journal
Open Problems within Nonextensive Statistical Mechanics
Previous Article in Journal
Stochastic Antiresonance for Systems with Multiplicative Noise and Sector-Type Nonlinearities
Previous Article in Special Issue
Joint Detection and Communication over Type-Sensitive Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Lossy Compression of Individual Sequences Revisited: Fundamental Limits of Finite-State Encoders

The Viterbi Faculty of Electrical and Computer Engineering, Technion-Israel Institute of Technology, Technion City, Haifa 3200003, Israel
Entropy 2024, 26(2), 116; https://doi.org/10.3390/e26020116
Submission received: 3 January 2024 / Revised: 25 January 2024 / Accepted: 26 January 2024 / Published: 28 January 2024
(This article belongs to the Collection Feature Papers in Information Theory)

Abstract

:
We extend Ziv and Lempel’s model of finite-state encoders to the realm of lossy compression of individual sequences. In particular, the model of the encoder includes a finite-state reconstruction codebook followed by an information lossless finite-state encoder that compresses the reconstruction codeword with no additional distortion. We first derive two different lower bounds to the compression ratio, which depend on the number of states of the lossless encoder. Both bounds are asymptotically achievable by conceptually simple coding schemes. We then show that when the number of states of the lossless encoder is large enough in terms of the reconstruction block length, the performance can be improved, sometimes significantly so. In particular, the improved performance is achievable using a random-coding ensemble that is universal, not only in terms of the source sequence but also in terms of the distortion measure.

1. Introduction

We revisit the classical domain of rate-distortion coding applied to finite-alphabet sequences, focusing on a prescribed distortion function [1,2], (Chapter 10), [3] (Chapter 9), [4,5], (Chapters 7, 8). Specifically, our attention is directed toward encoders comprising finite-state reproduction encoders followed by information-lossless finite-state encoders that compress reproduction sequences without introducing additional distortion (see Figure 1). In essence, our principal findings are in establishing two asymptotically achievable lower bounds for the optimal compression ratio of an individual source sequence of length n, utilizing any finite-state encoder with the aforementioned structure, where the lossless encoder possesses q states. These lower bounds can both be conceptualized as the individual-sequence counterparts to the rate-distortion function of the given source sequence, akin to the lossless finite-state compressibility of a source sequence serving as the individual-sequence analog of entropy. However, before delving into the intricacies of our results, a brief overview of the background is warranted.
Over the past several decades, numerous research endeavors have been spurred by the realization that source statistics are seldom, if ever, known in practical scenarios. Consequently, these efforts have been dedicated to the pursuit of universal coding strategies that remain independent of unknown statistics while asymptotically approaching lower bounds, such as entropy in lossless compression or the rate-distortion function in the case of lossy compression, as the block length extends indefinitely. Here, we offer a succinct and non-exhaustive overview of some pertinent earlier works.
In the realm of lossless compression, the field of universal source coding has achieved a high level of sophistication and maturity. Davisson’s seminal work [6] on universal-coding redundancies has introduced the pivotal concepts of weak universality and strong universality, characterized by vanishing maximin and minimax redundancies, respectively. This work has also elucidated the link between these notions and the capacity of the ’channel’ defined by the family of conditional distributions of the data to be compressed, given the index or parameter of the source in the class [7,8,9]. For numerous parametric source classes encountered in practice, the minimum achievable redundancy of universal codes is well-established as being dominated by k   log   n 2 n , where k denotes the number of degrees of freedom of the parameter, and n is the block length [10,11,12,13]. Davisson’s theory gives rise to a central idea of constructing a Shannon code based on the probability distribution of the data vector with respect to a mixture, incorporating a certain prior function, of all sources within the class. Rissanen, credited with the invention of the minimum description length (MDL) principle [14], established a converse to a coding theorem in [15]. This theorem asserts that, asymptotically, no universal code can achieve redundancy below ( 1 ϵ ) k   log   n 2 n , with a possible exception of sources from a subset of the parameter space, the volume of which diminishes as n for every positive ϵ . Merhav and Feder [16] generalized this result to more extensive classes of sources, substituting the term k   log   n 2 n with the capacity of the aforementioned ’channel’. Subsequent studies have further refined redundancy analyses and contributed to ongoing developments in the field.
In the broader domain of universal lossy compression, the theoretical landscape is regrettably not as sharply defined and well-developed as in the lossless counterpart. In this study, we narrow our focus to a specific class known as d-semifaithful codes [17] codes that fulfill the distortion requirement with probability one. Zhang, Yang, and Wei [18] have demonstrated a notable contrast with lossless compression, establishing that, even when source statistics are perfectly known, achieving redundancy below log   n 2 n in the lossy case is impossible, although log   n n is attainable. The absence of source knowledge imposes a cost in terms of enlarging the multiplicative constant associated with log   n n . Yu and Speed [19] established weak universality, introducing a constant that grows with the cardinalities of the source and reconstruction alphabets [20]. Ornstein and Shields [17] delved into universal d-semifaithful coding for stationary and ergodic sources concerning the Hamming distortion measure, demonstrating convergence to the rate-distortion function with the probability one. Kontoyiannis [21] made several noteworthy contributions: Firstly, a central limit theorem (CLT) with a O ( 1 / n ) redundancy term, featuring a limiting Gaussian random variable with constant variance. Secondly, the law of iterated logarithm (LIL) with redundancy proportional to log ( log n ) / n infinitely often with probability one. A counterintuitive conclusion from [21] is the priceless nature of universality under these CLT and LIL criteria. In [22], optimal compression is characterized by the negative logarithm of the probability of a sphere of radius n D around the source vector with respect to the distortion measure, where D denotes the allowed per-letter distortion. The article also introduces the concept of a random coding ensemble with a probability distribution given by a mixture of all distributions in a specific class. In two recent articles, Mahmood and Wagner [23,24] delved into the study of d-semifaithful codes that are strongly universal concerning both the source and the distortion function. The redundancy rates in [23] behave like log   n n but with different multiplicative constants. Other illuminating results regarding a special distortion measure are found in [25].
A parallel path of research in the field of universal lossless and lossy compression, spearheaded by Ziv, revolves around the individual-sequence approach. In this paradigm, no assumptions are made about the statistical properties of the source. The source sequence to be compressed is treated as an arbitrary deterministic (individual) sequence, but instead, limitations are imposed on the implementability of the encoder and/or decoder using finite-state machines. This approach notably encompasses the widely celebrated Lempel–Ziv (LZ) algorithm [26,27,28], along with subsequent advancements broadening its scope to both lossy compression with and without side information [29,30], as well as joint source-channel coding [31,32]. In the lossless context, the work in [33] establishes an individual-sequence analog akin to Rissanen’s result, where the expression k   log   n 2 n continues to denote the best achievable redundancy. However, the primary term in the compression ratio is the empirical entropy of the source vector, deviating from the conventional entropy in the probabilistic setting. The converse bound presented in [33] is applicable to the vast majority of source sequences within each type, echoing the analogy with Rissanen’s framework concerning the majority of the parameter space. It is noteworthy that this converse result retains a semblance of the probabilistic setting, as asserting the relatively small number of exceptional typical sequences is equivalent to assuming a uniform distribution across the type and asserting a low probability of violating the bound. Conversely, the achievability result in [33] holds pointwise for every sequence. A similar observation applies to [34], where asymptotically pointwise lossy compression was established concerning first-order statistics (i.e., “memoryless” statistics), emphasizing distortion-universality, akin to the focus in [23,24]. A similar fusion of the individual-sequence setting and the probabilistic framework is evident in [35] concerning universal rate-distortion coding. However, akin to the approach in [34], there is no constraint on finite-state encoders/decoders as in [33]. Notably, the converse theorem in [35] states that for any variable-rate code and any distortion function within a broad class, the vast majority of reproduction vectors representing source sequences of a given type (of any fixed order) must exhibit a code length essentially no smaller than the negative logarithm of the probability of a ball with a normalized radius D (where D denotes the allowed per-letter distortion). This ball is centered at the specified source sequence, and the probability is computed with respect to a universal distribution proportional to 2 L Z ( x ^ ) , where L Z ( x ^ ) denotes the code length of the LZ encoding of the reproduction vector x ^ .
The emphasis on the term “majority” in the preceding paragraph, as highlighted earlier, necessitates clarification. It should be noted that in the absence of constraints on encoding memory resources, such as the finite-state machine model mentioned earlier, there cannot exist any meaningful lower bound that universally applies to every individual sequence. The rationale is straightforward: for any specific individual source sequence, it is always possible to devise an encoder compressing that sequence to a single bit (even losslessly). For instance, by designating the bit ‘0’ as the compressed representation of the given sequence and appending the bit ‘1’ as a header to the uncompressed binary representation of any other source sequence. In this scenario, the compression ratio for the given individual sequence would be 1 / n , dwindling to zero as n grows indefinitely. Therefore, it is clear that any non-trivial lower bound that universally applies to every individual source sequence at the same time necessitates reference to a class of encoders/decoders equipped with constrained resources, such as those featuring a finite number of states.
In this work, we consider lossy compression of individual source sequences using finite-state encoders whose structure is as follows: Owing to the fact that, without loss of optimality, every lossy encoder can be represented as a cascade of a reproduction encoder and a lossless (or “noiseless”) encoder (see, e.g., [36], particularly the discussion around Figure 1), we consider a class of lossy encoders that can be implemented as a cascade of a finite-state reproduction encoder and a finite-state lossless encoder; see Figure 1. The finite-state reproduction encoder model is a generalization of the well-known finite-state vector quantizer (FSVQ), see, e.g., [37,38] (Chapter 14). It is designed to produce reproduction vectors of dimension k in response to source vectors of dimension k, while complying with the distortion constraint for every such vector. The finite-state lossless encoder is the same as in [27]. The number of states of the reproduction encoder can be assumed to be very large (large enough to store many recent input blocks). Both the dimension, k, and the number of states, q, of the lossless encoder are assumed to be small compared to the total length, n, of the source sequence to be compressed, similar to [27] (and other related works), where the regime q n is also assumed.
One of our main messages in this work is that the relationship between q and k is important, not only with how they both relate to n. If q is large in terms of k, one can do much better than if it is small. Accordingly, we first derive two different lower bounds to the compression ratio under the assumption that q k , which are both asymptotically achievable by conceptually simple schemes that, within each k-block, seek the most compressible k-vector within a ball of ‘radius’ k D around the source block. The motivation for deriving two different bounds is that each one of them has its own strengths and it is not apparent that any one of them always dominates the other (see the details in the sequel, and in particular, the third paragraph of the discussion in Section 3). We compare the performance of the achievability scheme to the ensemble performance of a universal coding scheme that can be implemented when q is exponential in k. The improvement can sometimes be considerably large. The universality of the coding scheme is two-fold: both in the source sequence to be compressed and in the distortion measure in the sense that the order of codewords within the typical codebook (which affects the encoding of their indices) is asymptotically optimal no matter which distortion measure is used (see [35] for a discussion of this property). The intuition behind this improvement is that when q is exponential in k, the memory of the lossless encoder is large enough to store the entire input blocks and thereby exploit the sparseness of the reproduction codebook in the space of k-dimensional vectors with components in the reproduction alphabet. The asymptotic achievability of the lower bound will rely on the direct coding theorem of [35].
Bounds on both lossless and lossy compression of individual sequences using finite-state encoders and decoders have been explored in previous works, necessitating a contextualization of the present work. As previously mentioned, the cases of (almost) lossless compression were examined in [26,27,30]. In [32], the lossy case was considered, incorporating both a finite-state encoder and a finite-state decoder in the defined model. However, in the proof of the converse part, the assumption of a finite-state encoder was not essential; only a finite number of states of the decoder was required. In a subsequent work, [31], the finite number of states for both the encoder and decoder were indeed utilized. This holds true for [29] as well, where the individual-sequence analog of the Wyner–Ziv problem was investigated with more restrictive assumptions on the structure of the finite-state encoder. In contrast, the current work restricts only the encoder to be a finite-state machine, presenting a natural generalization of [27] to the lossy case. Specifically, one of our achievable lower bounds can be regarded as an extension of the compressibility bound found in [27] Corollary 1 to the lossy scenario. It is crucial to note that, particularly in the lossy case, it is more imperative to impose limitations on the encoder than the decoder, as encoding complexity serves as the practical bottleneck. Conversely, for deriving converse bounds, it is stronger and more general not to impose any constraints on the decoder.
The outline of this paper is as follows. In Section 2, we establish notation, as well as definitions, and spell out the objectives. In Section 3, we derive the main results and discuss them. Finally, in Section 4, we summarize the main contributions of this work and make some concluding remarks.

2. Notation, Definitions, and Objectives

Throughout the paper, random variables will be denoted by capital letters; specific values they may take will be denoted by the corresponding lowercase letters, and their alphabets will be denoted by calligraphic letters. Random vectors and their realizations will be denoted, respectively, by capital letters and the corresponding lowercase letters, both in the boldface font. Their alphabets will be superscript by their dimensions. The source vector of length n, ( x 1 , x 2 , , x n ) , with components, x i , i = 1 , 2 , , n , from a finite alphabet, X , will be denoted by x n . The set of all such n-vectors will be denoted by X n , which is the n-th order Cartesian power of X . Likewise, a reproduction vector of length n, ( x ^ 1 , , x ^ n ) , with components, x ^ i , i = 1 , , n , from a finite alphabet, X ^ , will be denoted by x ^ n X ^ n . The notation X ^ * will be used to designate the set of all finite-length strings of symbols from X ^ .
For i j , the notation x i j will be used to denote the substring ( x i , x i + 1 , , x j ) . For i = 1 , subscript ‘1’ will be omitted, and so, the shorthand notation of ( x 1 , x 2 , , x n ) will be x n . Similar conventions will apply to other sequences. Probability distributions will be denoted by the letter P or Q with possible subscripts, depending on the context. The probability of an event A will be denoted by P r { A } , and the expectation operator with respect to (w.r.t.) a probability distribution P will be denoted by E { · } . The logarithmic function, log x , will be understood to refer to base 2. Logarithms to base e will be denoted by ln. Let d : X × X ^ R be a given distortion function between source symbols and reproduction symbols. The distortion between vectors will be defined additively as d ( x n , x ^ n ) = i = 1 n d ( x i , x ^ i ) for every positive integer, n, and every x n X n , x ^ n X ^ n .
Consider the encoder model depicted in Figure 1, which is a cascade of a finite-state reproduction encoder (FSRE) and a finite-state lossless encoder (FSLE). This encoder is fully determined by the set E = ( X , X ^ , S , Z , u , v , f , g , k ) , where X is the source input alphabet of size α , X ^ is the reproduction alphabet of size β , S is a set of FSRE states, Z is a set of FSLE states of size q, u, and v are functions that define the FSRE, f and g are functions that define the FSLE (both to be defined shortly), and k is a positive integer that designates the basic block length within which the distortion constraint must be kept, as will be described shortly. The number of states, | S | , of the FSRE may be assumed arbitrarily large (as the lower bounds to be derived will actually be independent of this number). In particular, it can be assumed to be large enough to store several recent input k-blocks.
According to this encoder model, the input, x t X , t = 1 , 2 , , is fed sequentially into the FSRE, which goes through a sequence of states s t S , and produces an output sequence, y t X ^ * of variable-length strings of symbols from X ^ , with the possible inclusion of the empty symbol, λ , of length zero. Referring to Figure 1, the FSRE is defined by the recursive equations:
y t = u ( x t , s t )
s t + 1 = v ( x t , s t ) ,
for t = 1 , 2 , , where the initial state, s 1 , is assumed to be some fixed member of S .
Remark 1.
The above-defined model of the FSRE has some resemblance to the well-known model of the finite-state vector quantizer (FSVQ) [37], [38] (Chapter 14), but it is in fact, considerably more general than the FSVQ. Specifically, the FSVQ works as follows. At each time instant t, it receives a source vector x t and outputs a finite-alphabet variable, u t , while updating its internal state, s t . The encoding function is u t = a ( x t , s t ) and the next-state function is s t + 1 = ϕ ( u t , s t ) . Note that the state evolves in response to { u t } (and not x t ), so that the decoder will be able to maintain its own copy of { s t } . At the decoder, the reproduction is generated according to x ^ t = b ( u t , s t ) , and the state is updated again using s t + 1 = ϕ ( u t , s t ) . By cascading the FSVQ encoder and its decoder, one obtains a system with input x t and output x ^ t , which is basically a special case of our FSRE with the functions u and v being given by u ( x , s ) = b ( a ( x , s ) , s ) and v ( x , s ) = ϕ ( a ( x , s ) , s ) .
As described above, given an input block of length k, ( x 1 , x 2 , , x k ) , the FSRE generates a corresponding output block, ( y 1 , y 2 , , y k ) , while traversing a sequence of states ( s 1 , , s n ) . The FSRE must be designed in such a way that the total length of the concatenation of the (non-empty) variable-length strings, y 1 , y 2 , , y k , is equal to k as well. Accordingly, given ( y 1 , y 2 , , y k ) , let ( x ^ 1 , x ^ 2 , , x ^ k ) denote the corresponding vector of reproduction symbols from X ^ , which forms the output of the FSRE. This formal transformation from y k to x ^ k is designated by the expression y t x ^ t in Figure 1.
Example 1.
Let X = X ^ = { a , b , c } , and suppose that the FSRE is a block code of length k = 5 . Suppose also that x 5 = ( a , a , b , c , c ) and y 5 = ( λ , λ , λ , λ , a a b b ) . Then, x ^ 5 = ( a , a , b , b , c ) . The current state, in this case, is simply the contents of the input, starting from the beginning of the current block and ending at the current input symbol. Accordingly, the encoder idles until the end of the input block, and then it produces the full output block.
The parameter k of the encoder E is the length of the basic block that is associated with the distortion constraint. For a given input alphabet X , reconstruction alphabet X ^ , and distortion function d, we denote by E ( q , k , D ) the class of all finite-state encoders with the above-described structure; in this class, the number of FSLE states is q, the dimension of the FSRE is k, and d ( x k , x ^ k ) k D for every above-described x k X k . For future use, we also define the ‘ball’
B ( x k , D ) = { x ^ k : d ( x k , x ^ k ) k D } .
Remark 2.
Note that the role of the state variable, s t , might not be only to store information from the past of the input, but possibly also to maintain the distortion budget within each k-block. At each time instant, t, the state can be used to update the remaining distortion allowed until the end of the current k-block. For example, if the entire allowed distortion budget, k D , has already been exhausted before the current k-block has ended, then in the remaining part of the current block, the encoder must carry on losslessly, that is, it must produce reproduction symbols that incur zero distortion relative to the corresponding source symbols.
The FSLE is defined similarly to the description provided in [27]. Specifically, the output of the FSRE, x ^ t X ^ , t = 1 , 2 , , is fed sequentially into the FSLE, which in turn goes through a sequence of states z t Z , and produces an output sequence, b t { 0 , 1 } * of variable-length binary strings, with the possible inclusion of the empty symbol, λ , of length zero. Accordingly, the FSLE implements the recursive equations,
b t = f ( x ^ t , z t )
z t + 1 = g ( x ^ t , z t ) ,
for t = 1 , 2 , , where the initial state, z 1 , is assumed to be some fixed member of Z .
With a slight abuse of notation, we adopt the extended use of encoder functions u, v, f, and g, to designate output sequences and final states, which result from the corresponding initial states and inputs. We use the notations u ( s 1 , x n ) , v ( s 1 , x n ) , f ( z 1 , u ( s 1 , x n ) ) , and g ( z 1 , u ( s 1 , x n ) ) for x ^ n , s n + 1 , b n , and z n + 1 , respectively. We assume the FSLE to be information lossless, and define it similarly to the description provided in [27], as follows. For every ( z 1 , s 1 ) Z × S , every positive integer n, and every x n X n , the triple ( z 1 , f ( z 1 , u ( s 1 , x n ) ) , g ( z 1 , u ( s 1 , x n ) ) ) uniquely determines x ^ n .
Given an encoder E = ( X , X ^ , S , Z , u , v , f , g , k ) E ( q , k , D ) , and a source string x n , where n is divisible by k, the compression ratio of x n by E is defined as
ρ ( x n ; E ) = L ( b n ) n ,
where L ( b n ) = t = 1 n ( b t ) , ( b t ) being the length (in bits) of the binary string b t . Next, define
ρ ( x n ; E ( q , k , D ) ) = min E E ( q , k , D ) ρ ( x n ; E ) .
Our main objective is to derive bounds for ρ ( x n ; E ( q , k , D ) ) for large k and n k , with special interest in the case where q is large enough (in terms of k), but still fixed and independent of n, so that the FSLE could take advantage of the fact that not necessarily every x ^ n X ^ n can be obtained as an output of the given FSRE. In particular, a good FSLE with long memory should exploit the sparseness of the reproduction codebook relative to the entire space of k-vectors in X ^ k .

3. Lower Bounds

To present both the lower bounds and the achievability, we briefly review a few terms and facts concerning the 1978 version of the Lempel–Ziv algorithm (a.k.a. the LZ78 algorithm) [27]. The incremental parsing procedure of the LZ78 algorithm is a procedure of sequentially parsing a vector, x ^ k X ^ k , such that each new phrase is the shortest string that has not been encountered before as a parsed phrase, with the possible exception of the last phrase, which might be incomplete. For example, the incremental parsing of the vector x ^ 15 = a b b a b a a b b a a a b a a is a , b , b a , b a a , b b , a a , a b , a a . Let c ( x ^ k ) denote the number of phrases in x ^ k resulting from the incremental parsing procedure (in the above example, c ( x ^ 15 ) = 8 ). Let L Z ( x ^ k ) denote the length of the LZ78 binary compressed code for x ^ k . According to [27] Theorem 2,
L Z ( x ^ k ) [ c ( x ^ k ) + 1 ] log { 2 β [ c ( x ^ k ) + 1 ] } = c ( x ^ k ) log [ c ( x ^ k ) + 1 ] + c ( x ^ k ) log ( 2 β ) + log { 2 β [ c ( x ^ k ) + 1 ] } = c ( x ^ k ) log c ( x ^ k ) + c ( x ^ k ) log 1 + 1 c ( x ^ k ) + c ( x ^ k ) log ( 2 β ) + log { 2 β [ c ( x ^ k ) + 1 ] } c ( x ^ k ) log c ( x ^ k ) + log e + k ( log β ) log ( 2 β ) ( 1 ϵ k ) log k + log [ 2 β ( k + 1 ) ] = c ( x ^ k ) log c ( x ^ k ) + k · ε ( k ) ,
where we note that β is the cardinality of X ^ , and where ϵ k and ε ( k ) tend to zero as k .
Our first lower bound is given in the following theorem.
Theorem 1.
Consider the setting formulated in Section 2. Then, for every x n X n ,
ρ ( x n ; E ( q , k , D ) ) 1 n i = 0 n / k 1 min x ^ k B ( x i k + 1 i k + k , D ) c ( x ^ k ) log c ( x ^ k ) ( log β ) log ( 4 q 2 ) ( 1 ϵ k ) log k q 2 log ( 4 q 2 ) k .
Proof of Theorem 1.
The proof is conceptually simple. Since each k-block, x ^ i k + 1 i k + k , i = 0 , 1 , , n / k 1 , of the reconstruction vector, x ^ n , is compressed using a finite-state machine with q states, then, according to [27] Theorem 1, its compression ratio is lower bounded by
c ( x ^ i k + 1 i k + k ) + q 2 k log c ( x ^ i k + 1 i k + k ) + q 2 4 q 2 c ( x ^ i k + 1 i k + k ) k log c ( x ^ i k + 1 i k + k ) c ( x ^ i k + 1 i k + k ) + q 2 k log ( 4 q 2 ) c ( x ^ i k + 1 i k + k ) k log c ( x ^ i k + 1 i k + k ) ( log β ) log ( 4 q 2 ) ( 1 ϵ k ) log k q 2 log ( 4 q 2 ) k ,
where the second inequality follows from [27] Equation (6). Since each k-block must comply with the distortion constraint, this quantity is further lower bounded by
min x ^ k B ( x i k + 1 i k + k , D ) c ( x ^ k ) k log c ( x ^ k ) ( log β ) log ( 4 q 2 ) ( 1 ϵ k ) log k q 2 log ( 4 q 2 ) k ,
and so, for the entire source vector x n , we have
ρ ( x n ; E ( q , k , D ) ) 1 n i = 0 n / k 1 min x ^ k B ( x i k + 1 i k + k , D ) c ( x ^ k ) log c ( x ^ k ) ( log β ) log ( 4 q 2 ) ( 1 ϵ k ) log k q 2 log ( 4 q 2 ) k .
This completes the proof of Theorem 1. □
For large enough k, the last two terms can be made arbitrarily small, provided that log q log k . Clearly, this lower bound can be asymptotically attained by seeking the vector x ^ k X k ^ that minimizes c ( x ^ k ) log c ( x ^ k ) across B ( x i k + 1 i k + k , D ) within each k-block and compressing it by the LZ78 compression algorithm.
In order to state our second lower bound, we next define the joint empirical distribution of -blocks of x ^ i k + 1 i k + k . Specifically, let divide k, which in turn divides n, and consider the empirical distribution, P ^ i = { P ^ i ( x ^ ) , x ^ X ^ } , of -vectors along the i-th k-block of x ^ n , which is x ^ i k + 1 i k + k , i = 0 , 1 , , n / k 1 , that is,
P ^ i ( x ^ ) = k j = 0 k / 1 I { x ^ i k + j + 1 i k + j + = x ^ } , x ^ X ^ .
Let H ^ ( X ^ i ) denote the empirical entropy of an auxiliary random -vector, X ^ i , induced by P ^ i , that is,
H ^ ( X ^ i ) = x ^ X ^ P ^ i ( x ^ ) log P ^ i ( x ^ ) .
Now, our second lower bound is given in the following theorem.
Theorem 2.
Consider the setting formulated in Section 2. Then, for every x n X n ,
ρ ( x n ; E ( q , k , D ) ) k n i = 0 n / k 1 min x ^ i k + 1 i k + k B ( x i k + 1 i k + k , D ) H ^ ( X ^ i ) 1 log q 2 1 + log 1 + β q 2 .
Discussion.
Note that both lower bounds depend on the number of states, q, of the FSLE, but not on the number of states, | S | , of the FSRE. In this sense, no matter how large the number of states of the FSRE may be, none of these bounds is affected. For the purpose of lower bounds, which establish fundamental limitations, we wish to consider a class of encoders that is as broad as possible, for the sake of generality. Therefore, we assume that S is arbitrarily large.
The second term on the right-hand side of (14) is small when log q is small, relative to , which is in turn smaller than k. This requirement is less restrictive than the parallel one in the first bound, which was log q log k . The bound is asymptotically achievable by the universal lossless coding of the vector x ^ i k + 1 i k + k that minimizes H ^ ( X ^ i ) within B ( x i k + 1 i k + k , D ) using a universal lossless code that is based on two-part coding: the first part is a header that indicates the type class P ^ i using a logarithmic number of bits as a function of k and the second part is the index of the vector within the type class.
The main term of the second bound is essentially tighter than the main term of the first bound since H ^ ( X ^ i ) can be lower bounded by c ( x ^ i k + 1 i k + k ) log c ( x ^ i k + 1 i k + k ) , minus some small terms (see, e.g., [35] Equation (26)). On the other hand, the second bound is somewhat more complicated due to the introduction of the additional parameter . It is not clear whether any one of the bounds completely dominates the other one for any x n . It is always possible to choose the larger bound between the two.
Proof of Theorem 2.
According to [27] Lemma 2, since the FSLE is an information lossless encoder with q states, it must obey the following generalized Kraft inequality:
x ^ X ^ 2 min z Z L [ f ( z , x ^ ) ] q 2 1 + log 1 + β q 2 .
This implies that the description length at the output of the encoder is lower bounded as follows:
L ( b n ) = t = 1 n L [ f ( z t , x ^ t ) ] = i = 0 n / k 1 m = 0 k / 1 j = 1 L [ f ( z i k + m + j , x ^ i k + m + j ) ] = i = 0 n / k 1 m = 0 k / 1 L [ f ( z i k + m + 1 , x ^ i k + m + 1 i k + m + ) ] i = 0 n / k 1 m = 0 k / 1 min z Z L [ f ( z , x ^ i k + m + 1 i k + m + ) ] = i = 0 n / k 1 k x ^ X ^ P ^ i ( x ^ ) · min z Z L [ f ( z , x ^ ) ] ,
Clearly,
L ( b n ) n k n i = 0 n / k 1 1 x ^ X ^ P ^ i ( x ^ ) · min z Z L [ f ( z , x ^ ) ] .
Now, by the generalized Kraft inequality above,
q 2 1 + log 1 + β q 2 x ^ X ^ 2 min z Z L [ f ( z , x ^ ) ] x ^ X ^ P ^ i ( x ^ ) · 2 min z Z L [ f ( z , x ^ ) log P ^ i ( x ^ ) exp 2 x ^ X ^ P ^ i ( x ^ ) · min z Z L [ f ( z , x ^ ) + H ^ ( X ^ i ) ,
where the last inequality follows from the convexity of the exponential function and Jensen’s inequality. This yields
log q 2 1 + log 1 + β q 2 H ^ ( X ^ i ) x ^ X ^ P ^ i ( x ^ ) · min z Z L [ f ( z , x ^ ) ] ,
implying that
L ( b n ) n k n i = 0 n / k 1 1 x ^ X ^ P ^ i ( x ^ ) · min z Z L [ f ( z , x ^ ) ] k n i = 0 n / k 1 H ^ ( X ^ i ) 1 log q 2 1 + log 1 + β q 2 ,
and since each x ^ i k + 1 i k + k must be in B ( x i k + 1 i k + k ) , D ) , the summand of the first term on the left-hand side cannot be smaller than min x ^ k B ( x i k + 1 i k + k ) , D ) H ^ ( X ^ i ) / . Since this lower bound on L ( b n ) / n holds for every E E ( q , k , D ) , it holds also for ρ ( x n ; E ( q , k , D ) ) . This completes the proof of Theorem 2. □
Returning now to the first lower bound, consider the following chain of inequalities:
i = 0 n / k 1 min x ^ k B ( x i k + 1 i k + k , D ) c ( x ^ k ) log c ( x ^ k ) i = 0 n / k 1 min x ^ k B ( x i k + 1 i k + k , D ) L Z ( x ^ k ) k ε ( k ) = i = 0 n / k 1 log max x ^ k B ( x i k + 1 i k + k , D ) 2 L Z ( x ^ k ) k ε ( k ) i = 0 n / k 1 log x ^ k B ( x i k + 1 i k + k , D ) 2 L Z ( x ^ k ) k ε ( k ) .
It is conceivable that the last inequality may contribute to most of the gap between the left-most side and the right-most side of the chain (21), since we pass from a single term in B ( x i k + 1 i k + k , D ) to the sum of all terms in B ( x i k + 1 i k + k , D ) . Since
max x ^ k B ( x i k + 1 i k + k , D ) 2 L Z ( x ^ k ) x ^ k B ( x i k + 1 i k + k , D ) 2 L Z ( x ^ k ) | B ( x i k + 1 i k + k , D ) | · max x ^ k B ( x i k + 1 i k + k , D ) 2 L Z ( x ^ k ) ,
the gap between the left-most side of (21) and the right-most side of (21) might take any positive value that does not exceed log | B ( ( x i k + 1 i k + k , D ) | , which is in turn approximately proportional to k as | B ( x i k + 1 i k + k , D ) | is asymptotically exponential in k. Thus, the right-most side of (21), corresponds to a coding rate, which might be strictly smaller than that of the left-most side. Yet, we argue that the right-most side of (21) can still be asymptotically attained by a finite-state encoder. But to this end, its FSLE component should possess q = β k states, as it is actually a block code of length k. In order to see this, we need to define the following universal probability distribution (see also [35] and references therein):
U ( x ^ k ) = 2 L Z ( x ^ k ) x ˜ k X ^ k 2 L Z ( x ˜ k ) = 2 L Z ( x ^ k ) Z , x ^ k X ^ k ,
and accordingly, also define
U [ B ( x k , D ) ] = x ^ k B ( x k , D ) U ( x ^ k ) .
Now, the first term on the right-most side of (21) can be further manipulated as follows:
i = 0 n / k 1 log x ^ k B ( x i k + 1 i k + k , D ) 2 L Z ( x ^ k ) = i = 0 n / k 1 log x ^ k B ( x i k + 1 i k + k , D ) U ( x ^ k ) · Z i = 0 n / k 1 log U [ B ( x i k + 1 i k + k , D ) ] ,
where the last inequality is due to the fact that log Z 0 , thanks to Kraft’s inequality applied to the code-length function L Z ( · ) .
Now, the last expression in (25) suggests achievability using the universal distribution, U, for the independent random selection of various codewords. The basic idea is quite standard and simple: The quantity, U [ B ( x i k + 1 i k + k , D ) ] , is the probability that a single randomly chosen reproduction vector, drawn under U, would fall within distance k D from the source vector, x i k + 1 i k + k . If all reproduction codewords are drawn independently under U, then the typical number of random selections required before one sees the first one in B ( x i k + 1 i k + k , D ) is of the exponential order of 1 / U [ B ( x i k + 1 i k + k , D ) ] . Given that the codebook is revealed to both the encoder and decoder, once it has been selected, the encoder merely needs to transmit the index of the first reproduction vector within the codebook, and the description length of that index can be made essentially as small as log { 1 / U [ B ( x i k + 1 i k + k , D ) ] } = log ( U [ B ( x i k + 1 i k + k , D ) ] ) . In [35], we use this simple idea to prove achievability for an arbitrary distortion measure. More precisely, the following theorem is stated and proved in [35] with some adjustments to the notation:
Theorem 3
([35] Theorem 2). Let d : X k × X ^ k R + be an arbitrary distortion function. Then, for every ϵ > 0 , there exists a sequence of d-semifaithful, variable-length block codes of block length k, such that for every x k X k , the code length for x k is upper bounded by
L ( x i k + 1 i k + k ) log ( U [ B ( x i k + 1 i k + k , D ) ] ) + ( 2 + ϵ ) log k + c + δ k ,
where c > 0 is a constant and δ k = O ( k β k e k 1 + ϵ ) .
Through the repeated application of this code for each one of the n / k blocks of length k, the lower bound of the last line of (25) is asymptotically attained. As elaborated on in [35], the ensemble of codebooks selected under the universal distribution, U, exhibits universality in both the source sequence slated for encoding and the chosen distortion measure. This stands in contrast to the classical random coding distribution, which typically relies on both the statistics of the source and the characteristics of the distortion measure.
Discussion.
A natural question that may arise is whether this performance is the best that can be attained given that the number of FSLE states, q, is as large as β k . For now, this question remains open, but it is conjectured that the answer is affirmative, in view of the matching converse theorem of [35] Theorem 1, which applies to the vast majority of source sequences in every type class of any order, even without the limitation of finite-state encoders.
It is natural to think of the memory resource used by a finite-state encoder in terms of the number of bits (or equivalently, the size of a register) needed in order to store the current state at each time instant, namely, the base 2 logarithm of the total number of states. Indeed, both lower bounds derived earlier contain terms that are proportional to log q , the memory size pertaining to the FSLE. Since the memory size, log | S | , of the FSRE is assumed arbitrarily large, as discussed earlier, the total size of the encoder memory, log | S | + log q , is dominated by log | S | , and so, the contribution of log q to the total memory volume can be considered negligibly small. Therefore, one of our main messages in this work is that, as far as the total memory size goes, it makes very little difference if we allow log q to be as large as k log β and, thereby, achieve better performance, rather than keeping log q smaller and ending up with the inferior compression performance of minimizing L Z ( x ^ i k + 1 i k + k ) within B ( x i k + 1 i k + k , D ) for each block.

4. Conclusions

In this paper, we revisited the paradigm of lossy compression of individual sequences using finite-state machines, as a natural extension of the same paradigm in the lossless case, as established by Ziv and Lempel in [27] and other related works. This work can also be viewed as a revisit of [35] from the perspective of finite-state encoding of individual sequences. Our model of a finite-state encoder is that of a cascade of the finite-state k-dimensional reproduction encoder (with an arbitrarily large number of states) and a finite-state lossless encoder, acting on the reproduction sequence. Our main contributions to this work are as follows:
  • We proposed a model of a finite-state lossy encoder, composed of a cascade of an FSRE and an FSLE.
  • We derived two different lower bounds to the compression ratio.
  • We showed that both bounds depend on the number of states, q, of the lossless encoder, but not on the number of states of the reproduction encoder.
  • We showed that for relatively small q, one cannot do better than seeking the most compressible reproduction sequence within the ’sphere’ of radius, k D , around the source vector. Nonetheless, if we allow q = β k , we can improve performance significantly by using a good code from the ensemble of codes, where each codeword is selected independently at random under the universal distribution, U. The resulting code is universal, not only in the sense of the source sequence, as in [27], but also in the distortion function, in the sense discussed in [35]. This passage from small q to large q will not increase the total memory resources of the entire encoder significantly, considering the large memory that may be used by the reproduction encoder anyway.
  • We suggested the conjecture that the performance achieved, as described in item 3, is the best performance achievable for large q.
Finally, our derivations can be extended to incorporate side information, u n = ( u 1 , u 2 , , u n ) , available to both the encoder and decoder. In the model of the finite-state encoder, this amounts to allowing both the FSRE and the FSLE sequential access to u t , t = 1 , 2 , . The decoder, of course, should also have access to u n . Another modification needed is to replace the LZ algorithm with its conditional version in all places (see, e.g., [31,39]).

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Conflicts of Interest

The author declares no conflicts of interest.

References

  1. Berger, T. Rate Distortion Theory—A Mathematical Basis for Data Compression; Prentice-Hall Inc.: Englewood Cliffs, NJ, USA, 1971. [Google Scholar]
  2. Cover, T.M.; Thomas, J.A. Elements of Information Theory; John Wiley & Sons: Hoboken NJ, USA, 2006. [Google Scholar]
  3. Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: New York, NY, USA, 1968. [Google Scholar]
  4. Gray, R.M. Source Coding Theory; Kluwer Academic Publishers: Boston, MA, USA, 1990. [Google Scholar]
  5. Viterbi, A.J.; Omura, J.K. Principles of Digital Communication and Coding; McGraw-Hill Inc.: New York, NY, USA, 1979. [Google Scholar]
  6. Davisson, L.D. Universal noiseless coding. IEEE Trans. Inform. Theory 1973, IT–19, 783–795. [Google Scholar] [CrossRef]
  7. Gallager, R.G. Source Coding with Side Information and Universal Coding; Unplublished Technical Report, LIDS-P-937; M.I.T.: Cambridge, MA, USA, 1976. [Google Scholar]
  8. Ryabko, B. Coding of a source with unknown but ordered probabilities. Probl. Inf. Transm. 1979, 15, 134–138. [Google Scholar]
  9. Davisson, L.D.; Leon-Garcia, A. A source matching approach to finding minimax codes. IEEE Trans. Inform. Theory 1980, 26, 166–174. [Google Scholar] [CrossRef]
  10. Krichevsky, R.E.; Trofimov, R.K. The performance of universal encoding. IEEE Trans. Inform. Theory 1981, 27, 199–207. [Google Scholar] [CrossRef]
  11. Shtar’kov, Y.M. Universal sequential coding of single messages. Probl. Inf. Transm. 1987, 23, 175–186. [Google Scholar]
  12. Barron, A.R.; Rissanen, J.; Yu, B. The minimum description length principle in coding and modeling. IEEE Trans. Inf. Theory 1998, 44, 2734–2760. [Google Scholar] [CrossRef]
  13. Yang, Y.; Barron, A.R. Information-theoretic determination of minimax rates of convergence. Ann. Stat. 1999, 27, 1564–1599. [Google Scholar] [CrossRef]
  14. Rissanen, J. Modeling by shortest data description. Automatica 1978, 14, 465–471. [Google Scholar] [CrossRef]
  15. Rissanen, J. Universal coding, information, prediction, and estimation. IEEE Trans. Inf. Theory 1984, IT–30, 629–636. [Google Scholar] [CrossRef]
  16. Merhav, N.; Feder, M. A strong version of the redundancy–capacity theorem of universal coding. IEEE Trans. Inform. Theory 1995, 41, 714–722. [Google Scholar] [CrossRef]
  17. Ornstein, D.S.; Shields, P.C. Universal almost sure data compression. Ann. Probab. 1990, 18, 441–452. [Google Scholar] [CrossRef]
  18. Zhang, Z.; Yang, E.-H.; Wei, V. The redundancy of source coding with a fidelity criterion. I. known statistics. IEEE Trans. Inform. Theory 1997, 43, 71–91. [Google Scholar] [CrossRef]
  19. Yu, B.; Speed, T. A rate of convergence result for a universal d-semifaithful code. IEEE Trans. Inform. Theory 1993, 39, 813–820. [Google Scholar] [CrossRef]
  20. Silva, J.F.; Piantanida, P. On universal d-semifaithful coding for memoryless sources with infinite alphabets. IEEE Trans. Inf. Theory 2022, 68, 2782–2800. [Google Scholar] [CrossRef]
  21. Kontoyiannis, I. Pointwise redundancy in lossy data compression and universal lossy data compression. IEEE Trans. Inform. Theory 2000, 46, 136–152. [Google Scholar] [CrossRef]
  22. Kontoyiannis, I.; Zhang, J. Arbitrary source models and Bayesian codebooks in rate-distortion theory. IEEE Trans. Inform. Theory 2002, 48, 2276–2290. [Google Scholar] [CrossRef]
  23. Mahmood, A.; Wagner, A.B. Lossy compression with universal distortion. IEEE Trans. Inform. Theory 2023, 69, 3525–3543. [Google Scholar] [CrossRef]
  24. Mahmood, A.; Wagner, A.B. Minimax rate-distortion. IEEE Trans. Inform. Theory 2023, 69, 7712–7737. [Google Scholar] [CrossRef]
  25. Sholomov, L.A. Measure of information in fuzzy and partially defined data. Dokl. Math. 2006, 74, 775–779. [Google Scholar] [CrossRef]
  26. Ziv, J. Coding theorems for individual sequences. IEEE Trans. Inform. Theory 1978, IT–24, 405–412. [Google Scholar] [CrossRef]
  27. Ziv, J.; Lempel, A. Compression of individual sequences via variable-rate coding. IEEE Trans. Inform. Theory 1978, IT–24, 530–536. [Google Scholar] [CrossRef]
  28. Potapov, V.N. Redundancy estimates for the Lempel-Ziv algorithm of data compression. Discret. Appl. Math. 2004, 135, 245–254. [Google Scholar] [CrossRef]
  29. Merhav, N.; Ziv, J. On the Wyner-Ziv problem for individual sequences. IEEE Trans. Inform. Theory 2006, 52, 867–873. [Google Scholar] [CrossRef]
  30. Ziv, J. Fixed-rate encoding of individual sequences with side information. IEEE Trans. Inf. Theory 1984, IT–30, 348–452. [Google Scholar] [CrossRef]
  31. Merhav, N. Finite-state source-channel coding for individual source sequences with source side information at the decoder. IEEE Trans. Inform. Theory 2022, 68, 1532–1544. [Google Scholar] [CrossRef]
  32. Ziv, J. Distortion-rate theory for individual sequences. IEEE Trans. Inform. Theory 1980, IT–26, 137–143. [Google Scholar] [CrossRef]
  33. Weinberger, M.J.; Merhav, N.; Feder, M. Optimal sequential probability assignment for individual sequences. IEEE Trans. Inform. Theory 1994, 40, 384–396. [Google Scholar] [CrossRef]
  34. Merhav, N. D-semifaithful codes that are universal over both memoryless sources and distortion measures. IEEE Trans. Inform. Theory 2023, 69, 4746–4757. [Google Scholar] [CrossRef]
  35. Merhav, N. A universal random coding ensemble for sample-wise lossy compression. Entropy 2023, 25, 1199. [Google Scholar] [CrossRef]
  36. Neuhoff, D.L.; Gilbert, R.K. Causal source codes. IEEE Trans. Inform. Theory 1982, IT–28, 701–713. [Google Scholar] [CrossRef]
  37. Foster, J.; Gray, R.M.; Ostendorf Dunham, M. Finite-state vector quantization for waveform coding. IEEE Trans. Inform. Theory 1985, IT–31, 348–359. [Google Scholar] [CrossRef]
  38. Gersho, A.; Gray, R.M. Vector Quantization and Signal Compression, 8th ed.; Springer Science+Business Media: New York, NY, USA, 2001; Originally published by Kluwer Academic Publishers: New York, NY, USA, 1992. [Google Scholar]
  39. Ziv, J. Universal decoding for finite-state channels. IEEE Trans. Inform. Theory 1985, IT–31, 453–460. [Google Scholar] [CrossRef]
Figure 1. Finite-state reproduction encoder followed by a finite-state lossless encoder.
Figure 1. Finite-state reproduction encoder followed by a finite-state lossless encoder.
Entropy 26 00116 g001
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Merhav, N. Lossy Compression of Individual Sequences Revisited: Fundamental Limits of Finite-State Encoders. Entropy 2024, 26, 116. https://doi.org/10.3390/e26020116

AMA Style

Merhav N. Lossy Compression of Individual Sequences Revisited: Fundamental Limits of Finite-State Encoders. Entropy. 2024; 26(2):116. https://doi.org/10.3390/e26020116

Chicago/Turabian Style

Merhav, Neri. 2024. "Lossy Compression of Individual Sequences Revisited: Fundamental Limits of Finite-State Encoders" Entropy 26, no. 2: 116. https://doi.org/10.3390/e26020116

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop