# Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals

^{*}

## Abstract

**:**

_{+}), the strings being generated independently and identically distributed (i.i.d.) according an unknown distribution over ℤ

_{+}in a known collection $\mathcal{P}$. We first show that if describing a single symbol incurs finite redundancy, then $\mathcal{P}$ is tight, but that the converse does not always hold. If a single symbol can be described with finite worst-case regret (a more stringent formulation than redundancy above), then it is known that describing length n i.i.d. strings only incurs vanishing (to zero) redundancy per symbol as n increases. On the contrary, we show it is possible that the description of a single symbol from an unknown distribution of $\mathcal{P}$ incurs finite redundancy, yet the description of length n i.i.d. strings incurs a constant (> 0) redundancy per symbol encoded. We then show a sufficient condition on single-letter marginals, such that length n i.i.d. samples will incur vanishing redundancy per symbol encoded.

## 1. Introduction

^{n}). The primary theme of this paper is to collect such results on the redundancy of classes over countably infinite support.

^{n}scales as $\frac{k-1}{2}\text{log}n$. However, when $\mathcal{P}$ does not have a finite support, the above bounds are meaningless.

_{+}= {1, 2, 3, ...} be the set of positive integers and ℕ = {0, 1, 2, ...} be the set of non-negative integers. However, what about the case where the redundancy of a collection $\mathcal{P}$ over ℤ

_{+}is finite? Now, a well-known redundancy-capacity [4] argument can be used to interpret the redundancy, which equates the redundancy to the amount of information we can get about the source from the data. In this case, finite (infinite, respectively) redundancy of $\mathcal{P}$ implies that a single symbol contains a finite (infinite, respectively) amount of information about the model.

_{+}has finite redundancy, does it imply that the redundancy of length n i.i.d. strings from $\mathcal{P}$ grows sublinearly? Equivalently, do finite redundancy collections behave similar to their fixed alphabet counterparts? If true, roughly speaking, such a result would inform us that as the universal encoder sees more and more of the sequence, it learns less and less of the underlying model. This would be in line with our intuition, where seeing more data fixes the model. Therefore, the more data we have already seen, the less there is to learn. Yet, as we will show, that is not the case.

_{+}is finite, then $\mathcal{P}$ is tight. This turns out to be a useful tool to check if the redundancy is finite in [3], for example.

## 2. Notation and Background

_{+}= {1, 2, 3, ...} be the set of positive integers and ℕ = {0, 1, 2, ... } be the set of non-negative integers.

#### 2.1. Redundancy

_{+}. Let $\mathcal{P}$

^{n}be the set of distributions over length-n sequences obtained by i.i.d. sampling from distributions in $\mathcal{P}$.

^{∞}is the collection of measures over infinite length sequences of ℤ

_{+}obtained by i.i.d. sampling as follows. Observe that ${\mathbb{Z}}_{+}^{n}$ is countable for every n. For simplicity of exposition, we will think of each length n string

**x**as a subset of ${\mathbb{Z}}_{+}^{\infty}$—the set of all semi-infinite strings of positive integers that begin with

**x**. Each subset of ${\mathbb{Z}}_{+}^{n}$ is therefore a subset of ${\mathbb{Z}}_{+}^{\infty}$. Now the collection $\mathcal{J}$ of all subsets of ${\mathbb{Z}}_{+}^{n}$ and all n ∈ ℤ

_{+}, is a semi-algebra [5]. The probabilities i.i.d. sampling assigns to finite unions of disjoint sets in $\mathcal{J}$ is the sum of that assigned to the components of the union. Therefore, there is a sigma-algebra over the uncountable set ${\mathbb{Z}}_{+}^{\infty}$ that extends $\mathcal{J}$ and matches the probabilities assigned to sets in $\mathcal{J}$ by i.i.d. sampling. The reader can assume that $\mathcal{P}$

^{∞}is the measure on the minimal sigma-algebra that extends $\mathcal{J}$ and matches what the probabilities i.i.d. sampling gives to sets in $\mathcal{J}$. See, e.g., [5], for a development of elementary measure theory that lays out the above steps.

_{n}($\mathcal{P}$

^{∞}) in (1) by the block length n. We will call R

_{n}($\mathcal{P}$

^{∞})/n the per-symbol length n redundancy.

_{n}($\mathcal{P}$

^{∞})/n. Furthermore, the limit lim sup

_{n}

_{→∞}R

_{n}($\mathcal{P}$

^{∞})/n is the asymptotic per-symbol redundancy. Whether the asymptotic per-symbol redundancy is zero (we will equivalently say that the asymptotic per-symbol redundancy diminishes to zero to keep in line with prior literature) is in many ways a litmus test for compression, estimation and other related problems. Loosely speaking, if R

_{n}($\mathcal{P}$

^{∞})/n → 0, the redundancy-capacity interpretation [4] mentioned above implies that after a point, there is little further information to be learned when we see an additional symbol, no matter what the underlying source is. In this sense, this is the case where we can actually learn the underlying model at a uniform rate over the entire class.

#### 2.2. Patterns

**x**by Ψ(

**x**). There is only one possible pattern of strings of length one (no matter what the alphabet, the pattern of a length one string is one), two possible patterns of strings of length two (11 and 12), and so on. The number of possible patterns of length n is the n-th Bell number [1], and we denote the set of all possible length n patterns by Ψ

^{n}. The measures induced on patterns by a corresponding measure p on infinite sequences of positive integers assigns to any pattern ψ a probability:

_{Ψ}for convenience.

_{Ψ}as a sequential prediction procedure that estimates the probability that the symbol X

_{n}

_{+1}will be “new” (has not appeared in ${X}_{1}^{n}$) and the probability that X

_{n}

_{+1}takes a value that has been seen so far. This view of estimation also appears in the statistical literature on Bayesian nonparametrics that focuses on exchangeability. Kingman [7] advocated the use of exchangeable random partitions to accommodate the analysis of data from an alphabet that is not bounded or known in advance. A more detailed discussion of the history and philosophy of this problem can be found in the works of Zabell [8,9] collected in [10].

#### 2.3. Cumulative Distributions and Tight Collections

_{+}(ℕ, respectively) is a function F

_{p}: ℝ ∪ {∞} → [0, 1] defined in the following (slightly unconventional) way. We let F

_{p}(0) = 0 in case the support is ℤ

_{+}(F

_{p}(−1) = 0 if the support is ℕ, respectively). We then define F

_{p}on points in the support of p in the way cumulative distribution functions are normally defined. Specifically for all y in the support of p,

_{p}(−∞) ≔ 0 and F

_{p}(∞) ≔ 1. Finally, we extend the definition of F

_{p}to all real numbers by linearly interpolating between the values defined already.

_{p}defined as follows. To begin with,

_{p}(y) = 1. It follows [11] then that:

_{+}is defined to be tight if for all γ > 0,

## 3. Redundancy and Tightness

**Lemma 1.**A collection $\mathcal{P}$ over ℕ with bounded length n redundancy is tight. Namely, if the single-letter redundancy of $\mathcal{P}$ is finite, then for any γ > 0:

**Proof**Since $\mathcal{P}$ has bounded single-letter redundancy, fix a distribution q over ℕ, such that:

_{p}

_{∈$\mathcal{P}$}D(p‖q) where D(p‖q) is the Kullback–Leibler distance between p and q. We will first show that for all p ∈ $\mathcal{P}$ and any m > 0,

^{*}is the smallest integer, such that (R + (2 log e)/e)/m

^{*}< γ/2. Equivalently, for all γ > 0 and p ∈ $\mathcal{P}$, we show that:

- (i)
- the set ${W}_{1}=\{x\in \mathbb{N}:x>2{F}_{q}^{-1}(1-\gamma /{2}^{m+2}$ and $\text{log}\frac{p(x)}{q(x)}>{m}^{*}\}$. Clearly:$${W}_{1}\subseteq \left\{y\in \mathbb{N}:\left|\text{log}\frac{p(y)}{q(y)}\right|>{m}^{*}\right\},$$$$p({W}_{1})\le p\left\{y\in \mathbb{N}:\left|\text{log}\frac{p(y)}{q(y)}\right|>{m}^{*}\right\}\le \frac{\gamma}{2}$$
- (ii)
- the set ${W}_{2}=\{x\in \mathbb{N}:x>2{F}_{q}^{-1}(1-\gamma /{2}^{{m}^{*}+2})$ and $\text{log}\frac{p(x)}{q(x)}\le {m}^{*}\}$. Clearly:$${W}_{2}\subseteq \left\{y\in \mathbb{N}:y>2{F}_{q}^{-1}\left(1-\gamma /{2}^{{m}^{*}+2}\right)\right\}$$$$q\left({W}_{2}\right)\le q\left\{y\in \mathbb{N}:y>2{F}_{q}^{-1}\left(1-\gamma /{2}^{{m}^{*+2}}\right)\right\}\le \frac{\gamma}{{2}^{{m}^{*}+1}}.$$
_{2}satisfy $\text{log}\frac{p(x)}{q(x)}\le {m}^{*}$ or that p(x) ≤ q(x)2^{m*}. Hence, we have:$$p({W}_{2})\le q({W}_{2}){2}^{{m}^{*}}\le \frac{\gamma {2}^{{m}^{*}}}{{2}^{{m}^{*}+1}}=\frac{\gamma}{2}.$$

_{+}. First, partition the set of positive integers into the sets T

_{i}, i ∈ ℕ, where:

_{i}| =2

^{i}. Now, $\mathcal{I}$ is the collection of all possible distributions that can be formed as follows: for all i ∈ ℤ

_{+}, pick exactly one element of T

_{i}and assign probability 1/((i +1)(i +2)) to the element of T

_{i}chosen choosing the support as above implicitly assumes the axiom of choice. Note that the set $\mathcal{I}$ is uncountably infinite.

**Corollary 2.**The set $\mathcal{I}$ of distributions is tight.

**Proof**For all p ∈ $\mathcal{I}$,

**Proposition 1.**The collection $\mathcal{I}$ does not have finite redundancy.

**Proof**Suppose q is any distribution over ℤ

_{+}. We will show that ∃p ∈ $\mathcal{I}$, such that:

_{+}, there exists p ∈ $\mathcal{I}$, such that:

_{+}. Observe that for all i, |T

_{i}| =2

^{i}. It follows that for all i, there is x

_{i}∈ T

_{i}, such that:

^{*}that has for its support {x

_{i}: i ∈ ℤ

_{+}} identified above. Furthermore p

^{*}assigns:

^{*}to q is not finite, and the Lemma follows, since q is arbitrary.

## 4. Length n Redundancy

^{∞}grows sublinearly in the block length n.

**Lemma 3.**Let $\mathcal{P}$ be a collection of distributions over a countable support $\mathcal{X}$. For some m ∈ ℤ

_{+}, consider m pairwise disjoint subsets S

_{i}⊂ $\mathcal{X}$ (1 ≤ i ≤ m), and let δ > 1/2. If there exist p

_{1}, ..., p

_{m}∈ $\mathcal{P}$, such that:

_{i}, i ∈ ℤ

_{+}and distributions p

_{i}∈ $\mathcal{P}$, such that p

_{i}(S

_{i}) ≥ δ, then the redundancy is infinite.

**Proof**This is a simplified formulation of the distinguishability concept in [4]. For a proof, see e.g., [12].

#### 4.1. Counterexample

^{∞}normalized by n) remains bounded away from zero; in the limit, the block length goes to infinity. To show this, we obtain such a collection $\mathcal{B}$.

_{+}into T

_{i}= {2

^{i}, ..., 2

^{i}

^{+1}− 1}, i ∈ ℕ. Recall that T

_{i}has 2

^{i}elements. For all 0 < ∊ ≤ 1, let ${n}_{\u220a}=\lfloor \frac{1}{\u220a}\rfloor $. Let 1 ≤ j ≤ 2

^{n∊}, and let p

_{∊,j}be a distribution on ℤ

_{+}that assigns probability 1 − ∊ to the number one (or equivalently, to the set T

_{0}) and ∊ to the j-th smallest element of T

_{n∊}, namely the number 2

^{n∊}+ j − 1. $\mathcal{B}$ (mnemonic for binary, since every distribution has a support of size two) is the collection of distributions p

_{∊,j}for all ∊ > 0 and 1 ≤ j ≤ 2

^{n∊}. $\mathcal{B}$

^{∞}is the set of measures over infinite sequences of numbers corresponding to i.i.d. sampling from $\mathcal{B}$.

**Proposition 2.**Let q be a distribution that assigns $q({T}_{i})=\frac{1}{(i+1)(i+2)}$ and for all j ∈ T

_{i},

^{∞}scales linearly with n.

**Proposition 3.**For all n ∈ ℤ

_{+},

**Proof**Let the set {1

^{n}} denote a set containing a length n sequence of only ones. For all n, define 2

^{n}pairwise disjoint sets S

_{i}of ${\mathbb{Z}}_{+}^{n}$, 1 ≤ i ≤ 2

^{n}, where:

^{n}+ i − 1) and at least one occurrence of 2

^{n}+ i − 1. Clearly, for distinct i and j between one and 2

^{n}, S

_{i}and S

_{j}are disjoint. Furthermore, the measure ${p}_{\frac{1}{n},i}\in {\mathcal{B}}^{\infty}$ assigns S

_{i}the probability:

^{∞}is lower bounded by:

_{+}is finite, the single-letter tail redundancy, as described in the equation below, does not diminish to zero; namely, for all M:

#### 4.2. Sufficient Condition

^{∞}to grow sublinearly with n. This condition is, however, not necessary; and the characterization of a condition that is both necessary and sufficient is as yet open.

_{p,∊}be the set of all elements in the support of p with probability ≥∊, and let T

_{p,∊}= ℤ

_{+}− A

_{p,∊}. Let G

_{0}= {ϕ}, where ϕ denotes the empty string. For all i, the sets:

_{1}, ...,x

_{i}} to denote the set of distinct symbols in the string ${x}_{1}^{i}$. Let B

_{0}= {}, and let ${B}_{i}={\mathbb{Z}}_{+}^{i}-{G}_{i}$. Observe from an argument similar to the coupon collector problem that:

**Lemma 4.**For all i ≥ 2,

**Proof**The proof follows from an elementary union bound:

**Theorem 5.**Suppose $\mathcal{P}$ is a collection of distributions over ℤ

_{+}. Let the entropy be uniformly bounded over the entire collection, and in addition, let the redundancy of the collection be finite. Namely,

_{p,δ}denotes the support of p, all of whose probabilities are < δ. Let:

_{n}($\mathcal{P}$

^{∞}), grows sublinearly:

**Remark**If the conditions of the theorem are met, we can always assume without loss of generality that there is a distribution q

_{1}that satisfies (3) and simultaneously has finite redundancy. To see this, suppose ${q}_{1}^{\prime}$ satisfies the finite-redundancy condition, namely:

_{+}, ${q}_{1}(x)=\frac{{q}_{1}^{\prime}(x)+{q}_{1}^{\u2033}(x)}{2}$ satisfies both conditions simultaneously.

**Proof**In what follows, x

^{i}represents a string x

_{1}, ..., x

_{i}and x

^{0}denotes the empty string. For all n, we denote Ψ(x

^{n}) = ψ

_{1}, ..., ψ

_{n}and Ψ(X

^{n})=Ψ

_{1}, ..., Ψ

_{n}.

_{Ψ}is the optimal universal pattern encoder over patterns of i.i.d. sequences defined in Section 2.2. Furthermore, recall that the redundancy of $\mathcal{P}$ is finite and that q

_{1}is the universal distribution over ℤ

_{+}that attains redundancy R for $\mathcal{P}$.

^{i}∈ Ψ

^{i}, such that ψ

^{i}

^{−1}=Ψ(x

^{i}

^{−1}),

^{∞},

_{1}is always one, p(ψ

_{1}) = q

_{Ψ}(ψ

_{1}) = 1. Therefore, we have:

_{p}as:

_{i}

_{−1}and B

_{i}

_{−1}and use separate bounds on each set that hold uniformly over the entire model collection. The last inequality above follows from Lemma 4. From Condition (3) of the Theorem, we have that:

_{i}, i ∈ ℤ

_{+}} with a

_{i}< ∞ for all i, if lim

_{i}

_{→∞}a

_{i}exists, then:

_{+}, such that every distribution in $\mathcal{U}$ has finite entropy, the redundancy of $\mathcal{U}$ is finite,

^{∞}diminishes sublinearly. This is therefore also an example to show that the conditions in Theorem 5 are only sufficient, but, in fact, not necessary. It is yet open to find a condition on single-letter marginals that is both necessary and sufficient for the asymptotic per-symbol redundancy to diminish to zero.

_{k}, k ∈ ℤ

_{+}, on ℕ where:

_{k}∈ $\mathcal{U}$ is therefore $1+h\left(\frac{1}{{k}^{2}}\right)$. Note that the redundancy of $\mathcal{U}$ is finite, too. To see this, first note that:

^{+}≝ log(∑

_{x∈ℤ+}sup

_{k}

_{∈ℕ}p

_{k}(x)), observe that the distribution:

_{k}∈ $\mathcal{U}$:

^{+}+ 2. Furthermore, Equation (5) implies that worst-case regret is finite, and from [2] the length n redundancy of $\mathcal{U}$

^{∞}diminishes sublinearly. Now, pick an integer m ∈ ℤ

_{+}. We have for all p ∈ $\mathcal{U}$,

^{∞}diminishes to zero, while not satisfying all of the requirements of Theorem 5. Therefore, the conditions of Theorem 5 are only sufficient, not necessary.

## 5. Open Problems

## Acknowledgments

## Conflicts of Interest

## References

- Orlitsky, A.; Santhanam, N.P.; Zhang, J. Universal compression of memoryless sources over unknown alphabets. IEEE Tran. Inf. Theory
**2004**, 50, 1469–1481. [Google Scholar] - Boucheron, S.; Garivier, A.; Gassiat, E. Coding on countably infinite alphabets. 2008; arXiv.org: 0801.2456. [Google Scholar]
- Santhanam, N.; Anantharam, V.; Kavcic, A.; Szpankowski, W. Data driven weak universal redundancy. Proceedings of 2014 IEEE International Symposium on Information Theory (ISIT), Honolulu, HI, USA, 29 June–4 July 2014.
- Merhav, N.; Feder, M. Universal prediction. IEEE Tran. Inf. Theory
**1998**, 44, 2124–2147. [Google Scholar] - Rosenthal, J.S. A First Look at Rigorous Probability Theory, 2nd ed. ; World Scientific: Singapore, Singapore, 2008. [Google Scholar]
- Santhanam, N. Probability Estimation and Compression Involving Large Alphabets. Ph.D. Thesis, University of California, San Diego, CA, USA. 2006. [Google Scholar]
- Kingman, J.F.C. The Mathematics of Genetic Diversity; SIAM: Philadelphia, PA, USA, 1980. [Google Scholar]
- Zabell, S.L. Predicting the unpredictable. Synthese
**1992**, 90, 205–232. [Google Scholar] - Earman, J.; Norton, J.D. The continuum of inductive methods revisited. In The Cosmos of Science: Essays of Exploration; Earman, J.; Norton, J.D. The University of Pittsburgh Press: Pittsburgh, PA, USA; 1997; Chapter 12. [Google Scholar]
- Zabell, S.L. Symmetry and Its Discontents: Essays on the History of Inductive Probability; Cambridge Studies in Probability, Induction, and Decision Theory; Cambridge University Press: Cambridge, UK, 2005. [Google Scholar]
- Santhanam, N.; Anantharam, V. Agnostic insurance of model classes. 2012; arXiv.org: 1212:3866. [Google Scholar]
- Orlitsky, A.; Santhanam, N. Lecture notes on universal compression. Available online: http://www-ee.eng.hawaii.edu/~prasadsn/ (accessed on 9 July 2014).

© 2014 by the authors; licensee MDPI, Basel, Switzerland This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution license (http://creativecommons.org/licenses/by/3.0/).

## Share and Cite

**MDPI and ACS Style**

Hosseini, M.; Santhanam, N.
Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals. *Entropy* **2014**, *16*, 4168-4184.
https://doi.org/10.3390/e16074168

**AMA Style**

Hosseini M, Santhanam N.
Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals. *Entropy*. 2014; 16(7):4168-4184.
https://doi.org/10.3390/e16074168

**Chicago/Turabian Style**

Hosseini, Maryam, and Narayana Santhanam.
2014. "Characterizing the Asymptotic Per-Symbol Redundancy of Memoryless Sources over Countable Alphabets in Terms of Single-Letter Marginals" *Entropy* 16, no. 7: 4168-4184.
https://doi.org/10.3390/e16074168