Next Article in Journal
Environmental Response of 2D Thermal Cloak under Dynamic External Temperature Field
Next Article in Special Issue
Time-Limited Codewords over Band-Limited Channels: Data Rates and the Dimension of the W-T Space
Previous Article in Journal
Non-Hermitian Hamiltonians and Quantum Transport in Multi-Terminal Conductors
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Finite-Length Analyses for Source and Channel Coding on Markov Chains †

by
Masahito Hayashi
1,2,3,4,*,‡ and
Shun Watanabe
5,‡
1
Shenzhen Institute for Quantum Science and Engineering, Southern University of Science and Technology, Shenzhen 518055, China
2
Graduate School of Mathematics, Nagoya University, Nagoya 464-8602, Japan
3
Center for Quantum Computing, Peng Cheng Laboratory, Shenzhen 518000, China
4
Centre for Quantum Technologies, National University of Singapore, 3 Science Drive 2, Singapore 117542, Singapore
5
Department of Computer and Information Sciences, Tokyo University of Agriculture and Technology, Koganei-shi, Tokyo 184-8588, Japan
*
Author to whom correspondence should be addressed.
This paper is an extended version of the conference paper presented at the 51st Allerton Conference and 2014 Information Theory and Applications Workshop, San Diego, CA, USA, 9–14 February 2014.
These authors contributed equally to this work.
Entropy 2020, 22(4), 460; https://doi.org/10.3390/e22040460
Submission received: 9 March 2020 / Revised: 2 April 2020 / Accepted: 4 April 2020 / Published: 18 April 2020
(This article belongs to the Special Issue Finite-Length Information Theory)

Abstract

:
We derive finite-length bounds for two problems with Markov chains: source coding with side-information where the source and side-information are a joint Markov chain and channel coding for channels with Markovian conditional additive noise. For this purpose, we point out two important aspects of finite-length analysis that must be argued when finite-length bounds are proposed. The first is the asymptotic tightness, and the other is the efficient computability of the bound. Then, we derive finite-length upper and lower bounds for the coding length in both settings such that their computational complexity is low. We argue the first of the above-mentioned aspects by deriving the large deviation bounds, the moderate deviation bounds, and second-order bounds for these two topics and show that these finite-length bounds achieve the asymptotic optimality in these senses. Several kinds of information measures for transition matrices are introduced for the purpose of this discussion.

1. Introduction

In recent years, finite-length analyses for coding problems have been attracting considerable attention [1]. This paper focuses on finite-length analyses for two representative coding problems: One is source coding with side-information for Markov sources, i.e., the Markov–Slepian–Wolf problem on the system X n with full side-information Y n at the decoder, where only the decoder observes the side-information and the source and the side-information are a joint Markov chain. The other is channel coding for channels with Markovian conditional additive noise. Although the main purpose of this paper is finite-length analyses, we also present a unified approach we developed to investigate these topics including asymptotic analyses. Since this discussion is spread across a number of subtopics, we explain them separately in the Introduction.

1.1. Two Aspects of Finite-Length Analysis

We explain the motivations of this research by starting with two aspects of finite-length analysis that must be argued when finite-length bounds are proposed. For concreteness, we consider channel coding here even though the problems treated in this paper are not restricted to channel coding. To date, many types of finite-length achievability bounds have been proposed. For example, Verdú and Han derived a finite-length bound by using the information-spectrum approach in order to derive the general formula [2] (see also [3]), which we term as the information-spectrum bound. One of the authors and Nagaoka derived a bound (for the classical-quantum channel) by relating the error probability to binary hypothesis testing [4] (Remark 15) (see also [5]), which we refer to as the hypothesis-testing bound. Polyanskiy et. al. derived the random coding union (RCU) bound and the dependence testing (DT) bound [1] (a bound slightly looser (coefficients are worse) than the DT bound can be derived from the hypothesis-testing bound of [4]). Moreover, Gallager’s bound [6] is known as an efficient bound to derive the exponentially decreasing rate.
Here, we focus on two important aspects of finite-length analysis:
(A1) 
Computational complexity for the bound and
(A2) 
Asymptotic tightness for the bound.
Both aspects are required for the bound in finite-length analysis as follows. As the first aspect, we consider the computational complexity for the bound. For the BSC (binary symmetric channel), the computational complexity of the RCU bound is O ( n 2 ) , and that of the DT bound is O ( n ) [7]. However, the computational complexities of these bounds are much larger for general DMCs (discrete memoryless channels) or channels with memory. It is known that the hypothesis testing bound can be described as a linear programming problem (e.g., see [8,9] (in the the case of a quantum channel, the bound is described as a semi-definite programming problem)) and can be efficiently computed under certain symmetry. However, the number of variables in the linear programming problem grows exponentially with the block length, and it is difficult to compute in general. The computation of the information-spectrum bound depends on the evaluation of the tail probability. The hypothesis testing bound gives a tighter bound than the information-spectrum bound, as pointed out by [8], and the computational complexity of the former is much smaller than that of the latter. However, the computation of the tail probability continues to remain challenging unless the channel is a DMC. For DMCs, the computational complexity of Gallager’s bound is O ( 1 ) since the Gallager function is an additive quantity for DMCs. However, this is not the case if there is a memory (the Gallager bound for finite-state channels was considered in [10] (Section 5.9), but a closed form expression for the exponent was not derived). Consequently, no efficiently computable bound currently exists for channel coding with Markov additive noise. The situation is the same for source coding with side-information.
Since the actual computation time may depend on the computational resource we can use for numerical experiment, it is not possible to provide a concrete requirement of computational complexity. However, in order to conduct a numerical experiment for a meaningful blocklength, it is reasonable to require the computational complexity to be, at most, a polynomial order of the blocklength n.
Next, let us consider the second aspect, i.e., asymptotic tightness. Thus far, three kinds of asymptotic regimes have been studied in information theory [1,11,12,13,14,15,16]:
  • A large deviation regime in which the error probability ε asymptotically behaves as e n r for some r > 0 ;
  • A moderate deviation regime in which ε asymptotically behaves as e n 1 2 t r for some r > 0 and t ( 0 , 1 / 2 ) ; and
  • A second-order regime in which ε is a constant.
We shall claim that a good finite-length bound should be asymptotically optimal for at least one of the above-mentioned three regimes. In fact, the information-spectrum bound, the hypothesis-testing bound, and the DT bound are asymptotically optimal in both the moderate deviation and second-order regimes, whereas the Gallager bound is asymptotically optimal in the large deviation regime and the RCU bound asymptotically optimal in all the regimes (Both the Gallager and RCU bounds are asymptotically optimal in the large deviation regime only up to the critical rate). Recently, for DMCs, Yang and Meng derived an efficiently computable bound for low-density parity check (LDPC) codes [17], which is asymptotically optimal in both the moderate deviation and second-order regimes.

1.2. Main Contribution for Finite-Length Analysis

We derive the finite-length achievability bounds on the problems by basically using the exponential-type bounds (for channel coding, it corresponds to the Gallager bound.). In source coding with side-information, the exponential-type upper bounds on error probability P ¯ e ( M n ) for a given message size M n are described by using the conditional Rényi entropies as follows (cf. Lemmas 14 and 15):
P ¯ e ( M n ) inf 1 2 θ 0 M n θ 1 + θ e θ 1 + θ H 1 + θ ( X n | Y n )
and:
P ¯ e ( M n ) inf 1 θ 0 M n θ e θ H 1 + θ ( X n | Y n ) .
Here, X n is the information to be compressed and Y n is the side-information that can be accessed only by the decoder. H 1 + θ ( X n | Y n ) is the conditional Rényi entropy introduced by Arimoto [18], which we shall refer to as the upper conditional Rényi entropy (cf. (12)). On the other hand, H 1 + θ ( X n | Y n ) is the conditional Rényi entropy introduced in [19], which we shall refer to as the lower conditional Rényi entropy (cf. (7)). Although there are several other definitions of conditional Rényi entropies, we only use these two in this paper; see [20,21] for an extensive review on conditional Rényi entropies.
Although the above-mentioned conditional Rényi entropies are additive for i.i.d. random variables, they are not additive for joint Markov chains over X n and Y n , for which the derivation of finite-length bounds for Markov chains are challenging. Because it is generally not easy to evaluate the conditional Rényi entropies for Markov chains, we consider two assumptions in relation to transition matrices: the first assumption, which we refer to as non-hidden, is that the Y-marginal process is a Markov chain, which enables us to derive the single-letter expression of the conditional entropy rate and the lower conditional Rényi entropy rate; the second assumption, which we refer to as strongly non-hidden, enables us to derive the single-letter expression of the upper conditional Rényi entropy rate; see Assumptions 1 and 2 of Section 2 for more detail (Indeed, as explained later, our result on the data compression can be converted to a result on the channel coding for a specific class of channels. Under this conversion, we obtain certain assumptions for channels. As explained later, these assumptions for channels are more meaningful from a practical point of view.). Under Assumption 1, we introduce the lower conditional Rényi entropy for transition matrices H 1 + θ , W ( X | Y ) (cf. (47)). Then, we evaluate the lower conditional Rényi entropy for the Markov chain in terms of its transition matrix counterpart. More specifically, we derive an approximation:
H 1 + θ ( X n | Y n ) = n H 1 + θ , W ( X | Y ) + O ( 1 ) ,
where an explicit form of the O ( 1 ) term is also derived. Using the evaluation (2) with this evaluation, we obtain finite-length bounds under Assumption 1. Under a more restrictive assumption, i.e., Assumption 2, we also introduce the upper conditional Rényi entropy for a transition matrix H 1 + θ , W ( X | Y ) (cf. (55)). Then, we evaluate the upper Rényi entropy for the Markov chain in terms of its transition matrix counterpart. More specifically, we derive an approximation:
H 1 + θ ( X n | Y n ) = n H 1 + θ , W ( X | Y ) + O ( 1 ) ,
where an explicit form of the O ( 1 ) term is also derived. Using the evaluation (1) with this evaluation, we obtain finite-length bounds that are tighter than those obtained under Assumption 1. It should be noted that, without Assumption 1, even the conditional entropy rate is challenging to evaluate. For evaluation of the conditional entropy rate of the X process given the Y process, the assumption of the X process being Markov seems to be not helpful. This is the reason why we consider the Y process being Markov instead of the X process being Markov in this paper.
We also derive converse bounds by using the change of measure argument for Markov chains developed by the authors in the accompanying paper on information geometry [22,23]. For this purpose, we further introduce two-parameter conditional Rényi entropy and its transition matrix counterpart (cf. (18) and (59)). This novel information measure includes the lower conditional Rényi entropy and the upper conditional Rényi entropy as special cases. We clarify the relation among bounds based on these quantities by numerically calculating the upper and lower bounds for the optimal coding rate in source coding with a Markov source in Section 3.7. Owing to the second aspect (A2), this calculation shows that our finite-length bounds are very close to the optimal value. Although this numerical calculation contains a case with a very large size n = 1 × 10 5 , its calculation is not as difficult because the calculation complexity behaves as O ( 1 ) . That is, this calculation shows the advantage of the first aspect (A1).
Here, we would like to remark about the terminologies because there are a few ways to express exponential-type bounds. In statistics or large deviation theory, we usually use the cumulant generating function (CGF) to describe exponents. In information theory, we employ the Gallager function or the Rényi entropies. Although these three terminologies are essentially the same quantity and are related by the change of variables, the CGF and the Gallager function are convenient for some calculations because of their desirable properties such as convexity. On the other hand, the minimum entropy and collision entropy are often used as alternative information measures of Shannon entropy in the community of cryptography. Since the Rényi entropies are a generalization of the minimum entropy and collision entropy, we can regard the Rényi entropies as information measures. The information theoretic meaning of the CGF and the Gallager function are less clear. Thus, the Rényi entropies are intuitively familiar to the readers’ of this journal. The Rényi entropies have an additional advantage in that two types of bounds (e.g., (152) and (161)) can be expressed in a unified manner. Therefore, we state our main results in terms of the Rényi entropies, whereas we use the CGF and the Gallager function in the proofs. For the readers’ convenience, the relation between the Rényi entropies and corresponding CGFs are summarized in Appendix A and Appendix B.

1.3. Main Contribution for Channel Coding

An intimate relationship is known to exist between channel coding and source coding with side-information (e.g., [24,25,26]). In particular, for an additive channel, the error probability of channel coding by a linear code can be related to the corresponding source coding problem with side-information [24]. Chen et. al. also showed that the error probability of source coding with side-information by a linear encoder can be related to the error probability of a dual channel coding problem and vice versa [27] (see also [28]). Since these dual channels can be regarded as additive channels conditioned on state information, we refer to these channels as conditional additive channels (In [28], we termed these channels general additive channels, but we think “conditional” more suitably describes the situation.). In this paper, we mainly discuss a conditional additive channel, in which the additive noise is operated subject to a distribution conditioned on additional output information. Then, we convert our obtained results of source coding with side-information to the analysis on conditional additive channels. That is, using the aforementioned duality between channel coding and source coding with side-information enables us to evaluate the error probability of channel coding for additive channels. Then, we derive several finite-length analyses on additive channels.
For the same reason as source coding with side-information, we make two assumptions, Assumptions 1 and 2, on the noise process of a conditional additive channel. In this context, Assumption 1 means that the marginal system Y n deciding the behavior of the additive noise X n is a Markov chain. It should be noted that the Gilbert–Elliott channel [29,30] with state information available at the receiver can be regarded as a conditional additive channel such that the noise process is a Markov chain satisfying both Assumptions 1 and 2 (see Example 6). Thus, we believe that Assumptions 1 and 2 are quite reasonable assumptions.
In fact, our analysis is applicable for a broader class of channels known as regular channels [31]. The class of regular channels includes conditional additive channels as a special case, and it is known as a class of channels that are similarly symmetrical. To show it, we propose a method to convert a regular channel into a conditional additive channel such that our treatment covers regular channels. Additionally, we show that the BPSK (binary phase shift keying)-AWGN (additive white Gaussian noise) channel is included in conditional additive channels.

1.4. Asymptotic Bounds and Asymptotic Tightness for Finite-Length Bounds

We present asymptotic analyses of the large and moderate deviation regimes by deriving the characterizations (for the large deviation regime, we only derive the characterizations up to the critical rate) with the use of our finite-length achievability and converse bounds, which implies that our finite-length bounds are tight in both of these deviation regimes. We also derive the second-order rate. Although this rate can be derived by the application of the central limit theorem to the information-spectrum bound, the variance involves the limit with respect to the block length because of memory. In this paper, we derive a single-letter form of the variance by using the conditional Rényi entropy for transition matrices (An alternative way to derive a single-letter characterization of the variance for the Markov chain was shown in [32] (Lemma 20). It should also be noted that a single-letter characterization can be derived by using the fundamental matrix [33]. The single-letter characterization of the variance in [12] (Section VII) and [11] (Section III) contains an error, which is corrected in this paper.).
As we will see in Theorems 11–14 and 22–25, our asymptotic results have the same forms as the counterparts of the i.i.d. case (cf. [1,6,11,12,13,14]) when the information measures for distributions in the i.i.d. case are replaced by the information measures for the transition matrices introduced in this paper.
We determine the asymptotic tightness for finite-length bounds by summarizing the relation between the asymptotic results and the finite-length bounds in Table 1. The table also describes the computational complexity of the finite-length bounds. “ Solved * ” indicates that those problems are solved up to the critical rates. “Ass. 1” and “Ass. 2” indicate that those problems are solved either under Assumption 1 or Assumption 2. “ O ( 1 ) ” indicates that both the achievability and converse parts of those asymptotic results are derived from our finite-length achievability bounds and converse bounds whose computational complexities are O ( 1 ) . “Tail” indicates that both the achievability and converse parts of those asymptotic results are derived from the information-spectrum-type achievability bounds and converse bounds of which the computational complexities depend on the computational complexities of the tail probabilities.
In general, the exact computations of tail probabilities are difficult, although they may be feasible for a simple case such as an i.i.d. case. One way to compute tail probabilities approximately is to use the Berry–Esséen theorem [34] (Theorem 16.5.1) or its variant [35]. This direction of research is still ongoing [36,37], and an evaluation of the constant was conducted [37], although its tightness has not been clarified. If we can derive a tight Berry–Esséen-type bound for the Markov chain, this would enable us to derive a finite-length bound that is asymptotically tight in the second-order regime. However, the approximation errors of Berry–Esséen-type bounds converge only in the order of 1 / n and cannot be applied when ε is rather small. Even in cases in which the exact computations of tail probabilities are possible, the information-spectrum-type bounds are looser than the exponential type bounds when ε is rather small, and we need to use appropriate bounds depending on the size of ε . In fact, this observation was explicitly clarified in [38] for random number generation with side-information. Consequently, we believe that our exponential-type finite-length bounds are very useful. It should be also noted that, for source coding with side-information and channel coding for regular channels, even the first-order results have not been revealed as far as the authors know, and they are clarified in this paper (General formulae for those problems were known [2,3], but single-letter expressions for Markov sources or channels were not clarified in the literature. For the source coding without side-information, the single-letter expression for entropy rate of Markov source is well known (e.g., see [39]).).

1.5. Related Work on Markov Chains

Since related work concerning the finite-length analysis is reviewed in Section 1.1, we only review work related to the asymptotic analysis here. Some studies on Markov chains for the large deviation regime have been reported [40,41,42]. The derivation in [40] used the Markov-type method. A drawback of this method is that it involves a term that stems from the number of types, which does not affect the asymptotic analysis, but does hurt the finite-length analysis. Our achievability is derived by following a similar approach as in [41,42], i.e., the Perron–Frobenius theorem, but our derivation separates the single-shot part and the evaluation of the Rényi entropy, and thus is more transparent. Furthermore, the converse part of [41,42] is based on the Shannon–McMillan–Breiman limiting theorem and does not yield finite-length bounds.
For the second-order regime, Polyanskiy et. al. studied the second-order rate (dispersion) of the Gilbert–Elliott channel [43]. Tomamichel and Tan studied the second-order rate of channel coding with state information such that the state information may be a general source and derived a formula for the Markov chain as a special case [32]. Kontoyiannis studied the second-order variable length source coding for the Markov chain [44]. In [45], Kontoyiannis and Verdú derived the second-order rate of lossless source coding under the overflow probability criterion.
For channel coding of the i.i.d. case, Scarlett et al. derived a saddle-point approximation, which unifies all three regimes [46,47].

1.6. Organization of the Paper

In Section 2, we introduce the information measures and their properties that will be used in Section 3 and Section 4. Then, source coding with side-information and channel coding is discussed in Section 3 and Section 4, respectively. As we mentioned above, we state our main result in terms of the Rényi entropies, and we use the CGFs and the Gallager function in the proofs. We explain how to cover the continuous case in Remarks 1 and 5. In Appendix A and Appendix B, the relation between the Rényi entropies and corresponding CGFs are summarized. The relation between the Rényi entropies and the Gallager function are explained as necessary. Proofs of some technical results are also provided in the remaining Appendices.

1.7. Notations

For a set X , the set of all distributions on X is denoted by P ( X ) . The set of all sub-normalized non-negative functions on X is denoted by P ¯ ( X ) . The cumulative distribution function of the standard Gaussian random variable is denoted by:
Φ ( t ) = t 1 2 π exp x 2 2 d x .
Throughout the paper, the base of the logarithm is the natural base e.

2. Information Measures

Since this paper discusses the second-order tightness, we need to discuss the central limit theorem for the Markov process. For this purpose, we usually employ advanced mathematical methods from probability theory. For example, the paper [48] (Theorem 4) showed the Markov version of the central limit theorem by using a martingale stopping technique. Lalley [49] employed the regular perturbation theory of operators on the infinite-dimensional space [50] (Chapter 7, #1, Chapter 4, #3, and Chapter 3, #5). The papers [51,52] and [53] (Lemma 1.5 of Chapter 1) employed the spectral measure, while it is hard to calculate the spectral measure in general even in the finite-state case. Further, the papers [36,51,54,55] showed the central limit theorem by using the asymptotic variance, but they did not give any computable expression of the asymptotic variance without the infinite sum. In summary, to derive the central limit theorem with the variance of a computable form, these papers needed to use very advanced mathematics beyond calculus and linear algebra.
To overcome the difficulty of the Markov version of the central limit theorem, we employed the method used in our recent paper [23]. The paper [23] employed the method based on the cumulant generating function for transition matrices, which is defined by the Perron eigenvalue of a specific non-negative-entry matrix. Since a Perron eigenvalue can be explained in the framework of linear algebra, the method can be described with elementary mathematics. To employ this method, we need to define the information measure in a way similar to the cumulant generating function for transition matrices. That is, we define the information measures for transition matrices, e.g., the conditional Rényi entropy for transition matrices, etc, by using Perron eigenvalues.
Fortunately, these information measures for transition matrices are very useful even for large deviation-type evaluation and finite-length bounds. For example, our recent paper [23] derived finite-length bounds for simple hypothesis testing for the Markov chain by using the cumulant generating function for transition matrices. Therefore, using these information measures for transition matrices, this paper derives finite-length bounds for source coding and channel coding with Markov chains and discusses their asymptotic bounds with large deviation, moderate deviation, and the second-order type.
Since they are natural extensions of information measures for single-shot setting, we first review information measures for the single-shot setting in Section 2.1. Next, we introduce information measures for transition matrices in Section 2.2. Then, we show that information measures for Markov chains can be approximated by information measures for transition matrices generating those Markov chains in Section 2.3.

2.1. Information Measures for the Single-Shot Setting

In this section, we introduce conditional Rényi entropies for the single-shot setting. For more a detailed review of conditional Rényi entropies, see [21]. For a correlated random variable ( X , Y ) on X × Y with probability distribution P X Y and a marginal distribution Q Y on Y , we introduce the conditional Rényi entropy of order 1 + θ relative to Q Y as:
H 1 + θ ( P X Y | Q Y ) : = 1 θ log x , y P X Y ( x , y ) 1 + θ Q Y ( y ) θ ,
where θ ( 1 , 0 ) ( 0 , ) . The conditional Rényi entropy of order zero relative to Q Y is defined by the limit with respect to θ . When X has no side-information, it is nothing but the ordinary Rényi entropy, and it is denoted by H 1 + θ ( X ) = H 1 + θ ( P X ) throughout the paper.
One of the important special cases of H 1 + θ ( P X Y | Q Y ) is the case with Q Y = P Y , where P Y is the marginal of P X Y . We shall call this special case the lower conditional Rényi entropy of order 1 + θ and denote (this notation was first introduced in [56]):
H 1 + θ ( X | Y ) : = H 1 + θ ( P X Y | P Y )
= 1 θ log x , y P X Y ( x , y ) 1 + θ P Y ( y ) θ .
When we consider the second-order analysis, the variance of the entropy density plays an important role:
V ( X | Y ) : = Var log 1 P X | Y ( X | Y ) .
We have the following property, which follows from the correspondence between the conditional Rényi entropy and the cumulant generating function (cf. Appendix B).
Lemma 1.
We have:
lim θ 0 H 1 + θ ( X | Y ) = H ( X | Y )
and (as seen in the proof (cf. (A26)), the left-hand side of (11) corresponds to the second derivative of the cumulant generating function):
lim θ 0 2 H ( X | Y ) H 1 + θ ( X | Y ) θ = V ( X | Y ) .
Proof. 
(10) follows from the relation in (A25) and the fact that the first-order derivative of the cumulant generating function is the expectation. (11) follows from (A25), (10) and (A26). □
The other important special case of H 1 + θ ( P X Y | Q Y ) is the measure maximized over Q Y . We shall call this special case the upper conditional Rényi entropy of order 1 + θ and denote (Equation (13) for 1 < θ < 0 follows from the Hölder inequality, and Equation (13) for 0 < θ follows from the reverse Hölder inequality [57] (Lemma 8). Similar optimization has appeared in the context of Rényi mutual information in [58] (see also [59]).):
H 1 + θ ( X | Y ) : = max Q Y P ( Y ) H 1 + θ ( P X Y | Q Y )
= H 1 + θ ( P X Y | P Y ( 1 + θ ) )
= 1 + θ θ log y P Y ( y ) x P X | Y ( x | y ) 1 + θ 1 1 + θ ,
where:
P Y ( 1 + θ ) ( y ) : = x P X Y ( x , y ) 1 + θ 1 1 + θ y x P X Y ( x , y ) 1 + θ 1 1 + θ .
For this measure, we also have the same properties as Lemma 1. This lemma will be proven in Appendix C.
Lemma 2.
We have:
lim θ 0 H 1 + θ ( X | Y ) = H ( X | Y )
and:
lim θ 0 2 H ( X | Y ) H 1 + θ ( X | Y ) θ = V ( X | Y ) .
When we derive converse bounds, we need to consider the case such that the order of the Rényi entropy is different from the order of conditioning distribution defined in (15). For this purpose, we introduce two-parameter conditional Rényi entropy, which connects the two kinds of conditional Rényi entropies H 1 + θ ( X | Y ) and H 1 + θ ( X | Y ) in the way as Statements 10 and 11 of Lemma 3:
H 1 + θ , 1 + θ ( X | Y )
: = H 1 + θ ( P X Y | P Y ( 1 + θ ) )
= 1 θ log y P Y ( y ) x P X | Y ( x | y ) 1 + θ x P X | Y ( x | y ) 1 + θ θ 1 + θ + θ 1 + θ H 1 + θ ( X | Y ) .
Next, we investigate some properties of the measures defined above, which will be proven in Appendix D.
Lemma 3.
1.
For fixed Q Y , θ H 1 + θ ( P X Y | Q Y ) is a concave function of θ, and it is strict concave iff Var log Q Y ( Y ) P X Y ( X , Y ) > 0 .
2.
For fixed Q Y , H 1 + θ ( P X Y | Q Y ) is a monotonically decreasing (Technically, H 1 + θ ( P X Y | Q Y ) is always non-increasing, and it is monotonically decreasing iff strict concavity holds in Statement 1. Similar remarks are also applied for other information measures throughout the paper.) function of θ.
3.
The function θ H 1 + θ ( X | Y ) is a concave function of θ, and it is strict concave iff V ( X | Y ) > 0 .
4.
H 1 + θ ( X | Y ) is a monotonically decreasing function of θ.
5.
The function θ H 1 + θ ( X | Y ) is a concave function of θ, and it is strict concave iff V ( X | Y ) > 0 .
6.
H 1 + θ ( X | Y ) is a monotonically decreasing function of θ.
7.
For every θ ( 1 , 0 ) ( 0 , ) , we have H 1 + θ ( X | Y ) H 1 + θ ( X | Y ) .
8.
For fixed θ , the function θ H 1 + θ , 1 + θ ( X | Y ) is a concave function of θ, and it is strict concave iff V ( X | Y ) > 0 .
9.
For fixed θ , H 1 + θ , 1 + θ ( X | Y ) is a monotonically decreasing function of θ.
10.
We have:
H 1 + θ , 1 ( X | Y ) = H 1 + θ ( X | Y ) .
11.
We have:
H 1 + θ , 1 + θ ( X | Y ) = H 1 + θ ( X | Y ) .
12.
For every θ ( 1 , 0 ) ( 0 , ) , H 1 + θ , 1 + θ ( X | Y ) is maximized at θ = θ .
The following lemma expresses explicit forms of the conditional Rényi entropies of order zero.
Lemma 4.
We have:
lim θ 1 H 1 + θ ( P X Y | Q Y ) = H 0 ( P X Y | Q Y )
: = log y Q Y ( y ) | supp ( P X | Y ( · | y ) ) | ,
lim θ 1 H 1 + θ ( X | Y ) = H 0 ( X | Y )
: = log max y supp ( P Y ) | supp ( P X | Y ( · | y ) ) | ,
lim θ 1 H 1 + θ ( X | Y ) = H 0 ( X | Y )
: = log y P Y ( y ) | supp ( P X | Y ( · | y ) ) | .
Proof. 
See Appendix E. □
The definition (6) guarantees the existence of the derivative of d [ θ H 1 + θ ( P X Y | Q Y ) ] d θ . From Statement 1 of Lemma 3, d [ θ H 1 + θ ( P X Y | Q Y ) ] / d θ is monotonically decreasing. Thus, the inverse function (Throughout the paper, the notations θ ( a ) and a ( R ) are reused for several inverse functions. Although the meanings of those notations are obvious from the context, we occasionally put superscript Q, ↓ or ↑ to emphasize that those inverse functions are induced from corresponding conditional Rényi entropies. This definition is related to the Legendre transform of the concave function θ θ H 1 + θ ( X | Y ) .) of θ d [ θ H 1 + θ ( P X Y | Q Y ) ] / d θ exists so that the function θ ( a ) = θ Q ( a ) is defined as:
d [ θ H 1 + θ ( P X Y | Q Y ) ] d θ | θ = θ ( a ) = a
for a ̲ < a a ¯ , where a ̲ = a ̲ Q : = lim θ d [ θ H 1 + θ ( P X Y | Q Y ) ] / d θ and a ¯ = a ¯ Q : = lim θ 1 d [ θ H 1 + θ ( P X Y | Q Y ) ] / d θ . Let:
R ( a ) = R Q ( a ) : = ( 1 + θ ( a ) ) a θ ( a ) H 1 + θ ( a ) ( P X Y | Q Y ) .
Since:
R ( a ) = d R ( a ) d a = d θ ( a ) d a a + 1 + θ ( a ) d ( θ H 1 + θ ( P X Y | Q Y ) ) d θ d θ ( a ) d a = d θ ( a ) d a a + 1 + θ ( a ) a d θ ( a ) d a = 1 + θ ( a ) ,
R ( a ) is a monotonic increasing function of a ̲ < a R ( a ¯ ) . Thus, we can define the inverse function a ( R ) = a Q ( R ) of R ( a ) by:
( 1 + θ ( a ( R ) ) ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) ( P X Y | Q Y ) = R
for R ( a ̲ ) < R H 0 ( P X Y | Q Y ) .
For θ H 1 + θ ( X | Y ) , by the same reason as above, we can define the inverse functions θ ( a ) = θ ( a ) and a ( R ) = a ( R ) by:
d [ θ H 1 + θ ( X | Y ) ] d θ | θ = θ ( a ) = a
and:
( 1 + θ ( a ( R ) ) ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) ( X | Y ) = R ,
for R ( a ̲ ) < R H 0 ( X | Y ) . For θ H 1 + θ ( X | Y ) , we also introduce the inverse functions θ ( a ) = θ ( a ) and a ( R ) = a ( R ) by:
d θ H 1 + θ ( X | Y ) d θ | θ = θ ( a ) = a
and:
( 1 + θ ( a ( R ) ) ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) ( X | Y ) = R
for R ( a ̲ ) < R H 0 ( X | Y ) .
Remark 1.
Here, we discuss the possibility for extension to the continuous case. Since the entropy in the continuous case diverges, we cannot extend the information quantities to the case when X is continuous. However, it is possible to extend these quantities to the case when Y is continuous, but X is a discrete finite set. In this case, we prepare a general measure μ (like the Lebesgue measure) on Y and probability density function p Y and q Y such that the distributions P Y and Q Y are given as p Y ( y ) μ ( d y ) and q Y ( y ) μ ( d y ) , respectively. Then, it is sufficient to replace ∑, Q ( y ) , and P X Y ( x , y ) by Y μ ( d y ) , P X | Y ( x | y ) p Y ( y ) , and q Y ( y ) , respectively. Hence, in the n-independent and identically distributed case, these information measures are given as n times the original information measures.
One might consider the information quantities for transition matrices given in the next subsection for this continuous case. However, this is not so easy because it needs a continuous extension of the Perron eigenvalue.

2.2. Information Measures for the Transition Matrix

Let { W ( x , y | x , y ) } ( ( x , y ) , ( x , y ) ) ( X × Y ) 2 be an ergodic and irreducible transition matrix. The purpose of this section is to introduce transition matrix counterparts of those measures in Section 2.1. For this purpose, we first need to introduce some assumptions on transition matrices:
Assumption 1
(Non-hidden). We say that a transition matrix W is non-hidden (with respect to Y) if the Y-marginal process is a Markov process, i.e., (The reason for the name “non-hidden” is the following. In general, the Y-marginal process is a hidden Markov process. However, when the condition (37) holds, the Y-marginal process is a Markov process. Hence, we call the condition (37) non-hidden.):
x W ( x , y | x , y ) = W ( y | y )
for every x X and y , y Y . This condition is equivalent to the existence of the following decomposition of W ( x , y | x , y ) :
W ( x , y | x , y ) = W ( y | y ) W ( x | x , y , y ) .
Assumption 2
(Strongly non-hidden). We say that a transition matrix W is strongly non-hidden (with respect to Y) if, for every θ ( 1 , ) and y , y Y (The reason for the name “strongly non-hidden” is the following. When we compute the upper conditional Rényi entropy rate of the Markov source, the effect of the Y process may propagate infinitely even if it is non-hidden. When (39) holds, the effect of the Y process in the computation of the upper conditional Rényi entropy rate is only one step.):
W θ ( y | y ) : = x W ( x , y | x , y ) 1 + θ
is well defined, i.e., the right-hand side of (39) is independent of x .
Assumption 1 requires (39) to hold only for θ = 0 , and thus, Assumption 2 implies Assumption 1. However, Assumption 2 is a strictly stronger condition than Assumption 1. For example, let us consider the case such that the transition matrix is a product form, i.e., W ( x , y | x , y ) = W ( x | x ) W ( y | y ) . In this case, Assumption 1 is obviously satisfied. However, Assumption 2 is not satisfied in general.
Assumption 2 has another expression as follows.
Lemma 5.
Assumption 2 holds if and only if, for every x x ˜ , there exists a permutation π x ; x ˜ on X such that W ( x | x , y , y ) = W ( π x ; x ˜ ( x ) | x ˜ , y , y ) .
Proof. 
Since the part “if” is trivial, we show the part “only if” as follows. By noting (38), Assumption 2 can be rephrased as:
x W ( x | x , y , y ) 1 + θ
does not depend on x for every θ ( 1 , ) . Furthermore, this condition can be rephrased as follows. For x x ˜ , if the largest values of { W ( x | x , y ) } x X and { W ( x | x ˜ , y ) } x X are different, say the former is larger, then x W ( x | x , y ) 1 + θ > x W ( x | x ˜ , y ) 1 + θ for sufficiently large θ , which contradicts the fact that (40) does not depend on x . Thus, the largest values of { W ( x | x , y ) } x X and { W ( x | x ˜ , y ) } x X must coincide. By repeating this argument for the second largest value of { W ( x | x , y ) } x X and { W ( x | x ˜ , y ) } x X , and so on, we find that Assumption 2 implies that for every x x ˜ , there exists a permutation π x ; x ˜ on X such that W ( x | x , y , y ) = W ( π x ; x ˜ ( x ) | x ˜ , y , y ) . □
Now, we fix an element x 0 X and transform a sequence of random numbers ( X 1 , Y 1 , X 2 , Y 2 , , X n , Y n ) to the sequence of random numbers ( X 1 , Y 1 , X 2 , Y 2 , , X n , Y n ) : = ( X 1 , Y 1 , π x 0 ; X 1 1 ( X 2 ) , Y 2 , , π x 0 ; X 1 1 ( X n ) , Y n ) . Then, letting W ( x | y , y ) : = W ( x | x 0 , y , y ) , we have P X i , Y i | X i 1 , Y i 1 = W ( y i | y i 1 ) W ( x i | y i , y i 1 ) . That is, essentially, the transition matrix of this case can be written by the transition matrix W ( y i | y i 1 ) W ( x i | y i , y i 1 ) . Therefore, the transition matrix can be written by using the positive-entry matrix W x i ( y i | y i 1 ) : = W ( y i | y i 1 ) W ( x i | y i , y i 1 ) .
The following are non-trivial examples satisfying Assumptions 1 and 2.
Example 1.
Suppose that X = Y is a module (an additive group). Let P and Q be transition matrices on X . Then, the transition matrix given by:
W ( x , y | x , y ) = Q ( y | y ) P ( x y | x y )
satisfies Assumption 1. Furthermore, if transition matrix P ( z | z ) can be written as:
P ( z | z ) = P Z ( π z ( z ) )
for permutation π z and a distribution P Z on X , then transition matrix W defined by (41) satisfies Assumption 2 as well.
Example 2.
Suppose that X is a module and W is (strongly) non-hidden with respect to Y . Let Q be a transition matrix on Z = X . Then, the transition matrix given by:
V ( x , y , z | x , y , z ) = W ( x z , y | x z , y ) Q ( z | z )
is (strongly) non-hidden with respect to Y × Z .
The following is also an example satisfying Assumption 2, which describes a noise process of an important class of channels with memory (cf. the Gilbert-Elliot channel in Example 6).
Example 3.
Let X = Y = { 0 , 1 } . Then, let:
W ( y | y ) = 1 q y i f y = y q y i f y y
for some 0 < q 0 , q 1 < 1 , and let:
W ( x | x , y , y ) = 1 p y i f x = 0 p y i f x = 1
for some 0 < p 0 , p 1 < 1 . By choosing π x ; x ˜ to be the identity, this transition matrix satisfies the condition given in Remark 5, which is equivalent to Assumption 2.
First, we introduce information measures under Assumption 1. In order to define a transition matrix counterpart of (7), let us introduce the following tilted matrix:
W ˜ θ ( x , y | x , y ) : = W ( x , y | x , y ) 1 + θ W ( y | y ) θ .
Here, we should notice that the tilted matrix W ˜ θ is not normalized, i.e., is not a transition matrix. Let λ θ be the Perron–Frobenius eigenvalue of W ˜ θ and P ˜ θ , X Y be its normalized eigenvector. Then, we define the lower conditional Rényi entropy for W by:
H 1 + θ , W ( X | Y ) : = 1 θ log λ θ ,
where θ ( 1 , 0 ) ( 0 , ) . For θ = 0 , we define the lower conditional Rényi entropy for W by:
H W ( X | Y ) = H 1 , W ( X | Y )
: = lim θ 0 H 1 + θ , W ( X | Y ) ,
and we just call it the conditional entropy for W. In fact, the definition of H W ( X | Y ) above coincides with:
x , y P 0 , X Y ( x , y ) x , y W ( x , y | x , y ) log W ( x , y | x , y ) W ( y | y ) ,
where P 0 , X Y is the stationary distribution of W (cf. [60] (Equation (30))). For θ = 1 , H 0 , W ( X | Y ) is also defined by taking the limit. When X has no side-information, the Rényi entropy H 1 + θ W ( X ) for W is defined as a special case of H 1 + θ , W ( X | Y ) .
As a counterpart of (11), we also define (Since the limiting expression in (51) coincides with the second derivative of the CGF (cf. (A30)) and since the second derivative of the CGF exists (cf. [22] (Appendix D)), the variance in (51) is well defined. While the definition (51) contains the limit θ 0 , it can be calculated without this type of limit by using the fundamental matrix [61] (Theorem 4.3.1), [23] (Theorem 7.7 and Remark 7.8).):
V W ( X | Y ) : = lim θ 0 2 H W ( X | Y ) H 1 + θ , W ( X | Y ) θ .
Remark 2.
When transition matrix W satisfies Assumption 2, H 1 + θ , W ( X | Y ) can be written as:
H 1 + θ , W ( X | Y ) = 1 θ log λ θ ,
where λ θ is the Perron–Frobenius eigenvalue of W θ ( y | y ) W ( y | y ) θ . In fact, for the left Perron–Frobenius eigenvector Q ^ θ of W θ ( y | y ) W ( y | y ) θ , we have:
x , y Q ^ θ ( y ) W ( x , y | x , y ) 1 + θ W ( y | y ) θ = λ θ Q θ ( y ) ,
which implies that λ θ is the Perron–Frobenius eigenvalue of W ˜ θ . Consequently, we can evaluate H 1 + θ , W ( X | Y ) by calculating the Perron–Frobenius eigenvalue of the | Y | × | Y | matrix instead of the | X | | Y | × | X | | Y | matrix when W satisfies Assumption 2.
Next, we introduce information measures under Assumption 2. In order to define a transition matrix counterpart of (12), let us introduce the following | Y | × | Y | matrix:
K θ ( y | y ) : = W θ ( y | y ) 1 1 + θ ,
where W θ is defined by (39). Let κ θ be the Perron–Frobenius eigenvalue of K θ . Then, we define the upper conditional Rényi entropy for W by:
H 1 + θ , W ( X | Y ) : = 1 + θ θ log κ θ ,
where θ ( 1 , 0 ) ( 0 , ) . For θ = 1 and θ = 0 , H 1 + θ , W ( X | Y ) is defined by taking the limit. We have the following properties, which will be proven in Appendix F.
Lemma 6.
We have:
lim θ 0 H 1 + θ , W ( X | Y ) = H W ( X | Y )
and:
lim θ 0 2 H W ( X | Y ) H 1 + θ , W ( X | Y ) θ = V W ( X | Y ) .
Now, let us introduce a transition matrix counterpart of (18). For this purpose, we introduce the following | Y | × | Y | matrix:
N θ , θ ( y | y ) : = W θ ( y | y ) W θ ( y | y ) θ 1 + θ .
Let ν θ , θ be the Perron–Frobenius eigenvalue of N θ , θ . Then, we define the two-parameter conditional Rényi entropy by:
H 1 + θ , 1 + θ W ( X | Y ) : = 1 θ log ν θ , θ + θ 1 + θ H 1 + θ , W ( X | Y ) .
Remark 3.
Although we defined H 1 + θ , W ( X | Y ) and H 1 + θ , W ( X | Y ) by (47) and (55), respectively, we can alternatively define these measures in the same spirit as the single-shot setting by introducing a transition matrix counterpart of H 1 + θ ( P X Y | Q Y ) as follows. For the marginal W ( y | y ) of W ( x , y | x , y ) , let Y W 2 : = { ( y , y ) : W ( y | y ) > 0 } . For another transition matrix V on Y , we define Y V 2 in a similar manner. For V satisfying Y W 2 Y V 2 , we define (although we can also define H 1 + θ W | V ( X | Y ) even if Y W 2 Y V 2 is not satisfied (see [22] for the detail), for our purpose of defining H 1 + θ , W ( X | Y ) and H 1 + θ , W ( X | Y ) , other cases are irrelevant):
H 1 + θ W | V ( X | Y ) : = 1 θ log λ θ W | V
for θ ( 1 , 0 ) ( 0 , ) , where λ θ W | V is the Perron–Frobenius eigenvalue of:
W ( x , y | x , y ) 1 + θ V ( y | y ) θ .
By using this measure, we obviously have:
H 1 + θ , W ( X | Y ) = H 1 + θ W | W ( X | Y ) .
Furthermore, under Assumption 2, the relation:
H 1 + θ , W ( X | Y ) = max V H 1 + θ W | V ( X | Y )
holds (see Appendix G for the proof), where the maximum is taken over all transition matrices satisfying Y W 2 Y V 2 .
Next, we investigate some properties of the information measures introduced in this section. The following lemma is proven in Appendix H.
Lemma 7.
1.
The function θ H 1 + θ , W ( X | Y ) is a concave function of θ, and it is strict concave iff V W ( X | Y ) > 0 .
2.
H 1 + θ , W ( X | Y ) is a monotonically decreasing function of θ.
3.
The function θ H 1 + θ , W ( X | Y ) is a concave function of θ, and it is strict concave iff V W ( X | Y ) > 0 .
4.
H 1 + θ , W ( X | Y ) is a monotonically decreasing function of θ.
5.
For every θ ( 1 , 0 ) ( 0 , ) , we have H 1 + θ , W ( X | Y ) H 1 + θ , W ( X | Y ) .
6.
For fixed θ , the function θ H 1 + θ , 1 + θ W ( X | Y ) is a concave function of θ, and it is strict concave iff V W ( X | Y ) > 0 .
7.
For fixed θ , H 1 + θ , 1 + θ W ( X | Y ) is a monotonically decreasing function of θ.
8.
We have:
H 1 + θ , 1 W ( X | Y ) = H 1 + θ , W ( X | Y ) .
9.
We have:
H 1 + θ , 1 + θ W ( X | Y ) = H 1 + θ , W ( X | Y ) .
10.
For every θ ( 1 , 0 ) ( 0 , ) , H 1 + θ , 1 + θ W ( X | Y ) is maximized at θ = θ , i.e.,
d [ H 1 + θ , 1 + θ W ( X | Y ) ] d θ | θ = θ = 0 .
From Statement 1 of Lemma 7, d [ θ H 1 + θ , W ( X | Y ) ] / d θ is monotonically decreasing. Thus, we can define the inverse function θ W ( a ) = θ , W ( a ) of d [ θ H 1 + θ , W ( X | Y ) ] / d θ by:
d [ θ H 1 + θ , W ( X | Y ) ] d θ | θ = θ W ( a ) = a
for a ̲ < a a ¯ , where a ̲ : = lim θ d [ θ H 1 + θ , W ( X | Y ) ] / d θ and a ¯ : = lim θ 1 d [ θ H 1 + θ , W ( X | Y ) ] / d θ . Let:
R W ( a ) : = ( 1 + θ ( a ) ) a θ ( a ) H 1 + θ ( a ) , W ( X | Y ) .
Since
R W ( a ) = ( 1 + θ ( a ) ) ,
R W ( a ) is a monotonic increasing function of a ̲ < a < R W ( a ¯ ) . Thus, we can define the inverse function a W ( R ) = a , W ( R ) of R W ( a ) by:
( 1 + θ ( a W ( R ) ) ) a W ( R ) θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y ) = R
for R W ( a ̲ ) < R < H 0 , W ( X | Y ) , where H 0 , W ( X | Y ) : = lim θ 1 H 1 + θ , W ( X | Y ) .
For θ H 1 + θ , W ( X | Y ) , by the same reason, we can define the inverse function θ W ( a ) = θ , W ( a ) by:
d [ θ H 1 + θ , 1 + θ W ( a ) W ( X | Y ) ] d θ | θ = θ W ( a ) = d [ θ H 1 + θ , W ( X | Y ) ] d θ | θ = θ W ( a ) = a ,
and the inverse function a W ( R ) = a , W ( R ) of:
R W ( a ) : = ( 1 + θ W ( a ) ) a θ W ( a ) H 1 + θ W ( a ) , W ( X | Y )
by:
( 1 + θ W ( a W ( R ) ) ) a W ( R ) θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y ) = R ,
for R ( a ̲ ) < R < H 0 , W ( X | Y ) , where H 0 , W ( X | Y ) : = lim θ 1 H 1 + θ , W ( X | Y ) . Here, the first equality in (71) follows from (66).
Since θ θ H 1 + θ , W ( X | Y ) is concave, the supremum of [ θ R + θ H 1 + θ , W ( X | Y ) ] is attained at the stationary point. Furthermore, note that 1 θ , W ( R ) 0 for H W ( X | Y ) R H 0 , W ( X | Y ) . Thus, we have the following property.
Lemma 8.
The function θ W ( R ) defined in (67) satisfies:
sup 1 θ 0 [ θ R + θ H 1 + θ , W ( X | Y ) ] = θ W ( R ) R + θ W ( R ) H 1 + θ W ( R ) , W ( X | Y )
for H W ( X | Y ) R H 0 , W ( X | Y ) .
Furthermore, we have the following characterization for another type of maximization.
Lemma 9.
The function θ W ( a W ( R ) ) defined by (70) satisfies:
sup 1 θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ = θ W ( a W ( R ) ) a W ( R ) + θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y )
for H W ( X | Y ) R H 0 , W ( X | Y ) , and the function θ ( a ( R ) ) defined in (73) satisfies:
sup 1 θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ = θ W ( a W ( R ) ) a W ( R ) + θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y )
for H W ( X | Y ) R H 0 , W ( X | Y ) .
Proof. 
See Appendix I. □
Remark 4.
The combination of (49), (51), and Lemma 6 guarantees that both the conditional Rényi entropies expand as:
H 1 + θ , W ( X | Y ) = H W ( X | Y ) 1 2 V W ( X | Y ) θ + o ( θ ) ,
H 1 + θ , W ( X | Y ) = H W ( X | Y ) 1 2 V W ( X | Y ) θ + o ( θ )
around θ = 0 . Thus, the difference of these measures significantly appears only when | θ | is rather large. For the transition matrix of Example 3 with q 0 = q 1 = 0 . 1 , p 0 = 0 . 1 , and p 1 = 0 . 4 , we plotted the values of the information measures in Figure 1. Although the values at θ = 1 coincide in Figure 1, note that the values at θ = 1 may differ in general.
In Example 1, we mentioned that the transition matrix W in (41) satisfies Assumption 2 when transition matrix P is given by (42). By computing the conditional Rényi entropies for this special case, we have:
H 1 + θ , W ( X | Y ) = H 1 + θ , W ( X | Y )
= H 1 + θ ( P Z ) ,
i.e., the two kinds of conditional Rényi entropies coincide.
Now, let us consider the asymptotic behavior of H 1 + θ , W ( X | Y ) around θ = 0 . When θ ( a ) is close to zero, we have:
θ W ( a ) H 1 + θ W ( a ) , W ( X | Y ) = θ W ( a ) H W ( X | Y ) 1 2 V W ( X | Y ) θ W ( a ) 2 + o ( θ W ( a ) 2 ) .
Taking the derivative, (67) implies that:
a = H W ( X | Y ) V W ( X | Y ) θ W ( a ) + o ( θ W ( a ) ) .
Hence, when R is close to H W ( X | Y ) , we have:
R = ( 1 + θ W ( a W ( R ) ) a W ( R ) θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y )
= H W ( X | Y ) 1 + θ W ( a W ( R ) ) 2 θ W ( a W ( R ) ) V W ( X | Y ) + o ( θ W ( a W ( R ) ) ,
i.e.,
θ W ( a W ( R ) ) = R + H W ( X | Y ) V W ( X | Y ) + o R H W ( X | Y ) V W ( X | Y ) .
Furthermore, (81) and (82) imply:
θ W ( a W ( R ) ) a W ( R ) + θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y )
= V W ( X | Y ) θ W ( a W ( R ) ) 2 2 + o ( θ W ( a W ( R ) ) 2 )
= V W ( X | Y ) 2 R H W ( X | Y ) V W ( X | Y ) 2 + o R H W ( X | Y ) V W ( X | Y ) 2 .

2.3. Information Measures for the Markov Chain

Let ( X , Y ) be the Markov chain induced by transition matrix W and some initial distribution P X 1 Y 1 . Now, we show how information measures introduced in Section 2.2 are related to the conditional Rényi entropy rates. First, we introduce the following lemma, which gives finite upper and lower bounds on the lower conditional Rényi entropy.
Lemma 10.
Suppose that transition matrix W satisfies Assumption 1. Let v θ be the eigenvector of W θ T with respect to the Perron–Frobenius eigenvalue λ θ such that min x , y v θ ( x , y ) = 1 (since the eigenvector corresponding to the Perron–Frobenius eigenvalue for an irreducible non-negative matrix has always strictly positive entries [62] (Theorem 8.4.4, p. 508), we can choose the eigenvector v θ satisfying this condition). Let w θ ( x , y ) : = P X 1 Y 1 ( x , y ) 1 + θ P Y 1 ( y ) θ . Then, for every n 1 , we have:
( n 1 ) θ H 1 + θ , W ( X | Y ) + δ ̲ ( θ ) θ H 1 + θ ( X n | Y n ) ( n 1 ) θ H 1 + θ , W ( X | Y ) + δ ¯ ( θ ) ,
where:
δ ¯ ( θ ) : = log v θ | w θ + log max x , y v θ ( x , y ) ,
δ ̲ ( θ ) : = log v θ | w θ ,
and v θ | w θ is defined as x , y v θ ( x , y ) w θ ( x , y ) .
Proof. 
This follows from (A29) and Lemma A2. □
From Lemma 10, we have the following.
Theorem 1.
Suppose that transition matrix W satisfies Assumption 1. For any initial distribution, we have (When there is no side-information, (93) reduces to the well-known expression of the entropy rate of the Markov process [39]. Without Assumption 1, it is not clear if (93) holds or not.):
lim n 1 n H 1 + θ ( X n | Y n ) = H 1 + θ , W ( X | Y ) ,
lim n 1 n H ( X n | Y n ) = H W ( X | Y ) .
We also have the following asymptotic evaluation of the variance, which follows from Lemma A3 in Appendix A.
Theorem 2.
Suppose that transition matrix W satisfies Assumption 1. For any initial distribution, we have:
lim n 1 n V ( X n | Y n ) = V W ( X | Y ) .
Theorem 2 is practically important since the limit of the variance can be described by a single-letter characterized quantity. A method to calculate V W ( X | Y ) can be found in [23].
Next, we show the lemma that gives the finite upper and lower bounds on the upper conditional Rényi entropy in terms of the upper conditional Rényi entropy for the transition matrix.
Lemma 11.
Suppose that transition matrix W satisfies Assumption 2. Let v θ be the eigenvector of K θ T with respect to the Perron–Frobenius eigenvalue κ θ such that min y v θ ( y ) = 1 . Let w θ be the | Y | -dimensional vector defined by:
w θ ( y ) : = x P X 1 Y 1 ( x , y ) 1 + θ 1 1 + θ .
Then, we have:
( n 1 ) θ 1 + θ H 1 + θ , W ( X | Y ) + ξ ̲ ( θ ) θ 1 + θ H 1 + θ ( X n | Y n ) ( n 1 ) θ 1 + θ H 1 + θ , W ( X | Y ) + ξ ¯ ( θ ) ,
where:
ξ ¯ ( θ ) : = log v θ | w θ + log max y v θ ( y ) ,
ξ ̲ ( θ ) : = log v θ | w θ .
Proof. 
See Appendix J. □
From Lemma 11, we have the following.
Theorem 3.
Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have:
lim n 1 n H 1 + θ ( X n | Y n ) = H 1 + θ , W ( X | Y ) .
Finally, we show the lemma that gives the finite upper and lower bounds on the two-parameter conditional Rényi entropy in terms of the two-parameter conditional Rényi entropy for the transition matrix.
Lemma 12.
Suppose that transition matrix W satisfies Assumption 2. Let v θ , θ be the eigenvector of N θ , θ T with respect to the Perron–Frobenius eigenvalue ν θ , θ such that min y v θ , θ ( y ) = 1 . Let w θ , θ be the | Y | -dimensional vector defined by:
w θ , θ ( y ) : = x P X 1 Y 1 ( x , y ) 1 + θ x P X 1 Y 1 ( x , y ) 1 + θ θ 1 + θ .
Then, we have:
( n 1 ) θ H 1 + θ , 1 + θ W ( X | Y ) + ζ ̲ ( θ , θ ) θ H 1 + θ , 1 + θ ( X n | Y n ) ( n 1 ) θ H 1 + θ , 1 + θ W ( X | Y ) + ζ ¯ ( θ , θ ) ,
where:
ζ ¯ ( θ , θ ) : = log v θ , θ | w θ , θ + log max y v θ , θ ( y ) + θ ξ ¯ ( θ ) ,
ζ ̲ ( θ , θ ) : = log v θ , θ | w θ , θ + θ ξ ̲ ( θ )
for θ > 0 and:
ζ ¯ ( θ , θ ) : = log v θ , θ | w θ , θ + log max y v θ , θ ( y ) + θ ξ ̲ ( θ ) ,
ζ ̲ ( θ , θ ) : = log v θ , θ | w θ , θ + θ ξ ¯ ( θ )
for θ < 0
Proof. 
By multiplying θ in the definition of H 1 + θ , 1 + θ ( X n | Y n ) , we have:
θ H 1 + θ , 1 + θ ( X n | Y n )
= log y n x n P X n Y n ( x n , y n ) 1 + θ x n P X n Y n ( x n , y n ) 1 + θ θ 1 + θ + θ θ 1 + θ H 1 + θ ( X n | Y n ) .
The second term is evaluated by Lemma 11. The first term can be evaluated almost in the same manner as Lemma 11. □
From Lemma 12, we have the following.
Theorem 4.
Suppose that transition matrix W satisfies Assumption 2. For any initial distribution, we have:
lim n 1 n H 1 + θ , 1 + θ ( X n | Y n ) = H 1 + θ , 1 + θ W ( X | Y ) .

3. Source Coding with Full Side-Information

In this section, we investigate source coding with side-information. We start this section by showing the problem setting in Section 3.1. Then, we review and introduce some single-shot bounds in Section 3.2. We derive finite-length bounds for the Markov chain in Section 3.3. Then, in Section 3.5 and Section 3.6, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds. We also derive the second-order rate in Section 3.4.

3.1. Problem Formulation

A code Ψ = ( e , d ) consists of one encoder e : X { 1 , , M } and one decoder d : { 1 , , M } × Y X . The decoding error probability is defined by:
P s [ Ψ ] = P s [ Ψ | P X Y ]
: = Pr { X d ( e ( X ) , Y ) } .
For notational convenience, we introduce the infimum of error probabilities under the condition that the message size is M:
P s ( M ) = P s ( M | P X Y )
: = inf Ψ P s [ Ψ ] .
For theoretical simplicity, we focus on a randomized choice of our encoder. For this purpose, we employ a randomized hash function F from X to { 1 , , M } . A randomized hash function F is called a two-universal hash when Pr { F ( x ) = F ( x ) } 1 M for any distinctive x and x [63]; the so-called bin coding [39] is an example of the two-universal hash function. In the following, we denote the set of two-universal hash functions by F . Given an encoder f as a function from X to { 1 , , M } , we define the decoder d f as the optimal decoder by argmin d P s [ ( f , d ) ] . Then, we denote the code ( f , d f ) by Ψ ( f ) . Then, we bound the error probability P s [ Ψ ( F ) ] averaged over the random function F by only using the property of two-universality. In order to consider the worst case of such schemes, we introduce the following quantity:
P ¯ s ( M ) = P ¯ s ( M | P X Y )
: = sup F F E F [ P s [ Ψ ( F ) ] ] , .
When we consider n-fold extension, the source code and related quantities are denoted with the superscript ( n ) . For example, the quantities in (112) and (114) are written as P s ( n ) ( M ) and P ¯ s ( n ) ( M ) , respectively. Instead of evaluating them, we are often interested in evaluating:
M ( n , ε ) : = inf { M n : P s ( n ) ( M n ) ε } ,
M ¯ ( n , ε ) : = inf { M n : P ¯ s ( n ) ( M n ) ε }
for given 0 ε < 1 .

3.2. Single-Shot Bounds

In this section, we review existing single-shot bounds and also show novel converse bounds. For the information measures used below, see Section 2.
By using the standard argument on information-spectrum approach, we have the following achievability bound.
Lemma 13
(Lemma 7.2.1 of [3]). The following bound holds:
P ¯ s ( M ) inf γ 0 P X Y log 1 P X | Y ( x | y ) > γ + e γ M .
Although Lemma 13 is useful for the second-order regime, it is known to be not tight in the large deviation regime. By using the large deviation technique of Gallager, we have the following exponential-type achievability bound.
Lemma 14
([64]). The following bound holds: (note that the Gallager function and the upper conditional Rényi entropy are related by (A45)):
P ¯ s ( M ) inf 1 2 θ 0 M θ 1 + θ e θ 1 + θ H 1 + θ ( X | Y ) .
Although Lemma 14 is known to be tight in the large deviation regime for i.i.d. sources, H 1 + θ ( X | Y ) for Markov chains can only be evaluated under the strongly non-hidden assumption. For this reason, even though the following bound is looser than Lemma 14, it is useful to have another bound in terms of H 1 + θ ( X | Y ) , which can be evaluated for Markov chains under the non-hidden assumption.
Lemma 15.
The following bound holds:
P ¯ s ( M ) inf 1 θ 0 M θ e θ H 1 + θ ( X | Y ) .
Proof. 
To derive this bound, we change the variable in (118) as θ = θ 1 θ . Then, 1 θ 0 , and we have:
M θ e θ H 1 1 θ ( X | Y ) M θ e θ H 1 + θ ( X | Y ) ,
where we use Lemma A4 in Appendix C. □
For the source coding without side-information, i.e., when X has no side-information, we have the following bound, which is tighter than Lemma 14.
Lemma 16
((2.39) [65]). The following bound holds:
P s ( M ) inf 1 < θ 0 M θ 1 + θ e θ 1 + θ H 1 + θ ( X ) .
For the converse part, we first have the following bound, which is very close to the operational definition of source coding with side-information.
Lemma 17
([66]). Let { Ω y } y Y be a family of subsets Ω y X , and let Ω = y Y Ω y × { y } . Then, for any Q Y P ( Y ) , the following bound holds:
P s ( M ) min { Ω y } P X Y ( Ω c ) : y Q Y ( y ) | Ω y | M .
Since Lemma 17 is close to the operational definition, it is not easy to evaluate Lemma 17. Thus, we derive another bound by loosening Lemma 17, which is more tractable for evaluation. Slightly weakening Lemma 17, we have the following.
Lemma 18
([3,4]). For any Q Y P ( Y ) , we have (In fact, a special case for Q Y = P Y corresponds to Lemma 7.2.2 of [3]. A bound that involves Q Y was introduced in [4] for channel coding, and it can be regarded as a source coding counterpart of that result.):
P s ( M ) sup γ 0 P X Y log Q Y ( y ) P X Y ( x , y ) > γ M e γ .
By using the change-of-measure argument, we also obtain the following converse bound.
Theorem 5.
For any Q Y P ( Y ) , we have:
log P s ( M ) inf s > 0 θ ˜ R , ϑ 0 [ ( 1 + s ) θ ˜ H 1 + θ ˜ ( P X Y | Q Y ) H 1 + ( 1 + s ) θ ˜ ( P X Y | Q Y )
( 1 + s ) log 1 2 e ϑ R + ( θ ˜ + ϑ ( 1 + θ ˜ ) ) H θ ˜ + ϑ ( 1 + θ ˜ ) ( P X Y | Q Y ) ( 1 + ϑ ) θ ˜ H 1 + θ ˜ ( P X Y | Q Y ) 1 + ϑ ] / s inf s > 0 1 < θ ˜ < θ ( a ( R ) ) [ ( 1 + s ) θ ˜ H 1 + θ ˜ ( P X Y | Q Y ) H 1 + ( 1 + s ) θ ˜ ( P X Y | Q Y )
( 1 + s ) log 1 2 e ( θ ( a ( R ) ) θ ˜ ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) ( P X Y | Q Y ) + θ ˜ H 1 + θ ˜ ( P X Y | Q Y ) ] / s ,
where R = log M , and θ ( a ) = θ Q ( a ) and a ( R ) = a Q ( R ) are the inverse functions defined in (29) and (32), respectively.
Proof. 
See Appendix K. □
In particular, by taking Q Y = P Y ( 1 + θ ( a ( R ) ) ) in Theorem 5, we have the following.
Corollary 1.
We have:
log P s ( M ) inf s > 0 1 < θ ˜ < θ ( a ( R ) ) [ ( 1 + s ) θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) ( X | Y ) H 1 + ( 1 + s ) θ ˜ , 1 + θ ( a ( R ) ) ( X | Y )
( 1 + s ) log 1 2 e ( θ ( a ( R ) ) θ ˜ ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) ( X | Y ) + θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) ( X | Y ) ] / s ,
where θ ( a ) = θ ( a ) and a ( R ) = a ( R ) are the inverse functions defined in (35) and (36).
Remark 5.
Here, we discuss the possibility for extension to the continuous case. As explained in Remark 1, we can define the information quantities for the case when Y is continuous, but X is a discrete finite set. The discussions in this subsection still hold even in this continuous case. In particular, in the n-i.i.d. extension case with this continuous setting, Lemma 14 and Corollary 1 hold when the information measures are replaced by n times the single-shot information measures.

3.3. Finite-Length Bounds for Markov Source

In this subsection, we derive several finite-length bounds for the Markov source with a computable form. Unfortunately, it is not easy to evaluate how tight those bounds are only with their formula. Their tightness will be discussed by considering the asymptotic limit in the remaining subsections of this section. Since we assume the irreducibility for the transition matrix describing the Markov chain, the following bound holds with any initial distribution.
To derive a lower bound on log P ¯ s ( M n ) in terms of the Rényi entropy of the transition matrix, we substitute the formula for the Rényi entropy given in Lemma 10 into Lemma 15. Then, we can derive the following achievability bound.
Theorem 6
(Direct, Ass. 1). Suppose that transition matrix W satisfies Assumption 1. Let R : = 1 n log M n . Then, for every n 1 , we have:
log P ¯ s ( n ) ( M n ) sup 1 θ 0 θ n R + ( n 1 ) θ H 1 + θ , W ( X | Y ) + δ ̲ ( θ ) ,
where δ ̲ ( θ ) is given by (91).
For the source coding without side-information, from Lemma 16 and a special case of Lemma 10, we have the following achievability bound.
Theorem 7
(Direct, no-side-information). Let R : = 1 n log M n . Then, for every n 1 , we have:
log P e ( n ) ( M n ) sup 1 < θ 0 n θ R + ( n 1 ) θ H 1 + θ W ( X ) + δ ̲ ( θ ) 1 + θ .
To derive an upper bound on log P s ( M n ) in terms of the Rényi entropy of transition matrix, we substitute the formula for the Rényi entropy given in Lemma 10 for Theorem 5. Then, we have the following converse bound.
Theorem 8
(Converse, Ass. 1). Suppose that transition matrix W satisfies Assumption 1. Let R : = 1 n log M n . For any H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
log P s ( n ) ( M n ) inf s > 0 1 < θ ˜ < θ ( a ( R ) ) [ ( n 1 ) ( 1 + s ) θ ˜ H 1 + θ ˜ , W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , W ( X | Y ) + δ 1
( 1 + s ) log 1 2 e ( n 1 ) [ ( θ W ( a W ( R ) ) θ ˜ ) a W ( R ) θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y ) + θ ˜ H 1 + θ ˜ , W ( X | Y ) ] + δ 2 ] / s ,
where θ ( a ) = θ ( a ) and a ( R ) = a ( R ) are the inverse functions defined by (67) and (70), respectively,
δ 1 : = ( 1 + s ) δ ¯ ( θ ˜ ) δ ̲ ( ( 1 + s ) θ ˜ ) ,
δ 2 : = ( θ W ( a W ( R ) ) θ ˜ ) R ( 1 + θ ˜ ) δ ̲ ( θ W ( a W ( R ) ) ) + ( 1 + θ W ( a W ( R ) ) ) δ ¯ ( θ ˜ ) 1 + θ W ( a W ( R ) ) ,
and δ ¯ ( · ) and δ ̲ ( · ) are given by (90) and (91), respectively.
Proof. 
We first use (124) of Theorem 5 for Q Y n = P Y n and Lemma 10. Then, we restrict the range of θ ˜ as 1 < θ ˜ < θ W ( a W ( R ) ) and set ϑ = θ W ( a W ( R ) ) θ ˜ 1 + θ ˜ . Then, we have the assertion of the theorem. □
Next, we derive tighter bounds under Assumption 2. To derive a lower bound on log P ¯ s ( M n ) in terms of the Rényi entropy of the transition matrix, we substitute the formula for the Rényi entropy in Lemma 11 for Lemma 14. Then, we have the following achievability bound.
Theorem 9
(Direct, Ass. 2). Suppose that transition matrix W satisfies Assumption 2. Let R : = 1 n log M n . Then, we have:
log P ¯ s ( n ) ( M n ) sup 1 2 θ 0 θ n R + ( n 1 ) θ H 1 + θ , W ( X | Y ) 1 + θ + ξ ̲ ( θ ) ,
where ξ ̲ ( θ ) is given by (98).
Finally, to derive an upper bound on log P s ( M n ) in terms of the Rényi entropy for the transition matrix, we substitute the formula for the Rényi entropy in Lemma 12 for Theorem 5 for Q Y n = P Y n ( 1 + θ W ( a W ( R ) ) ) . Then, we can derive the following converse bound.
Theorem 10
(Converse, Ass. 2). Suppose that transition matrix W satisfies Assumption 2. Let R : = 1 n log M n . For any H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
log P s ( n ) ( M n ) inf s > 0 1 < θ ˜ < θ W ( a W ( R ) ) [ ( n 1 ) ( 1 + s ) θ ˜ H 1 + θ ˜ , 1 + θ W ( a W ( R ) ) W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , 1 + θ W ( a W ( R ) ) W ( X | Y ) + δ 1
( 1 + s ) log 1 2 e ( n 1 ) [ ( θ W ( a W ( R ) ) θ ˜ ) a W ( R ) θ W ( a W ( R ) ) H 1 + θ W ( a W ( R ) ) , W ( X | Y ) + θ ˜ H 1 + θ ˜ , 1 + θ W ( a W ( R ) ) W ( X | Y ) ] + δ 2 ] / s ,
where θ W ( a ) = θ , W ( a ) and a W ( R ) = a , W ( R ) are the inverse functions defined by (71) and (73), respectively,
δ 1 : = ( 1 + s ) ζ ¯ ( θ ˜ , θ W ( a W ( R ) ) ) ζ ̲ ( ( 1 + s ) θ ˜ , θ W ( a W ( R ) ) ) ,
δ 2 : = ( θ W ( a W ( R ) ) θ ˜ ) R ( 1 + θ ˜ ) ζ ̲ ( θ W ( a W ( R ) ) , θ W ( a W ( R ) ) ) + ( 1 + θ W ( a W ( R ) ) ) ζ ¯ ( θ ˜ , θ W ( a W ( R ) ) ) 1 + θ W ( a W ( R ) ) ,
and ζ ¯ ( · , · ) and ζ ̲ ( · , · ) are given by (102)–(105).
Proof. 
We first use (124) of Theorem 5 for Q Y n = P Y n ( 1 + θ W ( a W ( R ) ) ) and Lemma 12. Then, we restrict the range of θ ˜ as 1 < θ ˜ < θ W ( a W ( R ) ) and set ϑ = θ W ( a W ( R ) ) θ ˜ 1 + θ ˜ . Then, we have the assertion of the theorem. □

3.4. Second-Order

By applying the central limit theorem to Lemma 13 (cf. [67] (Theorem 27.4, Example 27.6)) and Lemma 18 for Q Y = P Y and by using Theorem 2, we have the following.
Theorem 11.
Suppose that transition matrix W on X × Y satisfies Assumption 1. For arbitrary ε ( 0 , 1 ) , we have:
log M ( n , ε ) = log M ¯ ( n , ε ) + o ( n ) = n H W ( X | Y ) + V W ( X | Y ) n Φ ( 1 ε ) + o ( n ) .
Proof. 
The central limit theorem for the Markov process cf. [67] (Theorem 27.4, Example 27.6) guarantees that the random variable ( log P X n | Y n ( X n | Y n ) n H W ( X | Y ) ) / n asymptotically obeys the normal distribution with average zero and variance V W ( X | Y ) , where we use Theorem 2 to show that the limit of the variance is given by V W ( X | Y ) . Let R = V W ( X | Y ) Φ 1 ( 1 ε ) . Substituting M = e n H W ( X | Y ) + n R and γ = n H W ( X | Y ) + n R n 1 4 in Lemma 13, we have:
lim n P ¯ s ( n ) e n H W ( X | Y ) + n R ε .
On the other hand, substituting M = e n H W ( X | Y ) + n R and γ = n H W ( X | Y ) + n R + n 1 4 in Lemma 18 for Q Y = P Y , we have:
lim n P s ( n ) e n H W ( X | Y ) + n R ε .
Combining (140) and (141), we have the statement of the theorem.
From the above theorem, the (first-order) compression limit of source coding with side-information for a Markov source under Assumption 1 is given by (although the compression limit of source coding with side-information for a Markov chain is known more generally [68], we need Assumption 1 to get a single-letter characterization):
lim n 1 n log M ( n , ε ) = lim n 1 n log M ¯ ( n , ε )
= H W ( X | Y )
for any ε ( 0 , 1 ) . In the next subsections, we consider the asymptotic behavior of the error probability when the rate is larger than the compression limit H W ( X | Y ) in the moderate deviation regime and the large deviation regime, respectively.

3.5. Moderate Deviation

From Theorems 6 and 8, we have the following.
Theorem 12.
Suppose that transition matrix W satisfies Assumption 1. For arbitrary t ( 0 , 1 / 2 ) and δ > 0 , we have:
lim n 1 n 1 2 t log P s ( n ) e n H W ( X | Y ) + n 1 t δ = lim n 1 n 1 2 t log P ¯ s ( n ) e n H W ( X | Y ) + n 1 t δ
= δ 2 2 V W ( X | Y ) .
Proof. 
We apply Theorems 6 and 8 to the case with R = H W ( X | Y ) + n t δ , i.e., θ ( a ( R ) ) = n 1 δ V W ( X | Y ) + o ( n t ) . For the achievability part, from (88) and Theorem 6, we have:
log P s ( n ) M n sup 1 θ 0 θ n R + ( n 1 ) θ H 1 + θ , W ( X | Y ) + inf 1 θ 0 δ ̲ ( θ )
n 1 2 t δ 2 2 V W ( X | Y ) + o ( n 1 2 t ) .
To prove the converse part, we fix arbitrary s > 0 and choose θ ˜ to be n t δ V W ( X | Y ) + n 2 t . Then, Theorem 8 implies that:
lim sup n 1 n 1 2 t log P s ( M n ) lim sup n n 2 t 1 + s s θ ˜ H 1 + θ ˜ , W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , W ( X | Y )
= lim sup n n 2 t 1 + s s s θ ˜ 2 d H 1 + θ , W ( X | Y ) d θ | θ = θ ˜
= ( 1 + s ) δ 2 2 V W ( X | Y ) .
Remark 6.
In the literature [13,69], the moderate deviation results are stated for ϵ n such that ϵ n 0 and n ϵ n 2 instead of n t for t ( 0 , 1 / 2 ) . Although the former is slightly more general than the latter, we employ the latter formulation in Theorem 12 since the order of convergence is clearer. In fact, n t in Theorem 12 can be replaced by general ϵ n without modifying the argument of the proof.

3.6. Large Deviation

From Theorems 6 and 8, we have the following.
Theorem 13.
Suppose that transition matrix W satisfies Assumption 1. For H W ( X | Y ) < R , we have:
lim inf n 1 n log P ¯ s ( n ) e n R sup 1 θ 0 [ θ R + θ H 1 + θ , W ( X | Y ) ] .
On the other hand, for H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
lim sup n 1 n log P s ( n ) e n R θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y )
= sup 1 < θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
Proof. 
The achievability bound (151) follows from Theorem 6. The converse part (152) is proven from Theorem 8 as follows. We first fix s > 0 and 1 < θ ˜ < θ ( a ( R ) ) . Then, Theorem 8 implies:
lim sup n 1 n log P s ( n ) e n R 1 + s s θ ˜ H 1 + θ ˜ , W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , W ( X | Y ) .
By taking the limit s 0 and θ ˜ θ ( a ( R ) ) , we have:
1 + s s θ ˜ H 1 + θ ˜ , W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , W ( X | Y )
= 1 s θ ˜ H 1 + θ ˜ , W ( X | Y ) ( 1 + s ) θ ˜ H 1 + ( 1 + s ) θ ˜ , W ( X | Y ) + θ ˜ H 1 + θ ˜ , W ( X | Y )
θ ˜ d [ θ H 1 + θ , W ( X | Y ) ] d θ | θ = θ ˜ + θ ˜ H 1 + θ ˜ , W ( X | Y ) ( a s s 0 )
θ ( a ( R ) ) d [ θ H 1 + θ , W ( X | Y ) ] d θ | θ = θ ( a ( R ) ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) ( a s θ ˜ θ ( a ( R ) ) )
= θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) .
Thus, (152) is proven. The alternative expression (153) is derived via Lemma 9. □
Under Assumption 2, from Theorems 9 and 10, we have the following tighter bound.
Theorem 14.
Suppose that transition matrix W satisfies Assumption 2. For H W ( X | Y ) < R , we have:
lim inf n 1 n log P ¯ s ( n ) e n R sup 1 2 θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
On the other hand, for H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
lim sup n 1 n log P s ( n ) e n R θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y )
= sup 1 < θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
Proof. 
The achievability bound (160) follows from Theorem 9. The converse part (161) is proven from Theorem 10 as follows. We first fix s > 0 and 1 < θ ˜ < θ ( a ( R ) ) . Then, Theorem 10 implies:
lim sup n 1 n log P s ( n ) e n R 1 + s s θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) .
By taking the limit s 0 and θ ˜ θ ( a ( R ) ) , we have:
1 + s s θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y )
= 1 s θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) ( 1 + s ) θ ˜ H 1 + ( 1 + s ) θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) + θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y )
θ ˜ d [ θ H 1 + θ , 1 + θ ( a ( R ) ) W ( X | Y ) ] d θ | θ = θ ˜ + θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) ( a s s 0 )
θ ( a ( R ) ) d [ θ H 1 + θ , 1 + θ ( a ( R ) ) W ( X | Y ) ] d θ | θ = θ ( a ( R ) ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) ( a s θ ˜ θ ( a ( R ) ) )
= θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) .
Thus, (161) is proven. The alternative expression (162) is derived via Lemma 9.□
Remark 7.
For R R cr , where (cf. (72) for the definition of R ( a ) ):
R cr : = R d [ θ H 1 + θ , W ( X | Y ) ] d θ | θ = 1 2
is the critical rate, the left-hand side of (76) in Lemma 9 is attained by parameters in the range 1 / 2 θ 0 . Thus, the lower bound in (160) is rewritten as:
sup 1 2 θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ = θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) .
Thus, the lower bound and the upper bounds coincide up to the critical rate.
Remark 8.
For the source coding without side-information, by taking the limit of Theorem 7, we have:
lim inf n 1 n log P ¯ s ( n ) e n R sup 1 θ 0 θ R + θ H 1 + θ W ( X ) 1 + θ .
On the other hand, as a special case of (152) without side-information, we have:
lim sup n 1 n log P s ( n ) e n R sup 1 < θ 0 θ R + θ H 1 + θ W ( X ) 1 + θ
for H W ( X ) < R < H 0 W ( X ) . Thus, we can recover the results in [40,41] by our approach.

3.7. Numerical Example

In this section, to demonstrate the advantage of our finite-length bound, we numerically evaluate the achievability bound in Theorem 7 and a special case of the converse bound in Theorem 8 for the source coding without side-information. Thanks to the aspect (A2), our numerical calculation shows that our upper finite-length bounds are very close to our lower finite-length bounds when the size n is sufficiently large. Thanks to the aspect (A1), we could calculate both bounds with the huge size n = 1 × 10 5 because the calculation complexity behaves as O ( 1 ) .
We consider a binary transition matrix W given by Figure 2, i.e.,
W = 1 p q p 1 q .
In this case, the stationary distribution is:
P ˜ ( 0 ) = q p + q ,
P ˜ ( 1 ) = p p + q .
The entropy is:
H W ( X ) = q p + q h ( p ) + p p + q h ( q ) ,
where h ( · ) is the binary entropy function. The tilted transition matrix is:
W θ = ( 1 p ) 1 + θ q 1 + θ p 1 + θ ( 1 q ) 1 + θ .
The Perron–Frobenius eigenvalue is:
λ θ = ( 1 p ) 1 + θ + ( 1 q ) 1 + θ + { ( 1 p ) 1 + θ ( 1 q ) 1 + θ } 2 + 4 p 1 + θ q 1 + θ 2
and its normalized eigenvector is:
P ˜ θ ( 0 ) = q 1 + θ λ θ ( 1 p ) 1 + θ + q 1 + θ ,
P ˜ θ ( 1 ) = λ θ ( 1 p ) 1 + θ λ θ ( 1 p ) 1 + θ + q 1 + θ .
The normalized eigenvector of W ρ T is also given by:
P ^ θ ( 0 ) = p 1 + θ λ θ ( 1 p ) 1 + θ + p 1 + θ ,
P ^ θ ( 1 ) = λ θ ( 1 p ) 1 + θ λ θ ( 1 p ) 1 + θ + p 1 + θ .
From these calculations, we can evaluate the bounds in Theorems 7 and 8. For p = 0 . 1 , q = 0 . 2 , the bounds are plotted in Figure 3 for fixed error probability ε = 10 3 . Although there is a gap between the achievability bound and the converse bound for rather small n, the gap is less than approximately 5% of the entropy rate for n larger than 10,000. We also plot the bounds in Figure 4 for fixed block length n = 10,000 and varying ε . The gap between the achievability bound and the converse bound remains approximately 5% of the entropy rate even for ε as small as 10 10 .
The gap between the achievability bound and the converse bound in Figure 3 is rather large compared to a similar numerical experiment conducted in [1]. One reason for the gap is that our bounds are exponential-type bounds. For instance, when the source is i.i.d., the achievability bound essentially reduces to the so-called Gallager bound [64]. However, an advantage of our bounds is that the computational complexity does not depend on the blocklength. The computational complexities of the bounds plotted in [1] depend the blocklength, and numerical computation of those bounds for Markov sources seems to be difficult.
When p = q , an alternative approach to derive tighter bounds is to consider encoding of the Markov transition, i.e., 1 [ X i = X i + 1 ] , instead of the source itself (cf. [45] (Example 4)). Then, the analysis can be reduced to i.i.d. case. However, such an approach is possible only when p = q .

3.8. Summary of the Results

The obtained results in this section are summarized in Table 2. The check marks 🗸 indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second-order) can be obtained from those bounds. The marks 🗸 * indicate that the large deviation bound can be derived up to the critical rate. The computational complexity “Tail” indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities. It should be noted that Theorem 8 is derived from a special case ( Q Y = P Y ) of Theorem 5. The asymptotically optimal choice is Q Y = P Y ( 1 + θ ) , which corresponds to Corollary 1. Under Assumption 1, we can derive the bound of the Markov case only for that special choice of Q Y , while under Assumption 2, we can derive the bound of the Markov case for the optimal choice of Q Y .

4. Channel Coding

In this section, we investigate the channel coding with a conditional additive channel. The first part of this section discusses the general properties of the channel coding with a conditional additive channel. The second part of this section discusses the properties of the channel coding when the conditional additive noise of the channel is Markov. The first part starts with showing the problem setting in Section 4.1 by introducing a conditional additive channel. Section 4.2 gives a canonical method to convert a regular channel to a conditional additive channel. Section 4.3 gives a method to convert a BPSK-AWGN channel to a conditional additive channel. Then, we show some single-shot achievability bounds in Section 4.4 and single-shot converse bounds in Section 4.5.
As the second part, we derive finite-length bounds for the Markov noise channel in Section 4.6. Then, we derive the second-order rate in Section 4.7. In Section 4.8 and Section 4.9, we show the asymptotic characterization for the large deviation regime and the moderate deviation regime by using those finite-length bounds.

4.1. Formulation for the Conditional Additive Channel

4.1.1. Single-Shot Case

We first present the problem formulation in the single-shot setting. For a channel P B | A ( b | a ) with input alphabet A and output alphabet B , a channel code Ψ = ( e , d ) consists of one encoder e : { 1 , , M } A and one decoder d : B { 1 , , M } . The average decoding error probability is defined by:
P c [ Ψ ] : = m = 1 M 1 M P B | A ( { b : d ( b ) m } | e ( m ) ) .
For notational convenience, we introduce the error probability under the condition that the message size is M:
P c ( M ) : = inf Ψ P c [ Ψ ] .
Assume that the input alphabet A is the same set as the output alphabet B and they equal an additive group X . When the transition matrix P B | A ( b | a ) is given as P X ( b a ) by using a distribution P X on X , the channel is called additive.
To extend the concept of the additive channel, we consider the case when the input alphabet A is an additive group X and the output alphabet B is the product set X × Y . When the transition matrix P B | A ( x , y | a ) is given as P X Y ( x a , y ) by using a distribution P X Y on X × Y , the channel is called conditional additive. In this paper, we are exclusively interested in the conditional additive channel. As explained in Section 4.2, a channel is a conditional additive channel if and only if it is a regular channel in the sense of [31]. When we need to express the underlying distribution of the noise explicitly, we denote the average decoding error probability by P c [ Ψ | P X Y ] .

4.1.2. n-Fold Extension

When we consider n-fold extension, the channel code is denoted with subscript n such as Ψ n = ( e n , d n ) . The error probabilities given in (183) and (184) are written with the superscript ( n ) as P c ( n ) [ Ψ n ] and P c ( n ) ( M n ) , respectively. Instead of evaluating the error probability P c ( n ) ( M n ) for given M n , we are also interested in evaluating:
M ( n , ε ) : = sup M n : P c ( n ) ( M n ) ε
for given 0 ε 1 .
When the channel is given as a conditional distribution, the channel is given by:
P B n | A n ( x n , y n | a n ) = P X n Y n ( x n a n , y n ) ,
where P X n Y n is a noise distribution on X n × Y n .
For the code construction, we investigate the linear code. For an ( n , k ) linear code C n A n , there exists a parity check matrix f n : A n A n k such that the kernel of f n is C n . That is, given a parity check matrix f n : A n A n k , we define the encoder I Ker ( f n ) : C n A n as the imbedding of the kernel Ker ( f n ) . Then, using the decoder d f n : = argmin d P c [ ( I Ker ( f n ) , d ) ] , we define Ψ ( f n ) = ( I Ker ( f n ) , d f n ) .
Here, we employ a randomized choice of a parity check matrix. In particular, instead of a two-universal hash function, we focus on linear two-universal hash functions, because the linearity is required in the above relation with source coding. Therefore, denoting the set of linear two-universal hash functions from A n to A n k by F l , we introduce the quantity:
P ¯ c ( n , k ) : = sup F n F l E F n P c ( n ) [ Ψ ( F n ) ] .
Taking the infimum over all linear codes associated with F n (cf. (113)), we obviously have:
P c ( n ) ( | A | k ) P ¯ c ( n , k ) .
When we consider the error probability for conditionally additive channels, we use notation P ¯ c ( n , k | P X Y ) so that the underlying distribution of the noise is explicit. We are also interested in characterizing:
k ( n , ε ) : = sup k : P ¯ c ( n , k ) ε
for given 0 ε 1 .

4.2. Conversion from the Regular Channel to the Conditional Additive Channel

The aim of this subsection is to show the following theorem by presenting the conversion rule between these two types of channels. Then, we see that a binary erasure symmetric channel is an example of a regular channel.
Theorem 15.
A channel is a regular channel in the sense of [31] if and only if it can be written as a conditional additive channel.
To show the conversion from a conditional additive channel to a regular channel, we assume that the input alphabet A has an additive group structure. Let P X ˜ be a distribution on the output alphabet B . Let π a be a representation of the group A on B , and let G = { π a : a A } . A regular channel [31] is defined by:
P B | A ( b | a ) = P X ˜ ( π a ( b ) ) .
The group action induces orbit:
Orb ( b ) : = { π a ( b ) : a A } .
The set of all orbits constitutes a disjoint partition of B . A set of the orbits is denoted by B ¯ , and let Orb : B B ¯ be the map to the representatives.
Example 4
(Binary erasure symmetric channel). Let A = { 0 , 1 } , B = { 0 , 1 , ? } , and:
P X ˜ ( b ) = 1 p p i f b = 0 p i f b = 1 p i f b = ? .
Then, let:
π 0 = 0 1 ? 0 1 ? , π 1 = 0 1 ? 1 0 ? .
The channel defined in this way is a regular channel (see Figure 5). In this case, there are two orbits: { 0 , 1 } and { ? } .
Let B = X × Y and P X ˜ = P X Y for some joint distribution on X × Y . Now, we consider a conditional additive channel, whose transition matrix P B | A ( x , y | a ) is given as P X Y ( x a , y ) . When the group action is given by π a ( x , y ) = ( x a , y ) , the above conditional additive channel is given as a regular channel. In this case, there are | Y | orbits, and the size of each orbit is | X | , respectively. This fact shows that any conditional additive channel is written as a regular channel. That is, it shows the “if” part of Theorem 15.
Conversely, we present the conversion from a regular channel to a conditional additive channel. We first explain the construction for the single-shot channel. For random variable X ˜ P X ˜ , let Y = B ¯ and Y = ϖ ( X ˜ ) be the random variable describing the representatives of the orbits. For y = Orb ( b ) and each orbit Orb ( b ) , we fix an element 0 y Orb ( b ) . Then, we define:
P Y ( y ) : = P X ˜ ( Orb ( b ) ) , P X , Y ( a , y ) : = P X ˜ ( π a ( 0 y ) ) | { a A | π a ( 0 y ) = π a ( 0 y ) } | .
Then, we obtain the virtual channel P X , Y | A as P X , Y | A ( x , y | a ) : = P X , Y ( x a , y ) . Using the conditional distributions P X , Y | B and P B | X , Y as:
P X , Y | B ( a , y | b ) = 1 | { a A | π a ( 0 y ) = π a ( 0 y ) } | when b = π a ( 0 y ) 0 otherwise .
P X , Y | B ( a , y | b ) = 1 when b = π a ( 0 y ) 0 otherwise ,
we obtain the relations:
P B | A ( b | a ) = x , y P B | X , Y ( b | x , y ) P X , Y | A ( x , y | a ) , P X , Y | A ( x , y | a ) = b P X , Y | B ( x , y | b ) P B | A ( b | a ) .
These two equations show that the receiver information of the virtual conditional additive channel P X , Y | A and the receiver information of the regular channel P B | A can be converted into each other. Hence, we can say that a regular channel in the sense of [31] can be written as a conditional additive channel, which shows the “only if” part of Theorem 15.
Example 5
(Binary erasure symmetric channel revisited). We convert the regular channel of Example 4 to a conditional additive channel. Let us label the orbit { 0 , 1 } as y = 0 and { ? } as y = 1 . Let 0 0 = 0 and 0 1 = ? .
P X , Y ( x , 0 ) = 1 p p i f x = 0 p i f x = 0
P X , Y ( x , 1 ) = p 2 .
When we consider the nth extension, a channel is given by:
P B n | A n ( b n | a n ) = P X ˜ n ( π a n ( b n ) ) ,
where the nth extension of the group action is defined by π a n ( b n ) = ( π a 1 ( b 1 ) , , π a n ( b n ) ) .
Similarly, for n-fold extension, we can also construct the virtual conditional additive channel. More precisely, for X ˜ n P X ˜ n , we set Y n = ϖ ( X ˜ n ) = ( ϖ ( X ˜ 1 ) , , ϖ ( X ˜ n ) ) and:
P X n , Y n ( x n , y n ) : = P X ˜ n ( π a n ( 0 y n ) ) | { a n A n | π a n ( 0 y n ) = π a n ( 0 y n ) } | .

4.3. Conversion of the BPSK-AWGN Channel into the Conditional Additive Channel

Although we only considered finite input/output sources and channels throughout the paper, in order to demonstrate the utility of the conditional additive channel framework, let us consider the additive white Gaussian noise (AWGN) channel with binary phase shift keying (BPSK) in this section. Let A = { 0 , 1 } be the input alphabet of the channel, and let B = R be the output alphabet of the channel. For an input a A and Gaussian noise Z with mean zero and variance σ 2 , the output of the channel is given by B = ( 1 ) a + Z . Then, the conditional probability density function of this channel is given as:
P B | A ( b | a ) = 1 2 π σ e ( b ( 1 ) a ) 2 σ 2 .
Now, to define a conditional additive channel, we choose Y : = R + and define the probability density function p Y on Y with respect to the Lebesgue measure and the conditional distribution P X | Y ( x | y ) as:
p Y ( y ) : = 1 2 π σ ( e ( y 1 ) 2 σ 2 + e ( y + 1 ) 2 σ 2 )
P X | Y ( 0 | y ) : = e ( y 1 ) 2 σ 2 e ( y 1 ) 2 σ 2 + e ( y + 1 ) 2 σ 2
P X | Y ( 1 | y ) : = e ( y + 1 ) 2 σ 2 e ( y 1 ) 2 σ 2 + e ( y + 1 ) 2 σ 2
for y R + . When we define b : = ( 1 ) x y R for x { 0 , 1 } and y R + , we have:
p X Y | A ( y , x | a ) = 1 2 π σ e ( y ( 1 ) a + x ) 2 σ 2 = 1 2 π σ e ( ( 1 ) x y ( 1 ) a ) 2 σ 2 = 1 2 π σ e ( b ( 1 ) a ) 2 σ 2 .
The relations (202) and (206) show that the AWGN channel with BPSK is given as a conditional additive channel in the above sense.
By noting this observation, as explained in Remark 5, the single-shot achievability bounds in Section 3.2 are also valid for continuous Y. Furthermore, the discussions for the single-shot converse bounds in Section 4.5 hold even for continuous Y. Therefore, the bounds in Section 4.4 and Section 4.5 are also applicable to the BPSK-AWGN channel.
In particular, in the n memoryless extension of the BPSK-AWGN channel, the information measures for the noise distribution are given as n times the single-shot information measures for the noise distribution. Even in this case, the upper and lower bounds in Section 4.4 and Section 4.5 are also applicable by replacing the information measures by n times the single-shot information measures. Therefore, we obtain finite-length upper and lower bounds of the optimal coding length for the memoryless BPSK-AWGN channel. Furthermore, even though the additive noise is not Gaussian, when the probability density function p Z of the additive noise Z satisfies the symmetry p Z ( z ) = p Z ( z ) , the BPSK channel with the additive noise Z can be converted to a conditional additive channel in the same way.

4.4. Achievability Bound Derived by Source Coding with Side-Information

In this subsection, we give a code for a conditional additive channel from a code of source coding with side-information in a canonical way. In this construction, we see that the decoding error probability of the channel code equals that of the source code.
When the channel is given as the conditional additive channel with conditional additive noise distribution P X n Y n as (186) and X = A is the finite field F q , we can construct a linear channel code from a source code with full side-information whose encoder and decoder are f n and d n as follows. First, we assume linearity for the source encoder f n . Let C n ( f n ) be the kernel of the linear encoder f n of the source code. Suppose that the sender sends a codeword c n C n ( f n ) and ( c n + X n , Y n ) is received. Then, the receiver computes the syndrome f n ( c n + X n ) = f n ( c n ) + f n ( X n ) = f n ( X n ) , estimates X n from f n ( X n ) and Y n , and subtracts the estimate from c n + X n . That is, we choose the channel decoder d ˜ n as:
d ˜ n ( x n , y n ) : = x n d n ( f n ( x n ) , y n ) .
We succeed in decoding in this channel coding if and only if d n ( f n ( X n ) , Y n ) equals X n . Thus, the error probability of this channel code coincides with that of the source code for the correlated source ( X n , Y n ) . In summary, we have the following lemma, which was first pointed out in [27].
Lemma 19
([27], (19)). Given a linear encoder f n and a decoder d n for a source code with side-information with distribution P X n Y n , let I Ker ( f n ) and d ˜ n be the channel encoder and decoder induced by ( f n , d n ) . Then, the error probability of channel coding for the conditionally additive channel with noise distribution P X n Y n satisfies:
P c ( n ) [ ( I Ker ( f n ) , d ˜ n ) | P X n Y n ] = P s ( n ) [ ( f n , d n ) | P X n Y n ] .
Furthermore, (in fact, when we additionally impose the linearity on the random function F in the definition (114) for the definition of P ¯ s ( M | P X n Y n ) , the result in [27] implies that the equality in (209) holds) taking the infimum for F n chosen to be a linear two-universal hash function, we also have:
P ¯ c ( n , k ) = sup F n F l E F n P c ( n ) [ Ψ ( F n ) ] sup F n F l E F n P c ( n ) [ ( I Ker ( F n ) , d ˜ n ) ] = sup F n F l E F n P s ( n ) [ ( F n , d n ) ] sup F n F E F n P s ( n ) [ ( F n , d n ) ] = P ¯ s ( n ) ( | A n k | ) .
By using this observation and the results in Section 3.2, we can derive the achievability bounds. By using the conversion argument in Section 4.2, we can also construct a channel code for a regular channel from a source code with full side-information. Although the following bounds are just a specialization of known bounds for conditional additive channels, we review these bounds here to clarify the correspondence between the bounds in source coding with side-information and channel coding.
From Lemma 13 and (209), we have the following.
Lemma 20
([2]). The following bound holds:
P ¯ c ( n , k ) inf γ 0 P X n Y n log 1 P X n | Y n ( x n | y n ) > γ + e γ | A | n k .
From Lemma 14 and (209), we have the following exponential-type bound.
Lemma 21
([6]). The following bound holds:
P ¯ c ( n , k ) inf 1 2 θ 0 | A | θ ( n k ) 1 + θ e θ 1 + θ H 1 + θ ( X n | Y n ) .
From Lemma 15 and (209), we have the following slightly loose exponential bound.
Lemma 22
([3,70]). The following bound holds (The bound (212) was derived in the original Japanese edition of [3], but it is not written in the English edition [3]. The quantum analogue was derived in [70].):
P ¯ c ( n , k ) inf 1 θ 0 | A | θ ( n k ) e θ H 1 + θ ( X n | Y n ) .
When X has no side-information, i.e., the virtual channel is additive, we have the following special case of Lemma 21.
Lemma 23
([6]). Suppose that X has no side-information. Then, the following bound holds:
P ¯ c ( n , k ) inf 1 2 θ 0 | A | θ ( n k ) 1 + θ e θ 1 + θ H 1 + θ ( X n ) .

4.5. Converse Bound

In this subsection, we show some converse bounds. The following is the information spectrum-type converse shown in [4].
Lemma 24
([4], Lemma 4). For any code Ψ n = ( e n , d n ) and any output distribution Q B n P ( B n ) , we have:
P c ( n ) [ Ψ n ] sup γ 0 m = 1 M n 1 M n P B n | A n log P B n | A n ( b n | e n ( m ) ) Q B n ( b n ) < γ e γ M n .
When a channel is a conditional additive channel, we have:
P B n | A n ( a n + x n , y n | a n ) = P X n Y n ( x n , y n ) .
By taking the output distribution Q B n as:
Q B n ( a n + x n , y n ) = 1 | A | n Q Y n ( y n )
for some Q Y n P ( Y n ) , as a corollary of Lemma 24, we have the following bound.
Lemma 25.
When a channel is a conditional additive channel, for any distribution Q Y n P ( Y n ) , we have:
P c ( n ) ( M n ) sup γ 0 P X n Y n log Q Y n ( y n ) P X n Y n ( x n , y n ) > n log | A | γ e γ M n .
Proof. 
By noting (215) and (216), the first term of the right-hand side of (214) can be rewritten as:
m = 1 M n 1 M n P B n | A n log P B n | A n ( b n | e n ( m ) ) Q B n ( b n ) < γ
= m = 1 M n 1 M n P X n Y n log P B n | A n ( e n ( m ) + x n , y n | e n ( m ) ) Q B n ( e n ( m ) + x n , y n )
= P X n Y n log Q Y n ( y n ) P X n Y n ( x n , y n ) > n log | A | γ ,
which implies the statement of the lemma. □
A similar argument as in Theorem 5 also derives from the following converse bound.
Theorem 16.
For any Q Y n P ( Y n ) , we have:
log P c ( n ) ( M n ) inf s > 0 θ ˜ R , ϑ 0 [ ( 1 + s ) θ ˜ H 1 + θ ˜ ( P X n Y n | Q Y n ) H 1 + ( 1 + s ) θ ˜ ( P X n Y n | Q Y n )
( 1 + s ) log 1 2 e ϑ R + ( θ ˜ + ϑ ( 1 + θ ˜ ) ) H 1 + θ ˜ + ϑ ( 1 + θ ˜ ) ( P X n Y n | Q Y n ) ( 1 + ϑ ) θ ˜ H 1 + θ ˜ ( P X n Y n | Q Y n ) 1 + ϑ ] / s inf s > 0 1 < θ ˜ < θ ( a ( R ) ) [ ( 1 + s ) θ ˜ H 1 + θ ˜ ( P X n Y n | Q Y n ) H 1 + ( 1 + s ) θ ˜ ( P X n Y n | Q Y n )
( 1 + s ) log 1 2 e ( θ ( a ( R ) ) θ ˜ ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) ( P X n Y n | Q Y n ) + θ ˜ H 1 + θ ˜ ( P X n Y n | Q Y n ) ] / s ,
where R = n log | A | log M n , and θ ( a ) and a ( R ) are the inverse functions defined in (29) and (32), respectively.
Proof. 
See Appendix L. □

4.6. Finite-Length Bound for the Markov Noise Channel

From this section, we address the conditional additive channel whose conditional additive noise is subject to the Markov chain. Here, the input alphabet A n equals the additive group X n = F q n , and the output alphabet B n is X × Y n . That is, the transition matrix describing the channel is given by using a transition matrix W on X × Y n and an initial distribution Q as:
P B n | A n ( x n + a n , y n | a n ) = Q ( x 1 , y 1 ) i = 2 n W ( x i , y i | x i 1 , y i 1 ) .
As in Section 2.2, we consider two assumptions on the transition matrix W of the noise process ( X , Y ) , i.e., Assumptions 1 and 2. We also use the same notations as in Section 2.2.
Example 6
(Gilbert–Elliot channel with state-information available at the receiver). The Gilbert–Elliot channel [29,30] is characterized by a channel state Y n on Y n = { 0 , 1 } n and an additive noise X n on X n = { 0 , 1 } n . The noise process ( X n , Y n ) is a Markov chain induced by the transition matrix W introduced in Example 3. For the channel input a n , the channel output is given by ( a n + X n , Y n ) when the state-information is available at the receiver. Thus, this channel can be regarded as a conditional additive channel, and the transition matrix of the noise process satisfies Assumption 2.
Proofs of the following bounds are almost the same as those in Section 3.3, and thus omitted. The combination of Lemmas 10 and 22 derives the following achievability bound.
Theorem 17
(Direct, Ass. 1). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. Let R : = n k n log | A | . Then, we have:
log P ¯ c ( n , k ) sup 1 θ 0 θ n R + ( n 1 ) θ H 1 + θ , W ( X | Y ) + δ ̲ ( θ ) .
Theorem 16 for Q Y n = P Y n and Lemma 10 yield the following converse bound.
Theorem 18
(Converse, Ass. 1). Suppose that transition matrix W of the conditional additive noise satisfies Assumption 1. Let R : = log | A | 1 n log M n . If H W ( X | Y ) < R < H 0 , W ( X | Y ) , then we have:
log P c ( n ) ( M n ) inf s > 0 1 < θ ˜ < θ ( a ( R ) ) [ ( n 1 ) ( 1 + s ) θ ˜ H 1 + θ ˜ , W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , W ( X | Y ) + δ 1
( 1 + s ) log 1 2 e ( n 1 ) [ ( θ ( a ( R ) ) θ ˜ ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) + θ ˜ H 1 + θ ˜ , W ( X | Y ) ] + δ 2 ] / s ,
where θ ( a ) = θ ( a ) and a ( R ) = a ( R ) are the inverse functions defined by (67) and (70), respectively, and:
δ 1 : = ( 1 + s ) δ ¯ ( θ ˜ ) δ ̲ ( ( 1 + s ) θ ˜ ) ,
δ 2 : = ( θ ( a ( R ) ) θ ˜ ) R ( 1 + θ ˜ ) δ ̲ ( θ ( a ( R ) ) ) + ( 1 + θ ( a ( R ) ) ) δ ¯ ( θ ˜ ) 1 + θ ( a ( R ) ) .
Next, we derive tighter bounds under Assumption 2. From Lemmas 11 and 21, we have the following achievability bound.
Theorem 19
(Direct, Ass. 2). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. Let R : = n k n log | A | . Then, we have:
log P ¯ c ( n , k ) sup 1 2 θ 0 θ n R + ( n 1 ) θ H 1 + θ , W ( X | Y ) 1 + θ + ξ ̲ ( θ ) .
By using Theorem 16 for Q Y n = P Y n ( 1 + θ ( a ( R ) ) ) and Lemma 12, we obtain the following converse bound.
Theorem 20
(Converse, Ass. 2). Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. Let R : = log | A | 1 n log M n . If H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
log P c ( n ) ( M n ) inf s > 0 1 < θ ˜ < θ ( a ( R ) ) [ ( n 1 ) ( 1 + s ) θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) H 1 + ( 1 + s ) θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) + δ 1
( 1 + s ) log 1 2 e ( n 1 ) [ ( θ ( a ( R ) ) θ ˜ ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) + θ ˜ H 1 + θ ˜ , 1 + θ ( a ( R ) ) W ( X | Y ) ] + δ 2 ] / s ,
where θ ( a ) = θ ( a ) and a ( R ) = a ( R ) are the inverse functions defined by (71) and (73), respectively, and:
δ 1 : = ( 1 + s ) ζ ¯ ( θ ˜ , θ ( a ( R ) ) ) ζ ̲ ( ( 1 + s ) θ ˜ , θ ( a ( R ) ) ) ,
δ 2 : = ( θ ( a ( R ) ) θ ˜ ) R ( 1 + θ ˜ ) ζ ̲ ( θ ( a ( R ) ) , θ ( a ( R ) ) ) + ( 1 + θ ( a ( R ) ) ) ζ ¯ ( θ ˜ , θ ( a ( R ) ) ) 1 + θ ( a ( R ) ) .
Finally, when X has no side-information, i.e., the channel is additive, we obtain the following achievability bound from Lemma 23.
Theorem 21
(Direct, no-side-information). Let R : = n k n log | A | . Then, we have:
log P ¯ c ( n , k ) sup 1 2 θ 0 θ n R + ( n 1 ) θ H 1 + θ W ( X ) + δ ̲ ( θ ) 1 + θ .
Remark 9.
Our treatment for the Markov conditional additive channel covers Markov regular channels because Markov regular channels can be reduced to Markov conditional additive channels as follows. Let X ˜ = { X ˜ n } n = 1 be a Markov chain on B whose distribution is given by:
P X ˜ n ( x ˜ n ) = Q ( x ˜ 1 ) i = 2 n W ˜ ( x ˜ i | x ˜ i 1 )
for a transition matrix W ˜ and an initial distribution Q. Let ( X , Y ) = { ( X n , Y n ) } n = 1 be the noise process of the conditional additive channel derived from the noise process X ˜ of the regular channel by the argument of Section 4.2. Since we can write:
P X n Y n ( x n , y n ) = Q ( ι y 1 1 ( ϑ y 1 ( x 1 ) ) ) 1 | Stb ( 0 y 1 ) | i = 2 n W ˜ ( ι y i 1 ( ϑ y i ( x i ) ) | ι y i 1 1 ( ϑ y i 1 ( x i 1 ) ) ) 1 | Stb ( 0 y i ) | ,
the process ( X , Y ) is also a Markov chain. Thus, the regular channel given by X ˜ is reduced to the conditional additive channel given by ( X , Y ) .

4.7. Second-Order

To discuss the asymptotic performance, we introduce the quantity:
C : = log | A | H W ( X | Y ) .
By applying the central limit theorem (cf. [67] (Theorem 27.4, Example 27.6)) to Lemmas 20 and 25 for Q Y n = P Y n , and by using Theorem 2, we have the following.
Theorem 22.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For arbitrary ε ( 0 , 1 ) , we have:
log M ( n , ε ) = k ( n , ε ) log | A | = C n + V W ( X | Y ) Φ 1 ( ε ) n + o ( n ) .
Proof. 
This theorem follows in the same manner as the proof of Theorem 11 by replacing Lemma 13 with Lemma 20 (achievability) and Lemma 18 with Lemma 25 (converse). □
From the above theorem, the (first-order) capacity of the conditional additive channel under Assumption 1 is given by:
lim n 1 n log M ( n , ε ) = lim n 1 n log k ( n , ε ) log | A | n = C
for every 0 < ε < 1 . In the next subsections, we consider the asymptotic behavior of the error probability when the rate is smaller than the capacity in the moderate deviation regime and the large deviation regime, respectively.

4.8. Moderate Deviation

From Theorems 17 and 18, we have the following.
Theorem 23.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For arbitrary t ( 0 , 1 / 2 ) and δ > 0 , we have:
lim n 1 n 1 2 t log P c ( n ) e n C n 1 t δ = lim n 1 n 1 2 t log P ¯ c ( n ) n , n C n 1 t δ log | A |
= δ 2 2 V W ( X | Y ) .
Proof. 
The theorem follows in the same manner as Theorem 12 by replacing Theorem 6 with Theorem 17 (achievability) and Theorem 8 with Theorem 18 (converse). □

4.9. Large Deviation

From Theorem 17 and Theorem 18, we have the following.
Theorem 24.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 1. For H W ( X | Y ) < R , we have:
lim inf n 1 n log P ¯ c ( n ) n , n 1 R log | A | sup 1 θ 0 θ R + θ H 1 + θ , W ( X | Y ) .
On the other hand, for H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
lim sup n 1 n log P c ( n ) e n ( log | A | R ) θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y )
= sup 1 < θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
Proof. 
The theorem follows in the same manner as Theorem 13 by replacing Theorem 6 with Theorem 17 (achievability) and Theorem 8 with Theorem 18 (converse). □
Under Assumption 2, from Theorems 19 and 20, we have the following tighter bound.
Theorem 25.
Suppose that the transition matrix W of the conditional additive noise satisfies Assumption 2. For H W ( X | Y ) < R , we have:
lim inf n 1 n log P ¯ c ( n ) n , n 1 R log | A | sup 1 2 θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
On the other hand, for H W ( X | Y ) < R < H 0 , W ( X | Y ) , we have:
lim sup n 1 n log P c ( n ) e n ( log | A | R ) θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y )
= sup 1 < θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
Proof. 
The theorem follows the same manner as Theorem 14 by replacing Theorem 9 with Theorem 19 and Theorem 10 with Theorem 20. □
When X has no side-information, i.e., the channel is additive, from Theorem 21 and (245), we have the following.
Theorem 26.
For H W ( X ) < R , we have:
lim inf n 1 n log P ¯ c ( n ) n , n 1 R log | A | sup 1 2 θ 0 θ R + θ H 1 + θ W ( X ) 1 + θ .
On the other hand, for H W ( X ) < R < H 0 W ( X ) , we have:
lim sup n 1 n log P c ( n ) e n ( log | A | R ) sup 1 < θ 0 θ R + θ H 1 + θ W ( X ) 1 + θ .
Proof. 
The first claim follows by taking the limit of Theorem 21, and the second claim follows as a special case of (245) without side-information. □

4.10. Summary of the Results

The results shown in this section for the Markov conditional additive noise are summarized in Table 3. The check marks 🗸 indicate that the tight asymptotic bounds (large deviation, moderate deviation, and second-order) can be obtained from those bounds. The marks 🗸 * indicate that the large deviation bound can be derived up to the critical rate. The computational complexity “Tail” indicates that the computational complexities of those bounds depend on the computational complexities of tail probabilities. It should be noted that Theorem 18 is derived from a special case ( Q Y = P Y ) of Theorem 16. The asymptotically optimal choice is Q Y = P Y ( 1 + θ ) . Under Assumption 1, we can derive the bound of the Markov case only for that special choice of Q Y , while under Assumption 2, we can derive the bound of the Markov case for the optimal choice of Q Y . Furthermore, Theorem 18 is not asymptotically tight in the large deviation regime in general, but it is tight if X has no side-information, i.e., the channel is additive. It should be also noted that Theorem 20 does not imply Theorem 18 even for the additive channel case since Assumption 2 restricts the structure of transition matrices even when X has no side-information.

5. Discussion and Conclusions

In this paper, we developed a unified approach to source coding with side-information and channel coding for a conditional additive channel for finite-length and asymptotic analyses of Markov chains. In our approach, the conditional Rényi entropies defined for transition matrices played important roles. Although we only illustrated the source coding with side-information and the channel coding for a conditional additive channel as applications of our approach, it could be applied to some other problems in information theory such as random number generation problems, as shown in another paper [60].
Our obtained results for the source coding with side-information and the channel coding of the conditional additive channel has been extended to the case when the side-information is continuous like the real line and the joint distribution X and Y is memoryless. Since this case covers the BPSK-AWGN channel, it can be expected that it covers the MPSK-AWGN channel. Since such channels are often employed in the real channel coding, it is an interesting future topic to investigate the finite-length bound for these channels. Further, we could not define the conditional Rényi entropy for transition matrices of continuous Y. Hence, our result could not be extended to such a continuous case. It is another interesting future topic to extend the obtained result to the case with continuous Y.

Author Contributions

Conceptualization, M.H.; methodology, S.W.; formal analysis, S.W. and M.H.; writing, original draft preparation, S.W.; writing, review and editing, M.H. All authors read and agreed to the published version of the manuscript.

Funding

M.H. is partially supported by the Japan Society of the Promotion of Science (JSPS) Grant-in-Aid for Scientific Research (A) No. 23246071, (A) No. 17H01280, (B) No. 16KT0017, the Okawa Research Grant, and Kayamori Foundation of Informational Science Advancement. He is also partially supported by the National Institute of Information and Communication Technology (NICT), Japan. S.W. is supported in part by the Japan Society of the Promotion of Science (JSPS) Grant-in-Aid for Young Scientists (A) No. 16H06091.

Acknowledgments

The authors would like to thank Vincent Y. F. Tan for pointing out Remark 6. The authors are also grateful to Ryo Yaguchi for his helpful comments.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; nor in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
RCUrandom coding union
BSCbinary symmetric channel
DMCdiscrete memoryless channel
DTdependence testing
LDPClow-density parity check
BPSKbinary phase shift keying
AWGNadditive white Gaussian noise
CGFcumulant generating function
MPSKM-ary phase shift keying
CCchannel coding
SCsource coding
SIside-information

Appendix A. Preparation for the Proofs

When we prove some properties of Rényi entropies or derive converse bounds, some properties of cumulant generating functions (CGFs) become useful. For this purpose, we introduce some terminologies in statistics from [22,23]. Then, in Appendix B, we show the relation between the terminologies in statistics and those in information theory. For the proofs, see [22,23].

Appendix A.1. Single-Shot Setting

Let Z be a random variable with distribution P. Let:
ϕ ( ρ ) : = log E e ρ Z
= log z P ( z ) e ρ Z
be the cumulant generating function (CGF). Let us introduce an exponential family:
P ρ ( z ) : = P ( z ) e ρ z ϕ ( ρ ) .
By differentiating the CGF, we find that:
ϕ ( ρ ) = E ρ [ Z ]
: = z P ρ ( z ) z .
We also find that:
ϕ ( ρ ) = z P ρ ( z ) z E ρ [ Z ] 2 .
We assume that Z is not constant. Then, (A6) implies that ϕ ( ρ ) is a strict convex function and ϕ ( ρ ) is monotonically increasing. Thus, we can define the inverse function ρ ( a ) of ϕ ( ρ ) by:
ϕ ( ρ ( a ) ) = a .
Let:
D 1 + s ( P Q ) : = 1 s log z P ( z ) 1 + s Q ( z ) s
be the Rényi divergence. Then, we have the following relation:
s D 1 + s ( P ρ ˜ P ρ ) = ϕ ( ( 1 + s ) ρ ˜ s ρ ) ( 1 + s ) ϕ ( ρ ˜ ) + s ϕ ( ρ ) .

Appendix A.2. Transition Matrix

Let { W ( z | z ) } ( z , z ) Z 2 be an ergodic and irreducible transition matrix, and let P ˜ be its stationary distribution. For a function g : Z × Z R , let:
E [ g ] : = z , z P ˜ ( z ) W ( z | z ) g ( z , z ) .
We also introduce the following tilted matrix:
W ρ ( z | z ) : = W ( z | z ) e ρ g ( z , z ) .
Let λ ρ be the Perron–Frobenius eigenvalue of W ρ . Then, the CGF for W with generator g is defined by:
ϕ ( ρ ) : = log λ ρ .
Lemma A1.
The function ϕ ( ρ ) is a convex function of ρ, and it is strict convex iff ϕ ( 0 ) > 0 .
From Lemma A1, ϕ ( ρ ) is a monotone increasing function. Thus, we can define the inverse function ρ ( a ) of ϕ ( ρ ) by:
ϕ ( ρ ( a ) ) = a .

Appendix A.3. Markov Chain

Let Z = { Z n } n = 1 be the Markov chain induced by W ( z | z ) and an initial distribution P Z 1 . For functions g : Z × Z R and g ˜ : Z R , let S n : = i = 2 n g ( Z i , Z i 1 ) + g ˜ ( Z 1 ) . Then, the CGF for S n is given by:
ϕ n ( ρ ) : = log E e ρ S n .
We will use the following finite evaluation for ϕ n ( ρ ) .
Lemma A2.
Let v ρ be the eigenvector of W ρ T with respect to the Perron–Frobenius eigenvalue λ ρ such that min z v ρ ( z ) = 1 . Let w ρ ( z ) : = P Z 1 ( z ) e ρ g ˜ ( z ) . Then, we have:
( n 1 ) ϕ ( ρ ) + δ ̲ ϕ ( ρ ) ϕ n ( ρ ) ( n 1 ) ϕ ( ρ ) + δ ¯ ϕ ( ρ ) ,
where:
δ ¯ ϕ ( ρ ) : = log v ρ | w ρ ,
δ ̲ ϕ ( ρ ) : = log v ρ | w ρ log max z v ρ ( z ) .
From this lemma, we have the following.
Corollary A1.
For any initial distribution and ρ R , we have:
lim n ϕ n ( ρ ) = ϕ ( ρ ) .
The relation:
lim n 1 n E [ S n ] = ϕ ( 0 )
= E [ g ]
is well known. Furthermore, we also have the following.
Lemma A3.
For any initial distribution, we have:
lim n 1 n Var S n = ϕ ( 0 ) .

Appendix B. Relation Between CGF and Conditional Rényi Entropies

Appendix B.1. Single-Shot Setting

For correlated random variable ( X , Y ) , let us consider Z = log Q Y ( Y ) P X Y ( X , Y ) . Then, the relation between the CGF and conditional Rényi entropy relative to Q Y is given by:
θ H 1 + θ ( P X Y | Q Y ) = ϕ ( θ ; P X Y | Q Y ) .
From this, we can also find that the relationship between the inverse functions (cf. (29) and (A7)):
θ ( a ) = ρ ( a ) .
Thus, the inverse function defined in (32) also satisfies:
( 1 ρ ( a ( R ) ) a ( R ) + ϕ ( ρ ( a ( R ) ) ; P X Y | Q Y ) = R .
Similarly, by setting Z = log 1 P X | Y ( X | Y ) , we have:
θ H 1 + θ ( X | Y ) = ϕ ( θ ; P X Y | P Y ) .
Then, the variance (cf. (11)) satisfies:
V ( X | Y ) = ϕ ( 0 ; P X Y | P Y ) .
Let ϕ ( ρ , ρ ) be the CGF of Z = log P Y ( 1 ρ ) ( Y ) P X Y ( X , Y ) (cf. (15) for the definition of P Y ( 1 ρ ) ). Then, we have:
θ H 1 + θ , 1 + θ ( X | Y ) = ϕ ( θ , θ ) .
It should be noted that ϕ ( ρ , ρ ) is a CGF for fixed ρ , but ϕ ( ρ , ρ ) cannot be treated as a CGF.

Appendix B.2. Transition Matrix

For transition matrix W ( x , y | x , y ) , we consider the function given by:
g ( ( x , y ) , ( x , y ) ) : = log W ( y | y ) W ( x , y | x , y ) .
Then, the relation between the CGF and the lower conditional Rényi entropy is given by:
θ H 1 + θ , W ( X | Y ) = ϕ ( θ ) .
Then, the variance defined in (51) satisfies:
V W ( X | Y ) = ϕ ( 0 ) .

Appendix C. Proof of Lemma 2

We use the following lemma.
Lemma A4.
For θ ( 1 , 0 ) ( 0 , 1 ) , we have:
H 1 1 θ ( X | Y ) H 1 1 θ ( X | Y ) H 1 + θ ( X | Y ) .
Proof. 
The left hand side inequality of (A31) is obvious from the definition of two Rényi entropies (the latter is defined by taking the maximum). The right-hand side inequality was proven in [71] (Lemma 6). □
Now, we go back to the proof of Lemma 2. From (10) and (11), by the Taylor approximation, we have:
H 1 + θ ( X | Y ) = H ( X | Y ) 1 2 V ( X | Y ) θ + o ( θ ) .
Furthermore, since 1 1 θ = 1 + θ + o ( θ ) , we also have:
H 1 1 θ ( X | Y ) = H ( X | Y ) 1 2 V ( X | Y ) θ + o ( θ ) .
Thus, from Lemma A4, we can derive (16) and (17).

Appendix D. Proof of Lemma 3

Statements 1 and 3 follow from the relationships in (A22) and (A25) and the strict convexity of the CGFs.
To prove Statement 5, we first prove the strict convexity of the Gallager function:
E 0 ( τ ; P X Y ) : = log y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 1 + τ
for τ > 1 . We use the Hölder inequality:
i a i α b i β i a i α i b i β
for α , β > 0 such that α + β = 1 , where the equality holds iff a i = c b i for some constant c. For λ ( 0 , 1 ) , let 1 + τ 3 = λ ( 1 + τ 1 ) + ( 1 λ ) ( 1 + τ 2 ) , which implies:
1 1 + τ 3 = 1 1 + τ 1 λ ( 1 + τ 1 ) 1 + τ 3 + 1 1 + τ 2 ( 1 λ ) ( 1 + τ 2 ) 1 + τ 3
and:
λ ( 1 + τ 1 ) 1 + τ 3 + ( 1 λ ) ( 1 + τ 2 ) 1 + τ 3 = 1 .
Then, by applying the Hölder inequality twice, we have:
y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 3 1 + τ 3
= y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 1 λ ( 1 + τ 1 ) 1 + τ 3 P X | Y ( x | y ) 1 1 + τ 2 ( 1 λ ) ( 1 + τ 2 ) 1 + τ 3 1 + τ 3
y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 1 λ ( 1 + τ 1 ) 1 + τ 3 x P X | Y ( x | y ) 1 1 + τ 2 ( 1 λ ) ( 1 + τ 2 ) 1 + τ 3 1 + τ 3
= y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 1 λ ( 1 + τ 1 ) x P X | Y ( x | y ) 1 1 + τ 2 ( 1 λ ) ( 1 + τ 2 )
= y P Y ( y ) λ x P X | Y ( x | y ) 1 1 + τ 1 λ ( 1 + τ 1 ) P Y ( y ) 1 λ x P X | Y ( x | y ) 1 1 + τ 2 ( 1 λ ) ( 1 + τ 2 )
y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 1 ( 1 + τ 1 ) λ y P Y ( y ) x P X | Y ( x | y ) 1 1 + τ 2 ( 1 + τ 2 ) 1 λ .
The equality in the second inequality holds iff:
x P X | Y ( x | y ) 1 1 + τ 1 1 + τ 1 = c x P X | Y ( x | y ) 1 1 + τ 2 1 + τ 2 y Y
for some constant c. Furthermore, the equality in the first inequality holds iff P X | Y ( x | y ) = 1 | supp ( P X | Y ( · | y ) ) | . Substituting this into (A44), we find that | supp ( P X | Y ( · | y ) ) | is irrespective of y. Thus, both the equalities hold simultaneously iff V ( X | Y ) = 0 . Now, since:
θ H 1 + θ ( X | Y ) = ( 1 + θ ) E 0 θ 1 + θ ; P X Y ,
we have:
d 2 [ θ H 1 + θ ( X | Y ) ] d θ 2 = 1 ( 1 + θ ) 4 E 0 θ 1 + θ ; P X Y
0
for θ ( 1 , ) , where the equality holds iff V ( X | Y ) = 0 .
Statement 7 is obvious from the definitions of the two measures. The first part of Statement 8 follows from (A27) and the convexity of the CGF, but we need another argument to check the conditions for strict concavity. Since the second term of:
θ H 1 + θ , 1 + θ ( X | Y ) = log y P Y ( y ) x P X | Y ( x | y ) 1 + θ x P X | Y ( x | y ) 1 + θ θ 1 + θ + θ θ 1 + θ H 1 + θ ( X | Y )
is linear with respect to θ , it suffices to show the strict concavity of the first term. By using the Hölder inequality twice, for θ 3 = λ θ 1 + ( 1 λ ) θ 2 , we have:
y P Y ( y ) x P X | Y ( x | y ) 1 + θ 3 x P X | Y ( x | y ) 1 + θ θ 3 1 + θ
y P Y ( y ) x P X | Y ( x | y ) 1 + θ 1 λ x P X | Y ( x | y ) 1 + θ 2 1 λ x P X | Y ( x | y ) 1 + θ λ θ 1 + ( 1 λ ) θ 2 1 + θ
y P Y ( y ) x P X | Y ( x | y ) 1 + θ 1 x P X | Y ( x | y ) 1 + θ θ 1 1 + θ λ
y P Y ( y ) x P X | Y ( x | y ) 1 + θ 2 x P X | Y ( x | y ) 1 + θ θ 2 1 + θ 1 λ ,
where both the equalities hold simultaneously iff V ( X | Y ) = 0 , which can be proven in a similar manner as the equality conditions in (A40) and (A43). Thus, we have the latter part of Statement 8.
Statements 10–12 are also obvious from the definitions. Statements 2, 4, 6, and 9, follow from Statements 1, 3, 5, and 8, (cf. [71], Lemma 1).

Appendix E. Proof of Lemma 4

Since (24) and (28) are obvious from the definitions, we only prove (26). We note that:
y P Y ( y ) x P X | Y ( x | y ) 1 + θ 1 1 + θ 1 + θ
y P Y ( y ) | supp ( P X | Y ( · | y ) ) | 1 1 + θ 1 + θ
max y supp ( P Y ) | supp ( P X | Y ( · | y ) ) |
and:
y P Y ( y ) x P X | Y ( x | y ) 1 + θ 1 1 + θ 1 + θ
P Y ( y * ) 1 + θ x P X | Y ( x | y * ) 1 + θ
θ 1 | supp ( P X | Y ( · | y * ) ) | ,
where:
y * : = argmax y supp ( P Y ) | supp ( P X | Y ( · | y ) ) | .

Appendix F. Proof of Lemma 6

From Lemma A4, Theorems 1 and 3, we have:
H 1 1 θ , W ( X | Y ) H 1 1 θ , W ( X | Y ) H 1 + θ , W ( X | Y )
for θ ( 1 , 0 ) ( 0 , 1 ) . Thus, we can prove Lemma 6 in the same manner as Lemma 2.

Appendix G. Proof of (63)

First, in the same manner as Theorem 1, we can show:
lim n 1 n H 1 + θ ( P X n Y n | Q Y n ) = H 1 + θ W | V ( X | Y ) ,
where Q Y n is a Markov chain induced by V for some initial distribution. Then, since H 1 + θ ( P X n Y n | Q Y n ) H 1 + θ ( X n | Y n ) for each n, by using Theorem 3, we have:
H 1 + θ W | V ( X | Y ) H 1 + θ , W ( X | Y ) .
Thus, the rest of the proof is to show that H 1 + θ , W ( X | Y ) is attainable by some V.
Let Q ^ θ be the normalized left eigenvector of K θ , and let:
V θ ( y | y ) : = Q ^ θ ( y ) κ θ Q ^ θ ( y ) K θ ( y | y ) .
Then, V θ attains the maximum. To prove this, we will show that κ θ 1 + θ is the Perron–Frobenius eigenvalue of:
W ( x , y | x , y ) 1 + θ V θ ( y | y ) θ .
We first confirm that ( Q ^ θ ( y ) 1 + θ : ( x , y ) X × Y ) is an eigenvector of (A64) as follows:
x , y Q ^ θ ( y ) 1 + θ W ( x , y | x , y ) 1 + θ V θ ( y | y ) θ
= y Q ^ θ ( y ) 1 + θ W θ ( y | y ) Q ^ θ ( y ) κ θ Q ^ θ ( y ) W θ ( y | y ) 1 1 + θ θ
= κ θ θ Q ^ θ ( y ) θ y Q ^ θ ( y ) W θ ( y | y ) 1 1 + θ
= κ θ 1 + θ Q ^ θ ( y ) 1 + θ .
Since ( Q ^ θ ( y ) 1 + θ : ( x , y ) X × Y ) is a positive vector and the Perron–Frobenius eigenvector is the unique positive eigenvector, we find that κ θ 1 + θ is the Perron–Frobenius eigenvalue. Thus, we have:
H 1 + θ W | V θ ( X | Y ) = 1 + θ θ log κ θ
= H 1 + θ , W ( X | Y ) .

Appendix H. Proof of Lemma 7

Statement 1 follows from (A29) and the strict convexity of the CGF. Statements 5 and 8–10 follow from the corresponding statements in Lemma 3, Theorems 1, 3 and 4.
Now, we prove (the concavity of θ H 1 + θ , W ( X | Y ) follows from the limiting argument, i.e., the concavity of θ H 1 + θ ( X n | Y n ) (cf. Lemma 3) and Theorem 3. However, the strict concavity does not follow from the limiting argument; Statement 3. For this purpose, we introduce the transition matrix counterpart of the Gallager function as follows. Let:
K ¯ τ ( y | y ) : = W ( y | y ) x W ( x | x , y , y ) 1 1 + τ 1 + τ
for τ > 1 , which is well defined under Assumption 2. Let κ ¯ τ be the Perron–Frobenius eigenvalue of K ¯ τ , and let Q ˜ τ and Q ^ τ be its normalized right and left eigenvectors. Then, let:
L τ ( y | y ) : = Q ^ τ ( y ) κ ¯ τ Q ^ τ ( y ) K ¯ τ ( y | y )
be a parametrized transition matrix. The stationary distribution of L τ is given by:
Q τ ( y ) : = Q ^ τ ( y ) Q ˜ τ ( y ) y Q ^ τ ( y ) Q ˜ τ ( y ) .
We prove the strict convexity of E 0 W ( τ ) : = log κ ¯ τ for τ > 1 . Then, by the same reason as (A46), we can show Statement 3. Let Q τ ( y , y ) : = L τ ( y | y ) Q τ ( y ) . By the same calculation as [22] (Proof of Lemmas 13 and 14), we have:
y , y Q τ ( y , y ) d d τ log L τ ( y | y ) 2 = y , y Q τ ( y , y ) d 2 d τ 2 log L τ ( y | y ) .
Furthermore, from the definition of L τ , we have:
y , y Q τ ( y , y ) d 2 d τ 2 log L τ ( y | y )
= y , y Q τ ( y , y ) d 2 d τ 2 log 1 κ τ + d 2 d τ 2 log Q ^ τ ( y ) Q ^ τ ( y ) + d 2 d τ 2 log K τ ( y | y )
= d 2 d τ 2 log κ τ y , y Q τ ( y , y ) d 2 d τ 2 log K τ ( y | y ) .
Now, we show the convexity of log K ¯ τ ( y | y ) for each ( y , y ) . By using the Hölder inequality (cf. Appendix D), for τ 3 = λ τ 1 + ( 1 λ ) τ 2 , we have:
x W ( x | x , y , y ) 1 1 + τ 3 1 + τ 3 x W ( x | x , y , y ) 1 1 + τ 1 λ ( 1 + τ 1 ) x W ( x | x , y , y ) 1 1 + τ 2 ( 1 λ ) ( 1 + τ 2 ) .
Thus, E 0 W ( τ ) is convex. To check strict convexity, we note that the equality in (A78) holds iff W ( x | x , y , y ) = 1 | supp ( W ( · | x , y , y ) ) | . Since:
x W ( x | x , y , y ) 1 + θ = 1 | supp ( W ( · | x , y , y ) ) | θ
does not depend on x from Assumption 2, we have | supp ( W ( · | x , y , y ) ) | = C y y for some integer C y y . By substituting this into K ¯ τ , we have:
K ¯ τ ( y | y ) = W ( y | y ) C y y τ .
On the other hand, we note that the CGF ϕ ( ρ ) is defined as the logarithm of the Perron–Frobenius eigenvalue of:
W ( x , y | w , y ) 1 ρ W ( y | y ) ρ = W ( y | y ) 1 C y y 1 ρ 1 [ x supp ( W ( · | x , y , y ) ) ] .
Since:
x , y Q ^ τ ( y ) W ( y | y ) 1 C y y 1 τ 1 [ x supp ( W ( · | x , y , y ) ) ]
= y Q ^ τ ( y ) W ( y | y ) C y y τ
= κ ¯ τ Q ^ τ ( y ) ,
κ ¯ τ is the Perron–Frobenius eigenvalue of (A81), and thus, we have E 0 W ( τ ) = ϕ ( τ ) when the equality in (A78) holds for every ( y , y ) such that W ( y | y ) > 0 . Since ϕ ( τ ) is strict convex if V W ( X | Y ) > 0 , E 0 W ( τ ) is strict convex if V W ( X | Y ) > 0 . Thus, θ H 1 + θ , W ( X | Y ) is strict concave if V W ( X | Y ) > 0 . On the other hand, from (57), θ H 1 + θ , W ( X | Y ) is strict concave only if V W ( X | Y ) > 0 .
Statement 6 can be proven by modifying the proof of Statement 8 of Lemma 3 to a transition matrix in a similar manner as Statement 3 of the present lemma.
Finally, Statements 2, 4 and 7 follow from Statements 1, 3 and 6 (cf. [71], Lemma 1).

Appendix I. Proof of Lemma 9

We only prove (75) since we can prove (76) exactly in the same manner by replacing H 1 + θ , W ( X | Y ) , θ ( a ) , and a ( R ) by H 1 + θ , W ( X | Y ) , θ ( a ) , and a ( R ) . Let:
f ( θ ) : = θ R + θ H 1 + θ , W ( X | Y ) 1 + θ .
Then, we have:
f ( θ ) = R + ( 1 + θ ) d [ θ H 1 + θ , W ( X | Y ) ] d θ θ H 1 + θ , W ( X | Y ) ( 1 + θ ) 2
= R + R d [ θ H 1 + θ , W ( X | Y ) ] d θ ( 1 + θ ) 2 .
Since R ( a ) is monotonically increasing and d [ θ H 1 + θ , W ( X | Y ) ] d θ is monotonically decreasing, we have f ( θ ) 0 for θ θ ( a ( R ) ) and f ( θ ) 0 for θ θ ( a ( R ) ) . Thus, f ( θ ) takes its maximum at θ ( a ( R ) ) . Furthermore, since 1 θ ( a ( R ) ) 0 for H W ( X | Y ) R H 0 , W ( X | Y ) , we have:
sup 1 θ 0 θ R + θ H 1 + θ , W ( X | Y ) 1 + θ
= θ ( a ( R ) ) R + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) 1 + θ ( a ( R ) )
= θ ( a ( R ) ) [ ( 1 + θ ( a ( R ) ) ) a ( R ) θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) ] + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) 1 + θ ( a ( R ) )
= θ ( a ( R ) ) a ( R ) + θ ( a ( R ) ) H 1 + θ ( a ( R ) ) , W ( X | Y ) ,
where we substituted R = R ( a ( R ) ) in the second equality.

Appendix J. Proof of Lemma 11

Let u be the vector such that u ( y ) = 1 for every y Y . From the definition of H 1 + θ ( X n | Y n ) , we have the following sequence of calculations:
e θ 1 + θ θ H 1 + θ ( X n | Y n )
= y 1 , , y n x n , , x 1 P ( x 1 , y 1 ) 1 + θ i = 2 n W ( x i , y i | x i 1 , y i 1 ) 1 + θ 1 1 + θ
= ( a ) y n , , y 1 x 1 P ( x 1 , y 1 ) 1 + θ 1 1 + θ i = 2 n W θ ( y i | y i 1 ) 1 1 + θ
= u | K θ n 1 w θ
v τ | K θ n 1 w θ
= ( K θ T ) n 1 v θ | w θ
= κ θ n 1 v θ | w θ
= e ( n 1 ) θ 1 + θ H 1 + θ , W ( X | Y ) v θ | w θ ,
which implies the left-hand side inequality, where we used Assumption 2 in ( a ) . On the other hand, we have the following sequence of calculations:
e θ 1 + θ θ H 1 + θ ( X n | Y n )
= u | K θ n 1 w θ
1 max y v θ ( y ) v θ | K θ n 1 w θ
= 1 max y v θ ( y ) ( K θ T ) n 1 v θ | w θ
= κ θ n 1 v θ | w θ max y v θ ( y )
= e ( n 1 ) θ 1 + θ H 1 + θ , W ( X | Y ) v θ | w θ max y v θ ( y ) ,
which implies the right-hand side inequality.

Appendix K. Proof of Theorem 5

For arbitrary ρ ˜ R , we set α : = P X Y { X d ( e ( X ) , Y ) } and β : = P X Y , ρ ˜ { X d ( e ( X ) , Y ) } , where:
P X Y , ρ ( x , y ) : = P X Y ( x , y ) 1 ρ Q Y ( y ) ρ e ϕ ( ρ ; P X Y | Q Y ) .
Then, by the monotonicity of the Rényi divergence, we have:
s D 1 + s ( P X Y , ρ ˜ P X Y ) log β 1 + s α s + ( 1 β ) 1 + s ( 1 α ) s
log β 1 + s α s .
Thus, we have:
log α ϕ ( ( 1 + s ) ρ ˜ ; P X Y | Q Y ) ( 1 + s ) ϕ ( ρ ˜ ; P X Y | Q Y ) ( 1 + s ) log β s .
Now, by using Lemma 18, we have:
1 β P X Y , ρ ˜ log Q Y ( y ) P X Y , ρ ˜ ( x , y ) γ + M e γ .
We also have, for any σ 0 ,
P X Y , ρ ˜ log Q Y ( y ) P X Y , ρ ˜ ( x , y ) γ
x , y P X Y , ρ ˜ ( x , y ) e σ log Q Y ( y ) P X Y , ρ ˜ ( x , y ) γ
= e [ σ γ ϕ ( σ ; P X Y , ρ ˜ | Q Y ) ] .
Thus, by setting γ so that:
σ γ ϕ ( σ ; P X Y , ρ ˜ | Q Y ) = γ R ,
we have
1 β 2 e σ R ϕ ( σ ; P X Y , ρ ˜ | Q Y ) 1 σ .
Furthermore, we have the relation:
ϕ ( σ ; P X Y , ρ ˜ | Q Y ) = log x , y P X Y , ρ ˜ ( x , y ) 1 σ Q Y ( y ) σ
= log x , y P X Y ( x , y ) 1 ρ ˜ Q Y ( y ) ρ ˜ e ϕ ( ρ ˜ ; P X Y | Q Y ) 1 σ Q Y ( y ) σ
= ( 1 σ ) ϕ ( ρ ˜ ; P X Y | Q Y ) + log x , y P X Y ( x , y ) 1 ρ ˜ σ ( 1 ρ ˜ ) Q Y ( y ) ρ ˜ + σ ( 1 ρ ˜ )
= ϕ ( ρ ˜ + σ ( 1 ρ ˜ ) ; P X Y | Q Y ) ( 1 σ ) ϕ ( ρ ˜ ; P X Y | Q Y ) .
Thus, by substituting ρ ˜ = θ ˜ and σ = ϑ and by using (A22), we can derive (124).
Now, we restrict the range of ρ ˜ so that ρ ( a ( R ) ) < ρ ˜ < 1 and take:
σ = ρ ( a ( R ) ) ρ ˜ 1 ρ ˜ .
Then, by substituting this into (A119) and (A119) into (A115), we have ( ϕ ( ρ ; P X Y | Q Y ) is omitted as ϕ ( ρ ) ):
σ R ϕ ( ρ ˜ + σ ( 1 ρ ˜ ) ) + ( 1 σ ) ϕ ( ρ ˜ ) 1 σ
= ( ρ ( a ( R ) ) ρ ˜ ) R ( 1 ρ ˜ ) ϕ ( ρ ( a ( R ) ) ) + ( 1 ρ ( a ( R ) ) ) ϕ ( ρ ˜ ) 1 ρ ( a ( R ) )
= ( ρ ( a ( R ) ) ρ ˜ ) ( 1 ρ ( a ( R ) ) ) a ( R ) + ϕ ( ρ ( a ( R ) ) ) ( 1 ρ ˜ ) ϕ ( ρ ( a ( R ) ) ) + ( 1 ρ ( a ( R ) ) ) ϕ ( ρ ˜ ) 1 ρ ( a ( R ) )
= ( ρ ( a ( R ) ) ρ ˜ ) a ( R ) ϕ ( ρ ( a ( R ) ) ) + ϕ ( ρ ˜ ) ,
where we used (A24) in the second equality. Thus, by substituting ρ ˜ = θ ˜ and by using (A22) again, we have (125).

Appendix L. Proof of Theorem 16

Let:
P X n Y n , ρ ( x n , y n ) : = P X n Y n ( x n , y n ) 1 ρ Q Y n ( y n ) ρ e ϕ ( ρ ; P X n Y n | Q Y n ) ,
and let P B n | A n , ρ be a conditional additive channel defined by:
P B n | A n , ρ ( a n + x n | a n ) = P X n Y n , ρ ( x n , y n ) .
We also define the joint distribution of the message, the input, the output, and the decoded message for each channel:
P M n A n B n M ^ n ( m , a n , b n , m ^ ) : = 1 M n 1 [ e n ( m ) = a n ] P B n | A n ( b n | a n ) 1 [ d n ( b n ) = m ^ ] ,
P M n A n B n M ^ n , ρ ( m , a n , b n , m ^ ) : = 1 M n 1 [ e n ( m ) = a n ] P B n | A n , ρ ( b n | a n ) 1 [ d n ( b n ) = m ^ ] .
For arbitrary ρ ˜ R , let α : = P M n M ^ n { m m ^ } and β : = P M n M ^ n , ρ ˜ { m m ^ } . Then, by the monotonicity of the Rényi divergence, we have:
s D 1 + s ( P A n B n , ρ ˜ P A n B n ) s D 1 + s ( P M n M ^ n , ρ ˜ P M n M ^ n )
log β 1 + s α s + ( 1 β ) 1 + s ( 1 α ) s
log β 1 + s α s .
Thus, we have:
log α s D 1 + s ( P A n B n , ρ ˜ P A n B n ) ( 1 + s ) log β s .
Here, we have:
D 1 + s ( P A n B n , ρ ˜ P A n B n ) = D 1 + s ( P X n Y n , ρ ˜ P X n Y n ) .
On the other hand, from Lemma 25, we have:
1 β P X n Y n , ρ ˜ log Q Y n ( y n ) P X n Y n , ρ ˜ ( x n , y n ) n log | A | γ + e R e n log | A | γ .
Thus, by the same argument as in (A111)–(A119) and by noting (A22), we can derive (222).
Now, we restrict the range of ρ ˜ so that ρ ( a ( R ) ) < ρ ˜ < 1 and take:
σ = ρ ( a ( R ) ) ρ ˜ 1 ρ ˜ .
Then, by noting (A22), we have (223).

References

  1. Polyanskiy, Y.; Poor, H.V.; Verdu, S. Channel coding rate in the finite blocklength regime. IEEE Trans. Inf. Theory 2010, 56, 2307–2359. [Google Scholar] [CrossRef]
  2. Verdú, S.; Han, T.S. A general fomula for channel capacity. IEEE Trans. Inform. Theory 1994, 40, 1147–1157. [Google Scholar] [CrossRef] [Green Version]
  3. Han, T.S. Information-Spectrum Methods in Information Theory; Springer: Berlin/Heidelberg, Germany, 2003. [Google Scholar]
  4. Hayashi, M.; Nagaoka, H. General formulas for capacity of classical-quantum channels. IEEE Trans. Inf. Theory 2003, 49, 1753–1768. [Google Scholar] [CrossRef] [Green Version]
  5. Wang, L.; Renner, R. One-shot classical-quantum capacity and hypothesis testing. Phys. Rev. Lett. 2012, 108, 200501. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  6. Gallager, R.G. A simple derivation of the coding theorem and some applications. IEEE Trans. Inf. Theory 1965, 11, 3–18. [Google Scholar] [CrossRef] [Green Version]
  7. Polyanskiy, Y. Channel coding: Non-Asymptotic Fundamental Limits. Ph.D. Dissertation, Princeton University, Princeton, NJ, USA, November 2010. [Google Scholar]
  8. Tomamichel, M.; Hayashi, M. A hierarchy of information quantities for finite block length analysis of quantum tasks. IEEE Trans. Inform. Theory 2013, 59, 7693–7710. [Google Scholar] [CrossRef] [Green Version]
  9. Matthews, W.; Wehner, S. Finite blocklength converse bounds for quantum channels. IEEE Trans. Inf. Theory 2014, 60, 7317–7329. [Google Scholar] [CrossRef]
  10. Gallager, R.G. Information Theory and Reliable Communication; John Wiley & Sons: Hoboken, NJ, USA, 1968. [Google Scholar]
  11. Hayashi, M. Information spectrum approach to second-order coding rate in channel coding. IEEE Trans. Inf. Theory 2009, 55, 4947–4966. [Google Scholar] [CrossRef] [Green Version]
  12. Hayashi, M. Second-order asymptotics in fixed-length source coding and intrinsic randomness. IEEE Trans. Inf. Theory 2008, 54, 4619–4637. [Google Scholar] [CrossRef] [Green Version]
  13. Altug, Y.; Wagner, A.B. Moderate deviation analysis of channel coding: Discrete memoryless case. In Proceedings of the IEEE International Symposium on Information Theory, Austin, TX, USA, 13–18 June 2010; pp. 265–269. [Google Scholar]
  14. He, D.; Lastras-Montano, L.A.; Yang, E.; Jagmohan, A.; Chen, J. On the redundancy of slepian-wolf coding. IEEE Trans. Inf. Theory 2009, 55, 5607–5627. [Google Scholar] [CrossRef]
  15. Tan, V.Y.F. Moderate-deviations of lossy source coding for discrete and gaussian sources. In Proceedings of the 2012 IEEE International Symposium on Information Theory, Cambridge, MA, USA, 1–6 July 2012; pp. 920–924. [Google Scholar]
  16. Kuzuoka, S. A simple technique for bounding the redundancy of source coding with side-information. In Proceedings of the 2012 IEEE International Symposium on Information Theory, Cambridge, MA, USA, 1–6 July 2012; pp. 915–919. [Google Scholar]
  17. Yang, E.; Meng, J. New nonasymptotic channel coding theorems for structured codes. IEEE Trans. Inf. Theory 2015, 61, 4534–4553. [Google Scholar] [CrossRef]
  18. Arimoto, S. Information measures and capacity of order α for discrete memoryless channels. In Colloquia Mathematica Societatis Janos Bolyai, 16. Topics in Information Theory; Elsevier: Amsterdam, The Netherlands, 1975; pp. 41–52. [Google Scholar]
  19. Hayashi, M. Exponential decreasing rate of leaked information in universal random privacy amplification. IEEE Trans. Inf. Theory 2011, 57, 3989–4001. [Google Scholar] [CrossRef] [Green Version]
  20. Teixeira, A.; Matos, A.; Antunes, L. Conditional Rényi entropies. IEEE Trans. Inf. Theory 2012, 58, 4273–4277. [Google Scholar] [CrossRef]
  21. Iwamoto, M.; Shikata, J. Information theoretic security for encryption based on conditional Rényi entropies. In Information Theoretic Security ICITS 2013; Padró, C., Ed.; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; pp. 103–121. [Google Scholar]
  22. Hayashi, M.; Watanabe, S. Information geometry approach to parameter estimation in Markov chains. Ann. Stat. 2016, 44, 1495–1535. [Google Scholar] [CrossRef] [Green Version]
  23. Watanabe, S.; Hayashi, M. Finite-length analysis on tail probability and simple hypothesis testing for Markov chain. Ann. Appl. Probab. 2017, 27, 811–845. [Google Scholar] [CrossRef] [Green Version]
  24. Wyner, A.D. Recent results in the shannon theory. IEEE Trans. Inf. Theory 1974, 20, 2–10. [Google Scholar] [CrossRef]
  25. Csiszár, I. Linear codes for sources and source networks: Error exponents, universal coding. IEEE Trans. Inf. Theory 1982, 28, 585–592. [Google Scholar] [CrossRef]
  26. Ahlswede, R.; Dueck, G. Good codes can be produced by a few permutations. IEEE Trans. Inf. Theory 1982, 28, 430–443. [Google Scholar] [CrossRef]
  27. Chen, J.; He, D.-K.; Jagmohan, A.; Lastras-Montano, L.A.; Yang, E. On the linear codebook-level duality between Slepian-Wolf coding and channel coding. IEEE Trans. Inf. Theory 2009, 55, 5575–5590. [Google Scholar] [CrossRef]
  28. Hayashi, M. Tight exponential analysis of universally composable privacy amplification and its applications. IEEE Trans. Inf. Theory 2013, 59, 7728–7746. [Google Scholar] [CrossRef] [Green Version]
  29. Gilbert, E.N. Capacity of burst-noise for codes on burst-noise channels. Bell Syst. Tech. J. 1960, 39, 1253–1265. [Google Scholar] [CrossRef]
  30. Elliott, E.O. Estimates of error rates for codes on burst-noise channels. Bell Syst. Tech. J. 1963, 42, 1977–1997. [Google Scholar] [CrossRef]
  31. Delsarte, P.; Piret, P. Algebraic construction of shannon codes for regular channels. IEEE Trans. Inf. Theory 1982, 28, 593–599. [Google Scholar] [CrossRef]
  32. Tomamichel, M.; Tan, V.Y.F. Second-order coding rates for channels with state. IEEE Trans. Inf. Theory 2014, 60, 4427–4448. [Google Scholar] [CrossRef] [Green Version]
  33. Kemeny, J.G.; Snell, J. Finite Markov Chains; Springer: Berlin/Heidelberg, Germany, 1976. [Google Scholar]
  34. Feller, W. An Introduction to Probability Theory and Its Applications; Wiley: Hoboken, NJ, USA, 1971. [Google Scholar]
  35. Tikhomirov, A.N. On the convergence rate in the central limit theorem for weakly dependent random variables. Theory Probab. Appl. 1980, 25, 790–890. [Google Scholar] [CrossRef]
  36. Kontoyiannis, I.; Meyn, P. Spectral theory and limit theorems for geometrically ergodic Markov processes. Ann. Appl. Probab. 2003, 13, 304–362. [Google Scholar] [CrossRef]
  37. Hervé, L.; Ledoux, J.; Patilea, V. A uniform Berry-Esseen theorem on m-estimators for geometrically ergodic Markov chains. Bernoulli 2012, 18, 703–734. [Google Scholar] [CrossRef]
  38. Watanabe, S.; Hayashi, M. Non-asymptotic analysis of privacy amplification via Rényi entropy and inf-spectral entropy. In Proceedings of the 2013 IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013; pp. 2715–2719. [Google Scholar]
  39. Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley & Sons: Hoboken, NJ, USA, 2006. [Google Scholar]
  40. Davisson, L.D.; Longo, G.; Sgarro, A. The error exponent for the noiseless encoding of finite ergodic Markov sources. IEEE Trans. Inform. Theory 1981, 27, 431–438. [Google Scholar] [CrossRef]
  41. Vašek, K. On the error exponent for ergodic Markov source. Kybernetika 1980, 16, 318–329. [Google Scholar]
  42. Zhong, Y.; Alajaji, F.; Campbell, L.L. Joint source-channel coding error exponent for discrete communication systems with Markovian memory. IEEE Trans. Inf. Theory 2007, 53, 4457–4472. [Google Scholar] [CrossRef]
  43. Polyanskiy, Y.; Poor, H.V.; Verdu, S. Dispersion of the Gilbert-Elliott channel. IEEE Trans. Inf. Theory 2011, 57, 1829–1848. [Google Scholar] [CrossRef] [Green Version]
  44. Kontoyiannis, I. Second-order noiseless source coding theorems. IEEE Trans. Inform. Theory 1997, 43, 1339–1341. [Google Scholar] [CrossRef]
  45. Kontoyiannis, I.; Verdú, S. Optimal lossless data compression: Non-asymptotic and asymptotics. IEEE Trans. Inf. Theory 2014, 60, 777–795. [Google Scholar] [CrossRef]
  46. Scarlett, J.; Martinez, A.; Fábregas, A.G.i. Mismatched decoding: Error exponents, second-order rates and saddlepoint approximations. IEEE Trans. Inform. Theory 2014, 60, 2647–2666. [Google Scholar] [CrossRef] [Green Version]
  47. Scarlett, J.; Martinez, A.; i Fabregas, A.G. The saddlepoint approximation: A unification of exponents, dispersions and moderate deviations. arXiv 2014, arXiv:1402.3941. [Google Scholar]
  48. Ben-Ari, I.; Neumann, M. Probabilistic approach to Perron root, the group inverse, and applications. Linear Multilinear Algebra 2010, 60, 39–63. [Google Scholar] [CrossRef]
  49. Lalley, S.P. Ruelle’s Perron-Frobenius theorem and the central limit theorem for additive functionals of one-dimensional Gibbs states. Adapt. Stat. Proced. Relat. Top. 1986, 8, 428–446. [Google Scholar]
  50. Kato, T. Perturbation Theory for Linear Operators; Springer: New York, NY, USA, 1980. [Google Scholar]
  51. Häggström, O.; Rosenthal, J.S. On the central limit theorem for geometrically ergodic markov chains. Electron. Commun. Probab. 2007, 12, 454–464. [Google Scholar]
  52. Kipnis, C.; Varadhan, S.R.S. Central limit theorem for additive functionals of reversible markov processes and applications to simple exclusions. Commun. Math. Phys. 1986, 104, 1–19. [Google Scholar] [CrossRef]
  53. Komorowski, C.L.T.; Olla, S. Fluctuations in Markov Processes: Time Symmetry and Martingale Approximation; Springer: Berlin, Germany, 2012. [Google Scholar]
  54. Meyn, S.P.; Tweedie, R.L. Markov Chains and Stochastic Stability; Springer: London, UK, 1993. [Google Scholar]
  55. Jones, G.L. On the Markov chain central limit theorem. Probab. Surv. 2004, 1, 299–320. [Google Scholar] [CrossRef]
  56. Tomamichel, M.; Berta, M.; Hayashi, M. Relating different quantum generalizations of the conditional Rényi entropy. J. Math. Phys. 2014, 55, 082206. [Google Scholar] [CrossRef] [Green Version]
  57. Hayashi, M. Large deviation analysis for classical and quantum security via approximate smoothing. IEEE Trans. Inf. Theory 2014, 60, 6702–6732. [Google Scholar] [CrossRef]
  58. Csiszár, I. Generalized cutoff rates and Rényi’s information measures. IEEE Trans. Inf. Theory 1995, 41, 6–34. [Google Scholar] [CrossRef]
  59. Polyanskiy, Y.; Verdú, S. Arimoto channel coding converse and Rényi divergence. In Proceedings of the 2010 48th Annual Allerton Conference on Communication, Control, and Computing (Allerton), Allerton, IL, USA, 29 September–1 October 2010; pp. 1327–1333. [Google Scholar]
  60. Hayashi, M.; Watanabe, S. Uniform random number generation from Markov chains: Non-asymptotic and asymptotic analyses. IEEE Trans. Inform. Theory 2016, 62, 1795–1822. [Google Scholar] [CrossRef] [Green Version]
  61. Kemeny, J.G.; Snell, J.L. Finite Markov Chains; Springer: New York, NY, USA, 1960. [Google Scholar]
  62. Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Express: Cambridge, UK, 1985. [Google Scholar]
  63. Wegman, M.N.; Carter, J.L. New hash functions and their use in authentication and set equality. J. Comput. Syst. Sci. 1981, 22, 265–279. [Google Scholar] [CrossRef] [Green Version]
  64. Gallager, R.G. Source coding with side-information and universal coding. Proc. IEEE Int. Symp. Inf. Theory 1976. Available online: http://web.mit.edu/gallager/www/papers/paper5.pdf (accessed on 5 April 2020).
  65. Hayashi, M. Quantum Information: An Introduction; Springer: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  66. Renner, R.; Wolf, S. Simple and tight bound for information reconciliation and privacy amplification. In Advances in Cryptology – ASIACRYPT 2005; ser. Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2005; pp. 199–216. [Google Scholar]
  67. Billingsley, P. Probability and Measure; JOHN WILEY & SONS: Hoboken, NJ, USA, 1995. [Google Scholar]
  68. Cover, T. A proof of the data compression theorem of Slepian and Wolf for ergodic sources. IEEE Trans. Inf. Theory 1975, 21, 226–228. [Google Scholar] [CrossRef] [Green Version]
  69. Dembo, A.; Zeitouni, O. Large Deviations Techniques and Applications, 2nd ed.; Springer: Berlin/Heidelberg, Germany, 1998. [Google Scholar]
  70. Hayashi, M. Error exponent in asymmetric quantum hypothesis testing and its application to classical-quantum channel coding. Phys. Rev. A 2007, 76, 062301. [Google Scholar] [CrossRef] [Green Version]
  71. Hayashi, M. Security analysis of ε-almost dual universal2 hash functions. IEEE Trans. Inf. Theory 2016, 62, 3451–3476. [Google Scholar] [CrossRef] [Green Version]
Figure 1. A comparison of H 1 + θ , W ( X | Y ) (upper red curve) and H 1 + θ , W ( X | Y ) (lower blue curve) for the transition matrix of Example 3 with q 0 = q 1 = 0 . 1 , p 0 = 0 . 1 , and p 1 = 0 . 4 . The horizontal axis is θ , and the vertical axis is the values of the information measures (nats).
Figure 1. A comparison of H 1 + θ , W ( X | Y ) (upper red curve) and H 1 + θ , W ( X | Y ) (lower blue curve) for the transition matrix of Example 3 with q 0 = q 1 = 0 . 1 , p 0 = 0 . 1 , and p 1 = 0 . 4 . The horizontal axis is θ , and the vertical axis is the values of the information measures (nats).
Entropy 22 00460 g001
Figure 2. The description of the transition matrix in (173).
Figure 2. The description of the transition matrix in (173).
Entropy 22 00460 g002
Figure 3. A comparison of the bounds for p = 0 . 1 , q = 0 . 2 , and ε = 10 3 . The horizontal axis is the block length n, and the vertical axis is the rate R (nats). The upper red curve is the achievability bound in Theorem 7. The middle blue curve is the converse bound in Theorem 8. The lower purple line is the first-order asymptotics given by the entropy H W ( X ) .
Figure 3. A comparison of the bounds for p = 0 . 1 , q = 0 . 2 , and ε = 10 3 . The horizontal axis is the block length n, and the vertical axis is the rate R (nats). The upper red curve is the achievability bound in Theorem 7. The middle blue curve is the converse bound in Theorem 8. The lower purple line is the first-order asymptotics given by the entropy H W ( X ) .
Entropy 22 00460 g003
Figure 4. A comparison of the bounds for p = 0 . 1 , q = 0 . 2 , and n = 10,000. The horizontal axis is log 10 ( ε ) , and the vertical axis is the rate R (nats). The upper red curve is the achievability bound in Theorem 7. The middle blue curve is the converse bound in Theorem 8. The lower purple line is the first-order asymptotics given by the entropy H W ( X ) .
Figure 4. A comparison of the bounds for p = 0 . 1 , q = 0 . 2 , and n = 10,000. The horizontal axis is log 10 ( ε ) , and the vertical axis is the rate R (nats). The upper red curve is the achievability bound in Theorem 7. The middle blue curve is the converse bound in Theorem 8. The lower purple line is the first-order asymptotics given by the entropy H W ( X ) .
Entropy 22 00460 g004
Figure 5. The binary erasure symmetric channel.
Figure 5. The binary erasure symmetric channel.
Entropy 22 00460 g005
Table 1. Summary of asymptotic results and finite-length bounds to derive asymptotic results under Assumptions 1 and 2, which are abbreviated to Ass. 1 and Ass. 2.
Table 1. Summary of asymptotic results and finite-length bounds to derive asymptotic results under Assumptions 1 and 2, which are abbreviated to Ass. 1 and Ass. 2.
ProblemFirst-OrderLarge DeviationModerate DeviationSecond-Order
SC with SISolved (Ass. 1) Solved * (Ass. 2)Solved (Ass. 1),Solved (Ass. 1)
O ( 1 ) O ( 1 ) Tail
CC for ConditionalSolved (Ass. 1) Solved * (Ass. 2)Solved (Ass. 1)Solved (Ass. 1)
Additive Channels O ( 1 ) O ( 1 ) Tail
Table 2. Summary of the bounds for source coding with full side-information. No-side means the case with no side-information.
Table 2. Summary of the bounds for source coding with full side-information. No-side means the case with no side-information.
Ach./Conv.MarkovSingle-Shot P s / P ¯ s ComplexityLarge DeviationModerate DeviationSecond Order
AchievabilityTheorem 6 (Ass. 1)Lemma 15 P ¯ s O ( 1 ) 🗸
Theorem 9 (Ass. 2)Lemma 14 P ¯ s O ( 1 ) 🗸 * 🗸
Theorem 7 (No-side)Lemma 16 P ¯ s O ( 1 ) 🗸 * 🗸
Lemma 13 P ¯ s Tail 🗸 🗸
ConverseTheorem 8 (Ass. 1)(Theorem 5) P s O ( 1 ) 🗸
Theorem 10 (Ass. 2)Corollary 1 P s O ( 1 ) 🗸 * 🗸
Theorem 8 (No-side)(Theorem 5) P s O ( 1 ) 🗸 * 🗸
Lemma 18 P s Tail 🗸 🗸
Table 3. Summary of the finite-length bounds for channel coding.
Table 3. Summary of the finite-length bounds for channel coding.
Ach./Conv.MarkovSingle-Shot P c / P ¯ c ComplexityLarge DeviationModerate DeviationSecond Order
AchievabilityTheorem 17 (Ass. 1)Lemma 22 P ¯ c O ( 1 ) 🗸
Theorem 19 (Ass. 2)Lemma 21 P ¯ c O ( 1 ) 🗸 * 🗸
Theorem 21 (Additive)Lemma 23 P ¯ c O ( 1 ) 🗸 * 🗸
Lemma 20 P ¯ c Tail 🗸 🗸
ConverseTheorem 18 (Ass. 1)(Theorem 16) P c O ( 1 ) 🗸
Theorem 20 (Ass. 2)Theorem 16 P c O ( 1 ) 🗸 * 🗸
Theorem 18 (Additive)(Theorem 16) P c O ( 1 ) 🗸 * 🗸
Lemma 25 P c Tail 🗸 🗸

Share and Cite

MDPI and ACS Style

Hayashi, M.; Watanabe, S. Finite-Length Analyses for Source and Channel Coding on Markov Chains. Entropy 2020, 22, 460. https://doi.org/10.3390/e22040460

AMA Style

Hayashi M, Watanabe S. Finite-Length Analyses for Source and Channel Coding on Markov Chains. Entropy. 2020; 22(4):460. https://doi.org/10.3390/e22040460

Chicago/Turabian Style

Hayashi, Masahito, and Shun Watanabe. 2020. "Finite-Length Analyses for Source and Channel Coding on Markov Chains" Entropy 22, no. 4: 460. https://doi.org/10.3390/e22040460

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop