Next Article in Journal
Soft Quantization Using Entropic Regularization
Next Article in Special Issue
Mechanisms for Robust Local Differential Privacy
Previous Article in Journal
Fractal-Cluster Theory and Its Applications for the Description of Biological Organisms
Previous Article in Special Issue
Multi-User PIR with Cyclic Wraparound Multi-Access Caches
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

On the Optimal Error Exponent of Type-Based Distributed Hypothesis Testing †

1
Tsinghua–Berkeley Shenzhen Institute, Shenzhen 518055, China
2
Tsinghua Shenzhen International Graduate School, Shenzhen 518055, China
*
Author to whom correspondence should be addressed.
This work was presented in part at the 2021 IEEE International Symposium on Information Theory (ISIT), Melbourne, Victoria, Australia, 12–20 July 2021.
Current address: Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA 02139, USA.
Entropy 2023, 25(10), 1434; https://doi.org/10.3390/e25101434
Submission received: 4 August 2023 / Revised: 1 October 2023 / Accepted: 8 October 2023 / Published: 10 October 2023
(This article belongs to the Special Issue Information Theory for Distributed Systems)

Abstract

:
Distributed hypothesis testing (DHT) has emerged as a significant research area, but the information-theoretic optimality of coding strategies is often typically hard to address. This paper studies the DHT problems under the type-based setting, which is requested from the popular federated learning methods. Specifically, two communication models are considered: (i) DHT problem over noiseless channels, where each node observes i.i.d. samples and sends a one-dimensional statistic of observed samples to the decision center for decision making; and (ii) DHT problem over AWGN channels, where the distributed nodes are restricted to transmit functions of the empirical distributions of the observed data sequences due to practical computational constraints. For both of these problems, we present the optimal error exponent by providing both the achievability and converse results. In addition, we offer corresponding coding strategies and decision rules. Our results not only offer coding guidance for distributed systems, but also have the potential to be applied to more complex problems, enhancing the understanding and application of DHT in various domains.

1. Introduction

Distributed hypothesis testing (DHT) is a significant problem in the field of information theory [1]. In this problem, each distributed node observes partial data generated from the joint distribution and transmits an encoded message through a communication channel to a decision center, aiming to detect the true hypothesis. The primary goal of DHT is to maximize the decision error exponent in the asymptotic regime, where many different communication models [2,3,4,5,6] were considered in the previous literature. The main challenges of the DHT arise in two respects. Firstly, due to the intricate distributed structures, most of the existing works have focused on demonstrating achievability results, with converse results being limited to specific cases, such as the 1-bit [3], log 2 3 -bit [7], and O ( log 2 n ) -bit [1] communication channels. Secondly, many of the achievability results were established using random coding with auxiliary random variables [8], which are difficult to implemen in real systems.
Notice that the distributed encoders in many real applications are required to process high-dimensional data [9], such as images, texts, and audios. Consequently, many of the federated learning algorithms focus on computing the quantities, such as the statistics, empirical risks, and gradient of data [10], which can be viewed as certain functions of the empirical distribution (type) of the data (for example, given the data x 1 , , x n and feature function f ( x ) , the statistic 1 n i = 1 n f ( x i ) = x P ^ X ( x ) f ( x ) is a linear function of the empirical distribution P ^ X ).
Motivated by this observation, we investigate the optimal decision error exponent of DHT based on the empirical distributions (type-based) under two common communication models. The first problem considers a noiseless channel, which is the typical mathematical model in real federated learning scenarios. It comes from the reality that federated learning often assumes that the nodes and the center machine can exchange information precisely; however, the dimensionalities of the transmitted signals are limited [9]. Specifically, it is assumed that each node can only transmit the empirical mean of a one-dimensional feature, and such settings have gained significant attention recently in federated and multi-modal machine learning [9,11]. The second problem assumes that the signal of each node, encoded with the empirical distribution, is transmitted over an additive white Gaussian noise (AWGN) channel, which is a widely-used mathematical model for real-world channels [12]. The main goal of this paper is to establish the optimal error exponent for the aforementioned two problems by presenting: (i) the converse bound for the error exponent; and (ii) a practical coding strategy that achieves the converse bound.
The contributions of this paper are summarized as follows. First, in Section 4.1, we demonstrate the optimal error exponent for the type-based hypothesis testing over noiseless channels, where one-dimensional functions for all nodes and the corresponding decision rule are provided. Moreover, by applying the information geometric approach in [13], the hypotheses and the feature functions of each node can be modeled as vectors in the joint and marginal distribution spaces, respectively. In Section 4.3, the optimal feature function of each node can be interpreted as a decomposition of the hypothesis vector in the joint distribution space into vectors in the marginal distribution spaces, where each decomposed component indicates the contribution of the corresponding node in making the inference.
Second, we establish the optimal achievable error of the type-based hypothesis testing over AWGN channels by presenting both the achievability and converse results. In particular, the achievability part is based on a mixture coding strategy of both the amplify-and-forward and decode-and-forward strategies. Specifically, when the observed empirical distribution at a distributed node is sufficiently close to one of the true marginal distributions with respect to the two hypotheses, the node is confident of the true hypothesis. Then, we apply the decode-and-forward strategy, which first estimates the true hypothesis based on the observed empirical distribution, and then we apply the binary phase shift keying (BPSK) to transmit the decoded bit to the decision center. On the other hand, when the observed empirical distribution is far from both true marginal distributions, we apply the amplify-and-forward strategy to encode and transmit the observed empirical distribution by the pulse amplitude modulation (PAM) to the decision center. By applying the proposed coding strategy and conducting the log-likelihood ratio test at the decision center, we show in Section 5.2 the achievable error exponent. Finally, we demonstrate the converse results of the error exponent in Section 5.3 based on a genie-aided approach. The main idea is to add additional information to the distributed nodes. By either leveraging the true hypothesis to the distributed nodes or eliminating the channel noises, we show that the error exponent in Section 5.2 is also an upper bound of the optimal error exponent, which establishes the optimality.

2. Problem Formulations

Suppose that there are K random variables X K ( X 1 , , X K ) . In this paper, we consider the binary hypothesis testing problem, and the two hypotheses H 0 and H 1 are defined as:
H 0 : ( x 1 ( 1 ) , , x K ( 1 ) ) , , ( x 1 ( n ) , , x K ( n ) ) i . i . d . P X K ( 0 ) , H 1 : ( x 1 ( 1 ) , , x K ( 1 ) ) , , ( x 1 ( n ) , , x K ( n ) ) i . i . d . P X K ( 1 ) ,
where the observable data are i.i.d. generated according to either P X K ( 0 ) or P X K ( 1 ) from the alphabet set ( X 1 , , X K ) . In addition, we assume that there are K distributed nodes, where the k-th ( k = 1 , , K ) node can only observe the samples X k { x k ( 1 ) , , x k ( n ) } . To facilitate clarity in our illustration, we concentrate on the discrete case, assuming that each alphabet X k is discrete, and X X 1 × × X K . In addition, for a joint distribution Q X K P X , we use [ Q X K ] X k to denote its marginal distribution with respect to X k . We also denote P X 1 ( i ) , , P X K ( i ) as the marginal distributions of P X K ( i ) , for i = 0 , 1 . In the distributed hypothesis testing problem, we introduce a common assumption in the distributed setup [14] that the generating distributions P X K ( 0 ) and P X K ( 1 ) satisfy D ( P X K ( 1 ) P X K ( 0 ) ) < , D ( P X K ( 0 ) P X K ( 1 ) ) < , to avoid the trivial irregularities. Due to the type-based restriction, we further assume that P X k ( 0 ) P X k ( 1 ) , k = 1 , , K . Otherwise, the transmitted message as a function of the empirical distribution would be uninformative for distinguishing the hypotheses. In the following, we denote P ^ X k as the empirical distributions of X k , defined as:
P ^ X k ( x k ) 1 n i = 1 n 1 x k = x k ( i ) .

2.1. Type-Based Hypothesis Testing over Noiseless Channels

As shown in Figure 1, node k ( k = 1 , , K ) can encode the observed data X k and transmit a scalar signal by function u k . Due to the computational requirement as introduced in Section 1, we impose a restriction whereby the encoder u k is explicitly dependent on the empirical distribution P ^ X k , i.e., u k : P X k R , and P X k denotes the set of probability distributions defined on the alphabet X k . For the most direct method, we can transmit the emprical distributions by encoding them into the real space, which can lead to computational difficulty for federated learning data. In this paper, we further consider one of the most commonly used approaches in federated learning [15,16] and assume that u k computes a one-dimensional statistic
u k ( P ^ X k ) = 1 n i = 1 n f k ( x k ( i ) ) = E P ^ X k [ f k ( X k ) ] ,
where feature function f k : X k R . Then, the decision center collects statistics u k ( P ^ X k ) k = 1 K , and makes a decision H ^ on the true hypothesis. We prove in Section 4 that the further restrictions of computing the empirical means of features are without a loss of generality, where we can make good decisions as we observe the types. Additionally, the error probability is defined as
P n ( H ^ H ) i { 0 , 1 } P H ( H i ) P n ( H ^ H | H = H i ) ,
where H denotes the true hypothesis, P H ( H 0 ) and P H ( H 1 ) are the prior distributions, and P n ( · ) is the probability measure defined from the data sampling process (1). In particular, we focus on the asymptotic error decaying rate, i.e., the error exponent, defined as
E lim n 1 n log P n ( H ^ H ) ,
where all logarithms are base e unless otherwise specified. The goal is to find the maximal error exponent of (4) and design the feature functions f 1 , , f k and the detailed decision rule such that this error exponent can be achieved based on the log-likelihood ratio test (LLRT).

2.2. Type-Based Hypothesis Testing over AWGN Channels

As depicted in Figure 2, we employ the identical hypothesis testing formulation as presented in (1). In this context, it is assumed that nodes 1 through K encode and transmit a length-m sequence using functions g 1 , , g K , which operate based on their respective observations through additive white Gaussian noise (AWGN) channels to the central decision center. To accommodate the computational constraints, we restrict that the encoder g k ( k = 1 , , K ) is a function of the empirical distribution P ^ X k , i.e.,
g k : P X k R m , k = 1 , , K .
Moreover, the averaged power constraints of the AWGN channels are:
1 m E g k ( P ^ X k ) 2 p k , k = 1 , , K ,
where the expectations are taken over the data sampling process defined in (1). Then, the decision center makes a decision H ^ based on the received signals g 1 ( P ^ X 1 ) + Z 1 , , g K ( P ^ X K ) + Z K , where the noises are drawn from
Z k N 0 , σ k 2 I m , k = 1 , , K ,
and I m denotes the m × m identity matrix.
Additionally, we make the following assumption to make the errors arising from the AWGN channels and the decision process comparable, so that the trade-off between them can be described. In detail, we assume that the sequence length m also increases with n, and there exists a positive constant μ such that
lim n n m ( n ) = μ .
Our goal is to design the optimal encoders g 1 , , g K , subjected to the constraints (5) and (6), as well as the decision rule H ^ , where we have assumed P H ( H 0 ) = P H ( H 1 ) = 1 2 for explicit mathematical expression, such that the error exponent as defined in (4) is maximized.

3. Related Works

Distributed hypothesis testing problems, also known as multiterminal hypothesis testing [1,3,14] or decentralized detection [17,18], have been extensively explored in the literature. In scenarios where each node can observe a single observation and send an encoded message to the central machine, the authors of [17] demonstrated that determining the optimal coding scheme is NP-hard, while [18,19] provided characterizations for the minimum decoding error rate and the optimal coding scheme for conditionally independent nodes.
Furthermore, in situations where each node can observe n samples and transmit an encoded message to the decision center, [3,5,14,20] investigated the optimal decoding error exponents for the case of K = 2 nodes, with [21] generalizing the results to K > 2 nodes. Additionally, the author of [5] studied the Neyman–Pearson-like test, which further constrained the encoded messages to being an empirical functional mean, and provided optimal functions for the scenario with K = 2 nodes. The outcome presented in Section 4 can be perceived as a generalization of such setups to the case with K > 2 nodes.
On the other hand, DHT over noisy channels represents a novel and highly significant sub-problem within the broader context. While current research has primarily focused on transmission over discrete memoryless channels, certain aspects of this sub-problem have been investigated. For instance, some studies have explored scenarios involving side information [22] and cases that counteract independence assumptions [23]. Additionally, optimal Type-II error considerations have been examined [24], along with investigations into the optimal pairs of Type-I and Type-II errors [25].
Diverging from the existing literature, the present paper delves into the DHT problem in the context of widely considered AWGN channels while also addressing the implications of computational demands. This novel approach fills a critical research gap and extends the understanding of DHT to a broader set of channel conditions, thus contributing to the advancement of the field.

4. Type-Based Hypothesis Testing over Noiseless Channels

In this section, we present the optimal error exponent along with the corresponding decision rule for the type-based hypothesis testing over noiseless channels. We commence by introducing the optimal error exponent under the condition that the decision center has access to the empirical distributions from different nodes.
Definition 1.
The quantities D i * ( R X 1 , , R X K ) , for i = 0 , 1 , are defined as
D i * ( R X 1 , , R X K ) min Q X K S D ( Q X K P X K ( i ) ) ,
where
S Q X K : [ Q X K ] X k = R X k , k = 1 , , K ,
which represents the set of all distributions with given marginals R X 1 , , R X K .
The following result provides the operational meaning of (9), which can be proved by Sanov’s theorem [12].
Lemma 1.
When H i is the true hypothesis, the probability that nodes 1 , , K observe the empirical distributions P ^ X 1 , , P ^ X K , respectively, is given by
P n ( P ^ X 1 , , P ^ X K | H = H i ) exp n D i * ( P ^ X 1 , , P ^ X K ) , i = 0 , 1 ,
where ≐ is the conventional dot-equal notation, i.e., we denote f n g n when lim n 1 n log f n = lim n 1 n log g n . In addition, by applying the log-likelihood ratio test to detect the true hypothesis, the optimal decision error exponent based on the empirical distributions is
E * min R X 1 , , R X K max i { 0 , 1 } D i * ( R X 1 , , R X K ) .
Note that the type-based hypothesis testing problem assumes that the signal from each node is a function of the empirical distribution. Hence, the optimal error exponent in (4) will not exceed E * . In the following, we prove that error exponent E * can be achieved and provide the corresponding decision rule.

4.1. Optimal Feature

First, we introduce the following definitions of exponential and linear families, which will be useful for delineating our results.
Definition 2
(Exponential family). Given distribution P Z ( z ) , and a function T : Z R , we define the distribution P ˜ Z ( λ ) ( · ; T , P Z ) as
P ˜ Z ( λ ) ( z ; T , P Z ) P Z ( z ) exp ( λ T ( z ) α ( λ ) ) , for all z Z ,
with α ( λ ) log z Z P Z ( z ) exp ( λ T ( z ) ) . In addition, we use
E Z ( T , P Z ) P ˜ Z ( λ ) ( · ; T , P Z ) : λ R
to denote the exponential family passing through P Z with T being the natural statistic.
Definition 3
(Linear family). Given a function h : Z R , we define the linear family L Z ( h ) as
L Z ( h ) Q Z P Z : E Q Z h ( Z ) = 0 .
In addition, we define the half-spaces S Z ( 0 ) ( h ) and S Z ( 1 ) ( h ) as
S Z ( 0 ) ( h ) Q Z P Z : E Q Z h ( Z ) 0 , S Z ( 1 ) ( h ) Q Z P Z : E Q Z h ( Z ) 0 .
Then, for i = 0 , 1 and t > 0 , we define the sets
D i ( t ) { ( R X 1 , , R X K ) : D i * ( R X 1 , , R X K ) < t } .
We also define D ( t ) D 0 ( t ) D 1 ( t ) . It can be verified that, for all t 0 , both D 0 ( t ) and D 1 ( t ) are convex subsets of P X 1 × × P X K , and thus D ( t ) is also convex. In addition, we have the following lemma.
Lemma 2.
For E * as defined in (10), we have D ( t ) = for all t [ 0 , E * ] and D ( t ) for all t > E * . Additionally, a unique ( R ˜ X 1 , , R ˜ X K ) P X 1 × × P X K exists such that
D 0 * ( R ˜ X 1 , , R ˜ X K ) = D 1 * ( R ˜ X 1 , , R ˜ X K ) = E * .
Proof. 
See Appendix A. □
Based on Lemma 2, it follows from the separating hyperplane theorem (see, e.g., Section 2.5.1 of [26]) that functions ( f 1 * , , f K * ) , where f k * : X k R , k = 1 , , K exist, such that for all ( R X 1 , , R X K ) D 0 ( E * ) ,
i = 1 K x i X i R X i ( x i ) f i * ( x i ) = i = 1 K E R X i f i * ( X i ) 0 ,
and for all ( R X 1 , , R X K ) D 1 ( E * ) ,
i = 1 K E R X i f i * ( X i ) 0 .
Furthermore, we denote
h * ( x K ) i = 1 K f i * ( x i ) ,
and then we have the following proposition. Given P Z P Z and S P Z , we adopt the notation [27,28] D ( S P Z ) inf Q Z S D ( Q Z P Z ) , where P Z denotes the set of all distributions supported on Z .
Proposition 1.
The optimal exponent E * as defined in (10) satisfies
E * = D S X ( 0 ) ( h * ) P X K ( 1 ) = D S X ( 1 ) ( h * ) P X K ( 0 ) .
Proof. 
See Appendix B. □
Consequently, we establish the optimality of E * and provide the corresponding decision rule.
Theorem 1.
Let f 1 * , , f K * denote the features as defined in (15) and (16). The optimal error exponent of (4) is given by
lim n 1 n log P n ( H ^ H ) = E * ,
where E * is defined in (10). In addition, the corresponding decision rule H ^ is
k = 1 K E P ^ X k f k * ( X k ) H ^ = H 0 H ^ = H 1 0 .
Proof. 
See Appendix C. □

4.2. General Geometric Structure

The geometry associated with Proposition 1 and Theorem 1 is depicted in Figure 3. In this figure, each point represents a distribution in P X , and the decision boundary (20) corresponds to the linear family L X ( h * ) defined as in (13). In addition, from Corollary 3.1 of [27], λ 0 , λ 1 R exist such that
Q X K ( i ) P ˜ X K ( λ i ) ( · ; h * , P X K ( i ) ) , i = 0 , 1 ,
satisfy
D S X ( 1 i ) ( h * ) P X K ( i ) = D Q X K ( i ) P X K ( i ) ,
where P ˜ X K ( λ i ) ( · ; h * , P X K ( i ) ) , i = 0 , 1 are as defined in (11). In this context, Q X K ( 0 ) and Q X K ( 1 ) in (21) are the I-projections [27] of P X K ( 0 ) and P X K ( 1 ) onto this linear family, respectively, which also induces the two exponential families E X h * , P X K ( 0 ) and E X h * , P X K ( 1 ) with h * as their common natural statistic. Additionally, all the points in D 0 ( E * ) and D 1 ( E * ) are divided by the the linear family L X ( h * ) .

4.3. Local Information Geometric Analysis

Although an explicit information geometry has been shown, we apply the local information geometric framework [13] to provide fundamental insights into this problem. Some useful notations and definitions in local information geometry are introduced as follows.
Definition 4
( ϵ -neighborhood). Given a finite alphabet Z , and letting R Z be a distribution supported on Z with all entries being positive, its ϵ-neighborhood N ϵ Z ( R Z ) is defined as
N ϵ Z ( R Z ) P Z P Z : z Z ( P Z ( z ) R Z ( z ) ) 2 R Z ( z ) ϵ 2 .
Then, with R Z used as the reference distribution, each distribution P Z P Z can be equivalently expressed as a vector ϕ R | Z | or a function f : Z R with
ϕ ( z ) P Z ( z ) R Z ( z ) R Z ( z ) , f ( z ) ϕ ( z ) R Z ( z ) , z Z ,
referred to as the information vector and feature function associated with P Z , respectively. This provides a three way correspondence P Z ϕ f , which will be useful in our derivations. Based on Definition 4, we introduce the local assumption that
P X k ( i ) N ϵ X ( P X k ) , for i = 0 , 1 ,
We use ψ ( i ) P X K ( i ) , i = 0 , 1 to represent the corresponding information vectors [cf. (23)]. For each k = 1 , , K , and given feature f k : X k R , we define the corresponding information vector ϕ k R | X k | , where P X k [ P X K ] X k is used as the reference distribution. Note that for i = 0 , 1 , the correspondence B k T ψ ( i ) P X k ( i ) exists, where P X k ( i ) [ P X K ( i ) ] X k represents the corresponding marginal distributions. Specifically, B k is an | X | × | X k | dimensional matrix with entries [29]
B k ( x K , x ^ k ) P X K ( x K ) P X k ( x ^ k ) δ x k x ^ k ,
where δ x k x ^ k represents the Kronecker delta.
Moreover, the feature f k defined on X k , when considered as a mapping from X to R , corresponds to the information vector B k ϕ k in R | X | . Leveraging this correspondence, we can further establish the information vector for h ( x K ) = k = 1 K f k ( x k ) as
i = 1 K B i ϕ i = B 0 ϕ 0 R | X | ,
where we have defined
B 0 B 1 B K and ϕ 0 ϕ 1 ϕ K ,
and where for each k = 1 , , K , ϕ k R | X k | denotes the information vector corresponding to f k .
Additionally, given a matrix A R m 1 × m 2 , we use A to denote its Moore–Penrose inverse [30], and we define the associated column space R ( A ) { A x : x R m 2 } and projection matrix Π A A A . Then, we can establish the local counterpart of E * in Theorem 1 as follows.
Theorem 2.
Under the local assumption (24), let ψ ( i ) P X K ( i ) , i = 0 , 1 denote the corresponding information vectors. Then, for h * as defined in (17), we have the correspondence h * B 0 ϕ 0 * , where
ϕ 0 * B 0 ψ ( 1 ) ψ ( 0 ) ,
and where B 0 is defined in (27). In addition, the optimal exponent E * in (10) can be expressed as
E * = 1 8 B 0 ϕ 0 * 2 + o ( ϵ 2 ) .
Proof. 
See Appendix D. □
Note that from Theorem 2, we have
h * B 0 B 0 ( ψ ( 1 ) ψ ( 0 ) ) = Π B 0 ( ψ ( 1 ) ψ ( 0 ) ) ,
where Π B 0 is the projection matrix associated with the subspace R ( B 0 ) . The optimal feature B 0 ϕ 0 * in (26) corresponds to the projection of the sufficient statistic f LLR ( ψ ( 1 ) ψ ( 0 ) ) onto the function space that encompasses all possible h’s satisfying the form h ( x K ) = k = 1 K f k ( x k ) . In other words, B 0 ϕ 0 * represents the best approximation of f LLR within the function space of interest, which leads to the optimal decision error exponent E * as shown in (29).
Moreover, from (26), this optimal feature can be decomposed to K components in subspaces R ( B k ) , for k = 1 , , K ,
B 0 ϕ 0 * = k = 1 K B k ϕ k * ,
where ϕ 0 * is stacked by ϕ k * R | X k | , k = 1 , , K , as in (27). This decomposition structure can be depicted as Figure 4 for the case K = 2 .
Remark 1.
The vectors B i ϕ k * are not simply the orthogonal projections of B 0 ϕ 0 * onto the subspaces R ( B k ) since these subspaces, for k = 1 , , K , are not mutually orthogonal. Therefore, the decomposition of B 0 ϕ 0 * will depend on the Gram matrix [30] of the subspaces R ( B k ) , as illustrated in Figure 4. Furthermore, it is noteworthy that the orthogonal projection of B 0 ϕ 0 * onto the subspaces R ( B k ) can be interpreted as characterizing the optimal error exponent of the binary hypothesis testing problem solely with the observations of X k [12]. When the subspaces R ( B k ) are orthogonal to each other, the optimal inference approach is straightforward, involving the extraction of the optimal information from each node by orthogonal projection. However, when the subspaces R ( B k ) are not orthogonal, different nodes may share various forms of common information. Our result fundamentally demonstrates how to handle this shared information and extract the optimal features through the decomposition of the information vector over non-orthogonal subspaces. This insight provides a novel approach to address the challenges posed by the non-orthogonal subspaces and reveals how to extract the most informative features effectively, ultimately leading to improved performance in the distributed hypothesis testing problem.

5. Type-Based Hypothesis Testing over AWGN Channels

This section presents the optimal error exponent of the type-based hypothesis testing problem over AWGN channels, along with the corresponding coding strategy. To begin, we introduce several notations that will help in the presentation of the results.
Definition 5.
Let [ K ] { 1 , 2 , , K } , and for subset ω [ K ] , i = 0 , 1 , we define
D i ω ( { R X k } k ω ) min Q X K S ω D ( Q X K P X K ( i ) ) ,
where
S ω Q X K : [ Q X K ] X k = R X k , k ω .
It would be easy to find that D i [ K ] ( · ) = D i * ( · ) , and D i * ( · ) is as defined in (9). Moreover, we define the following error exponent with respect to ω [ K ] .
E ω min { R X k } k ω , { θ k } k [ K ] ω max { D 0 ω ( { R X k } k ω ) + k [ K ] ω ( θ k p k ) 2 2 μ σ k 2 , D 1 ω ( { R X k } k ω ) + k [ K ] ω ( θ k + p k ) 2 2 μ σ k 2 } ,
where we have used A B to represent the relative complement of set B in set A, and where μ is as defined in (8). We can also find E [ K ] = E * and E * is as defined in (10). Finally, we define the quantity E , which will be shown as the optimal error exponent
E min ω ( [ K ] ) E ω ,
where ( [ K ] ) denotes the power set of [ K ] .
Theorem 3.
The optimal error exponent of (4) is given by
lim n 1 n log P n ( H ^ H ) = E .
In the following, we prove Theorem 3 by both the achievability and converse result.

5.1. The Coding Strategy for Distributed Nodes

First, we define the different regimes of empirical distributions, for each k = 1 , , K and for some γ ( 0 , 1 ) . Basically, the specific choice of γ does not effect the achievable error exponent as long as γ ( 0 , 1 ) . It helps conduct the decode-and-forward and amplify-and-forward coding strategies as introduced in Section 1.
  • Decode-and-forward regime:
M k ( 0 ) R X k : D ( R X k P X k ( 0 ) ) < n γ , M k ( 1 ) R X k : D ( R X k P X k ( 1 ) ) < n γ .
  • Amplify-and-forward regime:
M k c R X k : min D ( R X k P X k ( 0 ) ) , D ( R X k P X k ( 1 ) ) n γ .
Note that for each k = 1 , , K , the probability of the empirical distribution P ^ X k in M k c is exp n 1 γ . Consequently, in the amplify-and-forward regime, we can transmit such empirical distributions with exponentially large power by Pulse Amplitude Modulation (PAM) while still satisfying the power constraint. Specifically, let P X k ( n ) be the set of all possible empirical distributions of X k with n samples, and denote η k | P X k ( n ) M k c | . We define the bijective function ξ k : P X k ( n ) M k c { 1 , , η k } as the indices of empirical distributions. Then, according to the observed empirical distribution, the encoder of node k ( k = 1 , , K ) is designed to transmit the signal
Q k ( P ^ X k ) ξ k ( P ^ X ) · exp n 1 γ 2 .
Furthermore, if the empirical distributions are in the decode-and-forward regimes, we initially detect the true hypothesis and then transmit the bit using Binary Phase Shift Keying (BPSK) with the appropriate power. By employing these strategies, the achievability result can be obtained through repeated transmissions from all the distributed nodes. In other words, the resulting encoder for node k is defined as follows:
g k * = g k * , , g k * , k = 1 , , K ,
where
g k * ( P ^ X k ) p k δ ( n , γ ) , if P ^ X k M k ( 0 ) p k δ ( n , γ ) , if P ^ X k M k ( 1 ) Q k ( P ^ X k ) , if P ^ X k M k c ,
and where
δ ( n , γ ) max k [ K ] P n ( P ^ X k M k c ) P n ( P ^ X k M k c ) · ( n + 1 ) 2 | X k | · exp 2 n 1 γ 2 .
Proposition 2.
The encoders as defined in (38) satisfy the power constraint (6), and
lim n δ ( n , γ ) = 0 .
Proof. 
See Appendix E. □

5.2. Decision Rule and Achievable Error Exponent

After the decision center receives the output signals g 1 * ( P ^ X 1 ) + Z 1 , , g K * ( P ^ X K ) + Z K , we then compute
θ k 1 m i = 1 m [ g k * ( P ^ X k ) + Z k ] i , k = 1 , , K ,
where [ · ] i denotes the i-th entry of a given vector. Then, we conduct the log-likelihood ratio test (LLRT) to detect the true hypothesis:
log P n ( θ 1 , , θ K | H = H 0 ) P n ( θ 1 , , θ K | H = H 1 ) H ^ = H 1 H ^ = H 0 0 .
Note that exponentially large power is allocated for the empirical distributions in the amplify-and-forward regime (cf. (35), (36)); the decision center can correctly detect the coding regime of the nodes with super-exponentially high probability, i.e., for k = 1 , , K ,
lim n 1 n log P n P ^ X k M k c | θ k exp n 1 γ 4 = , lim n 1 n log P n P ^ X k M k c | θ k > exp n 1 γ 4 = .
Therefore, we can assume that the decision center knows the coding regime of the nodes and define the following regime of the received signals with respect to subset ω [ K ] .
Θ ω ( θ 1 , , θ K ) : θ k > exp n 1 γ 4 , k ω , and θ k exp n 1 γ 4 , k [ K ] ω ,
for all ω ( [ K ] ) . When the received signals ( θ 1 , , θ K ) Θ ω , the decision center can recover the empirical distributions P ^ X k ( k ω ) from the received signals θ k by the decoder:
Q k 1 ( θ k ) ξ k 1 θ k / exp n 1 γ 2 + 0.5 ,
where · denotes the floor function [31]. The following result shows that decoding error of (43) can be neglected.
Proposition 3.
For all P ^ X k P X k ( n ) M k c , k = 1 , , K ,
lim n 1 n log P ( Q k 1 ( θ k ) P ^ X k ) = .
Proof. 
See Appendix F. □
In the following, we denote p k p k δ , for k = 1 , , K and discuss the decision error exponent when the received signals are in Θ ω . For k ω , the empirical distribution P ^ X k can be recovered by (43), and for k [ K ] ω , node k detects the hypothesis according to the observed empirical distribution and transmits the detected bit by BPSK (cf. (38)) through the AWGN channel. Then, the decision center detects the true hypothesis from the received signals by LLRT (41), which can be reduced to
E ˜ 0 ω ( θ 1 , , θ K ) H ^ = H 0 H ^ = H 1 E ˜ 1 ω ( θ 1 , , θ K ) ,
where for i = 0 , 1 ,
E ˜ i ω ( θ 1 , , θ K ) min ω ¯ ( [ K ] ω ) D i * ( P ¯ X 1 , , P ¯ X K ) + k ω ¯ ( θ k p k ) 2 2 μ σ k 2 + k [ K ] ( ω ω ¯ ) ( θ k + p k ) 2 2 μ σ k 2 ,
where ( [ K ] ω ) denotes the power set of [ K ] ω , and where for k = 1 , , K ,
P ¯ X k Q k 1 ( θ k ) , if k ω P X k ( 0 ) , if k ω ¯ P X k ( 1 ) , if k [ K ] ( ω ω ¯ ) .
Consequently, the decision error exponent is characterized by the following proposition.
Proposition 4.
For any ϵ > 0 and ω ( [ K ] ) , the decision error exponent by the decision rule (45) satisfies
lim n 1 n log P n H ^ H , ( θ 1 , , θ K ) Θ ω E ϵ ,
where E is as defined in (33).
Proof. 
See Appendix G. □
Noticing that the overall decision error probability is
P n ( H ^ H ) = ω ( [ K ] ) P ( H ^ H , ( θ 1 , , θ K ) Θ ω ) ,
the following proposition establishes the achievable error exponent by the coding strategy (38).
Proposition 5.
By using the encoders g 1 * , , g K * as defined in (38), and the decision rules H ^ from (41), the achievable error exponent is given by E , i.e.,
lim n 1 n log P n ( H ^ H ) E ,
where E is as defined in (33).

5.3. The Converse Result

In this section, we show that E is indeed an upper bound of (4), which establishes Theorem 3. Our main technique is to apply a genie-aided approach, which provides different kinds of additional information to both nodes and computes the corresponding error exponents under additional information. As depicted in Figure 5, given index set ω ( [ K ] ) , suppose that for all k ω , node k can know and cancel the channel noise in advance; then, the channel is noiseless, and the decision center can perfectly receive the empirical distribution P ^ X k . On the other hand, suppose that for all k [ K ] ω , we can leverage the true hypothesis H to node k ; then, with such additional information, we can establish the following upper bound of (4) (cf. (33)).
Proposition 6.
Given index set ω ( [ K ] ) , suppose that for all k ω , the decision center can obtain P ^ X k perfectly. Additionally, for all k [ K ] ω , node k can obtain the true hypothesis H . The resulting optimal decision error exponent is
lim n 1 n log P n ( H ^ H ) = E ω ,
where E ω is as defined in (32).
Proof. 
See Appendix H. □
Notice that Proposition 6 is verified for all ω ( [ K ] ) , and we cannot obtain a better performance than Proposition 6 for the DHT over AWGN channels without the additional information. We then conclude the following error exponent upper bound.
Proposition 7.
For all possible encodes g 1 , , g K under the power constraint (6), the corresponding error exponent with respect to the LLRT decision rule satisfies
lim n 1 n log P n ( H ^ H ) E ,
where E is as defined in (33).
Finally, by combining Propositions 5 and 7, Theorem 3 is proved.
Remark 2
(Local-geometric interpretation). Note that the expression of the optimal error exponent E as defined in (33) is quite intricate, which could limit our understanding. To simplify the analysis, we introduce the local geometry assumption as given in (24). In Appendix I, we demonstrate that the error exponent corresponds to a more manageable expression
E = min ω ( [ K ] ) 1 8 B ω B ω ψ ω ( 1 ) ψ ω ( 0 ) 2 + k [ K ] ω p k 2 μ σ k 2 + o ( ϵ 2 ) ,
where for ω = { i 1 , , i | ω | } , we have defined
B ω B i 1 B i | ω | ,
and ψ ω ( i ) [ P X K ( i ) ] X i 1 X i | ω | , i = 0 , 1 . Given ω ( [ K ] ) , the first term in (51) represents the optimal error exponent (cf. (29)) when the decision center can access the empirical distributions P ^ X k , k ω . The second term corresponds to the optimal error exponent when node k, k [ K ] ω can know the true hypothesis H and transmit the bit using BPSK modulation. The total error exponent is the sum of these two parts, and E aims to determine the minimum sum among all possible splits of the index set [ K ] . In other words, E finds the optimal trade-off between accessing empirical distributions at the decision center and having individual nodes transmit bits with BPSK modulation.

6. Discussion

This paper discusses the DHT problem over two communication models.The first is the noiseless channel, which is mostly considered in current distributed learning and federated learning systems [9,11]. For the noiseless channels, we show that by using one-dimensional statistics from different nodes, it is possible to achieve the same error exponent when the decision center has knowledge of the corresponding empirical distributions. This result is significant as it simplifies the coding process at distributed nodes, allowing them to transmit only the necessary statistics rather than the entire empirical distribution, which provides a practical implementation of the result in [5]. This finding proves the rationality of transmitting statistics as the most widely-used strategy in distributed learning and federated learning [11].
For the AWGN channels, this paper introduces a novel coding strategy, which cleverly combines decode-and-forward and amplify-and-forward techniques. The underlying concept of this coding strategy is based on the observation that the probability of the empirical distribution deviating significantly from the true marginal distribution diminishes exponentially. Consequently, by employing sufficiently large power, we can transmit the empirical distribution almost perfectly to the decision center while satisfying the averaged power constraint. When the prior distributions are not 1 / 2 , the strategy still work for the optimal error exponent, and the only difference is to adjust the BPSK points for two hypotheses according to the power constaint. The demonstrated optimality of the achieved decision error exponent further indicates that the proposed coding strategy is highly effective and successfully approaches the theoretical limit within the given constraints of the problem.

7. Conclusions

This paper focuses on investigating DHT problems over both noiseless channels and AWGN channels, where the distributed nodes are constrained to encoding the received empirical distributions, driven by practical computational considerations. In the first problem, we demonstrate that utilizing one-dimensional statistics of distributed nodes and simply summing them up as the decision rule can lead to the optimal error exponent. For the second problem, we propose a coding strategy that combines decode-and-forward and amplify-and-forward techniques. We further introduce a genie-aided approach to establish the optimality of the achieved decision error exponent. Overall, our findings offer valuable insights into coding techniques for distributed nodes, and the established strategies can be extended to more general scenarios, broadening the applicability of DHT in diverse settings.

Author Contributions

X.T., X.X. and S.-L.H. contributed to the conceptualization, methodology, and writing of this paper. All authors have read and agreed to the published version of the manuscript.

Funding

The research of Shao-Lun Huang is supported in part by National Key R&D Program of China under Grant 2021YFA0715202 and the Shenzhen Science and Technology Program under Grant KQTD20170810150821146.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data sharing not applicable.

Conflicts of Interest

The authors declare no conflict of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

The following abbreviations are used in this manuscript:
DHTDistributed hypothesis testing
AWGNAdditive white Gaussian noise
BPSKBinary phase shift keying
LLRTLog-likelihood ratio test
PAMPulse amplitude modulation

Appendix A. Proof of Lemma 2

We have the following facts:
D 0 * ( P X 1 ( 1 ) , , P X K ( 1 ) ) D ( P X K ( 1 ) P X K ( 0 ) )
and
D 1 * ( P X 1 ( 0 ) , , P X K ( 0 ) ) D ( P X K ( 0 ) P X K ( 1 ) ) ,
from which we know D ( t ˜ ) , where t ˜ min { D ( P X K ( 0 ) P X K ( 1 ) ) , D ( P X K ( 1 ) P X K ( 0 ) ) } . Moreover, from the facts D ( 0 ) = and
D ( t 1 ) D ( t 2 ) , for all 0 t 1 t 2 ,
we define
t 0 sup { t 0 : D ( t ) = } .
We also have
D ( t ) D ( t ϵ ) for some ϵ > 0 .
Indeed, since D ( t ) is non-empty, ( R X 1 , , R X K ) and ϵ > 0 exist such that
D i * ( R X 1 , , R X K ) < t ϵ ,
for i = 0 , 1 , and thus D ( t ϵ ) is non-empty.
To sum up, from (A1)–(A3) we obtain D ( t ) for all t > t 0 and D ( t ) = for all t t 0 .
Furthermore, to prove (14), we define
D ¯ i ( t ) { ( R X 1 , , R X K ) : D i * ( R X 1 , , R X K ) t } ,
and D ¯ ( t ) D ¯ 0 ( t ) D ¯ 1 ( t ) . Then, for all t > t 0 we have
min R X 1 , , R X K max i { 0 , 1 } D i * ( R X 1 , , R X K ) = min ( R X 1 , , R X K ) D ¯ ( t ) max i { 0 , 1 } D i * ( R X 1 , , R X K ) [ t 0 , t ] ,
where the second minimum exists since D ¯ ( t ) is closed and bounded. This implies that t 0 = E * (cf. (10)). Hence, marginal distributions R ˜ X 1 , , R ˜ X K exist such that
D i * ( R ˜ X 1 , , R ˜ X K ) = E * , i = 0 , 1 .
Finally, to illustrate the uniqueness of ( R ˜ X 1 , , R ˜ X K ) , suppose that (14) also holds for ( R ˜ X 1 , , R ˜ X K ) ( R ˜ X 1 , , R ˜ X K ) . Let R ˜ X k ( R ˜ X k + R ˜ X k ) / 2 for k = 1 , , K ; then, it follows from the strong convexities of D 0 * ( · ) and D 1 * ( · ) that
D i * ( R ˜ X 1 , , R ˜ X K ) < t 0 , i = 0 , 1 ,
which contradicts (A2).

Appendix B. Proof of Proposition 1

We know that D i ( E * ) S X ( i ) ( h * ) , for i = 0 , 1 . This implies that S X ( i ) ( h * ) D 1 i c ( E * ) , where for t 0 and i = 0 , 1 , we have defined D i c ( t ) ( P X 1 × × P X K ) D i ( t ) .
Moreover, let ( R ˜ X 1 , , R ˜ X K ) P X 1 × × P X K be as defined in Lemma 2; then, we have
( R ˜ X 1 , , R ˜ X K ) L X ( h * ) = S X ( 0 ) ( h * ) S X ( 1 ) ( h * ) .
As a result, for i = 0 , 1 we have
E * = D i * ( R ˜ X 1 , , R ˜ X K ) D S X ( 1 i ) ( h * ) P X K ( i ) = min ( R X 1 , , R X K ) S X ( 1 i ) ( h * ) D i * ( R X 1 , , R X K ) min ( R X 1 , , R X K ) D i c ( E * ) D i * ( R X 1 , , R X K ) E * ,
which implies (18).

Appendix C. Proof of Theorem 1

On the one hand, note that from the Markov relation
H ( P ^ X 1 , , P ^ X K ) ( u 1 ( P ^ X 1 ) , , u K ( P ^ X K ) ) ,
the minimum possible decision error can be obtained when we choose the empirical distributions P ^ X 1 , , P ^ X K themselves as the statistics.
One the other hand, from Proposition 1, the error exponents associated with the type I error and the type II error are D S X ( 1 ) ( h * ) P X K ( 0 ) and D S X ( 0 ) ( h * ) P X K ( 1 ) , respectively. From (18), both exponents are E * , and thus the error exponent for P n ( H ^ H ) is also E * .

Appendix D. Proof of Theorem 2

To begin, we define ψ ψ ( 1 ) ψ ( 0 ) . Then, for given f k : X k R it follows from Lemma 17 of [13] that the exponent based on the feature h ( x K ) = k = 1 K f k ( x k ) is
E = 1 8 · ψ , ζ 2 ζ 2 + o ( ϵ 2 ) ,
where we have defined ζ B 0 ϕ 0 R ( B 0 ) , and where ϕ 0 is as defined in (27).
Then, note that the projection matrix Π B 0 satisfies Π B 0 = ( Π B 0 ) 2 and ζ = Π B 0 ζ . Therefore, from the Cauchy–Schwarz inequality we have
ψ , ζ 2 ζ 2 = ( ψ T Π B 0 ζ ) 2 ζ 2 = Π B 0 ψ , ζ 2 ζ 2 Π B 0 ψ 2 ,
where the inequality holds with equality if and only if ζ takes the optimal values
ζ * = c · Π B 0 ψ ,
or equivalently, B 0 ϕ 0 * = c · B 0 B 0 ψ for some constant scalar c 0 .
To determine the value of c, note that we have ζ * h * , where h * is the optimal feature as defined in (17). Note that in (21), for each i = 0 , 1 , Q X K ( i ) depends only on the product λ i h * ; we may assume λ 0 = 1 / 2 and simply use λ to denote λ 1 . Then, we have
Q X K ( 0 ) ( x K ) = P ˜ X K ( 1 2 ) ( x K ; h * , P X K ( 0 ) ) = P X K ( 0 ) ( x K ) 1 + 1 2 h * ( x K ) E P X K ( 0 ) h * ( X K ) + o ( ϵ ) = P X K ( x K ) + P X K ( x K ) ψ ( 0 ) ( x K ) · 1 + 1 2 P X K ( x K ) ζ ( x K ) + o ( ϵ ) = P X K ( x K ) + P X K ( x K ) · ψ ( 0 ) ( x K ) + 1 2 ζ ( x K ) + o ( ϵ ) ,
which implies the correspondence
Q X K ( 0 ) ( x K ) ψ ( 0 ) + 1 2 ζ + o ( ϵ ) .
Similarly, we have
Q X K ( 1 ) ( x K ) ψ ( 1 ) + λ ζ + o ( ϵ ) .
Then, it follows from the second-order Taylor series expansion of the K-L divergence that (see, e.g., Lemma 10 of [13])
D Q X K ( 0 ) P X K ( 0 ) = 1 8 ζ 2 + o ( ϵ 2 ) , D Q X K ( 1 ) P X K ( 1 ) = λ 2 2 ζ 2 + o ( ϵ 2 ) .
Moreover, note that since (cf. Lemma 9 of [13])
E Q X K ( 0 ) h * ( X K ) = ψ ( 0 ) + 1 2 ζ , ζ + o ( ϵ 2 ) , E Q X K ( 1 ) h * ( X K ) = ψ ( 1 ) + λ ζ , ζ + o ( ϵ 2 ) ,
we have
0 = E Q X K ( 1 ) h * ( X K ) E Q X K ( 0 ) h * ( X K ) = ψ + λ 1 2 ζ , ζ + o ( ϵ 2 ) = c ψ + λ 1 2 c · Π B 0 ψ , Π B 0 ψ + o ( ϵ 2 ) = c · 1 + λ 1 2 c · Π B 0 ψ 2 + o ( ϵ 2 ) .
As a result, it follows from D Q X K ( 0 ) P X K ( 0 ) = D Q X K ( 1 ) P X K ( 1 ) and (A8) that c = 1 , λ = 1 2 . Then, we obtain
ζ * = Π B 0 ψ = B 0 B 0 ψ = B 0 ϕ 0 * ,
where ϕ 0 * B 0 ψ .
Finally, the optimal error exponent is
E * = 1 8 · Π B 0 ψ 2 + o ( ϵ 2 ) = 1 8 · B 0 ϕ 0 * 2 + o ( ϵ 2 ) .

Appendix E. Proof of Proposition 2

According to Sanov’s theorem, P n ( P ^ X k M k c ) exp n 1 γ , and P n ( P ^ X k M k c ) 1 . Then, we have
P n ( P ^ X k M k c ) P n ( P ^ X k M k c ) · ( n + 1 ) 2 | X k | · exp 2 n 1 γ 2 exp n 1 γ ,
which will converge to 0 as n 0 . Additionally, for the power constraint,
P [ g k * 2 ( P ^ X k ) ] ( p k δ ( n , γ ) ) · P n ( P ^ X k M k c ) + | M k c | · exp n 1 γ 2 2 · P ( P ^ X M k c ) p k δ ( n , γ ) · P n ( P ^ X k M k c ) + ( n + 1 ) 2 | X k | · exp 2 n 1 γ 2 · P ( P ^ X M k c ) p k .

Appendix F. Proof of Proposition 3

Note that equivalently,
θ k = g k * ( P ^ X k ) + Z ˜ k ,
where Z ˜ k N ( 0 , σ k 2 / m ) . We then apply the typical result for Gaussian tail [32], i.e., for any α > 0 ,
lim n 1 n log P Z ˜ k > α = α 2 2 μ σ k 2 ,
which implies that
lim n 1 n log P Q k 1 ( Q k ( P ^ X k ) + Z ˜ k ) P ^ X k lim n 1 n log P | Z ˜ k | > 1 2 exp n 1 γ 2 = .

Appendix G. Proof of Proposition 4

Note that
P n ( θ 1 , , θ K ) , ( θ 1 , , θ K ) Θ ω | H = H 0 P n ( θ 1 , , θ K ) , P ^ X k M k c , k ω , P ^ X k M k c , k [ K ] ω | H = H 0 = ω ¯ ( [ K ] ω ) { k ω ¯ P θ k | P ^ X k M k ( 0 ) · k [ K ] ( ω ω ¯ ) P θ k | P ^ X k M k ( 1 ) · k ω P ^ X k P X k ( n ) P ( θ k | P ^ X k ) P n ( P ^ X k , P ^ X k M k c , k ω , P ^ X k M k ( 0 ) , k ω ¯ ,
P ^ X k M k ( 1 ) , k [ K ] ( ω ω ¯ ) | H = H 0 ) } ,
where (A10) comes from (42). By decoding the empirical distributions from θ k with Q k 1 ( · ) for k ω and Proposition 3, we have
P ^ X k P X k ( n ) P ( θ k | P ^ X k ) P n ( P ^ X k , P ^ X k M k c , k ω , P ^ X k M k ( 0 ) , k ω ¯ , P ^ X k M k ( 1 ) , k [ K ] ( ω ω ¯ ) | H = H 0 ) P ( θ k | P ^ X k = Q k 1 ( θ k ) ) P n ( P ^ X k = Q k 1 ( θ k ) , P ^ X k M k c , k ω , P ^ X k M k ( 0 ) , k ω ¯ , P ^ X k M k ( 1 ) , k [ K ] ( ω ω ¯ ) | H = H 0 ) P ( θ k | P ^ X k = Q k 1 ( θ k ) ) · exp n · D 0 * ( P ¯ X 1 , , P ¯ X k ) .
With
P ( θ k | P ^ X k M k ( 0 ) ) exp n · ( θ k p k ) 2 2 μ σ k 2 ,
and
P ( θ k | P ^ X k M k ( 1 ) ) exp n · ( θ k p k ) 2 2 μ σ k 2 ,
we have
P n ( ( θ 1 , , θ K ) , ( θ 1 , , θ K ) Θ ω | H = H 0 ) ω ¯ ( [ K ] ω ) k ω · P ( θ k | P ^ X k = Q k 1 ( θ k ) ) · exp n · E ˜ 0 ω ( θ 1 , , θ K ) .
Similarly,
P n ( ( θ 1 , , θ K ) , ( θ 1 , , θ K ) Θ ω | H = H 1 ) ω ¯ ( [ K ] ω ) k ω P ( θ k | P ^ X k = Q k 1 ( θ k ) ) · exp n · E ˜ 1 ω ( θ 1 , , θ K ) .
Note that P ( θ k | P ^ X k = Q k 1 ( θ k ) ) is not related to ω ¯ and H , and then we can derive the decision rule (45) with LLRT. To compute the error exponent, we use Proposition 3 and the fact that P ( θ k | P ^ X k = Q k 1 ( θ k ) ) 1 when θ k = Q ( P ^ X k ) . Then, the optimal error exponent corresponds to
min { P ^ X k } k ω , { θ k } k [ K ] ω max i = 0 , 1 min ω ¯ ( [ K ] ω ) D i * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ( θ k p k ) 2 2 μ σ k 2 + k [ K ] ( ω ω ¯ ) ( θ k + p k ) 2 2 μ σ k 2 ,
where for k = 1 , , K , and ω ¯ ( [ K ] ω ) ,
R ¯ X k ω ¯ P ^ X k , if k ω P X k ( 0 ) , if k ω ¯ P X k ( 1 ) , if k [ K ] ( ω ω ¯ ) .
To finish the proof, we introduce the following lemma.
Lemma A1.
For arbitrary functions v 1 , , v : Z R and w 1 , , w : Z R , where Z is a given set, we have
min z Z max min v 1 ( z ) , , v ( z ) , min w 1 ( z ) , , w ( z ) = min i { 1 , , } , j { 1 , , } min z Z max v i ( z ) , w j ( z ) .
With Lemma A1, we only need to compare each component in (A12), i.e.,
min ω ¯ , ω ¯ ( [ K ] ω ) min { P ^ X k } k ω , { θ k } k [ K ] ω max { D 0 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ( θ k p k ) 2 2 μ σ k 2 + k [ K ] ( ω ω ¯ ) ( θ k + p k ) 2 2 μ σ k 2 , D 1 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ( θ k p k ) 2 2 μ σ k 2 + k [ K ] ( ω ω ¯ ) ( θ k + p k ) 2 2 μ σ k 2 } .
Given ω ¯ and ω ¯ , let ω ˜ = ω ¯ ω ¯ . By selecting θ k = p k for k ω ˜ and θ k = p k for k [ K ] ( ω ( ω ¯ ω ¯ ) ) in the minimization of (A15), (A15) equals
min ω ¯ , ω ¯ ( [ K ] ω ) min { P ^ X k } k ω , { θ k } k ω ¯ ω ¯ ω ˜ max { D 0 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ω ˜ ( θ k p k ) 2 2 μ σ k 2 + k ω ¯ ω ˜ ( θ k + p k ) 2 2 μ σ k 2 , D 1 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ω ˜ ( θ k + p k ) 2 2 μ σ k 2 + k ω ¯ ω ˜ ( θ k p k ) 2 2 μ σ k 2 } .
In the following, we denote Ω [ K ] ( ω ( ω ¯ ω ¯ ) ) . For those indices k ω ˜ or k Ω , although they do not contribute to the Gaussian-like error exponents, they restrict that R ¯ X k ω ¯ = R ¯ X k ω ¯ = P X k ( 0 ) or R ¯ X k ω ¯ = R ¯ X k ω ¯ = P X k ( 1 ) . By letting R ¯ X k ω ¯ = R ¯ X k ω ¯ = P ^ X k ( k ω ˜ or k Ω ) that can be optimized, we find the lower bound of (A15).
( A 15 ) min ω ¯ , ω ¯ ( [ K ] ω ) min { P ^ X k } k ω ω ˜ Ω , { θ k } k ω ¯ ω ¯ ω ˜ max { D 0 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ω ˜ ( θ k p k ) 2 2 μ σ k 2 + k ω ¯ ω ˜ ( θ k + p k ) 2 2 μ σ k 2 , D 1 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) + k ω ¯ ω ˜ ( θ k + p k ) 2 2 μ σ k 2 + k ω ¯ ω ˜ ( θ k p k ) 2 2 μ σ k 2 } = min ω ¯ , ω ¯ ( [ K ] ω ) E ω ω ˜ Ω ϵ E ϵ ,
where we have used the fact that lim n p k = p k ,
D 0 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) = D 0 ω ω ˜ Ω { P ^ X k } k ω ω ˜ Ω ,
D 1 * ( R ¯ X 1 ω ¯ , , R ¯ X K ω ¯ ) = D 1 ω ω ˜ Ω { P ^ X k } k ω ω ˜ Ω ,
and have substituted θ k for θ k .

Appendix H. Proof of Proposition 6

Let the encoders for k [ K ] ω be functions of H and P ^ X k . The upper bound comes from the fact that the type is also generated from the hypothesis H . Therefore, the encoder on both the hypothesis and the type is just a function of the true hypothesis. Suppose that ρ k : { 0 , 1 } R m ( k [ K ] ω ) satisfying 1 m E [ ρ k ( H ) 2 ] p k . Let ρ k ( i ) denote the i-th entry of ρ k , and
ρ k ( i ) ( H ) κ k ( i ) , if H = H 0 κ ¯ k ( i ) , if H = H 1 ,
where 1 2 κ k ( i ) 2 + 1 2 κ ¯ k ( i ) 2 = p k ( i ) and i = 1 m p k ( i ) = p k . The error exponent with respect to the LLRT is
min { R X k } k ω , { θ k ( i ) } k [ K ] ω , i = 1 , , m max { 1 n k [ K ] ω i = 1 m ( θ k ( i ) κ k ( i ) ) 2 2 σ k 2 + D 0 ω ( { R X k } k ω ) , 1 n k [ K ] ω i = 1 m ( θ k ( i ) κ ¯ k ( i ) ) 2 2 σ k 2 + D 1 ω ( { R X k } k ω ) } .
Here, we explain the optimality of κ ¯ k ( i ) = κ k ( i ) = p k ( i ) , under which let R X k * , θ k ( i ) * be the solution to problem (A19). For other pairs of ( κ ¯ k ( i ) , κ k ( i ) ) , | κ ¯ k ( i ) κ k ( i ) | < 2 p k ( i ) . Let θ ˜ k ( i ) * = κ k ( i ) + ( κ ¯ k ( i ) κ k ( i ) ) · θ k ( i ) * + p k ( i ) 2 p k ( i ) . Then, we have
( θ k ( i ) * p k ( i ) ) 2 2 σ k 2 ( θ ˜ k ( i ) * κ k ( i ) ) 2 2 σ k 2 ,
and
( θ k ( i ) * + p k ( i ) ) 2 2 σ k 2 ( θ ˜ k ( i ) * κ ¯ k ( i ) ) 2 2 σ k 2 ,
which will lead to a smaller error exponent (cf. (A19)) and the optimality is proved. The solution to problem (A19) is
lim n min { R X k } k ω , { θ k ( i ) } k [ K ] ω , i = 1 , , m max { 1 n k [ K ] ω i = 1 m ( θ k ( i ) p k ( i ) ) 2 2 σ k 2 + D 0 ω ( { R X k } k ω ) , 1 n k [ K ] ω i = 1 m ( θ k ( i ) + p k ( i ) ) 2 2 σ k 2 + D 1 ω ( { R X k } k ω ) } = min { R X k } k ω , { θ k } k [ K ] ω max { D 0 ω ( { R X k } k ω ) + k [ K ] ω ( θ k p k ) 2 2 μ σ k 2 , D 1 ω ( { R X k } k ω ) + k [ K ] ω ( θ k + p k ) 2 2 μ σ k 2 } = E ω .

Appendix I

Based on the results in Appendix D, E ω as defined in (32) satisfies
E ω = min ϕ ω R k ω , { θ k } k [ K ] ω max { 1 8 B ω ( B ω ψ ω ( 0 ) ϕ ω ) 2 + k [ K ] ω ( θ k p k ) 2 2 μ σ k 2 , 1 8 B ω ( B ω ψ ω ( 1 ) ϕ ω ) 2 + k [ K ] ω ( θ k + p k ) 2 2 μ σ k 2 } + o ( ϵ 2 ) ,
where k ω k ω | X k | , and then the result can be easily verified using Lagrangian multipliers.

References

  1. Han, T.S.; Amari, S. Statistical inference under multiterminal data compression. IEEE Trans. Inf. Theory 1998, 44, 2300–2324. [Google Scholar] [CrossRef]
  2. Ahlswede, R.; Csiszár, I. Hypothesis testing with communication constraints. IEEE Trans. Inf. Theory 1986, 32, 533–542. [Google Scholar] [CrossRef]
  3. Han, T.S.; Kobayashi, K. Exponential-type error probabilities for multiterminal hypothesis testing. IEEE Trans. Inf. Theory 1989, 35, 2–14. [Google Scholar] [CrossRef]
  4. Amari, S.I.; Han, T.S. Statistical inference under multiterminal rate restrictions: A differential geometric approach. IEEE Trans. Inf. Theory 1989, 35, 217–227. [Google Scholar] [CrossRef]
  5. Watanabe, S. Neyman–Pearson test for zero-rate multiterminal hypothesis testing. IEEE Trans. Inf. Theory 2017, 64, 4923–4939. [Google Scholar] [CrossRef]
  6. Shimokawa, H.; Han, T.S.; Amari, S. Error bound of hypothesis testing with data compression. In Proceedings of the 1994 IEEE International Symposium on Information Theory, Trondheim, Norway, 27 June–1 July 1994; p. 114. [Google Scholar] [CrossRef]
  7. Xu, X.; Huang, S.L. On Distributed Learning with Constant Communication Bits. IEEE J. Sel. Areas Inf. Theory 2022, 3, 125–134. [Google Scholar] [CrossRef]
  8. Sreekumar, S.; Gündüz, D. Strong Converse for Testing Against Independence over a Noisy channel. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 1283–1288. [Google Scholar] [CrossRef]
  9. McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; Arcas, B.A.Y. Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings of the 20th International Conference on Artificial Intelligence and Statistics, Fort Lauderdale, FL, USA, 20–22 April 2017; Volume 54, pp. 1273–1282. [Google Scholar]
  10. Vapnik, V. Principles of Risk Minimization for Learning Theory. In Proceedings of the 4th International Conference on Neural Information Processing Systems, San Francisco, CA, USA, 2–5 December 1991; pp. 831–838. [Google Scholar]
  11. Srivastava, N.; Salakhutdinov, R. Multimodal learning with deep boltzmann machines. J. Mach. Learn. Res. 2014, 15, 2949–2980. [Google Scholar]
  12. Cover, T.M.; Thomas, J.A. Elements of Information Theory (Wiley Series in Telecommunications and Signal Processing); Wiley-Interscience: Hoboken, NJ, USA, 2006. [Google Scholar]
  13. Huang, S.L.; Makur, A.; Wornell, G.W.; Zheng, L. On universal features for high-dimensional learning and inference. arXiv 2019, arXiv:1911.09105. [Google Scholar]
  14. Han, T.S. Hypothesis testing with multiterminal data compression. IEEE Trans. Inf. Theory 1987, 33, 759–772. [Google Scholar] [CrossRef]
  15. Scardapane, S.; Wang, D.; Panella, M.; Uncini, A. Distributed learning for random vector functional-link networks. Inf. Sci. 2015, 301, 271–284. [Google Scholar]
  16. Georgopoulos, L.; Hasler, M. Distributed machine learning in networks by consensus. Neurocomputing 2014, 124, 2–12. [Google Scholar] [CrossRef]
  17. Tsitsiklis, J.; Athans, M. On the complexity of decentralized decision making and detection problems. IEEE Trans. Autom. Control 1985, 30, 440–446. [Google Scholar] [CrossRef]
  18. Tsitsiklis, J.N. Decentralized detection by a large number of sensors. Math. Control. Signals Syst. 1988, 1, 167–182. [Google Scholar] [CrossRef]
  19. Tenney, R.R.; Sandell, N.R. Detection with distributed sensors. IEEE Trans. Aerosp. Electron. Syst. 1981, AES-17, 501–510. [Google Scholar] [CrossRef]
  20. Shalaby, H.M.; Papamarcou, A. Multiterminal detection with zero-rate data compression. IEEE Trans. Inf. Theory 1992, 38, 254–267. [Google Scholar] [CrossRef]
  21. Zhao, W.; Lai, L. Distributed testing with zero-rate compression. In Proceedings of the 2015 IEEE International Symposium on Information Theory (ISIT), Hong Kong, China, 14–19 June 2015; pp. 2792–2796. [Google Scholar]
  22. Sreekumar, S.; Gündüz, D. Distributed hypothesis testing over noisy channels. In Proceedings of the 2017 IEEE International Symposium on Information Theory (ISIT), Aachen, Germany, 25–30 June 2017; pp. 983–987. [Google Scholar]
  23. Zaidi, A. Hypothesis Testing Against Independence Under Gaussian Noise. In Proceedings of the 2020 IEEE International Symposium on Information Theory (ISIT), Los Angeles, CA, USA, 21–26 June 2020; pp. 1289–1294. [Google Scholar] [CrossRef]
  24. Salehkalaibar, S.; Wigger, M.A. Distributed hypothesis testing over a noisy channel. In Proceedings of the International Zurich Seminar on Information and Communication (IZS 2018), Zurich, Switzerland, 21–23 February 2018; pp. 25–29. [Google Scholar]
  25. Weinberger, N.; Kochman, Y.; Wigger, M. Exponent trade-off for hypothesis testing over noisy channels. In Proceedings of the 2019 IEEE International Symposium on Information Theory (ISIT), Paris, France, 7–12 July 2019; pp. 1852–1856. [Google Scholar]
  26. Boyd, S.; Vandenberghe, L. Convex Optimization; Cambridge University Press: Cambridge, UK, 2004. [Google Scholar]
  27. Csiszár, I.; Shields, P.C. Information Theory and Statistics: A Tutorial; Now Publishers Inc.: Delft, The Netherlands, 2004. [Google Scholar]
  28. Csiszár, I. The method of types [information theory]. IEEE Trans. Inf. Theory 1998, 44, 2505–2523. [Google Scholar] [CrossRef]
  29. Huang, S.L.; Xu, X.; Zheng, L. An information-theoretic approach to unsupervised feature selection for high-dimensional data. IEEE J. Sel. Areas Inf. Theory 2020, 1, 157–166. [Google Scholar] [CrossRef]
  30. Horn, R.A.; Johnson, C.R. Matrix Analysis; Cambridge University Press: Cambridge, UK, 2012. [Google Scholar]
  31. Graham, R.L.; Knuth, D.E.; Patashnik, O.; Liu, S. Concrete mathematics: A foundation for computer science. Comput. Phys. 1989, 3, 106–107. [Google Scholar] [CrossRef]
  32. Blair, J.; Edwards, C.; Johnson, J.H. Rational Chebyshev approximations for the inverse of the error function. Math. Comput. 1976, 30, 827–830. [Google Scholar] [CrossRef]
Figure 1. The transmission procedures for the type-based distributed hypothesis testing problem over noiseless channels.
Figure 1. The transmission procedures for the type-based distributed hypothesis testing problem over noiseless channels.
Entropy 25 01434 g001
Figure 2. The transmission procedures for the type-based distributed hypothesis testing problem over AWGN channels.
Figure 2. The transmission procedures for the type-based distributed hypothesis testing problem over AWGN channels.
Entropy 25 01434 g002
Figure 3. The geometric structure in distributed hypothesis testing, with Q X K ( i ) denoting the I-projection of P X K ( i ) onto the linear family L X ( h * ) , i = 0 , 1 , and L X ( h * ) can devide D 0 ( E * ) and D 1 ( E * ) in different half spaces.
Figure 3. The geometric structure in distributed hypothesis testing, with Q X K ( i ) denoting the I-projection of P X K ( i ) onto the linear family L X ( h * ) , i = 0 , 1 , and L X ( h * ) can devide D 0 ( E * ) and D 1 ( E * ) in different half spaces.
Entropy 25 01434 g003
Figure 4. The information decomposition structure in distributed hypothesis testing with K = 2 nodes, compared with the orthogonal decompositions on the subspace R ( B k ) for each node k = 1 , 2 .
Figure 4. The information decomposition structure in distributed hypothesis testing with K = 2 nodes, compared with the orthogonal decompositions on the subspace R ( B k ) for each node k = 1 , 2 .
Entropy 25 01434 g004
Figure 5. A geometric explanation of the genie-aided approach, which can lead to E ω as the upper bound of the error exponent in (4).
Figure 5. A geometric explanation of the genie-aided approach, which can lead to E ω as the upper bound of the error exponent in (4).
Entropy 25 01434 g005
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Tong, X.; Xu, X.; Huang, S.-L. On the Optimal Error Exponent of Type-Based Distributed Hypothesis Testing. Entropy 2023, 25, 1434. https://doi.org/10.3390/e25101434

AMA Style

Tong X, Xu X, Huang S-L. On the Optimal Error Exponent of Type-Based Distributed Hypothesis Testing. Entropy. 2023; 25(10):1434. https://doi.org/10.3390/e25101434

Chicago/Turabian Style

Tong, Xinyi, Xiangxiang Xu, and Shao-Lun Huang. 2023. "On the Optimal Error Exponent of Type-Based Distributed Hypothesis Testing" Entropy 25, no. 10: 1434. https://doi.org/10.3390/e25101434

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop