Theoretical Bounds on the Number of Tests in Noisy Threshold Group Testing Frameworks

Seong, Jin-Taek

doi:10.3390/math10142508

Open AccessArticle

Theoretical Bounds on the Number of Tests in Noisy Threshold Group Testing Frameworks

by

Jin-Taek Seong

Department of Convergence Software, Mokpo National University, Muan 58554, Korea

Mathematics 2022, 10(14), 2508; https://doi.org/10.3390/math10142508

Submission received: 9 June 2022 / Revised: 15 July 2022 / Accepted: 15 July 2022 / Published: 19 July 2022

(This article belongs to the Special Issue Probability and Stochastic Processes with Applications to Communications, Systems and Networks)

Download

Browse Figure

Versions Notes

Abstract

:

We consider a variant of group testing (GT) models called noisy threshold group testing (NTGT), in which when there is more than one defective sample in a pool, its test result is positive. We deal with a variant model of GT where, as in the diagnosis of COVID-19 infection, if the virus concentration does not reach a threshold, not only do false positives and false negatives occur, but also unexpected measurement noise can reverse a correct result over the threshold to become incorrect. We aim to determine how many tests are needed to reconstruct a small set of defective samples in this kind of NTGT problem. To this end, we find the necessary and sufficient conditions for the number of tests required in order to reconstruct all defective samples. First, Fano’s inequality was used to derive a lower bound on the number of tests needed to meet the necessary condition. Second, an upper bound was found using a MAP decoding method that leads to giving the sufficient condition for reconstructing defective samples in the NTGT problem. As a result, we show that the necessary and sufficient conditions for the successful reconstruction of defective samples in NTGT coincide with each other. In addition, we show a trade-off between the defective rate of the samples and the density of the group matrix which is then used to construct an optimal NTGT framework.

Keywords:

noisy threshold group testing; defective samples; number of tests; bounds; COVID-19

MSC:

68Q01

1. Introduction

Group Testing (GT) is a underdetermined problem in [1], and numerous methods have been developed to solve their problems. GT has become relevant in various problems including probabilistic approaches. The expansion of compressive sensing goes back to the fundamental idea of GT because it is an effort to find sparse signals [2,3]. Recently, academia has begun using the GT method as on vital approach to finding confirmed COVID-19 cases, showing this field’s potential importance in these uncertain times [4,5].

The first study for GT was proposed by Dorfman [1]. The background to the emergence of GT is that a large project was conducted in the United States to find soldiers with syphilis during World War II. Syphilis testing of inviduals involves taking a blood sample, then analyzing that to produce a positive or negative result for syphilis in that patient. The syphilis testing carried out at the time was very inefficient since it took a lot of time and money to test all the soldiers one by one [3]. After all, if N soldiers are individually tested for syphilis, N tests are required. Note that the number of soldiers infected with syphilis is very small compared to the total number of soldiers. That is why it is probably inefficient to test every soldier for syphilis one by one, and why the GT technique emerged. The initial GT model was performed in the following way [1]. Several soldiers’ blood samples were randomly selected, and the blood was put into a pool and mixed. Then, the blood pool was checked to see if it activated to syphilis or not. A positive result indicates that at least one of the soldiers in the pool was infected with syphilis. A negative result, on the other hand, indicates that all soldiers in the pool were free of syphilis. GT is attractive because the number of tests can be drastically reduced in the case of fewer soldiers infected with syphilis. After these beginnings, GT has mainly been studied with two different approaches, each forming a field of research of its own. One these fields is how to generate GT models. That is, it is a method of selecting samples to be included in one test pool. The second area is to reconstruct defective samples with as few tests as possible. GT loses its benefits if the requirement for a large number of retests leads to as many tests as the number of tests for individual screening.

For GT, various models have been proposed in consideration of how the test results express positive and negative results and the presence or absence of noise. In general, GT’s test results told us to see if the pool under being tested contains one or more defective samples. That is, a positive or negative result indicates whether at least one of the defective samples in the pool are present. The model called quantitative GT [3] is a generalization framework of GT. The test result of quantitative GT indicates the number of defective samples in the test pool. There is also another GT model called Threshold Group Testing (TGT) [6]. In the TGT model, a test result of a pool is positive or negative as in conventional GT schemes. However, unlike the conventional GT model, the positive result occurs only when the number of defective samples in the pool is greater than a given threshold. Otherwise, the test outcome is negative. The TGT model is used because it can represent situations in which the test result can be different depending on whether it is high or low, such as the COVID-19 virus concentration. A modified GT model in which measurement noise causes false negatives or false positives is also considered.

TGT problems have been dealt with in various fields such as construction of TGT models [7], theoretical analysis of performance [8], and efficient model design [9,10]. However, there have been no studies so far to quantify how much measurement noise affects performance of TGT models. In this paper, we consider a Noisy Threshold Group Testing (NTGT) model. We provide guidelines for designing a NTGT model that is robust and reliable to measurement noise. To this end, a lower bound on the number of tests is derived using Fano’s inequality. We show the trade-off relationship between the sparsity of the group matrix and the defective rate of the signal. And we obtain an upper bound on the probability of an error using the MAP decoding method. We show necessary and sufficient conditions on the number of tests required for finding a set of given defective samples using the lower and upper bounds.

2. Related Work

We look through previous studies and their significance to GT. Then, we will classify each type of problem related to current approaches to GT and consider the issues surrounding these problems. The study of GT first began in 1943 [1]. Dorfman made an effort to find a small number of syphilis-infected soldiers. Dorfman performed the GT with the following procedure. When testing for syphilis, all the soldiers were divided into various groups that were equal in size, then individual testing was only performed on soldiers from the groups that had recoded positive test results. In [1], the optimal group size for a given total number of samples and defective rate was summarized and presented. Later, Sterrett improved the performance by slightly modifying the existing GT method [11]. The main idea of Sterrett’s approach is that once the first positive result is obtained, the remaining untested individuals are put in one large grouped and tested. Other than that, there is no difference between Sterrett’s method and Dorfman’s. If there is a low infection rate, Sterrett’s method is more efficient because most of the samples are normal. A more general GT has been presented in [12], in which several algorithms were developed for finding defective samples when no infection rate exists. The paper [12] also provided a link between information theory and GT, introduced a new application of GT, and discussed the generalization of GT.

GT is classified based on types of defective sample distributions and decoding approaches. A probabilistic model uses the assumption that a defective sample is generated from a given probability distribution. On the other hand, the combinatorial model is an attempt to find defective samples without knowledge of probability distributions [13,14]. A typical example of this model is the minmax algorithm [15]. In [16], the results of improved performance in the combinatorial model were presented. Looking at other classes, the adaptive case is a model in which samples to be included in one pool are not independent of the results of previous tests. The samples to be used for the next round are changed each time based on the results of previous tests. Specifically, the method of selecting samples to be included in the next pool is optimized by using the results obtained from previous tests. Conversely, in the non-adaptive model, all tests are performed at the same time by a sample selection process defined in advance. So in this model, every test is independent of each other. This model offers the advantage of being able to test simultaneously regardless of the test order. When predetermined multiple steps are used, the non-adaptive model is extended to multi-stage models [1,17]. In fact, although the adaptive model has more constraints in GT design than the non-adaptive model, the adaptive model generally outperforms the non-adaptive model [3]. However, the recent research in [18] showed limitations in improving the performance of the adaptive model. Non-adaptive GTs are more efficient if all tests are being performed at the same time.

We now look at the significance of certain recent studies on noisy GT. The work in [19] showed the information-theoretic performance of GT with and without measurement noise. Several studies have recently showed interesting and significant performance. In [17], the proposed algorithms uses positive rates in the group to be included for each sample. In this case, if it is greater than the set value, the sample is considered as defective. This approach does not lead to optimal performance in all domains, but it follows a scaling law for a specific domain. In [19], there is separate testing for signals, and all of the group testing is carried out while still considering each sample. That is, although no individual testing is performed, samples use a binary value such as positive or negative. In the case of samples affected by symmetric noise, it was shown that the minimum number of tests reduces to a proportional to

K \log N

of the optimal information-theoretic bounds for identifying any K defectives samples in a population of N samples [19].

In [20,21], for noisy addition, GT algorithms were presented using message passing and linear programming. Although it does not guarantee optimal performance for decoding complexity, the algorithm proposed in [22] is capable of realistic runtime in terms of that case of a large population. Although many studies have been performed on the noiseless version of GT models, it has been considered as an assumption that the test results are always pure. But this is not realistic. In addition, most of the noisy GT approaches to deal with measurement noise were performed by considering the symmetric noise model such as binary symmetric channel mentioned in channel coding theory. The symmetric noise model referred to in this paper assumes that the test results have the same probability of occurrence of false negatives and false positives. However, asymmetric noise models are more natural than symmetric ones in various applications. For example, data forensics in [23] is an example of using noisy GT models where it identifies to see if recoded files are changed.

3. Noisy Threshold Group Testing Framework

3.1. Problem Statement

We define our NTGT problem. Let be the input

x

expressed as a binary vector of size N,

x = (x_{1}, x_{2}, \dots, x_{N})

,

x \in {\{0, 1\}}^{N}

. For

i \in [N]

,

x_{i}

is the i-th element of

x

.

x_{i}

is expressed in binary to identify either a defective sample or a normal sample. In other words, if the i-th sample is defective,

x_{i} = 1

, otherwise

x_{i} = 0

. Throughout this work, we assume that

x_{i}

has the following probability,

\Pr (x_{i} = α) = \{\begin{matrix} 1 - δ & if α = 0, \\ δ & if α = 1, \end{matrix}

(1)

where

δ

is the defective sample rate, and

α

is a dummy variable. In this case, the defective sample rate is less than 0.5, 0

< δ <

0.5, which is considered a small value for GT problems.

As mentioned earlier, one of the key points in the GT problems is to determine which samples to participate in a pool. In this paper, samples to be included in the pool are selected using a non-adaptive model. We use a matrix as a more concise way to define the samples to be included in the pool. Let be the group matrix which has M rows and N columns as denoted

A \in {\{0, 1\}}^{M \times N}

, where M is the number of tests in the NTGT model. Note that we aim for a small M as the number of tests required to reconstruct the signal

x

. When the j-th test includes i-th sample

x_{i}

and performs GT, it is expressed as

A_{j i} = 1

. Otherwise,

A_{j i} = 0

. Whether i-th sample is included in the j-th test and performs GT, is expressed as a binary value, i.e., 0 or 1, of each element

A_{j i}

of the group matrix. Although the d-Separable matrix and the d-Disjunct matrix [3] were used to design the group matrix, the approach of randomly selecting the elements of the group matrix is also known to be a good design method [3]. For

i \in [N]

and

j \in [M]

,

A_{j i}

is identically independent distributed as follows:

\Pr (A_{j i} = α) = \{\begin{matrix} 1 - γ & if α = 0, \\ γ & if α = 1, \end{matrix}

(2)

where

γ

denotes the sparsity of the group matrix and the range of

γ

is

0 < γ < 1

. As

γ

increases, the density of the group matrix also increases. Conversely, as they get smaller, increasingly sparse group matrices are designed. It should be noted that the computational complexity of the GT framework also increases when a group matrix is constructed from a large

γ

. Therefore, it is necessary to design GT frameworks with as low as possible the sparsity of group matrices while improving the reconstruction performance. We will consider how the relationship between

δ

and

γ

affects the number of tests for signal reconstruction.

The reason we are considering the NTGT model is as follows. Consider a model that could be used for the diagnosis of COVID-19 infection. There are cases in which the COVID-19 test showed false positive or false negative results when the concentration of the virus was low or contaminated. The current diagnosis of COVID-19 infection is positive when the virus concentration is above a certain level. During the incubation period or early stage of infection, the virus concentration is low, and false negative results may be obtained. In addition, even if the COVID-19 infection is confirmed using a precise and accurate diagnostic method, the result is sometimes reversed due to unexpected measurement noise. Throughout this work, a NTGT model suited to these challenges is considered. In other words, we consider the best approach to a TGT scheme where positive and negative cases occur by the quatitative concentration, and we consider an additive noise model because measurement noise can reverse the results. In a recent study [24], for the diagnosis of COVID-19 infection, false positives and false negatives were reported to be between 0.1% and 4.5%, respectively. Next, we obtain lower and upper performance bounds on the NTGT model in Section 4 and Section 5.

TGT is different from conventional GT models. In conventional GT, if at least one defective sample exists in one test, the output is positive without measurement noise. However, TGT is positive when there is a number of defective samples greater than the predefined threshold T. For example,

T = 3

means that a positive result occurs only when there are at least three defective samples in the pool. Once there is only one defective sample in the pool, its result would be negative. In other words, the result in the pool becomes positive only when it is above T for TGT models, also whether it is negative or positive in the diagnosis of COVID-19 infection depends on whether the virus concentration is high or low. The conventional GT uses

T = 1

. The following (3) presents an output for a TGT model. Let

z_{j}

be the result of the j-th test pool, which does not suffer from noise, where

z_{j} = 1

indicates a positive result and 0 for a negative result,

j \in [M]

,

z = (z_{1}, z_{2}, \dots, z_{M})

.

z_{j} = \{\begin{matrix} 0 & if \sum_{i = 1}^{N} A_{j i} x_{i} < T, \\ 1 & if \sum_{i = 1}^{N} A_{j i} x_{i} \geq T, \end{matrix}

(3)

Through this paper, we consider the NTGT framework with measurement noise. Assume a model whose results can be flipped due to the measurement noise.

z_{j}

is the pure result of the pool test, and its result converts from positive to negative and vice versa due to additive noise. For the NTGT model, the additive noise is defined as follows:

\Pr (e_{j} = α) = \{\begin{matrix} 1 - η & if α = 0, \\ η & if α = 1, \end{matrix}

(4)

where

η

is the measurement noise, and we assume all

e_{j}

are independent of each other. Therefore, the j-th output

y_{j}

in the NTGT model can be written as

y_{j} = z_{j} \oplus e_{j}

(5)

where the symbol ⊕ denotes the logical operation XOR. We denote

y = (y_{1}, y_{2}, \dots, y_{M})

and

e = (e_{1}, e_{2}, \dots, e_{M})

.

Figure 1 shows an example of this NTGT. In this example, two samples out of ten are defective, which is realized from (1). As shown in Figure 1, the number of tests is 7,

M = 7

. The

7 \times 10

group matrix is constructed by (2) mentioned above. For noiseless version, the vector

z

is

(0, 0, 1, 0, 0, 0, 0)

with

T = 2

. In the third test only, the number of defective samples becomes two, and the test result is positive. When additive noise is added as defined in (4), the output is

y = (1, 0, 1, 0, 0, 0, 0)

.

3.2. Decoding

We use a maximum a posteriori (MAP) method to reconstruct a signal

x

in the NTGT.

\hat{x} = \arg \max_{x} P (x| y, A)

(6)

The posteriori probability in (6) is as follows:

\begin{matrix} P (x| y, A) & = \frac{P (x, y, A)}{P (y, A)} \\ \propto P (x, y, A) \\ = \sum_{e} P (x, y, A, e) \\ = \sum_{e} P (x) P (A) P (e) P (y| x, A, e) \end{matrix}

(7)

The last line of (7) is obtained using independent conditions, while the conditional probability

P (y| x, A, e)

is an indicator function that satisfies the following condition:

P (y| x, A, e) = \{\begin{matrix} 1 & if y = z \oplus e, \\ 0 & if y \neq z \oplus e, \end{matrix}

(8)

We define an error event if

\hat{x}

from (6) is not the same as the true realization of

x

. In other words, the probability of an error is expressed as

P_{E} = \Pr \{\hat{x} \neq x\}

.

3.3. Bounds for Group Testing Schemes

Now consider the number of tests on successful decoding in the conventional GT models. The number of tests required to identify K defective samples out of all N samples for an adaptive GT algorithm with perfect reconstruction denotes as

m (N, K)

. Moreover, for the case of a non-adaptive model, the number of tests is defined as

\bar{m} (N, K)

. The number of tests N required for individual testing is greater than

\bar{m} (N, K)

. Adaptive GT models require less or equal number of tests than those of non-adaptive GTs because they check the results of previous tests and perform the next tests,

m (N, K) \leq \bar{m} (N, K)

. Even if the number of defective samples is one, at least one test must be performed,

1 \leq m (N, K)

. Therefore, the number of tests has a wide range as follows:

1 \leq m (N, K) \leq \bar{m} (N, K) \leq N

(9)

From an information-theoretic bound, the minimum number of tests M for a GT framework with a sample space is obtained as [3],

M \geq \log_{2} |S|

(10)

where

S

denotes the sample space. In addition, an information-theoretic performance is presented even for a GT framework with small error probability. It is expressed as an upper bound of the error probability for the number of tests required for successful decoding. This GT algorithm performs in such a way as the following bound on successful probability

P_{s}

for decoding of defective samples [25]:

P_{s} \leq \frac{M}{\log_{2} (\binom{N}{K})}

(11)

In the past half century, many studies on GT models have been performed, and among them, well-known and important GT algorithms are introduced next. The first one to be considered is the binary splitting algorithm [3]. This algorithm solves the existing GT problems efficiently and is applicable to the adaptive GT models. So far, the reason this algorithm is used for GT problems is because of its simplicity and good performance. The number of tests required to reconstruct defective samples using the binary splitting algorithm is known through the following bounds:

M = \{\begin{matrix} N & if N \leq 2 K - 2, \\ (\log_{2} σ + 2) K + p - 1 & if N \geq 2 K - 1, \end{matrix}

(12)

where

σ

is the number of samples to be included for one test, and p is a uniquely determined nonnegative integer conditioning

p < K

.

Next, the definite defectives algorithm [26] is considered. This algorithm is suitable for non-adaptive GT models because an unknown input signal can be reconstructed using all of the test results at the same time through an iterative process. The feature of the definite defectives algorithm is attractive in that it can eliminate false negatives that may occur during the reconstruction process. As a result, the use of the definite defectives algorithm is more useful in applications where false negatives are sensitive or should not be present. For given N and K, the definite defective algorithm has the following lower bound for the number of tests M required for identifying defective samples if it is allowed an error rate of

σ

,

M \geq (1 - σ) \log_{2} (\binom{N}{K})

(13)

This can be observed that (11) and (13) coincide with the same in the perfect reconstruction of defective samples.

4. Necessary Condition for Complete Recovery

4.1. Lower Bound

In this section, we take into account a necessary condition for the number of tests required to identify defective samples in the NTGT model. We obtain the necessary condition using Fano’s inequality theorem [27] presented in information theory. Fano’s inequality is mainly exploited in channel coding theory, and describes the connection between error probability and entropy. In addition, in [28], the authors reviewed GT problems comprehensively and in-depth from an information theory perspective. The lower bound on the probability of an error is obtained by considering Fano’s inequality theorem. From this lower bound, we are lead to the necessary condition for the number of tests to find all defective samples for the NTGT model. We first explain Fano’s inequality theorem before deriving the necessary condition.

Theorem 1

(Fano’s inequality [27]). Suppose there are random variables A and B of finite size. If the decoding function Φ that finds A by considering B is used, the following inequality holds:

1 + P (Φ (B) \neq A) \log_{2} |A| \geq H (A |B)

(14)

where

P (Φ (B) \neq A)

is the probability of an error for the decoding function Φ, and the conditional entropy

H (A |B)

is defined as follows:

H (A |B) = - \sum_{α \in A} \sum_{β \in B} P_{A B} (α, β) \log P_{A |B} (α |β)

(15)

where

P_{A B}

and

P_{A |B}

are the joint probability and conditional probability, respectively.

In the NTGT problem, we are able to obtain a lower bound on the probability of an error. This lower bound shows the minimum number of tests required to reconstruct an unknown signal, regardless of which decoding function is used. In this paper, our lower bound is a variant of the results obtained in [8]. Compared to [8], this work obtains the lower bound taking into account the measurement noise. However, the overall procedure of derivation is similar to each other because it uses Fano’s inequality theorem.

Theorem 2

(Lower bound). For any decoding function with the unknown sample signal defined in (1) and the measurement noise defined in (4), a necessary condition for the probability of error

P_{E}

to be less than an arbitrary small and positive value ρ for

P_{E} < ρ

holds such that

\frac{N H (δ) - M + M H (η) - 1}{N} < ρ

(16)

where

H (\cdot)

is the entropy function.

Proof of Theorem 2.

Let

\hat{x}

be the estimated signal of

x

found using the decoding function. Considering the following process in terms of a Markov chain, we can say

x \to (y, A) \to \hat{x}

. Then, the following inequality is satisfied,

H (x |y, A) \leq H (x |\hat{x})

(17)

Further, from Fano’s inequality described in (14), the conditional entropy is bounded by

H (x |y, A) \leq 1 + P_{E} \log_{2} (2^{N} - 1)

(18)

Then, the probability of error is bounded in terms of the conditional entropy and the total number of samples N,

P_{E} \geq \frac{H (x |y, A) - 1}{N}

(19)

It needs to tackle the conditional entropy

H (x |y, A)

. Let us divide and expand the following conditional entropy in more detail:

\begin{matrix} H (x |y, A) & = & H (x) - I (x; y, A) \\ = & H (x) - (I (x; A) + I (x; y |A)) \\ \overset{(a)}{=} & H (x) - (H (y |A) - H (y |A, x)) \end{matrix}

(20)

where

I (\cdot)

is mutual information, and equality (a) comes from the fact that

x

and

A

are independent of each other. Note that the smaller the term on the right side of (19), the lower the minimum value of the probability of error. This means that the conditional entropy,

H (x |y, A)

, should be small as possible. As a result, on the last line of the right side in (20), the conditional entropy

H (y |A)

should be large; conversely, the conditional entropy

H (y |A, x)

should be small.

To do this, let us find the maximum and minimum values of the two conditional entropies, respectively.

\begin{matrix} H (y |A) \leq H (y) & = & H (z \oplus e) \\ \leq & M \end{matrix}

(21)

where the first inequality is due to the definition of conditional entropy, and the last inequality comes from the fact that the result

y_{j}

is either 0 or 1,

y_{j}

values are independent of each other, and the maximum binary entropy is 1 in the case that

\Pr (y_{j} = 0) = \Pr (y_{j} = 1)

. Next, we take into account the other conditional entropy

H (y |A, x)

which is minimized,

\begin{matrix} H (y |A, x) & = & H (z \oplus e |A, x) \\ = & H (e) \\ = & M H (η) \end{matrix}

(22)

where the second equality comes from how the randomness of

z

vanishes if

x

and

A

are known, the last equality being due to the independent events of

e

. Using (21) and (22), (20) can be rewritten as

H (x |y, A) \leq N H (δ) - M + M H (η)

(23)

Finally, if (19) is changed to satisfy the condition

P_{E} < ρ

where

ρ

is a small, positive value and

ρ > 0

, the following condition holds:

\frac{N H (δ) - M + M H (η) - 1}{N} < ρ

(24)

This completes the proof of Theorem 2. □

4.2. Construction of Noisy Threshold Group Testing

We now consider the result obtained from Theorem 2. First, Theorem 2 can be expressed as the ratio of the number of tests to the total number of samples as follows:

\frac{M}{N} > \frac{H (δ) - ρ}{1 - H (η)}

(25)

It is advantageous to use the NTGT framework until the point N and M are equal. Otherwise, when

M > N

, individual testing becomes more effective than GT. This shows that NTGT can theoretically be used under the following noise conditions:

H (η) < 1 + ρ - H (δ)

(26)

To design an NTGT framework, how to construct a group matrix is important. The key to this is shown in the proof of Theorem 2. Looking carefully at the conditions under which the inequality of conditional entropy holds in (21), the maximum conditional entropy

H (y |A)

is obtained when the following conditions are satisfied:

\Pr (y_{j} = 0) = \Pr (y_{j} = 1)

. This means that the NTGT system should be designed so that the output has an equal probability of being 0 or 1. Since

x

and

A

are independent of each other, the probability of an output of 0 is as follows:

\Pr (y_{j} = 0) = \sum_{t = 0}^{T - 1} (\binom{N}{t}) {(δ γ)}^{t} {(1 - δ γ)}^{N - t} = \frac{1}{2}

(27)

As shown in (27), it can be seen that there is a trade-off between

δ

and

γ

. In other words, to reconstruct a sparse signal, a high-density group matrix needs to be generated and used. Conversely, if the signal is not sparse, the group matrix should be designed with low density.

5. Sufficient Condition for Average Performance

5.1. Upper Bound

Now we prove there is an upper bound on the probability of errors from the MAP decoding used in NTGT. We divide the proof of the upper bound into two parts: one considers the definition of the error event and the other part formulates the probability of errors.

We rewrite the a posteriori probability.

P (x |y, A) \propto \sum_{e} P (x) P (A) P (e) 1_{y = z \oplus e}

(28)

Note that both

A

and

y

are given and known. Using MAP decoding, we estimate with (28)

\hat{x} = \arg \max_{x} \sum_{e} P (x) P (A) P (e) 1_{y = z \oplus e}

(29)

An error event occurs if there is a feasible vector

\bar{x} \neq x

, such that

\sum_{v} P (\bar{x}) P (v) 1_{y = w \oplus v} \geq \sum_{e} P (x) P (e) 1_{y = z \oplus e}

(30)

where

w = \sum_{A_{j i} {\bar{x}}_{i} \geq T} A \bar{x}

comes from (3), and

v

comes from a realization from (4). When given

y

,

A

, and

x

, we have one vector

e

, such that

e = z \oplus y

. Then we can rewrite (30).

P (\bar{x}) P_{v} (y \oplus w) \geq P (x) P_{e} (y \oplus z)

(31)

Therefore, an error event becomes equivalent to there existing a pair

(\bar{x}, v)

such that

\begin{matrix} \bar{x} \neq x, \\ y = w \oplus v = z \oplus e, \\ P (\bar{x}) P_{v} (y \oplus w) \geq P (x) P_{e} (y \oplus z) \end{matrix}

(32)

So far, we have defined the error event and now we will derive an upper bound on the probability of error. When given

x

and

e

, we let

P (I |x, e)

be the conditional error probability. We have an average error probability as follows:

P_{E} = \sum_{x} \sum_{e} P (x, e) P (I |x, e)

(33)

We now introduce two typical sets that were defined in [27] (Ch.3.1). Let

A_{[e] ε}^{M}

and

A_{[x] ε}^{N}

be typical sets of

x

and

e

with respect to

P (x)

and

P (e)

as defined in (1) and (4). For any positive number

ε

and sufficiently large numbers of N and M, the two typical sets are defined as

A_{[x] ε}^{N} = \{x \in 2^{N} : |- \frac{1}{N} \log P (x) - H (δ)| \leq ε\}

(34)

and

A_{[e] ε}^{M} = \{e \in 2^{M} : |- \frac{1}{M} \log P (e) - H (η)| \leq ε\}

(35)

From the Shannon–McMillan–Breiman theorem [27] (Ch.16.8), we obtain the following two bounds:

P (|- \frac{1}{N} \log P (x) - H (δ)| \leq ε) \geq 1 - ε

(36)

and

P (|- \frac{1}{M} \log P (e) - H (η)| \leq ε) \geq 1 - ε

(37)

Now we define the space of the pair

(x, e)

with respect to the two typical sets. Let

U

and

U^{c}

be the sets for the pair

(x, e)

such that

U = \{x \in 2^{N}, e \in 2^{M} : (x \in A_{[x] ε}^{N} ⋂ e \in A_{[e] ε}^{M})\}

(38)

and

U^{c} = \{x \in 2^{N}, e \in 2^{M} : (x \notin A_{[x] ε}^{N} ⋃ e \notin A_{[e] ε}^{M})\}

(39)

where

U

is the joint typical set for the pair

(x, e)

, since

x

and

e

are independent.

Theorem 3

(Upper bound). In NTGT one, a distribution of defective samples defined in (1) and noise probability defined in (4), for any small ε, the ratio of the number of tests M to the total number of samples N is upper-bounded:

\frac{M}{N} > \frac{H (δ) + ε}{1 - H (η) - ε}

(40)

Proof of Theorem 3.

The probability of error is bounded as

\begin{matrix} P_{E} & = \sum_{(x, e) \in U} P (x) P (e) P (I |x, e) + \sum_{(x, e) \in U^{c}} P (x) P (e) P (I |x, e) \\ \overset{(a)}{\leq} \sum_{(x, e) \in U} P (x) P (e) P (I |x, e) + \sum_{(x, e) \in U^{c}} P (x) P (e) \\ \overset{(b)}{\leq} \sum_{(x, e) \in U} P (x) P (e) P (I |x, e) + 2 ε \end{matrix}

(41)

where (a) is due to

P (I |x, e) \leq 1

, and (b) comes from the following,

\begin{matrix} \sum_{(x, e) \in U^{c}} P (x) P (e) & = 1 - \sum_{(x, e) \in U} P (x) P (e) \\ = 1 - \sum_{x \notin A_{[x] ε}^{N}} P (x) \sum_{e \notin A_{[e] ε}^{M}} P (e) \\ \leq 1 - (1 - ε) (1 - ε) \\ \leq 2 ε \end{matrix}

(42)

This is because

A

is randomly generated as defined in (2); then we can define the following event as

E (x, e; \bar{x}, v) = \{(x, e; \bar{x}, v) : z \oplus e = w \oplus v\}

(43)

The conditional error probability

P (I |x, e)

is the probability of the union of all the events in (43) with respect to all pairs

(\bar{x}, v)

that satisfy (32). Thus, the conditional error probability in (33) can be rewritten as

P (I |x, e) = \Pr \{⋃_{\bar{x}, v : P (\bar{x}) P (v) \geq P (x) P (e)} E (x, e; \bar{x}, v)\}

(44)

Using the union bound in (41), we have the following bound:

\begin{matrix} P_{E} & \leq \sum_{(x, e) \in U} P (x) P (e) \sum_{(\bar{x}, v) : P (\bar{x}) P (v) \geq P (x) P (e)} P (E (x, e; \bar{x}, v)) + 2 ε \\ = \sum_{(x, e) \in U} P (x) P (e) \sum_{(\bar{x}, v)} P (E (x, e; \bar{x}, v)) Φ (x, \bar{x}, e, v) + 2 ε \end{matrix}

(45)

where

Φ (x, \bar{x}, e, v)

is the indicator function, such that

P (\bar{x}) P (v) \geq P (x) P (e)

.

Φ (x, \bar{x}, e, v) = \{\begin{matrix} 1 if P (\bar{x}) P (v) \geq P (x) P (e) \\ 0 if P (\bar{x}) P (v) < P (x) P (e) \end{matrix}

(46)

The indicator function is bounded [29] (Ch. 5.6) for

0 < s \leq 1

.

Φ (x, \bar{x}, e, v) \leq {(\frac{P (\bar{x}) P (v)}{P (x) P (e)})}^{s}

(47)

For

s = 1

in (47), we have the following bound:

P_{E} \leq \sum_{(x, e) \in U} \sum_{(\bar{x}, v)} P (\bar{x}) P (v) P (E (x, e; \bar{x}, v)) + 2 ε

(48)

From the definition in (43), note that the probability

P (E (x, e; \bar{x}, v))

is

P (E (x, e; \bar{x}, v)) = \Pr (w \oplus v = z \oplus e)

(49)

where

\begin{matrix} P_{E} & \leq \sum_{(x, e) \in U} \sum_{(\bar{x}, v)} P (\bar{x}) P (v) P (E (x, e; \bar{x}, v)) + 2 ε \\ = \sum_{(x, e) \in U, {∥e \oplus v∥}_{0} = d_{2}} \sum_{{∥\bar{x}∥}_{0} = d_{1}} \sum_{{∥e \oplus v∥}_{0} = d_{2}} P (\bar{x}) P (v) P (z \oplus w = e \oplus v | {∥\bar{x}∥}_{0} = d_{1}, {∥e \oplus v∥}_{0} = d_{2}) \end{matrix}

(50)

In (50), we find the following probability depending on the number of nonzero elements

d_{1}

and

d_{2}

:

\begin{matrix} P (z \oplus w = e \oplus v | {∥\bar{x}∥}_{0} = d_{1}, {∥e \oplus v∥}_{0} = d_{2}) & = \prod_{j = 1}^{M} P (z_{j} \oplus w_{j} = e_{j} \oplus v_{j} | {∥\bar{x}∥}_{0} = d_{1}, {∥e \oplus v∥}_{0} = d_{2}) \\ = P (z_{j} \oplus w_{j} = 1 | {∥\bar{x}∥}_{0} = d_{1}, {∥e \oplus v∥}_{0} = d_{2})^{d_{2}} \\ \times P (z_{j} \oplus w_{j} = 0 | {∥\bar{x}∥}_{0} = d_{1}, {∥e \oplus v∥}_{0} = d_{2})^{M - d_{2}} \\ = {(1 - P_{0})}^{d_{2}} P_{0}^{M - d_{2}} \end{matrix}

(51)

where each row is independent. Given this, we define the following probability:

P_{0} \overset{Δ}{=} \Pr (z_{j} \oplus w_{j} = 0 |{∥\bar{x}∥}_{0} = d_{1})

(52)

We can divide

P_{0}

in (52) into two parts. If

d_{1} < T

,

\begin{matrix} P_{0} & = \Pr (z_{j} = 0) \Pr (w_{j} = 0) + \Pr (z_{j} = 1) \Pr (w_{j} = 1) \\ = \Pr (z_{j} = 0) \end{matrix}

(53)

Otherwise,

\begin{matrix} P_{0} & = \Pr (z_{j} = 0) (\sum_{t = 0}^{T - 1} (\binom{d_{1}}{t}) γ^{t} {(1 - γ)}^{(d_{1} - t)}) + \Pr (z_{j} = 1) (\sum_{t = T}^{d_{1}} (\binom{d_{1}}{t}) γ^{t} {(1 - γ)}^{(d_{1} - t)}) \\ = P_{z, 0} (δ, γ) P_{w, 0} (d_{1}, γ) + (1 - P_{z, 0} (δ, γ)) (1 - P_{w, 0} (d_{1}, γ)) \end{matrix}

(54)

where

\begin{matrix} P_{z, 0} (δ, γ) \overset{Δ}{=} \Pr (z_{j} = 0) = \sum_{t = 0}^{T - 1} (\binom{N}{t}) {(δ γ)}^{t} {(1 - δ γ)}^{N - t}, \\ P_{w, 0} (d_{1}, γ) \overset{Δ}{=} \Pr (w_{j} = 0) = \sum_{t = 0}^{T - 1} (\binom{d_{1}}{t}) γ^{t} {(1 - γ)}^{d_{1} - t} \end{matrix}

(55)

The maximum for

P_{0}

by looking at

P_{z, 0} (δ, γ) = 1 / 2

and

P_{w, 0} (d_{1}, γ) = 1 / 2

from the fact that

P_{0}

in (54) is concave with respect to

P_{z, 0} (δ, γ)

and

P_{w, 0} (d_{1}, γ)

. Therefore, its bound is

P_{0} \leq \frac{1}{2}

(56)

Using (51) and (56), (50) can be bounded as follows:

\begin{matrix} P_{E} & \leq 2^{- M} \sum_{d_{1} = 0, x \neq \bar{x}} \sum_{(x, e) \in U, \bar{x} : {∥\bar{x}∥}_{0} = d_{1}} P (\bar{x}) (\sum_{v} P (v)) + 2 ε \\ \leq 2^{- M} \sum_{d_{1} = 0, x \neq \bar{x}} \sum_{(x, e) \in U, \bar{x} : {∥\bar{x}∥}_{0} = d_{1}} P (\bar{x}) + 2 ε \\ \leq 2^{- M} \sum_{x \in A_{[x] ε}^{N}} \sum_{e \in A_{[e] ε}^{M}} \sum_{d_{1} = 0, x \neq \bar{x}} P (\bar{x}) + 2 ε \\ = 2^{- M} |A_{[x] ε}^{N}| \cdot |A_{[e] ε}^{M}| \sum_{d_{1} = 0, x \neq \bar{x}} P (\bar{x}) + 2 ε \\ \leq 2^{- M} |A_{[x] ε}^{N}| \cdot |A_{[e] ε}^{M}| + 2 ε \\ \leq 2^{- M} 2^{N (H (δ) + ε)} 2^{M (H (η) + ε)} + 2 ε \\ = 2^{N (H (δ) + ε) + M (H (η) + ε) - M} + 2 ε \end{matrix}

(57)

As the probability of error is less than 1, the exponent term on the right side of (57) is bounded by

N (H (δ) + ε) + M (H (η) + ε) - M < 0

(58)

Then, the ratio of M to N is

\frac{M}{N} > \frac{H (δ) + ε}{1 - H (η) - ε}

(59)

This completes the proof of Theorem 3. □

5.2. Discussion for Necessary and Sufficient Conditions

In this section, we discuss the results obtained from Theorems 2 and 3. The result from Theorem 2 allows us to solve the lower bound in the NTGT problem using Fano’s inequality. The minimum number of tests required to recover all defective samples with

δ

probability out of N samples is also obtained. In other words, Theorem 2 is a necessary condition for any probability of error to be smaller than

ρ

. Conversely, Theorem 3 leads to the upper bound on the probability of an error using the MAP decoding method. This condition refers to the upper bound on performance and is the sufficient condition to allow us to reconstruct defective samples.

We show that the results of Theorems 2 and 3 coincide with each other. Finding and presenting the necessary and sufficient conditions for the number of tests required in the NTGT problem is significant for TGT. In addition, as shown in (27) above, a system design method for NTGT was proposed so that the probability that a test result is 0 and the probability that it is 1 are the same depending on threshold T.

6. Conclusions

In this paper, we considered a NTGT problem where the test result is positive when the number of defective samples in a pool equals or greater than a certain threshold. Recently, when performing GT for the diagnosis COVID-19 infection, if the sample’s virus concentration did not sufficiently reach the threshold, false positives or false negatives can occur, so in this work we dealt with this TGT framework. In addition, a noise model was added in case pure results were flipped due to unexpected measurement noise. We took into account how many tests were needed to successfully reconstruct a small defective sample with the NTGT problem. To this end, we aimed to find the necessary and sufficient conditions for the number of tests required. For the necessary condition, we obtained the lower bound on the number of tests using Fano’s inequality theorem. Next, the upper bound on performance defined by the probability of error was derived using the MAP decoding method. This result leads to the sufficient condition for identifying all defective samples in the NTGT problem. In this paper, we have shown that the necessary and sufficient conditions are consistent with the NTGT framework. In addition, we presented that the relationship between the defective rate of the input signal and the sparsity of the group matrix should be considered to design an optimal NTGT system.

Funding

National Research Foundation of Korea: NRF-2020R1I1A3071739.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dorfman, R. The Detection of Defective Members of Large Populations. Ann. Math. Stat. 1943, 14, 436–440. [Google Scholar] [CrossRef]
Donoho, D.L. Compressed Sensing. IEEE Trans. Inf. Theory 2006, 52, 1289–1306. [Google Scholar] [CrossRef]
Du, D.-Z.; Hwang, F.-K. Pooling Designs and Nonadaptive Group Testing: Important Tools for DNA Sequencing; World Scientific: Singapore, 2006. [Google Scholar]
Verdun, C.M.; Fuchs, T.; Harar, P.; Elbrächter, D.; Fischer, D.S.; Berner, J.; Grohs, P.; Theis, F.J.; Krahmer, F. Group Testing for SARS-CoV-2 Allows for Up to 10-Fold Efficiency Increas across Realistic Scenarios and Testing Strategies. Front. Public Health 2021, 9, 583377. [Google Scholar] [CrossRef] [PubMed]
Mutesa, L.; Ndishimye, P.; Butera, Y.; Souopgui, J.; Uwineza, A.; Rutayisire, R.; Ndoricimpaye, E.L.; Musoni, E.; Rujeni, N.; Nyatanyi, T.; et al. A pooled testing strategy for identifying SARS-CoV-2 at low prevalence. Nature 2021, 589, 276–280. [Google Scholar] [CrossRef] [PubMed]
Damaschke, P. Threshold group testing. Gen. Theory Inf. Transf. Comb. LNCS 2006, 4123, 707–718. [Google Scholar]
Bui, T.V.; Kuribayashi, M.; Cheraghchi, M.; Echizen, I. Efficiently Decodable Non-Adaptive Threshold Group Testing. IEEE Trans. Inf. Theory 2019, 65, 5519–5528. [Google Scholar] [CrossRef] [Green Version]
Seong, J.-T. Theoretical Bounds on Performance in Threshold Group Testing. Mathematics 2020, 8, 637. [Google Scholar] [CrossRef] [Green Version]
Chen, H.; Bonis, A.D. An almost optimal algorithm for generalized threshold group testing with inhibitors. J. Comput. Biol. 2011, 18, 851–864. [Google Scholar] [CrossRef] [PubMed]
De Marco, G.; Jurdzinski, T.; Rozanski, M.; Stachowiak, G. Subquadratic non-adaptive threshold group testing. Fundam. Comput. Theory 2017, 111, 177–189. [Google Scholar]
Sterrett, A. On the Detection of Defective Members of Large Populations. Ann. Math. Stat. 1957, 28, 1033–1036. [Google Scholar] [CrossRef]
Sobel, M.; Groll, P.A. Group testing to eliminate efficiently all defectives in a binomial sample. Bell Syst. Tech. J. 1959, 38, 1179–1252. [Google Scholar] [CrossRef]
Allemann, A. An Efficient Algorithm for Combinatorial Group Testing. In Information Theory, Combinatorics, and Search Theory; Lecture Notes in Computer Science; Springer: Berlin/Heidelberg, Germany, 2013; Volume 7777. [Google Scholar]
Srivastava, J.N. A Survey of Combinatorial Theory; North Holland Publishing Co.: Amsterdam, The Netherlands, 1973. [Google Scholar]
Riccio, L.; Colbourn, C.J. Sharper bounds in adaptive group testing. Taiwan. J. Math. 2000, 4, 669–673. [Google Scholar] [CrossRef]
Leu, M.-G. A note on the Hu–Hwang–Wang conjecture for group testing. ANZIAM J. 2008, 49, 561–571. [Google Scholar] [CrossRef] [Green Version]
Chan, C.L.; Che, P.H.; Jaggi, S.; Saligrama, V. Non-adaptive probabilistic group testing with noisy measurements: Near-optimal bounds with efficient algorithms. In Proceedings of the 49th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 28–30 September 2011. [Google Scholar]
Atia, G.K.; Saligrama, V. Boolean Compressed Sensing and Noisy Group Testing. IEEE Trans. Inf. Theory 2012, 58, 1880–1901. [Google Scholar] [CrossRef] [Green Version]
Malyutov, M. The separating property of random matrices. Math. Notes Acad. Sci. USSR 1978, 23, 84–91. [Google Scholar] [CrossRef]
Sejdinovic, D.; Johnson, O. Note on noisy group testing: Asymptotic bounds and belief propagation reconstruction. In Proceedings of the 48th Annual Allerton Conference on Communication, Control, and Computing, Monticello, IL, USA, 29 September–1 October 2010. [Google Scholar]
Malioutov, D.; Malyutov, M. Boolean compressed sensing: LP relaxation for group testing. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Kyoto, Japan, 25–30 March 2012. [Google Scholar]
Bondorf, S.; Chen, B.; Scarlett, J.; Yu, H.; Zhao, Y. Sublinear-time non-adaptive group testing with O(k log n) tests via bit-mixing coding. IEEE Trans. Inf. Theory 2021, 67, 1559–1570. [Google Scholar] [CrossRef]
Goodrich, M.T.; Atallah, M.J.; Tamassia, R. Indexing information for data forensics. In Proceedings of the Third International Conference on Applied Cryptography and Network Security, New York, NY, USA, 7–10 June 2005. [Google Scholar]
Mistry, D.A.; Wang, J.Y.; Moeser, M.E.; Starkey, T.; Lee, L.Y. A systematic review of the sensitivity and specificity of lateral flow devices in the detection of SARS-CoV-2. BMC Infect. Dis. 2021, 21, 828. [Google Scholar] [CrossRef] [PubMed]
Baldassini, L.; Johnson, O.; Aldridge, M. The capacity of adaptive group testing. In Proceedings of the IEEE International Symposium on Information Theory, Istanbul, Turkey, 7–12 July 2013. [Google Scholar]
Aldridge, M.; Baldassini, L.; Johnson, O. Group Testing Algorithms: Bounds and Simulations. IEEE Trans. Inf. Theory 2014, 60, 3671–3687. [Google Scholar] [CrossRef] [Green Version]
Cover, T.M.; Thomas, J.A. Elements of Information Theory; Wiley: Hoboken, NJ, USA, 2009. [Google Scholar]
Aldridge, M.; Johnson, O.; Scarlett, J. Group Testing: An Information Theory Perspective. Found. Trends Commun. Inf. Theory 2019, 15, 196–392. [Google Scholar] [CrossRef] [Green Version]
Gallager, R. Information Theory and Reliable Communication; John Wiley and Sons: Hoboken, NJ, USA, 1968. [Google Scholar]

Figure 1. One example of NTGT where M = 7, N = 10, T = 2, the black boxes denotes 1 s, and white ones 0 s.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the author. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Seong, J.-T. Theoretical Bounds on the Number of Tests in Noisy Threshold Group Testing Frameworks. Mathematics 2022, 10, 2508. https://doi.org/10.3390/math10142508

AMA Style

Seong J-T. Theoretical Bounds on the Number of Tests in Noisy Threshold Group Testing Frameworks. Mathematics. 2022; 10(14):2508. https://doi.org/10.3390/math10142508

Chicago/Turabian Style

Seong, Jin-Taek. 2022. "Theoretical Bounds on the Number of Tests in Noisy Threshold Group Testing Frameworks" Mathematics 10, no. 14: 2508. https://doi.org/10.3390/math10142508

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Theoretical Bounds on the Number of Tests in Noisy Threshold Group Testing Frameworks

Abstract

1. Introduction

2. Related Work

3. Noisy Threshold Group Testing Framework

3.1. Problem Statement

3.2. Decoding

3.3. Bounds for Group Testing Schemes

4. Necessary Condition for Complete Recovery

4.1. Lower Bound

4.2. Construction of Noisy Threshold Group Testing

5. Sufficient Condition for Average Performance

5.1. Upper Bound

5.2. Discussion for Necessary and Sufficient Conditions

6. Conclusions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI