# An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

- What are the closed-form relations between each bound and error components in a diagram of entropy and error probability?
- What are the lower and upper bounds in terms of the non-Bayesian errors if a non-Bayesian rule is applied in the information processing?

- A new approach is proposed for deriving bounds directly through the optimization process based on a joint distribution, which is significantly different from all other existing approaches. One advantage of using the approach is the closed-form expressions to the bounds and their error components.
- A new upper bound in a diagram of “Error Probability vs. Conditional Entropy” for the Bayesian errors is derived with a closed-form expression in the binary state, which has not been reported before. The new bound is generally tighter than Kovalevskij’s upper bound. Fano’s lower bound receives novel interpretations.
- The comparison study on the bounds in terms of the Bayesian and non-Bayesian errors are made in the binary state. The bounds of non-Bayesian errors are explored for a first time in information theory and imply a significant role in the study of machine learning and classification applications.

## 2. Related Works

## 3. Binary Classifications and Related Definitions

**Definition 1.**

**Definition 2.**

**Remark 1.**

**Definition 3.**

**Remark 2.**

**Remark 3.**

**Definition 4.**

## 4. Lower and Upper Bounds for Bayesian Errors

**Theorem 1.**

**Proof.**

**Remark 4.**

**Theorem 2.**

**Proof.**

**Remark 5.**

**Remark 6.**

## 5. Lower and Upper Bounds for Non-Bayesian Errors

**Definition 5.**

**Proposition 1.**

**Proof.**

**Theorem 3.**

**Proof.**

**Remark 7.**

**Remark 8.**

**Theorem 4.**

**Proof.**

**Remark 9.**

**Remark 10.**

## 6. Classification Interpretations to Some Key Points

**Remark 11.**

## 7. Summary

## 8. Discussions

## Acknowledgments

## Author Contributions

## Conflicts of Interest

## Appendix

## References

- Fano, R.M. Transmission of Information: A Statistical Theory of Communication. Am. J. Phys.
**1961**. [Google Scholar] [CrossRef] - Cover, T.M.; Thomas, J.A. Elements of Information Theory, 2nd ed.; John Wiley: New York, NY, USA, 2006. [Google Scholar]
- Verdú, S. Fifty years of Shannon theory. IEEE Trans. Inf. Theory
**1998**, 44, 2057–2078. [Google Scholar] [CrossRef] - Yeung, R.W. A First Course in Information Theory; Kluwer Academic: London, UK, 2002. [Google Scholar]
- Golic, J.D. Comment on “Relations between entropy and error probability”. IEEE Trans. Inf. Theory
**1999**. [Google Scholar] [CrossRef] - Vajda, I.; Zvárová, J. On generalized entropies, Bayesian decisions and statistical diversity. Kybernetika
**2007**, 43, 675–696. [Google Scholar] - Morales, D.; Vajda, I. Generalized information criteria for optimal Bayes decisions. Kybernetika
**2012**, 48, 714–749. [Google Scholar] - Kovalevskij, V.A. The Problem of Character Recognition from the Point of View of Mathematical Statistics. In Character Readers and Pattern Recognition; Spartan: New York, NY, USA, 1968; pp. 3–30. [Google Scholar]
- Chu, J.T.; Chueh, J.C. Inequalities between information measures and error probability. J. Frankl. Inst.
**1966**, 282, 121–125. [Google Scholar] [CrossRef] - Tebe, D.L.; Dwyer, S.J. Uncertainty and probability of error. IEEE Trans. Inf. Theory
**1968**, 16, 516–518. [Google Scholar] [CrossRef] - Hellman, M.E.; Raviv, J. Probability of error, equivocation, and the Chernoff bound. IEEE Trans. Inf. Theory
**1970**, 16, 368–372. [Google Scholar] [CrossRef] - Chen, C.H. Theoretical comparison of a class of feature selection criteria in pattern recognition. IEEE Trans. Comput.
**1971**, 20, 1054–1056. [Google Scholar] [CrossRef] - Ben-Bassat, M.; Raviv, J. Rényi’s entropy and the probability of error. IEEE Trans. Inf. Theory
**1978**, 24, 324–330. [Google Scholar] [CrossRef] - Golić, J.D. On the relationship between the information measures and the Bayes probability of error. IEEE Trans. Inf. Theory
**1987**, 35, 681–690. [Google Scholar] [CrossRef] - Feder, M.; Merhav, N. Relations between entropy and error probability. IEEE Trans. Inf. Theory
**1994**, 40, 259–266. [Google Scholar] [CrossRef] - Han, T.S.; Verdú, S. Generalizing the Fano inequality. IEEE Trans. Inf. Theory
**1994**, 40, 1247–1251. [Google Scholar] - Poor, H.V.; Verdú, S. A Lower bound on the probability of error in multihypothesis testing. IEEE Trans. Inf. Theory
**1995**, 41, 1992–1994. [Google Scholar] [CrossRef] - Harremoës, P.; Topsøe, F. Inequalities between entropy and index of coincidence derived from information diagrams. IEEE Trans. Inf. Theory
**2001**, 47, 2944–2960. [Google Scholar] [CrossRef] - Erdogmus, D.; Principe, J.C. Lower and upper bounds for misclassification probability based on Renyi’s information. J. VLSI Signal Process.
**2004**, 37, 305–317. [Google Scholar] [CrossRef] - Ho, S.-W.; Verdú, S. On the interplay between conditional entropy and error probability. IEEE Trans. Inf. Theory
**2010**, 56, 5930–5942. [Google Scholar] [CrossRef] - Liang, X.-B. A note on Fano’s inequality. In Proceedings of the 2011 45th Annual Conference on Information Sciences and Systems, Baltimore, MD, USA, 23–25 March 2011.
- Fano, R.M. Fano inequality. Scholarpedia
**2008**. [Google Scholar] [CrossRef] - Shannon, C.E. A mathematical theory of communication. Bell Syst. Tech. J.
**1948**, 27, 379–423. [Google Scholar] [CrossRef] - Feder, M.; Merhav, N. Universal prediction of individual sequences. IEEE Trans. Inf. Theory
**1992**, 38, 1258–1270. [Google Scholar] [CrossRef] - Wang, Y.; Hu, B.-G. Derivations of normalized mutual information in binary classifications. In Proceedings of the 6th International Conference on Fuzzy Systems and Knowledge Discovery, Tianjin, China, 14–16 August 2009; pp. 155–163.
- Hu, B.-G. What are the differences between Bayesian classifiers and mutual-information classifiers? IEEE Trans. Neural Net. Learn. Syst.
**2014**, 25, 249–264. [Google Scholar] - Eriksson, T.; Kim, S.; Kang, H.-G.; Lee, C. An information-theoretic perspective on feature selection in speaker recognition. IEEE Signal Process. Lett.
**2005**, 12, 500–503. [Google Scholar] [CrossRef] - Fisher, J.W.; Siracusa, M.; Tieu, K. Estimation of signal information content for classification. In Proceedings of the IEEE DSP Workshop, Marco Island, FL, USA, 4–7 January 2009; pp. 353–358.
- Taneja, I.J. Generalized error bounds in pattern recognition. Pattern Recogni. Lett.
**1985**, 3, 361–368. [Google Scholar] [CrossRef] - Duda, R.O.; Hart, P.E.; Stork, D. Pattern Classification, 2nd ed.; John Wiley: New York, NY, USA, 2001. [Google Scholar]
- Cristianini, N.; Shawe-Taylor, J. An Introduction to Support Vector Machines and other Kernel-based Learning Methods; Cambridge University Press: London, UK, 2000. [Google Scholar]
- He, H.; Garcia, E.A. Learning from imbalanced data. IEEE Trans. Knowl. Data Eng.
**2009**, 21, 1263–1284. [Google Scholar] - Sun, Y.M.; Wong, A.K.C.; Kamel, M.S. Classification of imbalanced data: A review. Int. J. Pattern Recognit. Artif. Intell.
**2009**, 23, 687–719. [Google Scholar] [CrossRef] - LeCun, Y.; Bengio, Y.; Hinton, G. Deep Learning. Nature
**2015**, 521, 436–444. [Google Scholar] [CrossRef] [PubMed] - Hu, B.-G.; He, R.; Yuan, X.-T. Information-theoretic measures for objective evaluation of classifications. Acta Autom. Sin.
**2012**, 38, 1160–1173. [Google Scholar] [CrossRef] - Mackay, D.J.C. Information Theory, Inference, and Learning Algorithms; Cambridge University Press: Cambridge, UK, 2003. [Google Scholar]
- Subramanian, V.R.; White, R.E. Symbolic solutions for boundary value problems using Maple. Comput. Chem. Eng.
**2000**, 24, 2405–2416. [Google Scholar] [CrossRef] - Temimi, H.; Ansari, A.R. A semi-analytical iterative technique for solving nonlinear problems. Comput. Math. Appl.
**2011**, 61, 203–210. [Google Scholar] [CrossRef] - Jordan, M.I. On statistics, computation and scalability. Bernoulli
**2013**, 19, 1378–1390. [Google Scholar] [CrossRef] - Principe, J.C. Information Theoretic Learning: Renyi’s Entropy and Kernel Perspectives; Springer: New York, NY, USA, 2010. [Google Scholar]
- Hu, B.-G. Information theory and its relation to machine learning. In Proceedings of the 2015 Chinese Intelligent Automation Conference; Springer-Verlag: Berlin/Heidelberg, Germany, 2015; pp. 1–11. [Google Scholar]

**Figure 1.**Schematic diagram of the pattern recognition systems (adapted from Figure 1.7 in [30]).

**Figure 2.**Bayesian decision boundary ${x}_{b}$ for equal priors $p({t}_{i})$ in a binary classification (adapted from Figure 2.17 in [30]).

**Figure 3.**Graphic diagram of the probability transformation between variables T and Y in a binary classification (or channel). Instead of using conditional probability $p(y|t)$, joint probability distributions $p(t,y)$ are applied to describe the channel.

© 2016 by the authors; licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons by Attribution (CC-BY) license (http://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Hu, B.-G.; Xing, H.-J.
An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications. *Entropy* **2016**, *18*, 59.
https://doi.org/10.3390/e18020059

**AMA Style**

Hu B-G, Xing H-J.
An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications. *Entropy*. 2016; 18(2):59.
https://doi.org/10.3390/e18020059

**Chicago/Turabian Style**

Hu, Bao-Gang, and Hong-Jie Xing.
2016. "An Optimization Approach of Deriving Bounds between Entropy and Error from Joint Distribution: Case Study for Binary Classifications" *Entropy* 18, no. 2: 59.
https://doi.org/10.3390/e18020059