entropy-logo

Journal Browser

Journal Browser

Information Bottleneck: Theory and Applications in Deep Learning

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (31 July 2020) | Viewed by 51945

Printed Edition Available!
A printed edition of this Special Issue is available here.

Special Issue Editors


E-Mail Website
Guest Editor
1. Area of Methods & Algorithms for Artificial Intelligence, Know-Center GmbH, 8010 Graz, Austria
2. Signal Processing and Speech Communication Laboratory, Graz University of Technology, 8010 Graz, Austria
Interests: information-theoretic model reduction; information bottleneck theory of deep learning; information-theoretic analysis of machine learning systems; theory-inspired machine learning
Special Issues, Collections and Topics in MDPI journals

E-Mail Website
Guest Editor
Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria
Interests: computational intelligence; signal processing; speech communication

Special Issue Information

Dear Colleagues,

The information bottleneck (IB) framework has recently gained popularity in the analysis and design of neural networks (NNs): The “information plane”, quantifying how the latent representations learn what is relevant and “forget” what is irrelevant during training, was shown to allow unprecedented insight into the inner workings of NNs, and the IB functional and its variants have been suggested as cost functions for NN training.

Based on this increased attention, this Special Issue aims to investigate the properties of the IB functional in this new context and to propose new learning mechanisms inspired by the IB framework. In the former aspect, we are interested in both purely theoretical as well as empirical observations that shed new light on the IB framework. In the latter aspect, we solicit papers that discuss training of NNs or other deep, multilayer machine learning models using cost functions that are inspired by the IB principle, even if the cost function itself looks different. Specifically, we seek:

  • Manuscripts that provide novel insight into the properties of the IB functional; both purely theoretical and empirical approaches are accepted;
  • Manuscripts that apply the IB principle for training deep, i.e., multilayer machine learning structures;
  • Manuscripts that discuss cost functions for NN training that are inspired by the IB principle but depart from the IB functional in a well-motivated manner.

Dr. Bernhard C. Geiger
Prof. Dr. Gernot Kubin
Guest Editors

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Keywords

  • information bottleneck
  • neural networks
  • multi-layer machine learning models
  • learning theory
  • information–theoretic cost functions

Published Papers (13 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Editorial

Jump to: Research

4 pages, 189 KiB  
Editorial
Information Bottleneck: Theory and Applications in Deep Learning
by Bernhard C. Geiger and Gernot Kubin
Entropy 2020, 22(12), 1408; https://doi.org/10.3390/e22121408 - 14 Dec 2020
Cited by 11 | Viewed by 4555
Abstract
The information bottleneck (IB) framework, proposed in [...] Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)

Research

Jump to: Editorial

12 pages, 263 KiB  
Article
A Comparison of Variational Bounds for the Information Bottleneck Functional
by Bernhard C. Geiger and Ian S. Fischer
Entropy 2020, 22(11), 1229; https://doi.org/10.3390/e22111229 - 29 Oct 2020
Cited by 2 | Viewed by 2182
Abstract
In this short note, we relate the variational bounds proposed in Alemi et al. (2017) and Fischer (2020) for the information bottleneck (IB) and the conditional entropy bottleneck (CEB) functional, respectively. Although the two functionals were shown to be equivalent, it was empirically [...] Read more.
In this short note, we relate the variational bounds proposed in Alemi et al. (2017) and Fischer (2020) for the information bottleneck (IB) and the conditional entropy bottleneck (CEB) functional, respectively. Although the two functionals were shown to be equivalent, it was empirically observed that optimizing bounds on the CEB functional achieves better generalization performance and adversarial robustness than optimizing those on the IB functional. This work tries to shed light on this issue by showing that, in the most general setting, no ordering can be established between these variational bounds, while such an ordering can be enforced by restricting the feasible sets over which the optimizations take place. The absence of such an ordering in the general setup suggests that the variational bound on the CEB functional is either more amenable to optimization or a relevant cost function for optimization in its own regard, i.e., without justification from the IB or CEB functionals. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
16 pages, 1261 KiB  
Article
CEB Improves Model Robustness
by Ian Fischer and Alexander A. Alemi
Entropy 2020, 22(10), 1081; https://doi.org/10.3390/e22101081 - 25 Sep 2020
Cited by 10 | Viewed by 2208
Abstract
Intuitively, one way to make classifiers more robust to their input is to have them depend less sensitively on their input. The Information Bottleneck (IB) tries to learn compressed representations of input that are still predictive. Scaling up IB approaches to large scale [...] Read more.
Intuitively, one way to make classifiers more robust to their input is to have them depend less sensitively on their input. The Information Bottleneck (IB) tries to learn compressed representations of input that are still predictive. Scaling up IB approaches to large scale image classification tasks has proved difficult. We demonstrate that the Conditional Entropy Bottleneck (CEB) can not only scale up to large scale image classification tasks, but can additionally improve model robustness. CEB is an easy strategy to implement and works in tandem with data augmentation procedures. We report results of a large scale adversarial robustness study on CIFAR-10, as well as the ImageNet-C Common Corruptions Benchmark, ImageNet-A, and PGD attacks. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

21 pages, 650 KiB  
Article
The Conditional Entropy Bottleneck
by Ian Fischer
Entropy 2020, 22(9), 999; https://doi.org/10.3390/e22090999 - 08 Sep 2020
Cited by 37 | Viewed by 3511
Abstract
Much of the field of Machine Learning exhibits a prominent set of failure modes, including vulnerability to adversarial examples, poor out-of-distribution (OoD) detection, miscalibration, and willingness to memorize random labelings of datasets. We characterize these as failures of robust generalization, which extends [...] Read more.
Much of the field of Machine Learning exhibits a prominent set of failure modes, including vulnerability to adversarial examples, poor out-of-distribution (OoD) detection, miscalibration, and willingness to memorize random labelings of datasets. We characterize these as failures of robust generalization, which extends the traditional measure of generalization as accuracy or related metrics on a held-out set. We hypothesize that these failures to robustly generalize are due to the learning systems retaining too much information about the training data. To test this hypothesis, we propose the Minimum Necessary Information (MNI) criterion for evaluating the quality of a model. In order to train models that perform well with respect to the MNI criterion, we present a new objective function, the Conditional Entropy Bottleneck (CEB), which is closely related to the Information Bottleneck (IB). We experimentally test our hypothesis by comparing the performance of CEB models with deterministic models and Variational Information Bottleneck (VIB) models on a variety of different datasets and robustness challenges. We find strong empirical evidence supporting our hypothesis that MNI models improve on these problems of robust generalization. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

33 pages, 3128 KiB  
Article
Variational Information Bottleneck for Semi-Supervised Classification
by Slava Voloshynovskiy, Olga Taran, Mouad Kondah, Taras Holotyak and Danilo Rezende
Entropy 2020, 22(9), 943; https://doi.org/10.3390/e22090943 - 27 Aug 2020
Cited by 11 | Viewed by 4082
Abstract
In this paper, we consider an information bottleneck (IB) framework for semi-supervised classification with several families of priors on latent space representation. We apply a variational decomposition of mutual information terms of IB. Using this decomposition we perform an analysis of several regularizers [...] Read more.
In this paper, we consider an information bottleneck (IB) framework for semi-supervised classification with several families of priors on latent space representation. We apply a variational decomposition of mutual information terms of IB. Using this decomposition we perform an analysis of several regularizers and practically demonstrate an impact of different components of variational model on the classification accuracy. We propose a new formulation of semi-supervised IB with hand crafted and learnable priors and link it to the previous methods such as semi-supervised versions of VAE (M1 + M2), AAE, CatGAN, etc. We show that the resulting model allows better understand the role of various previously proposed regularizers in semi-supervised classification task in the light of IB framework. The proposed IB semi-supervised model with hand-crafted and learnable priors is experimentally validated on MNIST under different amount of labeled data. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

15 pages, 1246 KiB  
Article
Convergence Behavior of DNNs with Mutual-Information-Based Regularization
by Hlynur Jónsson, Giovanni Cherubini and Evangelos Eleftheriou
Entropy 2020, 22(7), 727; https://doi.org/10.3390/e22070727 - 30 Jun 2020
Cited by 9 | Viewed by 3214
Abstract
Information theory concepts are leveraged with the goal of better understanding and improving Deep Neural Networks (DNNs). The information plane of neural networks describes the behavior during training of the mutual information at various depths between input/output and hidden-layer variables. Previous analysis revealed [...] Read more.
Information theory concepts are leveraged with the goal of better understanding and improving Deep Neural Networks (DNNs). The information plane of neural networks describes the behavior during training of the mutual information at various depths between input/output and hidden-layer variables. Previous analysis revealed that most of the training epochs are spent on compressing the input, in some networks where finiteness of the mutual information can be established. However, the estimation of mutual information is nontrivial for high-dimensional continuous random variables. Therefore, the computation of the mutual information for DNNs and its visualization on the information plane mostly focused on low-complexity fully connected networks. In fact, even the existence of the compression phase in complex DNNs has been questioned and viewed as an open problem. In this paper, we present the convergence of mutual information on the information plane for a high-dimensional VGG-16 Convolutional Neural Network (CNN) by resorting to Mutual Information Neural Estimation (MINE), thus confirming and extending the results obtained with low-dimensional fully connected networks. Furthermore, we demonstrate the benefits of regularizing a network, especially for a large number of training epochs, by adopting mutual information estimates as additional terms in the loss function characteristic of the network. Experimental results show that the regularization stabilizes the test accuracy and significantly reduces its variance. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

27 pages, 1642 KiB  
Article
The Convex Information Bottleneck Lagrangian
by Borja Rodríguez Gálvez, Ragnar Thobaben and Mikael Skoglund
Entropy 2020, 22(1), 98; https://doi.org/10.3390/e22010098 - 14 Jan 2020
Cited by 11 | Viewed by 4384
Abstract
The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y. It is defined as a constrained optimization problem that maximizes the information the representation has about the [...] Read more.
The information bottleneck (IB) problem tackles the issue of obtaining relevant compressed representations T of some random variable X for the task of predicting Y. It is defined as a constrained optimization problem that maximizes the information the representation has about the task, I ( T ; Y ) , while ensuring that a certain level of compression r is achieved (i.e., I ( X ; T ) r ). For practical reasons, the problem is usually solved by maximizing the IB Lagrangian (i.e., L IB ( T ; β ) = I ( T ; Y ) β I ( X ; T ) ) for many values of β [ 0 , 1 ] . Then, the curve of maximal I ( T ; Y ) for a given I ( X ; T ) is drawn and a representation with the desired predictability and compression is selected. It is known when Y is a deterministic function of X, the IB curve cannot be explored and another Lagrangian has been proposed to tackle this problem: the squared IB Lagrangian: L sq IB ( T ; β sq ) = I ( T ; Y ) β sq I ( X ; T ) 2 . In this paper, we (i) present a general family of Lagrangians which allow for the exploration of the IB curve in all scenarios; (ii) provide the exact one-to-one mapping between the Lagrange multiplier and the desired compression rate r for known IB curve shapes; and (iii) show we can approximately obtain a specific compression level with the convex IB Lagrangian for both known and unknown IB curve shapes. This eliminates the burden of solving the optimization problem for many values of the Lagrange multiplier. That is, we prove that we can solve the original constrained problem with a single optimization. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Graphical abstract

17 pages, 492 KiB  
Article
Probabilistic Ensemble of Deep Information Networks
by Giulio Franzese and Monica Visintin
Entropy 2020, 22(1), 100; https://doi.org/10.3390/e22010100 - 14 Jan 2020
Cited by 5 | Viewed by 2335
Abstract
We describe a classifier made of an ensemble of decision trees, designed using information theory concepts. In contrast to algorithms C4.5 or ID3, the tree is built from the leaves instead of the root. Each tree is made of nodes trained independently of [...] Read more.
We describe a classifier made of an ensemble of decision trees, designed using information theory concepts. In contrast to algorithms C4.5 or ID3, the tree is built from the leaves instead of the root. Each tree is made of nodes trained independently of the others, to minimize a local cost function (information bottleneck). The trained tree outputs the estimated probabilities of the classes given the input datum, and the outputs of many trees are combined to decide the class. We show that the system is able to provide results comparable to those of the tree classifier in terms of accuracy, while it shows many advantages in terms of modularity, reduced complexity, and memory requirements. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

27 pages, 5507 KiB  
Article
Pareto-Optimal Data Compression for Binary Classification Tasks
by Max Tegmark and Tailin Wu
Entropy 2020, 22(1), 7; https://doi.org/10.3390/e22010007 - 19 Dec 2019
Cited by 10 | Viewed by 4789
Abstract
The goal of lossy data compression is to reduce the storage cost of a data set X while retaining as much information as possible about something (Y) that you care about. For example, what aspects of an image X contain the [...] Read more.
The goal of lossy data compression is to reduce the storage cost of a data set X while retaining as much information as possible about something (Y) that you care about. For example, what aspects of an image X contain the most information about whether it depicts a cat? Mathematically, this corresponds to finding a mapping X Z f ( X ) that maximizes the mutual information I ( Z , Y ) while the entropy H ( Z ) is kept below some fixed threshold. We present a new method for mapping out the Pareto frontier for classification tasks, reflecting the tradeoff between retained entropy and class information. We first show how a random variable X (an image, say) drawn from a class Y { 1 , , n } can be distilled into a vector W = f ( X ) R n 1 losslessly, so that I ( W , Y ) = I ( X , Y ) ; for example, for a binary classification task of cats and dogs, each image X is mapped into a single real number W retaining all information that helps distinguish cats from dogs. For the n = 2 case of binary classification, we then show how W can be further compressed into a discrete variable Z = g β ( W ) { 1 , , m β } by binning W into m β bins, in such a way that varying the parameter β sweeps out the full Pareto frontier, solving a generalization of the discrete information bottleneck (DIB) problem. We argue that the most interesting points on this frontier are “corners” maximizing I ( Z , Y ) for a fixed number of bins m = 2 , 3 , which can conveniently be found without multiobjective optimization. We apply this method to the CIFAR-10, MNIST and Fashion-MNIST datasets, illustrating how it can be interpreted as an information-theoretically optimal image clustering algorithm. We find that these Pareto frontiers are not concave, and that recently reported DIB phase transitions correspond to transitions between these corners, changing the number of clusters. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

15 pages, 2142 KiB  
Article
Nonlinear Information Bottleneck
by Artemy Kolchinsky, Brendan D. Tracey and David H. Wolpert
Entropy 2019, 21(12), 1181; https://doi.org/10.3390/e21121181 - 30 Nov 2019
Cited by 68 | Viewed by 8588
Abstract
Information bottleneck (IB) is a technique for extracting information in one random variable X that is relevant for predicting another random variable Y. IB works by encoding X in a compressed “bottleneck” random variable M from which Y can be accurately decoded. [...] Read more.
Information bottleneck (IB) is a technique for extracting information in one random variable X that is relevant for predicting another random variable Y. IB works by encoding X in a compressed “bottleneck” random variable M from which Y can be accurately decoded. However, finding the optimal bottleneck variable involves a difficult optimization problem, which until recently has been considered for only two limited cases: discrete X and Y with small state spaces, and continuous X and Y with a Gaussian joint distribution (in which case optimal encoding and decoding maps are linear). We propose a method for performing IB on arbitrarily-distributed discrete and/or continuous X and Y, while allowing for nonlinear encoding and decoding maps. Our approach relies on a novel non-parametric upper bound for mutual information. We describe how to implement our method using neural networks. We then show that it achieves better performance than the recently-proposed “variational IB” method on several real-world datasets. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

21 pages, 3759 KiB  
Article
Markov Information Bottleneck to Improve Information Flow in Stochastic Neural Networks
by Thanh Tang Nguyen and Jaesik Choi
Entropy 2019, 21(10), 976; https://doi.org/10.3390/e21100976 - 06 Oct 2019
Cited by 5 | Viewed by 3937
Abstract
While rate distortion theory compresses data under a distortion constraint, information bottleneck (IB) generalizes rate distortion theory to learning problems by replacing a distortion constraint with a constraint of relevant information. In this work, we further extend IB to multiple Markov bottlenecks (i.e., [...] Read more.
While rate distortion theory compresses data under a distortion constraint, information bottleneck (IB) generalizes rate distortion theory to learning problems by replacing a distortion constraint with a constraint of relevant information. In this work, we further extend IB to multiple Markov bottlenecks (i.e., latent variables that form a Markov chain), namely Markov information bottleneck (MIB), which particularly fits better in the context of stochastic neural networks (SNNs) than the original IB. We show that Markov bottlenecks cannot simultaneously achieve their information optimality in a non-collapse MIB, and thus devise an optimality compromise. With MIB, we take the novel perspective that each layer of an SNN is a bottleneck whose learning goal is to encode relevant information in a compressed form from the data. The inference from a hidden layer to the output layer is then interpreted as a variational approximation to the layer’s decoding of relevant information in the MIB. As a consequence of this perspective, the maximum likelihood estimate (MLE) principle in the context of SNNs becomes a special case of the variational MIB. We show that, compared to MLE, the variational MIB can encourage better information flow in SNNs in both principle and practice, and empirically improve performance in classification, adversarial robustness, and multi-modal learning in MNIST. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

33 pages, 2387 KiB  
Article
Learnability for the Information Bottleneck
by Tailin Wu, Ian Fischer, Isaac L. Chuang and Max Tegmark
Entropy 2019, 21(10), 924; https://doi.org/10.3390/e21100924 - 23 Sep 2019
Cited by 14 | Viewed by 3395
Abstract
The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective I ( X ; Z ) β I ( Y ; Z ) employs a Lagrange multiplier β to tune [...] Read more.
The Information Bottleneck (IB) method provides an insightful and principled approach for balancing compression and prediction for representation learning. The IB objective I ( X ; Z ) β I ( Y ; Z ) employs a Lagrange multiplier β to tune this trade-off. However, in practice, not only is β chosen empirically without theoretical guidance, there is also a lack of theoretical understanding between β , learnability, the intrinsic nature of the dataset and model capacity. In this paper, we show that if β is improperly chosen, learning cannot happen—the trivial representation P ( Z | X ) = P ( Z ) becomes the global minimum of the IB objective. We show how this can be avoided, by identifying a sharp phase transition between the unlearnable and the learnable which arises as β is varied. This phase transition defines the concept of IB-Learnability. We prove several sufficient conditions for IB-Learnability, which provides theoretical guidance for choosing a good β . We further show that IB-learnability is determined by the largest confident, typical and imbalanced subset of the examples (the conspicuous subset), and discuss its relation with model capacity. We give practical algorithms to estimate the minimum β for a given dataset. We also empirically demonstrate our theoretical conditions with analyses of synthetic datasets, MNIST and CIFAR10. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

14 pages, 795 KiB  
Article
Gaussian Mean Field Regularizes by Limiting Learned Information
by Julius Kunze, Louis Kirsch, Hippolyt Ritter and David Barber
Entropy 2019, 21(8), 758; https://doi.org/10.3390/e21080758 - 03 Aug 2019
Cited by 2 | Viewed by 3030
Abstract
Variational inference with a factorized Gaussian posterior estimate is a widely-used approach for learning parameters and hidden variables. Empirically, a regularizing effect can be observed that is poorly understood. In this work, we show how mean field inference improves generalization by limiting mutual [...] Read more.
Variational inference with a factorized Gaussian posterior estimate is a widely-used approach for learning parameters and hidden variables. Empirically, a regularizing effect can be observed that is poorly understood. In this work, we show how mean field inference improves generalization by limiting mutual information between learned parameters and the data through noise. We quantify a maximum capacity when the posterior variance is either fixed or learned and connect it to generalization error, even when the KL-divergence in the objective is scaled by a constant. Our experiments suggest that bounding information between parameters and data effectively regularizes neural networks on both supervised and unsupervised tasks. Full article
(This article belongs to the Special Issue Information Bottleneck: Theory and Applications in Deep Learning)
Show Figures

Figure 1

Back to TopTop