entropy-logo

Journal Browser

Journal Browser

Information Theory in Machine Learning and Data Science

A special issue of Entropy (ISSN 1099-4300). This special issue belongs to the section "Information Theory, Probability and Statistics".

Deadline for manuscript submissions: closed (15 May 2018) | Viewed by 115980

Special Issue Editor


E-Mail Website
Guest Editor
Department of Electrical and Computer Engineering, University of Illinois, 162 Coordinated Science Lab MC 228, 1308 W. Main St., Urbana, IL 61801, USA
Interests: information theory; statistical machine learning; optimization and control

Special Issue Information

Dear Colleagues,

The field of machine learning and data science is concerned with the design and analysis of algorithms for making decisions, reasoning about the world, and knowledge extraction from massive amounts of data. On the one hand, the performance of machine learning algorithms is limited by the amount of predictively relevant information contained in the data. On the other hand, different procedures for accessing and processing data can be more or less informative. Claude Shannon originally developed information theory with communication systems in mind. However, it has also proved to be an indispensible analytical tool in the field of mathematical statistics, where it is used to quantify the fundamental limits on the performance of statistical decision procedures and to guide the processes of feature selection and experimental design.

The purpose of this Special Issue is to highlight the state-of-the-art in applications of information theory to the fields of machine learning and data science. Possible topics include, but are not limited to, the following:

  • Fundamental information-theoretic limits of machine learning algorithms

  • Information-directed sampling and optimization

  • Statistical estimation, optimization, and learning under information constraints

  • Information bottleneck methods

  • Information-theoretic approaches to adaptive data analysis

  • Information-theoretic approaches to feature design and selection

  • Estimation of information-theoretic functionals

Prof. Dr. Maxim Raginsky
Guest Editor

Manuscript Submission Information

Manuscripts should be submitted online at www.mdpi.com by registering and logging in to this website. Once you are registered, click here to go to the submission form. Manuscripts can be submitted until the deadline. All submissions that pass pre-check are peer-reviewed. Accepted papers will be published continuously in the journal (as soon as accepted) and will be listed together on the special issue website. Research articles, review articles as well as short communications are invited. For planned papers, a title and short abstract (about 100 words) can be sent to the Editorial Office for announcement on this website.

Submitted manuscripts should not have been published previously, nor be under consideration for publication elsewhere (except conference proceedings papers). All manuscripts are thoroughly refereed through a single-blind peer-review process. A guide for authors and other relevant information for submission of manuscripts is available on the Instructions for Authors page. Entropy is an international peer-reviewed open access monthly journal published by MDPI.

Please visit the Instructions for Authors page before submitting a manuscript. The Article Processing Charge (APC) for publication in this open access journal is 2600 CHF (Swiss Francs). Submitted papers should be well formatted and use good English. Authors may use MDPI's English editing service prior to publication or during author revisions.

Related Special Issue

Published Papers (23 papers)

Order results
Result details
Select all
Export citation of selected articles as:

Research

Jump to: Other

14 pages, 777 KiB  
Article
Information Perspective to Probabilistic Modeling: Boltzmann Machines versus Born Machines
by Song Cheng, Jing Chen and Lei Wang
Entropy 2018, 20(8), 583; https://doi.org/10.3390/e20080583 - 07 Aug 2018
Cited by 61 | Viewed by 6295
Abstract
We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states, respectively. Classical and quantum information patterns of the target datasets [...] Read more.
We compare and contrast the statistical physics and quantum physics inspired approaches for unsupervised generative modeling of classical data. The two approaches represent probabilities of observed data using energy-based models and quantum states, respectively. Classical and quantum information patterns of the target datasets therefore provide principled guidelines for structural design and learning in these two approaches. Taking the Restricted Boltzmann Machines (RBM) as an example, we analyze the information theoretical bounds of the two approaches. We also estimate the classical mutual information of the standard MNIST datasets and the quantum Rényi entropy of corresponding Matrix Product States (MPS) representations. Both information measures are much smaller compared to their theoretical upper bound and exhibit similar patterns, which imply a common inductive bias of low information complexity. By comparing the performance of RBM with various architectures on the standard MNIST datasets, we found that the RBM with local sparse connection exhibit high learning efficiency, which supports the application of tensor network states in machine learning problems. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Graphical abstract

39 pages, 997 KiB  
Article
Ensemble Estimation of Information Divergence
by Kevin R. Moon, Kumar Sricharan, Kristjan Greenewald and Alfred O. Hero III
Entropy 2018, 20(8), 560; https://doi.org/10.3390/e20080560 - 27 Jul 2018
Cited by 16 | Viewed by 4348
Abstract
Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known [...] Read more.
Recent work has focused on the problem of nonparametric estimation of information divergence functionals between two continuous random variables. Many existing approaches require either restrictive assumptions about the density support set or difficult calculations at the support set boundary which must be known a priori. The mean squared error (MSE) convergence rate of a leave-one-out kernel density plug-in divergence functional estimator for general bounded density support sets is derived where knowledge of the support boundary, and therefore, the boundary correction is not required. The theory of optimally weighted ensemble estimation is generalized to derive a divergence estimator that achieves the parametric rate when the densities are sufficiently smooth. Guidelines for the tuning parameter selection and the asymptotic distribution of this estimator are provided. Based on the theory, an empirical estimator of Rényi-α divergence is proposed that greatly outperforms the standard kernel density plug-in estimator in terms of mean squared error, especially in high dimensions. The estimator is shown to be robust to the choice of tuning parameters. We show extensive simulation results that verify the theoretical results of our paper. Finally, we apply the proposed estimator to estimate the bounds on the Bayes error rate of a cell classification problem. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

22 pages, 628 KiB  
Article
Recognizing Information Feature Variation: Message Importance Transfer Measure and Its Applications in Big Data
by Rui She, Shanyun Liu and Pingyi Fan
Entropy 2018, 20(6), 401; https://doi.org/10.3390/e20060401 - 24 May 2018
Cited by 8 | Viewed by 3171
Abstract
Information transfer that characterizes the information feature variation can have a crucial impact on big data analytics and processing. Actually, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to Kullback-Leibler (KL) divergence [...] Read more.
Information transfer that characterizes the information feature variation can have a crucial impact on big data analytics and processing. Actually, the measure for information transfer can reflect the system change from the statistics by using the variable distributions, similar to Kullback-Leibler (KL) divergence and Renyi divergence. Furthermore, to some degree, small probability events may carry the most important part of the total message in an information transfer of big data. Therefore, it is significant to propose an information transfer measure with respect to the message importance from the viewpoint of small probability events. In this paper, we present the message importance transfer measure (MITM) and analyze its performance and applications in three aspects. First, we discuss the robustness of MITM by using it to measuring information distance. Then, we present a message importance transfer capacity by resorting to the MITM and give an upper bound for the information transfer process with disturbance. Finally, we apply the MITM to discuss the queue length selection, which is the fundamental problem of caching operation on mobile edge computing. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

28 pages, 428 KiB  
Article
Shannon Entropy Estimation in ∞-Alphabets from Convergence Results: Studying Plug-In Estimators
by Jorge F. Silva
Entropy 2018, 20(6), 397; https://doi.org/10.3390/e20060397 - 23 May 2018
Cited by 6 | Viewed by 3241
Abstract
This work addresses the problem of Shannon entropy estimation in countably infinite alphabets studying and adopting some recent convergence results of the entropy functional, which is known to be a discontinuous function in the space of probabilities in ∞-alphabets. Sufficient conditions for the [...] Read more.
This work addresses the problem of Shannon entropy estimation in countably infinite alphabets studying and adopting some recent convergence results of the entropy functional, which is known to be a discontinuous function in the space of probabilities in ∞-alphabets. Sufficient conditions for the convergence of the entropy are used in conjunction with some deviation inequalities (including scenarios with both finitely and infinitely supported assumptions on the target distribution). From this perspective, four plug-in histogram-based estimators are studied showing that convergence results are instrumental to derive new strong consistent estimators for the entropy. The main application of this methodology is a new data-driven partition (plug-in) estimator. This scheme uses the data to restrict the support where the distribution is estimated by finding an optimal balance between estimation and approximation errors. The proposed scheme offers a consistent (distribution-free) estimator of the entropy in ∞-alphabets and optimal rates of convergence under certain regularity conditions on the problem (finite and unknown supported assumptions and tail bounded conditions on the target distribution). Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
15 pages, 1150 KiB  
Article
Polynomial-Time Algorithm for Learning Optimal BFS-Consistent Dynamic Bayesian Networks
by Margarida Sousa and Alexandra M. Carvalho
Entropy 2018, 20(4), 274; https://doi.org/10.3390/e20040274 - 12 Apr 2018
Cited by 4 | Viewed by 4457
Abstract
Dynamic Bayesian networks (DBN) are powerful probabilistic representations that model stochastic processes. They consist of a prior network, representing the distribution over the initial variables, and a set of transition networks, representing the transition distribution between variables over time. It was shown that [...] Read more.
Dynamic Bayesian networks (DBN) are powerful probabilistic representations that model stochastic processes. They consist of a prior network, representing the distribution over the initial variables, and a set of transition networks, representing the transition distribution between variables over time. It was shown that learning complex transition networks, considering both intra- and inter-slice connections, is NP-hard. Therefore, the community has searched for the largest subclass of DBNs for which there is an efficient learning algorithm. We introduce a new polynomial-time algorithm for learning optimal DBNs consistent with a breadth-first search (BFS) order, named bcDBN. The proposed algorithm considers the set of networks such that each transition network has a bounded in-degree, allowing for p edges from past time slices (inter-slice connections) and k edges from the current time slice (intra-slice connections) consistent with the BFS order induced by the optimal tree-augmented network (tDBN). This approach increases exponentially, in the number of variables, the search space of the state-of-the-art tDBN algorithm. Concerning worst-case time complexity, given a Markov lag m, a set of n random variables ranging over r values, and a set of observations of N individuals over T time steps, the bcDBN algorithm is linear in N, T and m; polynomial in n and r; and exponential in p and k. We assess the bcDBN algorithm on simulated data against tDBN, revealing that it performs well throughout different experiments. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

13 pages, 4462 KiB  
Article
Compression of a Deep Competitive Network Based on Mutual Information for Underwater Acoustic Targets Recognition
by Sheng Shen, Honghui Yang and Meiping Sheng
Entropy 2018, 20(4), 243; https://doi.org/10.3390/e20040243 - 02 Apr 2018
Cited by 24 | Viewed by 4437
Abstract
The accuracy of underwater acoustic targets recognition via limited ship radiated noise can be improved by a deep neural network trained with a large number of unlabeled samples. However, redundant features learned by deep neural network have negative effects on recognition accuracy and [...] Read more.
The accuracy of underwater acoustic targets recognition via limited ship radiated noise can be improved by a deep neural network trained with a large number of unlabeled samples. However, redundant features learned by deep neural network have negative effects on recognition accuracy and efficiency. A compressed deep competitive network is proposed to learn and extract features from ship radiated noise. The core idea of the algorithm includes: (1) Competitive learning: By integrating competitive learning into the restricted Boltzmann machine learning algorithm, the hidden units could share the weights in each predefined group; (2) Network pruning: The pruning based on mutual information is deployed to remove the redundant parameters and further compress the network. Experiments based on real ship radiated noise show that the network can increase recognition accuracy with fewer informative features. The compressed deep competitive network can achieve a classification accuracy of 89.1 % , which is 5.3 % higher than deep competitive network and 13.1 % higher than the state-of-the-art signal processing feature extraction methods. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

14 pages, 785 KiB  
Article
Q-Neutrosophic Soft Relation and Its Application in Decision Making
by Majdoleen Abu Qamar and Nasruddin Hassan
Entropy 2018, 20(3), 172; https://doi.org/10.3390/e20030172 - 06 Mar 2018
Cited by 43 | Viewed by 3600
Abstract
Q-neutrosophic soft sets are essentially neutrosophic soft sets characterized by three independent two-dimensional membership functions which stand for uncertainty, indeterminacy and falsity. Thus, it can be applied to two-dimensional imprecise, indeterminate and inconsistent data which appear in most real life problems. Relations are [...] Read more.
Q-neutrosophic soft sets are essentially neutrosophic soft sets characterized by three independent two-dimensional membership functions which stand for uncertainty, indeterminacy and falsity. Thus, it can be applied to two-dimensional imprecise, indeterminate and inconsistent data which appear in most real life problems. Relations are a suitable tool for describing correspondences between objects. In this study we introduce and discuss Q-neutrosophic soft relations, which can be discussed as a generalization of fuzzy soft relations, intuitionistic fuzzy soft relations, and neutrosophic soft relations. Q-neutrosophic soft relation is a sub Q-neutrosophic soft set of the Cartesian product of the Q-neutrosophic soft sets, in other words Q-neutrosophic soft relation is Q-neutrosophic soft sets in a Cartesian product of universes. We also present the notions of inverse, composition of Q-neutrosophic soft relations and functions along with some related theorems and properties. Reflexivity, symmetry, transitivity as well as equivalence relations and equivalence classes of Q-neutrosophic soft relations are also defined. Some properties of these concepts are presented and supported by real life examples. Finally, an algorithm to solve decision making problems using Q-neutrosophic soft relations is developed and verified by an example to show the efficiency of this method. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
25 pages, 400 KiB  
Article
Sequential Change-Point Detection via Online Convex Optimization
by Yang Cao, Liyan Xie, Yao Xie and Huan Xu
Entropy 2018, 20(2), 108; https://doi.org/10.3390/e20020108 - 07 Feb 2018
Cited by 17 | Viewed by 6134
Abstract
Sequential change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. When the post-change parameters are unknown, we consider a set of detection procedures based on sequential likelihood ratios with non-anticipating estimators constructed using online convex [...] Read more.
Sequential change-point detection when the distribution parameters are unknown is a fundamental problem in statistics and machine learning. When the post-change parameters are unknown, we consider a set of detection procedures based on sequential likelihood ratios with non-anticipating estimators constructed using online convex optimization algorithms such as online mirror descent, which provides a more versatile approach to tackling complex situations where recursive maximum likelihood estimators cannot be found. When the underlying distributions belong to a exponential family and the estimators satisfy the logarithm regret property, we show that this approach is nearly second-order asymptotically optimal. This means that the upper bound for the false alarm rate of the algorithm (measured by the average-run-length) meets the lower bound asymptotically up to a log-log factor when the threshold tends to infinity. Our proof is achieved by making a connection between sequential change-point and online convex optimization and leveraging the logarithmic regret bound property of online mirror descent algorithm. Numerical and real data examples validate our theory. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

18 pages, 805 KiB  
Article
The Complex Neutrosophic Soft Expert Relation and Its Multiple Attribute Decision-Making Method
by Ashraf Al-Quran and Nasruddin Hassan
Entropy 2018, 20(2), 101; https://doi.org/10.3390/e20020101 - 31 Jan 2018
Cited by 28 | Viewed by 3278
Abstract
This paper introduces a novel soft computing technique, called the complex neutrosophic soft expert relation (CNSER), to evaluate the degree of interaction between two hybrid models called complex neutrosophic soft expert sets (CNSESs). CNSESs are used to represent two-dimensional data that are imprecise, [...] Read more.
This paper introduces a novel soft computing technique, called the complex neutrosophic soft expert relation (CNSER), to evaluate the degree of interaction between two hybrid models called complex neutrosophic soft expert sets (CNSESs). CNSESs are used to represent two-dimensional data that are imprecise, uncertain, incomplete and indeterminate. Moreover, it has a mechanism to incorporate the parameter set and the opinions of all experts in one model, thus making it highly suitable for use in decision-making problems where the time factor plays a key role in determining the final decision. The complex neutrosophic soft expert set and complex neutrosophic soft expert relation are both defined. Utilizing the properties of CNSER introduced, an empirical study is conducted on the relationship between the variability of the currency exchange rate and Malaysian exports and the time frame (phase) of the interaction between these two variables. This study is supported further by an algorithm to determine the type and the degree of this relationship. A comparison between different existing relations and CNSER to show the ascendancy of our proposed CNSER is provided. Then, the notion of the inverse, complement and composition of CNSERs along with some related theorems and properties are introduced. Finally, we define the symmetry, transitivity and reflexivity of CNSERs, as well as the equivalence relation and equivalence classes on CNSESs. Some interesting properties are also obtained. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
25 pages, 9074 KiB  
Article
Gaussian Guided Self-Adaptive Wolf Search Algorithm Based on Information Entropy Theory
by Qun Song, Simon Fong, Suash Deb and Thomas Hanne
Entropy 2018, 20(1), 37; https://doi.org/10.3390/e20010037 - 10 Jan 2018
Cited by 3 | Viewed by 4142
Abstract
Nowadays, swarm intelligence algorithms are becoming increasingly popular for solving many optimization problems. The Wolf Search Algorithm (WSA) is a contemporary semi-swarm intelligence algorithm designed to solve complex optimization problems and demonstrated its capability especially for large-scale problems. However, it still inherits a [...] Read more.
Nowadays, swarm intelligence algorithms are becoming increasingly popular for solving many optimization problems. The Wolf Search Algorithm (WSA) is a contemporary semi-swarm intelligence algorithm designed to solve complex optimization problems and demonstrated its capability especially for large-scale problems. However, it still inherits a common weakness for other swarm intelligence algorithms: that its performance is heavily dependent on the chosen values of the control parameters. In 2016, we published the Self-Adaptive Wolf Search Algorithm (SAWSA), which offers a simple solution to the adaption problem. As a very simple schema, the original SAWSA adaption is based on random guesses, which is unstable and naive. In this paper, based on the SAWSA, we investigate the WSA search behaviour more deeply. A new parameter-guided updater, the Gaussian-guided parameter control mechanism based on information entropy theory, is proposed as an enhancement of the SAWSA. The heuristic updating function is improved. Simulation experiments for the new method denoted as the Gaussian-Guided Self-Adaptive Wolf Search Algorithm (GSAWSA) validate the increased performance of the improved version of WSA in comparison to its standard version and other prevalent swarm algorithms. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

2308 KiB  
Article
Context-Aware Generative Adversarial Privacy
by Chong Huang, Peter Kairouz, Xiao Chen, Lalitha Sankar and Ram Rajagopal
Entropy 2017, 19(12), 656; https://doi.org/10.3390/e19120656 - 01 Dec 2017
Cited by 90 | Viewed by 7535
Abstract
Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other [...] Read more.
Preserving the utility of published datasets while simultaneously providing provable privacy guarantees is a well-known challenge. On the one hand, context-free privacy solutions, such as differential privacy, provide strong privacy guarantees, but often lead to a significant reduction in utility. On the other hand, context-aware privacy solutions, such as information theoretic privacy, achieve an improved privacy-utility tradeoff, but assume that the data holder has access to dataset statistics. We circumvent these limitations by introducing a novel context-aware privacy framework called generative adversarial privacy (GAP). GAP leverages recent advancements in generative adversarial networks (GANs) to allow the data holder to learn privatization schemes from the dataset itself. Under GAP, learning the privacy mechanism is formulated as a constrained minimax game between two players: a privatizer that sanitizes the dataset in a way that limits the risk of inference attacks on the individuals’ private variables, and an adversary that tries to infer the private variables from the sanitized dataset. To evaluate GAP’s performance, we investigate two simple (yet canonical) statistical dataset models: (a) the binary data model; and (b) the binary Gaussian mixture model. For both models, we derive game-theoretically optimal minimax privacy mechanisms, and show that the privacy mechanisms learned from data (in a generative adversarial fashion) match the theoretically optimal ones. This demonstrates that our framework can be easily applied in practice, even in the absence of dataset statistics. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

309 KiB  
Article
On Lower Bounds for Statistical Learning Theory
by Po-Ling Loh
Entropy 2017, 19(11), 617; https://doi.org/10.3390/e19110617 - 15 Nov 2017
Cited by 8 | Viewed by 7174
Abstract
In recent years, tools from information theory have played an increasingly prevalent role in statistical machine learning. In addition to developing efficient, computationally feasible algorithms for analyzing complex datasets, it is of theoretical importance to determine whether such algorithms are “optimal” in the [...] Read more.
In recent years, tools from information theory have played an increasingly prevalent role in statistical machine learning. In addition to developing efficient, computationally feasible algorithms for analyzing complex datasets, it is of theoretical importance to determine whether such algorithms are “optimal” in the sense that no other algorithm can lead to smaller statistical error. This paper provides a survey of various techniques used to derive information-theoretic lower bounds for estimation and learning. We focus on the settings of parameter and function estimation, community recovery, and online learning for multi-armed bandits. A common theme is that lower bounds are established by relating the statistical learning problem to a channel decoding problem, for which lower bounds may be derived involving information-theoretic quantities such as the mutual information, total variation distance, and Kullback–Leibler divergence. We close by discussing the use of information-theoretic quantities to measure independence in machine learning applications ranging from causality to medical imaging, and mention techniques for estimating these quantities efficiently in a data-driven manner. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
2627 KiB  
Article
Discovering Potential Correlations via Hypercontractivity
by Hyeji Kim, Weihao Gao, Sreeram Kannan, Sewoong Oh and Pramod Viswanath
Entropy 2017, 19(11), 586; https://doi.org/10.3390/e19110586 - 02 Nov 2017
Cited by 1 | Viewed by 4511
Abstract
Discovering a correlation from one variable to another variable is of fundamental scientific and practical interest. While existing correlation measures are suitable for discovering average correlation, they fail to discover hidden or potential correlations. To bridge this gap, (i) we postulate a set [...] Read more.
Discovering a correlation from one variable to another variable is of fundamental scientific and practical interest. While existing correlation measures are suitable for discovering average correlation, they fail to discover hidden or potential correlations. To bridge this gap, (i) we postulate a set of natural axioms that we expect a measure of potential correlation to satisfy; (ii) we show that the rate of information bottleneck, i.e., the hypercontractivity coefficient, satisfies all the proposed axioms; (iii) we provide a novel estimator to estimate the hypercontractivity coefficient from samples; and (iv) we provide numerical experiments demonstrating that this proposed estimator discovers potential correlations among various indicators of WHO datasets, is robust in discovering gene interactions from gene expression time series data, and is statistically more powerful than the estimators for other correlation measures in binary hypothesis testing of canonical examples of potential correlations. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

9103 KiB  
Article
Entropy Ensemble Filter: A Modified Bootstrap Aggregating (Bagging) Procedure to Improve Efficiency in Ensemble Model Simulation
by Hossein Foroozand and Steven V. Weijs
Entropy 2017, 19(10), 520; https://doi.org/10.3390/e19100520 - 28 Sep 2017
Cited by 10 | Viewed by 4732
Abstract
Over the past two decades, the Bootstrap AGGregatING (bagging) method has been widely used for improving simulation. The computational cost of this method scales with the size of the ensemble, but excessively reducing the ensemble size comes at the cost of reduced predictive [...] Read more.
Over the past two decades, the Bootstrap AGGregatING (bagging) method has been widely used for improving simulation. The computational cost of this method scales with the size of the ensemble, but excessively reducing the ensemble size comes at the cost of reduced predictive performance. The novel procedure proposed in this study is the Entropy Ensemble Filter (EEF), which uses the most informative training data sets in the ensemble rather than all ensemble members created by the bagging method. The results of this study indicate efficiency of the proposed method in application to synthetic data simulation on a sinusoidal signal, a sawtooth signal, and a composite signal. The EEF method can reduce the computational time of simulation by around 50% on average while maintaining predictive performance at the same level of the conventional method, where all of the ensemble models are used for simulation. The analysis of the error gradient (root mean square error of ensemble averages) shows that using the 40% most informative ensemble members of the set initially defined by the user appears to be most effective. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

589 KiB  
Article
Survey on Probabilistic Models of Low-Rank Matrix Factorizations
by Jiarong Shi, Xiuyun Zheng and Wei Yang
Entropy 2017, 19(8), 424; https://doi.org/10.3390/e19080424 - 19 Aug 2017
Cited by 11 | Viewed by 5642
Abstract
Low-rank matrix factorizations such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are a large class of methods for pursuing the low-rank approximation of a given data matrix. The conventional factorization models are based on the assumption [...] Read more.
Low-rank matrix factorizations such as Principal Component Analysis (PCA), Singular Value Decomposition (SVD) and Non-negative Matrix Factorization (NMF) are a large class of methods for pursuing the low-rank approximation of a given data matrix. The conventional factorization models are based on the assumption that the data matrices are contaminated stochastically by some type of noise. Thus the point estimations of low-rank components can be obtained by Maximum Likelihood (ML) estimation or Maximum a posteriori (MAP). In the past decade, a variety of probabilistic models of low-rank matrix factorizations have emerged. The most significant difference between low-rank matrix factorizations and their corresponding probabilistic models is that the latter treat the low-rank components as random variables. This paper makes a survey of the probabilistic models of low-rank matrix factorizations. Firstly, we review some probability distributions commonly-used in probabilistic models of low-rank matrix factorizations and introduce the conjugate priors of some probability distributions to simplify the Bayesian inference. Then we provide two main inference methods for probabilistic low-rank matrix factorizations, i.e., Gibbs sampling and variational Bayesian inference. Next, we classify roughly the important probabilistic models of low-rank matrix factorizations into several categories and review them respectively. The categories are performed via different matrix factorizations formulations, which mainly include PCA, matrix factorizations, robust PCA, NMF and tensor factorizations. Finally, we discuss the research issues needed to be studied in the future. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

325 KiB  
Article
Estimating Mixture Entropy with Pairwise Distances
by Artemy Kolchinsky and Brendan D. Tracey
Entropy 2017, 19(7), 361; https://doi.org/10.3390/e19070361 - 14 Jul 2017
Cited by 72 | Viewed by 8270 | Correction
Abstract
Mixture distributions arise in many parametric and non-parametric settings—for example, in Gaussian mixture models and in non-parametric estimation. It is often necessary to compute the entropy of a mixture, but, in most cases, this quantity has no closed-form expression, making some form of [...] Read more.
Mixture distributions arise in many parametric and non-parametric settings—for example, in Gaussian mixture models and in non-parametric estimation. It is often necessary to compute the entropy of a mixture, but, in most cases, this quantity has no closed-form expression, making some form of approximation necessary. We propose a family of estimators based on a pairwise distance function between mixture components, and show that this estimator class has many attractive properties. For many distributions of interest, the proposed estimators are efficient to compute, differentiable in the mixture parameters, and become exact when the mixture components are clustered. We prove this family includes lower and upper bounds on the mixture entropy. The Chernoff α -divergence gives a lower bound when chosen as the distance function, with the Bhattacharyaa distance providing the tightest lower bound for components that are symmetric and members of a location family. The Kullback–Leibler divergence gives an upper bound when used as the distance function. We provide closed-form expressions of these bounds for mixtures of Gaussians, and discuss their applications to the estimation of mutual information. We then demonstrate that our bounds are significantly tighter than well-known existing bounds using numeric simulations. This estimator class is very useful in optimization problems involving maximization/minimization of entropy and mutual information, such as MaxEnt and rate distortion problems. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

922 KiB  
Article
Overfitting Reduction of Text Classification Based on AdaBELM
by Xiaoyue Feng, Yanchun Liang, Xiaohu Shi, Dong Xu, Xu Wang and Renchu Guan
Entropy 2017, 19(7), 330; https://doi.org/10.3390/e19070330 - 06 Jul 2017
Cited by 23 | Viewed by 6213
Abstract
Overfitting is an important problem in machine learning. Several algorithms, such as the extreme learning machine (ELM), suffer from this issue when facing high-dimensional sparse data, e.g., in text classification. One common issue is that the extent of overfitting is not well quantified. [...] Read more.
Overfitting is an important problem in machine learning. Several algorithms, such as the extreme learning machine (ELM), suffer from this issue when facing high-dimensional sparse data, e.g., in text classification. One common issue is that the extent of overfitting is not well quantified. In this paper, we propose a quantitative measure of overfitting referred to as the rate of overfitting (RO) and a novel model, named AdaBELM, to reduce the overfitting. With RO, the overfitting problem can be quantitatively measured and identified. The newly proposed model can achieve high performance on multi-class text classification. To evaluate the generalizability of the new model, we designed experiments based on three datasets, i.e., the 20 Newsgroups, Reuters-21578, and BioMed corpora, which represent balanced, unbalanced, and real application data, respectively. Experiment results demonstrate that AdaBELM can reduce overfitting and outperform classical ELM, decision tree, random forests, and AdaBoost on all three text-classification datasets; for example, it can achieve 62.2% higher accuracy than ELM. Therefore, the proposed model has a good generalizability. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

346 KiB  
Article
Rate-Distortion Bounds for Kernel-Based Distortion Measures
by Kazuho Watanabe
Entropy 2017, 19(7), 336; https://doi.org/10.3390/e19070336 - 05 Jul 2017
Viewed by 3355
Abstract
Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and [...] Read more.
Kernel methods have been used for turning linear learning algorithms into nonlinear ones. These nonlinear algorithms measure distances between data points by the distance in the kernel-induced feature space. In lossy data compression, the optimal tradeoff between the number of quantized points and the incurred distortion is characterized by the rate-distortion function. However, the rate-distortion functions associated with distortion measures involving kernel feature mapping have yet to be analyzed. We consider two reconstruction schemes, reconstruction in input space and reconstruction in feature space, and provide bounds to the rate-distortion functions for these schemes. Comparison of the derived bounds to the quantizer performance obtained by the kernel K -means method suggests that the rate-distortion bounds for input space and feature space reconstructions are informative at low and high distortion levels, respectively. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

253 KiB  
Article
The Expected Missing Mass under an Entropy Constraint
by Daniel Berend, Aryeh Kontorovich and Gil Zagdanski
Entropy 2017, 19(7), 315; https://doi.org/10.3390/e19070315 - 29 Jun 2017
Cited by 3 | Viewed by 2983
Abstract
In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in [...] Read more.
In Berend and Kontorovich (2012), the following problem was studied: A random sample of size t is taken from a world (i.e., probability space) of size n; bound the expected value of the probability of the set of elements not appearing in the sample (unseen mass) in terms of t and n. Here we study the same problem, where the world may be countably infinite, and the probability measure on it is restricted to have an entropy of at most h. We provide tight bounds on the maximum of the expected unseen mass, along with a characterization of the measures attaining this maximum. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

4272 KiB  
Article
A Novel Geometric Dictionary Construction Approach for Sparse Representation Based Image Fusion
by Kunpeng Wang, Guanqiu Qi, Zhiqin Zhu and Yi Chai
Entropy 2017, 19(7), 306; https://doi.org/10.3390/e19070306 - 27 Jun 2017
Cited by 67 | Viewed by 5444
Abstract
Sparse-representation based approaches have been integrated into image fusion methods in the past few years and show great performance in image fusion. Training an informative and compact dictionary is a key step for a sparsity-based image fusion method. However, it is difficult to [...] Read more.
Sparse-representation based approaches have been integrated into image fusion methods in the past few years and show great performance in image fusion. Training an informative and compact dictionary is a key step for a sparsity-based image fusion method. However, it is difficult to balance “informative” and “compact”. In order to obtain sufficient information for sparse representation in dictionary construction, this paper classifies image patches from source images into different groups based on morphological similarities. Stochastic coordinate coding (SCC) is used to extract corresponding image-patch information for dictionary construction. According to the constructed dictionary, image patches of source images are converted to sparse coefficients by the simultaneous orthogonal matching pursuit (SOMP) algorithm. At last, the sparse coefficients are fused by the Max-L1 fusion rule and inverted to a fused image. The comparison experimentations are simulated to evaluate the fused image in image features, information, structure similarity, and visual perception. The results confirm the feasibility and effectiveness of the proposed image fusion solution. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

2416 KiB  
Article
Face Verification with Multi-Task and Multi-Scale Feature Fusion
by Xiaojun Lu, Yue Yang, Weilin Zhang, Qi Wang and Yang Wang
Entropy 2017, 19(5), 228; https://doi.org/10.3390/e19050228 - 17 May 2017
Cited by 10 | Viewed by 5624
Abstract
Face verification for unrestricted faces in the wild is a challenging task. This paper proposes a method based on two deep convolutional neural networks (CNN) for face verification. In this work, we explore using identification signals to supervise one CNN and the combination [...] Read more.
Face verification for unrestricted faces in the wild is a challenging task. This paper proposes a method based on two deep convolutional neural networks (CNN) for face verification. In this work, we explore using identification signals to supervise one CNN and the combination of semi-verification and identification to train the other one. In order to estimate semi-verification loss at a low computation cost, a circle, which is composed of all faces, is used for selecting face pairs from pairwise samples. In the process of face normalization, we propose using different landmarks of faces to solve the problems caused by poses. In addition, the final face representation is formed by the concatenating feature of each deep CNN after principal component analysis (PCA) reduction. Furthermore, each feature is a combination of multi-scale representations through making use of auxiliary classifiers. For the final verification, we only adopt the face representation of one region and one resolution of a face jointing Joint Bayesian classifier. Experiments show that our method can extract effective face representation with a small training dataset and our algorithm achieves 99.71% verification accuracy on Labeled Faces in the Wild (LFW) dataset. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

291 KiB  
Article
Consistent Estimation of Partition Markov Models
by Jesús E. García and Verónica A. González-López
Entropy 2017, 19(4), 160; https://doi.org/10.3390/e19040160 - 06 Apr 2017
Cited by 25 | Viewed by 4444
Abstract
The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: [...] Read more.
The Partition Markov Model characterizes the process by a partition L of the state space, where the elements in each part of L share the same transition probability to an arbitrary element in the alphabet. This model aims to answer the following questions: what is the minimal number of parameters needed to specify a Markov chain and how to estimate these parameters. In order to answer these questions, we build a consistent strategy for model selection which consist of: giving a size n realization of the process, finding a model within the Partition Markov class, with a minimal number of parts to represent the process law. From the strategy, we derive a measure that establishes a metric in the state space. In addition, we show that if the law of the process is Markovian, then, eventually, when n goes to infinity, L will be retrieved. We show an application to model internet navigation patterns. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)

Other

Jump to: Research

4928 KiB  
Letter
Discovery of Kolmogorov Scaling in the Natural Language
by Maurice H. P. M. Van Putten
Entropy 2017, 19(5), 198; https://doi.org/10.3390/e19050198 - 02 May 2017
Viewed by 4802
Abstract
We consider the rate R and variance σ 2 of Shannon information in snippets of text based on word frequencies in the natural language. We empirically identify Kolmogorov’s scaling law in [...] Read more.
We consider the rate R and variance σ 2 of Shannon information in snippets of text based on word frequencies in the natural language. We empirically identify Kolmogorov’s scaling law in σ 2 k - 1 . 66 ± 0 . 12 (95% c.l.) as a function of k = 1 / N measured by word count N. This result highlights a potential association of information flow in snippets, analogous to energy cascade in turbulent eddies in fluids at high Reynolds numbers. We propose R and σ 2 as robust utility functions for objective ranking of concordances in efficient search for maximal information seamlessly across different languages and as a starting point for artificial attention. Full article
(This article belongs to the Special Issue Information Theory in Machine Learning and Data Science)
Show Figures

Figure 1

Back to TopTop