Entropy

Research

18 pages, 3420 KiB

Open AccessArticle

Perfect Density Models Cannot Guarantee Anomaly Detection

by Charline Le Lan and Laurent Dinh

Entropy 2021, 23(12), 1690; https://doi.org/10.3390/e23121690 - 16 Dec 2021

Cited by 16 | Viewed by 3767

Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. [...] Read more.

Thanks to the tractability of their likelihood, several deep generative models show promise for seemingly straightforward but important applications like anomaly detection, uncertainty estimation, and active learning. However, the likelihood values empirically attributed to anomalies conflict with the expectations these proposed applications suggest. In this paper, we take a closer look at the behavior of distribution densities through the lens of reparametrization and show that these quantities carry less meaningful information than previously thought, beyond estimation issues or the curse of dimensionality. We conclude that the use of these likelihoods for anomaly detection relies on strong and implicit hypotheses, and highlight the necessity of explicitly formulating these assumptions for reliable anomaly detection. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

15 pages, 549 KiB

Open AccessArticle

Gradient Regularization as Approximate Variational Inference

by Ali Unlu and Laurence Aitchison

Entropy 2021, 23(12), 1629; https://doi.org/10.3390/e23121629 - 03 Dec 2021

Cited by 4 | Viewed by 2095

Abstract

We developed Variational Laplace for Bayesian neural networks (BNNs), which exploits a local approximation of the curvature of the likelihood to estimate the ELBO without the need for stochastic sampling of the neural-network weights. The Variational Laplace objective is simple to evaluate, as [...] Read more.

We developed Variational Laplace for Bayesian neural networks (BNNs), which exploits a local approximation of the curvature of the likelihood to estimate the ELBO without the need for stochastic sampling of the neural-network weights. The Variational Laplace objective is simple to evaluate, as it is the log-likelihood plus weight-decay, plus a squared-gradient regularizer. Variational Laplace gave better test performance and expected calibration errors than maximum a posteriori inference and standard sampling-based variational inference, despite using the same variational approximate posterior. Finally, we emphasize the care needed in benchmarking standard VI, as there is a risk of stopping before the variance parameters have converged. We show that early-stopping can be avoided by increasing the learning rate for the variance parameters. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

17 pages, 2791 KiB

Open AccessArticle

Empirical Frequentist Coverage of Deep Learning Uncertainty Quantification Procedures

by Benjamin Kompa, Jasper Snoek and Andrew L. Beam

Entropy 2021, 23(12), 1608; https://doi.org/10.3390/e23121608 - 30 Nov 2021

Cited by 9 | Viewed by 2495

Abstract

Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics, such as the negative log-likelihood (NLL), expected calibration error (ECE) or [...] Read more.

Uncertainty quantification for complex deep learning models is increasingly important as these techniques see growing use in high-stakes, real-world settings. Currently, the quality of a model’s uncertainty is evaluated using point-prediction metrics, such as the negative log-likelihood (NLL), expected calibration error (ECE) or the Brier score on held-out data. Marginal coverage of prediction intervals or sets, a well-known concept in the statistical literature, is an intuitive alternative to these metrics but has yet to be systematically studied for many popular uncertainty quantification techniques for deep learning models. With marginal coverage and the complementary notion of the width of a prediction interval, downstream users of deployed machine learning models can better understand uncertainty quantification both on a global dataset level and on a per-sample basis. In this study, we provide the first large-scale evaluation of the empirical frequentist coverage properties of well-known uncertainty quantification techniques on a suite of regression and classification tasks. We find that, in general, some methods do achieve desirable coverage properties on in distribution samples, but that coverage is not maintained on out-of-distribution data. Our results demonstrate the failings of current uncertainty quantification techniques as dataset shift increases and reinforce coverage as an important metric in developing models for real-world applications. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

21 pages, 1331 KiB

Open AccessArticle

Minimum Message Length in Hybrid ARMA and LSTM Model Forecasting

by Zheng Fang, David L. Dowe, Shelton Peiris and Dedi Rosadi

Entropy 2021, 23(12), 1601; https://doi.org/10.3390/e23121601 - 29 Nov 2021

Cited by 7 | Viewed by 2748

Abstract

Modeling and analysis of time series are important in applications including economics, engineering, environmental science and social science. Selecting the best time series model with accurate parameters in forecasting is a challenging objective for scientists and academic researchers. Hybrid models combining neural networks [...] Read more.

Modeling and analysis of time series are important in applications including economics, engineering, environmental science and social science. Selecting the best time series model with accurate parameters in forecasting is a challenging objective for scientists and academic researchers. Hybrid models combining neural networks and traditional Autoregressive Moving Average (ARMA) models are being used to improve the accuracy of modeling and forecasting time series. Most of the existing time series models are selected by information-theoretic approaches, such as AIC, BIC, and HQ. This paper revisits a model selection technique based on Minimum Message Length (MML) and investigates its use in hybrid time series analysis. MML is a Bayesian information-theoretic approach and has been used in selecting the best ARMA model. We utilize the long short-term memory (LSTM) approach to construct a hybrid ARMA-LSTM model and show that MML performs better than AIC, BIC, and HQ in selecting the model—both in the traditional ARMA models (without LSTM) and with hybrid ARMA-LSTM models. These results held on simulated data and both real-world datasets that we considered.We also develop a simple MML ARIMA model. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

18 pages, 6520 KiB

Open AccessArticle

History Marginalization Improves Forecasting in Variational Recurrent Neural Networks

by Chen Qiu, Stephan Mandt and Maja Rudolph

Entropy 2021, 23(12), 1563; https://doi.org/10.3390/e23121563 - 24 Nov 2021

Cited by 1 | Viewed by 1832

Abstract

Deep probabilistic time series forecasting models have become an integral part of machine learning. While several powerful generative models have been proposed, we provide evidence that their associated inference models are oftentimes too limited and cause the generative model to predict mode-averaged dynamics. [...] Read more.

Deep probabilistic time series forecasting models have become an integral part of machine learning. While several powerful generative models have been proposed, we provide evidence that their associated inference models are oftentimes too limited and cause the generative model to predict mode-averaged dynamics. Mode-averaging is problematic since many real-world sequences are highly multi-modal, and their averaged dynamics are unphysical (e.g., predicted taxi trajectories might run through buildings on the street map). To better capture multi-modality, we develop variational dynamic mixtures (VDM): a new variational family to infer sequential latent variables. The VDM approximate posterior at each time step is a mixture density network, whose parameters come from propagating multiple samples through a recurrent architecture. This results in an expressive multi-modal posterior approximation. In an empirical study, we show that VDM outperforms competing approaches on highly multi-modal datasets from different domains. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

48 pages, 5412 KiB

Open AccessArticle

Winsorization for Robust Bayesian Neural Networks

by Somya Sharma and Snigdhansu Chatterjee

Entropy 2021, 23(11), 1546; https://doi.org/10.3390/e23111546 - 20 Nov 2021

Cited by 10 | Viewed by 2696

Abstract

With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have [...] Read more.

With the advent of big data and the popularity of black-box deep learning methods, it is imperative to address the robustness of neural networks to noise and outliers. We propose the use of Winsorization to recover model performances when the data may have outliers and other aberrant observations. We provide a comparative analysis of several probabilistic artificial intelligence and machine learning techniques for supervised learning case studies. Broadly, Winsorization is a versatile technique for accounting for outliers in data. However, different probabilistic machine learning techniques have different levels of efficiency when used on outlier-prone data, with or without Winsorization. We notice that Gaussian processes are extremely vulnerable to outliers, while deep learning techniques in general are more robust. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

15 pages, 761 KiB

Open AccessArticle

Conditional Deep Gaussian Processes: Multi-Fidelity Kernel Learning

by Chi-Ken Lu and Patrick Shafto

Entropy 2021, 23(11), 1545; https://doi.org/10.3390/e23111545 - 20 Nov 2021

Cited by 3 | Viewed by 2427

Abstract

Deep Gaussian Processes (DGPs) were proposed as an expressive Bayesian model capable of a mathematically grounded estimation of uncertainty. The expressivity of DPGs results from not only the compositional character but the distribution propagation within the hierarchy. Recently, it was pointed out that [...] Read more.

Deep Gaussian Processes (DGPs) were proposed as an expressive Bayesian model capable of a mathematically grounded estimation of uncertainty. The expressivity of DPGs results from not only the compositional character but the distribution propagation within the hierarchy. Recently, it was pointed out that the hierarchical structure of DGP well suited modeling the multi-fidelity regression, in which one is provided sparse observations with high precision and plenty of low fidelity observations. We propose the conditional DGP model in which the latent GPs are directly supported by the fixed lower fidelity data. Then the moment matching method is applied to approximate the marginal prior of conditional DGP with a GP. The obtained effective kernels are implicit functions of the lower-fidelity data, manifesting the expressivity contributed by distribution propagation within the hierarchy. The hyperparameters are learned via optimizing the approximate marginal likelihood. Experiments with synthetic and high dimensional data show comparable performance against other multi-fidelity regression methods, variational inference, and multi-output GP. We conclude that, with the low fidelity data and the hierarchical DGP structure, the effective kernel encodes the inductive bias for true function allowing the compositional freedom. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

17 pages, 1079 KiB

Open AccessArticle

Sampling the Variational Posterior with Local Refinement

by Marton Havasi, Jasper Snoek, Dustin Tran, Jonathan Gordon and José Miguel Hernández-Lobato

Entropy 2021, 23(11), 1475; https://doi.org/10.3390/e23111475 - 08 Nov 2021

Viewed by 1959

Abstract

Variational inference is an optimization-based method for approximating the posterior distribution of the parameters in Bayesian probabilistic models. A key challenge of variational inference is to approximate the posterior with a distribution that is computationally tractable yet sufficiently expressive. We propose a novel [...] Read more.

Variational inference is an optimization-based method for approximating the posterior distribution of the parameters in Bayesian probabilistic models. A key challenge of variational inference is to approximate the posterior with a distribution that is computationally tractable yet sufficiently expressive. We propose a novel method for generating samples from a highly flexible variational approximation. The method starts with a coarse initial approximation and generates samples by refining it in selected, local regions. This allows the samples to capture dependencies and multi-modality in the posterior, even when these are absent from the initial approximation. We demonstrate theoretically that our method always improves the quality of the approximation (as measured by the evidence lower bound). In experiments, our method consistently outperforms recent variational inference methods in terms of log-likelihood and ELBO across three example tasks: the Eight-Schools example (an inference task in a hierarchical model), training a ResNet-20 (Bayesian inference in a large neural network), and the Mushroom task (posterior sampling in a contextual bandit problem). Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

13 pages, 915 KiB

Open AccessArticle

Ensemble Neuroevolution-Based Approach for Multivariate Time Series Anomaly Detection

by Kamil Faber, Marcin Pietron and Dominik Zurek

Entropy 2021, 23(11), 1466; https://doi.org/10.3390/e23111466 - 06 Nov 2021

Cited by 13 | Viewed by 2583

Abstract

Multivariate time series anomaly detection is a widespread problem in the field of failure prevention. Fast prevention means lower repair costs and losses. The amount of sensors in novel industry systems makes the anomaly detection process quite difficult for humans. Algorithms that automate [...] Read more.

Multivariate time series anomaly detection is a widespread problem in the field of failure prevention. Fast prevention means lower repair costs and losses. The amount of sensors in novel industry systems makes the anomaly detection process quite difficult for humans. Algorithms that automate the process of detecting anomalies are crucial in modern failure prevention systems. Therefore, many machine learning models have been designed to address this problem. Mostly, they are autoencoder-based architectures with some generative adversarial elements. This work shows a framework that incorporates neuroevolution methods to boost the anomaly detection scores of new and already known models. The presented approach adapts evolution strategies for evolving an ensemble model, in which every single model works on a subgroup of data sensors. The next goal of neuroevolution is to optimize the architecture and hyperparameters such as the window size, the number of layers, and the layer depths. The proposed framework shows that it is possible to boost most anomaly detection deep learning models in a reasonable time and a fully automated mode. We ran tests on the SWAT and WADI datasets. To the best of our knowledge, this is the first approach in which an ensemble deep learning anomaly detection model is built in a fully automatic way using a neuroevolution strategy. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

14 pages, 715 KiB

Open AccessArticle

Conditional Deep Gaussian Processes: Empirical Bayes Hyperdata Learning

by Chi-Ken Lu and Patrick Shafto

Entropy 2021, 23(11), 1387; https://doi.org/10.3390/e23111387 - 23 Oct 2021

Cited by 1 | Viewed by 1914

Abstract

It is desirable to combine the expressive power of deep learning with Gaussian Process (GP) in one expressive Bayesian learning model. Deep kernel learning showed success as a deep network used for feature extraction. Then, a GP was used as the function model. [...] Read more.

It is desirable to combine the expressive power of deep learning with Gaussian Process (GP) in one expressive Bayesian learning model. Deep kernel learning showed success as a deep network used for feature extraction. Then, a GP was used as the function model. Recently, it was suggested that, albeit training with marginal likelihood, the deterministic nature of a feature extractor might lead to overfitting, and replacement with a Bayesian network seemed to cure it. Here, we propose the conditional deep Gaussian process (DGP) in which the intermediate GPs in hierarchical composition are supported by the hyperdata and the exposed GP remains zero mean. Motivated by the inducing points in sparse GP, the hyperdata also play the role of function supports, but are hyperparameters rather than random variables. It follows our previous moment matching approach to approximate the marginal prior for conditional DGP with a GP carrying an effective kernel. Thus, as in empirical Bayes, the hyperdata are learned by optimizing the approximate marginal likelihood which implicitly depends on the hyperdata via the kernel. We show the equivalence with the deep kernel learning in the limit of dense hyperdata in latent space. However, the conditional DGP and the corresponding approximate inference enjoy the benefit of being more Bayesian than deep kernel learning. Preliminary extrapolation results demonstrate expressive power from the depth of hierarchy by exploiting the exact covariance and hyperdata learning, in comparison with GP kernel composition, DGP variational inference and deep kernel learning. We also address the non-Gaussian aspect of our model as well as way of upgrading to a full Bayes inference. Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

17 pages, 4401 KiB

Open AccessArticle

Self-Supervised Variational Auto-Encoders

by Ioannis Gatopoulos and Jakub M. Tomczak

Entropy 2021, 23(6), 747; https://doi.org/10.3390/e23060747 - 14 Jun 2021

Cited by 7 | Viewed by 3128

Abstract

Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), which utilizes deterministic and discrete transformations [...] Read more.

Density estimation, compression, and data generation are crucial tasks in artificial intelligence. Variational Auto-Encoders (VAEs) constitute a single framework to achieve these goals. Here, we present a novel class of generative models, called self-supervised Variational Auto-Encoder (selfVAE), which utilizes deterministic and discrete transformations of data. This class of models allows both conditional and unconditional sampling while simplifying the objective function. First, we use a single self-supervised transformation as a latent variable, where the transformation is either downscaling or edge detection. Next, we consider a hierarchical architecture, i.e., multiple transformations, and we show its benefits compared to the VAE. The flexibility of selfVAE in data reconstruction finds a particularly interesting use case in data compression tasks, where we can trade-off memory for better data quality and vice-versa. We present the performance of our approach on three benchmark image data (Cifar10, Imagenette64, and CelebA). Full article

(This article belongs to the Special Issue Probabilistic Methods for Deep Learning)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Probabilistic Methods for Deep Learning

Share This Special Issue

Special Issue Editors

Special Issue Information

Keywords

Published Papers (11 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI