Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies

Manzhos, Sergei; Ihara, Manabu

doi:10.3390/physchem2020006

Open AccessReview

Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies

by

Sergei Manzhos

^*

and

Manabu Ihara

School of Materials and Chemical Technology, Tokyo Institute of Technology, Ookayama 2-12-1, Meguro-ku, Tokyo 152-8552, Japan

^*

Author to whom correspondence should be addressed.

Physchem 2022, 2(2), 72-95; https://doi.org/10.3390/physchem2020006

Submission received: 28 February 2022 / Revised: 24 March 2022 / Accepted: 24 March 2022 / Published: 29 March 2022

(This article belongs to the Special Issue Data-Driven Research in Physical Chemistry)

Download

Browse Figures

Versions Notes

Abstract

:

Machine learning (ML) has found increasing use in physical sciences, including research on energy conversion and storage technologies, in particular, so-called sustainable technologies. While often ML is used to directly optimize the parameters or phenomena of interest in the space of features, in this perspective, we focus on using ML to construct objects and methods that help in or enable the modeling of the underlying phenomena. We highlight the need for machine learning from very sparse and unevenly distributed numeric data in multidimensional spaces in these applications. After a brief introduction of some common regression-type machine learning techniques, we focus on more advanced ML techniques which use these known methods as building blocks of more complex schemes and thereby allow working with extremely sparse data and also allow generating insight. Specifically, we will highlight the utility of using representations with subdimensional functions by combining the high-dimensional model representation ansatz with machine learning methods such as neural networks or Gaussian process regressions in applications ranging from heterogeneous catalysis to nuclear energy.

Keywords:

machine learning; neural network; Gaussian process regression; curse of dimensionality; high-dimensional model representation; energy conversion and storage; heterogeneous catalysis

1. Introduction

Machine learning (ML) as well as more complex techniques of artificial intelligence (AI) have been finding increasing use in research and development for novel technologies [1,2,3,4,5,6,7]. Whenever one can identify an input–output mapping whose construction is helpful in achieving a research or engineering goal, ML techniques are often of assistance. One can cast a research problem as a problem of constructing such a mapping and then searching in the space of the descriptors (features or inputs) [8,9] for a desirable optimal point. One area where ML is gaining more and more traction is novel energy conversion and storage technologies. These techniques are, in particular, intensely explored for application to the development of technologies typically associated with sustainable generation and use of energy such as advanced types (organic and inorganic materials-based) of solar cells and LEDs (light-emitting diodes) [10,11,12,13,14,15,16,17,18,19,20,21,22], inorganic and organic metal ion batteries [23,24], fuel cells, and generally heterogeneous catalysis including electro- and photocatalysis [25,26,27,28,29,30,31,32,33,34]. This is natural in the sense that the development of these technologies often passes through optimization and balancing of multiple factors acting simultaneously and to opposite ends; for example, in the case of organic solar cells, there is an optimum to be sought between the donor’s bandgap, the band offset between the donor and the acceptor, the reorganization energies of both the donor and the acceptor, and the charge transfer integral, etc. [12,15,16]. These properties themselves are a function of the structure of the molecules which can be encoded through multiple descriptors [9]. In principle, most of these properties can be computed and or measured experimentally, but doing so for each candidate molecule is expensive. It is enticing, based on a limited number of measurements and or calculations, to build a mapping—with the help of machine learning—between the chosen descriptors and the properties of interest, and then search the map for an optimal point (such as maximum power conversion efficiency in the case of a solar cell). Similarly, optima need to be sought in the design of battery materials (balancing thermodynamics and kinetics of cation insertion, stability, etc.), catalysts (balancing adsorption energies of key species, kinetics, and stability), and other types of functional materials. Some examples will be given in Section 2.

Functional materials and interfaces are keys to the design of novel energy technologies. As compositional space is too vast to be explored by brute force and trial-and-error approaches; rational guidance of design is necessary. Such rational guidance comes, in particular, from simulations at various scales (from atomistic and electronic structure levels to the macroscopic level). Specifically, for technologies such as fuel cells, batteries, or solar cells, electronic properties are keys to the functionality, and it is atomistic scale, and in particular ab initio, simulations that are providing insight into the mechanism at the material level and thereby can guide rational design. Bandgaps and bandstructures, absorption spectra, molecular adsorption energies, etc., are today routinely computable at the DFT (density functional theory) [35,36] level. Further, structural information is obtainable with techniques such as molecular dynamics [37], Monte Carlo [38], and others which can operate at larger scales than DFT. Such calculations, however, remain costly, in particular, DFT with which routine calculations are still typically limited to about 10³ atoms. It is therefore enticing to deploy ML methods to replace such calculations or at least to reduce the required number of such calculations. Of course, ML can boost the development not only of the so-called sustainable or “green” energy technologies; we will provide an example in this perspective where it can be used to advance research useful for the future of nuclear energy.

The purpose of this perspective is not a comprehensive or even brief review of ML uses in materials science and energy technologies, nor is it to review staple ML methods–there is already ample literature on these topics [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34]. Novel energy technologies are very knowledge-deep, i.e., they require multiple steps in understanding and development; specifically, they require insight at the electronic structure and interatomic interactions level. Many aspects are being addressed with machine learning, not just predicting material or device properties from descriptors: much is being done at the back end or upstream by using machine learning in the workflow of understanding modeling of materials and phenomena for these technologies, or to improve modeling capabilities [1,39,40,41,42,43,44]. This in particular requires ML methods and tools capable of working with extremely sparse and unevenly distributed data in multidimensional spaces beyond the capabilities of standard methods. That is the focus of this perspective.

We will first show illustrative examples of how machine learning can be used to help improve the way we produce and use energy, especially for the benefit of so-called sustainable technologies. We will highlight the fact that in these applications, one usually has to deal with multiple variables, i.e., to operate in multidimensional spaces, and we will see that there are properties of multidimensional spaces with which one typically does not have to deal with in one-, two-, or three-dimensional problems. We will show why ML techniques are effective when working with such data. While all major types of machine learning (classification, regression, clustering) are finding use in the above applications, here we focus specifically on regression type problems which are highly relevant, as many types of descriptors (atomic coordinates and other structural descriptors, electron densities, ionization potentials, electron affinities, atomic charges, bandgaps, and redox potentials, among others) are real-valued or can be treated as such (e.g., atomic numbers/nuclei charges). We also focus on supervised machine learning, which is also highly relevant, as in many applications the target values for training points in the space of descriptors are known (for example, potential energy, kinetic energy density, bandgap, etc.). We will briefly introduce neural networks (NN) and Gaussian process regressions (GPR) and their pros and cons. We will then focus on using the so-called high-dimensional model representations to structure the functional representation, in conjunction with NN or GPR used to build components of those representations. We will see that this is especially useful when trying to recover trends or functional forms from sparse data and to gain insight using examples from catalysis and a prospective nuclear technology as well as providing an example of how ML can revolutionize ab initio modeling capabilities, which are critical for the modeling and mechanistic understanding of materials including those used in novel energy technologies.

2. High Dimensionality and Extremely Low Data Density in the Space of Descriptors

2.1. Examples of Input—Output Mappings Used in ML for Energy Technologies

Humanity is trying to develop more sustainable ways to produce and use energy. Currently the world still largely relies on fossil fuels, which are believed to be the result of absorption of sun energy [45]. The use of fossil fuels generates many pollutants, including poisonous gases and heavy metals, and it also generates large amounts of CO₂. CO₂ can be recycled through vegetation but unfortunately the timeframe from dead vegetation to fossil fuels is too long to rely on such a natural carbon cycle with current and projected levels of energy consumption. There is a quest for what can be called an anthropogenic chemical carbon cycle [46]. In it, one does not rely on burning of fossil fuels but on a set of technologies which either generate electricity directly from the sun with solar cells [47] and from other sustainable sources such as wind [48], or synthesize, using sunlight and CO₂ and other inputs, fuels which can be utilized in a clean way, such as hydrogen or liquids which can be used with cleaner exhaust in fuel cells or even burnt directly [49,50,51].

Solar and wind technologies are of course intermittent, and to match their output schedule to the demand schedule, storage is necessary [52], in particular with batteries ([53] and references therein). Batteries are also needed for road transport electrification. At the same time, novel and safer types of nuclear reactors are being developed for reliable low-carbon footprint baseload [54,55,56]. All these technologies become an energy mix. Multiple technologies are used, and in each of these technologies, there are multiple scientific, engineering, and technological problems that need to be addressed. Bringing these technologies together to work in one system is also a challenge. Machine learning can help resolve many of these problems, and we will now show some examples.

The reader will have heard about the “hydrogen economy” and about fuel cells which allow “burning” in a clean way of either hydrogen or other fuels. At the core of these capabilities is heterogeneous catalysis, i.e., reactions happening at and catalyzed by surfaces. Some of the key reactions relevant for energy production and use are driven by solid catalysts including the oxygen reduction reaction [57], which is a critical and still problematic step in fuel cells, the Fischer–Tropsch process [58], which allows producing synthetic fuels from a mixture of hydrogen and CO (so-called syngas); water–gas shift [59] and steam and dry reforming [60,61] are also very important reactions. These reactions permit transformation between hydrocarbons, water, hydrogen, and carbon oxides. Their efficiency depends on the quality of the catalyst. Many efficient catalysts are rare and or expensive, such as platinum, widely used in commercial fuel cells, e.g., those used in Toyota Mirai passenger cars, in particular, due to the lack of non-expensive alternatives.

Design of better catalysts is therefore a major bottleneck on the way to wide deployment of more sustainable energy technologies. Recently, machine learning has found increasing use in rational design of catalysts [25,26,27,28,29,30,31,32,33,34]. Typically, one considers a set of descriptors of catalytic activity which include reaction thermodynamics and kinetics, adsorption energies of key reactants and intermediates, as well as structure and electronic structure descriptors of the catalyst [26,33]; some are shown in Figure 1. To sieve through candidate materials, one encodes them into a set of compositional and structural variables such as atomic numbers and positions, crystal structure and surface cut, and adsorption site. Structures and many of the properties such as adsorption and reaction energies and kinetic barriers as well as electronic structure descriptors are routinely computed, typically with density functional theory [36,62]. Computations play a key role in rational design of catalysts with or without machine learning, as many properties are either not directly accessible experimentally or cannot be measured with high throughput and modest cost. All together this makes many variables on which catalytic activity depends, and one needs to perform searches and uncover relations in multidimensional spaces, which can be done with the help of machine learning [25,26,27,28,29].

To model directly a catalyzed reaction from reactants to products, one would need an interatomic potential energy function, and those potentials also pose a bottleneck in the modeling of heterogeneous catalysis and are more and more often constructed with machine learning methods [1,40,63,64]. There is a key difference between the traditional fossil fuel-based energy technologies and the newer technologies based on solar cells, fuel cells, batteries, etc.: the fossil fuel-based technologies can be understood based on classical thermodynamics and mechanics, while these newer technologies critically depend on quantum phenomena, ultimately on electronic states’ energies, occupancies, and localizations. Quantum mechanics based modeling therefore takes center stage. Such modeling is difficult and costly, and machine learning is also used to facilitate it; we will show a couple examples thereof below. This is most obvious in solar cells where details of the electronic structure such as the bandgap, the effective mass, etc., determine directly solar cell performance. Machine learning is now more and more often used to help design better solar cells and specifically materials for solar cells and other optoelectronic applications [12,13,14,15,65,66]. It is used to predict better active materials, to optimize device performance or even optimize fabrication processes [15]. Figure 2 shows main uses of ML for solar cell design as well as most widely used ML methods summarized in a recent review [15]. Artificial neural networks (ANN) remain the most widely used method; also widely used are genetic algorithms (GA), random forest (RF), particle swarm optimization (PSO), simulated annealing (SA), support vector machines (SVM), kernel ridge regression (KRR) and Gaussian process regression (GPR), externally randomized trees (ERT), clustering methods such as K nearest neighbors, and principal component analysis. These are just some methods from by now very many machine learning methods. That is, all three major classes of ML—classification, regression, and clustering—are finding use in solar cell research as well as in catalysis research [26].

These examples from the fields of catalysis and solar cells are barely illustrative. Machine learning (and the same methods as mentioned above) is also used to help design battery materials [23,24,67,68] and for other energy technologies as indicated above, but ML is in demand not just at the material or device level but also at the system level [69,70,71,72]. In an energy mix containing solar farms, wind farms, and other intermittent technologies alongside more established generation methods such as nuclear or natural gas powered stations, one needs to constantly balance supply and demand. Predicting demand and predicting possible supply from solar and wind and choosing how to palliate any excess or shortfall with either battery storage or a call on nuclear or hydrocarbon-based generation is an important problem where there is much potential for machine learning which is yet to be fully realized [73,74]. It also requires working with multidimensional datasets.

To illustrate specifically the challenge posed by the dimensionality of the feature space in ML for materials for novel energy technologies, consider one recent example from the literature [10] of how ML is applied to design better materials for perovskite solar cells [75], in which the main active material and light absorber is a (typically organic–inorganic hybrid) perovskite material. The composition of the perovskite can be changed by different choices of atoms placed in different sites of the crystal lattice. The perovskites most widely used in labs today contain lead, which is not desirable. Moreover, different structural choices also in principle allow one to modulate the bandgap and other properties that could better match with specific electron and hole transporting materials [76], which are in contact with the perovskite in a solar cell. In the example of [10], candidate compositions were encoded with 32 features which included properties of constituent atoms such as ionization potential, orbital radii, etc., as well as properties of the crystal. We will see below that this dimensionality, which from the point of view of applications does not appear to be high (indeed, hundreds to thousands of features are sometimes available), is in fact very high from the point of view of data density.

We point out that in all of the abovementioned applications of machine learning, a multitude of methods are used: classification methods, regression methods, clustering methods; supervised and unsupervised methods. In what follows, we will narrowly focus on specific regression type methods based on neural networks and Gaussian process regressions, and on composite methods using them as building blocks to enable ML in high dimensional spaces from extremely sparse data.

2.2. New Technologies and Challengies Require New Simulation Methods–A Large Scope for Machine Learning

Machine learning is also useful to improve simulation methods. The example of machine learning interatomic potentials was cited above; ML is also used to improve the widely used DFT method by learning better exchange correlation functionals, dispersion corrections, corrections to computed excitation energies, etc. ([39] and references therein). We will draw attention here to only one, but a critically important example that allows showcasing the power of ML. We start with a seemingly purely mathematical problem: consider a function of space ρ(x), where x is a vector of Cartesian coordinates x, y, z. The function is positively definite and integrates to N:

\int ρ (x) d x = N

. We represent ρ as a sum of N positively definite pieces, each integrating to 1:

ρ (x) = \sum_{n = 1}^{N} ρ_{i} (x) = \sum_{n = 1}^{N} {|ϕ_{i} (x)|}^{2} \int ρ_{i} (x) d x = 1

(1)

Because they are positive definite, they are squares of some functions

ϕ_{i}

. It is known that this decomposition can be done so that

ϕ_{i}

are orthonormal:

\int ϕ_{i} (x) ϕ_{j} (x) d x = δ_{i j}

.

We are interested in this quantity T or the corresponding integrand τ(x), which can take two alternative forms:

T = - \frac{1}{2} \int \sum_{n = 1}^{N} ϕ_{i} Δ ϕ_{i} d x = \int τ_{(K S)} (x) d x T = \frac{1}{2} \int \sum_{n = 1}^{N} {|\nabla ϕ_{i}|}^{2} d x = \int τ_{+} (x) d x

(2)

We want to express T without explicit reference to

ϕ_{i}

, as a function of ρ-dependent quantities only. It could be any derivative of ρ, any power of ρ, but only of ρ without explicit reference to

ϕ_{i}

.

The reason why this example is important is that a solution to this problem opens the way to routine, linear-scaling large-scale ab initio modeling of materials with the so-called orbital-free DFT (OF-DFT) [77]. We mentioned above that quantum mechanics-based modeling is critical for understanding and design of novel energy technologies, yet the commonly used Kohn–Sham DFT is unwieldy for systems with more than about 10³ atoms. Large-scale modeling means more realistic modeling and is required to properly account for a range of phenomena which are intrinsically large-scale (e.g., microstructure-driven properties). In OF-DFT, ρ takes the meaning of the electron density, N is the number of electrons,

ϕ_{i}

are single-electron (Kohn–Sham) orbitals [36], T is the kinetic energy, and

τ (x)

is the kinetic energy density (KED) (we neglect spins and partial orbital occupancy without the loss of generality). The mapping

T = T [ρ (x)]

is the kinetic energy functional (KEF). Approximate formulas for such an expression (

T [f (x)]

) exist but they are not accurate enough for use in most applications where ab initio modeling is needed, including (organic and inorganic) semiconductors and transition metal containing functional materials of novel energy technologies. In the past several years, substantial progress has been being made on this problem with the help of machine learning using, in particular, techniques such as neural networks and kernel methods [41,78,79,80,81,82]. We will return to this example later in the context of deep learning. Some uses of ML to improve quantum mechanics-based modeling methods are reviewed in [39].

2.3. The Curse of Dimensionality and Why ML Techniques Are Effective

We mentioned in an example above 32 features used to describe perovskite materials [10]. Is 32 dimensions high or low? It is (very) high. In multidimensional spaces, one is hit by the so-called curse of dimensionality. Imagine we sample a simple univariate function with M points sufficient to recover the function to a desired accuracy with a common method (e.g., splines). If one wants to maintain the same density of sampling in D dimensions, the number of sampling points would grow exponentially: M^D. With M as small as 10 and D as small as 10, one would need 10¹⁰ data. This is the curse of dimensionality. Moreover, when one constructs a function with polynomials or Fourier expansions, not only the number of required samples grows exponentially but the number of terms in the representation as well. The result of this is that it is impossible to achieve good density of sampling just by adding more data. For example, already in 20 dimensions, a million data points is equivalent to only about 2 data per degree of freedom (of an equivalent direct product grid). If we somehow managed to get 10 times more data, it would only increase the density of sampling to about 2.2 data per degree of freedom. Practically, one therefore always works with extremely sparse data. That is why 32 dimensions in the example above is in fact very high. In practice, one starts feeling this curse of dimensionality from about 6 dimensions and up.

However, this debilitating scaling strictly speaking only holds for direct product representations. One reason why various machine learning methods work well in multidimensional spaces is because they avoid direct product representations and thereby alleviate the problem of the curse of dimensionality. We will illustrate this point in Section 3.1.

In multidimensional spaces, some intuitions break down. For example, what is local in low-D may not be in high-D. An example is the Gaussian function,

g (x) = \prod_{i = 1}^{D} {(2 π σ_{i}^{2})}^{- \frac{1}{2}} e x p (\frac{{(x_{i} - \bar{x_{i}})}^{2}}{2 σ_{i}^{2}})

, which is localized around the mean

\bar{x}

in the sense that about 70% of the quadrature of this function is within one standard deviation σ from the mean in the one-dimensional case (D = 1). In six dimensions (D = 6), for example, only 10% of the quadrature is from within one standard deviation (in each respective dimension). That is, a multidimensional Gaussian function is no longer a localized function by this measure. Working in high dimensionality has also its advantages. For example, there is the concentration of measure whereby as dimension is increased, the width of the distribution of distances between data points collapses:

\lim_{D \to \infty} E (\frac{d i s t_{m a x} (D) - d i s t_{m i n} (D)}{d i s t_{m i n} (D)}) \to 0

, where

d i s t_{m a x}

and

d i s t_{m i n}

are maximum and minimum distances between the data points. We observed in our own experience that, for example, neural networks perform better in high-dimensional spaces than in 1, 2, 3 dimensions. This ultimately relates to growing advantages of non-direct product representations in high dimensions.

One important consequence of the low density of sampling is that the intrinsic dimensionality of the dataset is less than the dimensionality of the space [83,84]. This is obvious when the data are confined to lower-dimensional hypersurfaces or other shapes, but even if the data are not confined to such a sub-dimensional shape, simply by the virtue of low density, the intrinsic dimensionality of the dataset is lowered. This ultimately justifies representations with lower-dimensional functions, which will be described below.

3. Advanced Techniques for Working with Sparse Data

3.1. Brief Introduction to Neural Networks and Gaussian Process Regression

3.1.1. Neural Networks (NN)

Artificial neural networks [85] are an example of a representation of a multivariate function with univariate functions, which are adapted to the problem by the choice of parameters. They are often presented with the help of an analogy with a biologic neural network, but for regression problems we find more useful the following interpretation. We expand a function

f (x), x \in R^{D}

in a basis of univariate functions

σ_{n}

with coefficients

c_{n}

:

f_{k} (x) = \sum_{n = 0}^{N} c_{n k} σ_{n} (w_{n} x + b_{n})

(3)

Here we added a subscript k to indicate that several outputs can be computed from the same basis. The basis functions depend on all components of x but their arguments are scalars dependent on x as well as on weights w and biases b. This parameterization by w and b makes a flexible, non-direct product basis. This representation goes back to the Kolmogorov theorem of 1957 [86], and since then, in a series of papers restrictions on σ have been relaxed [87,88,89,90,91,92,93,94,95]. This expression is a universal approximator even when σ is the same for all n and as long as σ is smooth and nonlinear [96]. In applications, typically σ is the same for all n and typically it is the sigmoid function,

σ (x) = (e^{x} - e^{- x}) / (e^{x} + e^{- x})

, but it does not have to be [96]. The parameters w and b and the coefficients c are fitted to reproduce a set of known samples of f,

f^{j} = f (x^{j})

, j = 1, …, M. The fit is nonlinear because σ is nonlinear.

The Equation (3) (which is often written as

f (x) = σ (\sum_{n = 0}^{N} c_{n} σ (w_{n} x + b_{n})),

i.e., with a so-called output neuron, which, however, can be subsumed in the definition of f, without loss of generality) expresses a so-called single hidden layer neural network. It is graphically represented in Figure 3, where the arrows to each σ, which are called neurons, reflect the formation a single scalar input. These neurons form a hidden later. The outputs are collected in a sum to one or more outputs in the output layer, i.e., one can fit a multi-sheet function or a function and its derivatives, by the same NN. These are called output neurons.

One can collect the outputs of this last layer of neurons and feed them to yet another layer of neurons and so on, obtaining a multilayer or deep neural network:

f_{l} = σ_{o u t} (\sum_{k_{n} = 0}^{N_{n}} w_{l, k_{n}}^{(n)} σ_{n, k_{n}}^{} (\sum_{k_{n - 1} = 0}^{N_{n - 1}} w_{k_{n}, k_{n - 1}}^{(n - 1)} σ_{n - 1, k_{n - 1}} (\dots \sum_{k_{1} = 0}^{N_{1}} w_{k_{2} k_{1}}^{(2)} σ_{1, k_{1}} (\sum_{i = 0}^{d} w_{k_{1} i}^{(1)} x_{i}))))

(4)

We stress that a single hidden layer network is already a universal approximator. In many applications, one layer would be sufficient, and one needs to be sure if one actually needs a deep NN as it comes at a price of a large number of non-linear parameters. We will show examples later of where one actually needs a deep NN.

The fact that σ can be any smooth function opens new possibilities. We showed, for example, that with exponential neurons one easily obtains a sum of products representation with a relatively small number of terms [97]:

f (x) = \sum_{i = 0}^{M} w_{i}^{(2)} \prod_{k = 0}^{d} e^{w_{i k}^{(1)} x_{k}}

. Sum of products representations are very useful because they greatly simplify integration of the function (multidimensional integrals over the function f can be computed as sums of products of one-dimensional integrals), which is a major advantage when the dimensionality is high. In fact, sum of product representations are required in certain computational methods, for example, in some powerful quantum dynamics methods [98]. A more complicated way to get a sum of products is with multiplicative neural networks which use error function types of neurons [99]:

f (x) = μ_{1}^{(2)} + \sum_{k_{1} = 1}^{n_{1}} w_{1, k_{1}}^{(2)} \prod_{k_{0} = 1}^{J} e r f (μ_{k_{1} k_{0}}^{(1)} + w_{k_{1} 1 k_{o}}^{(1)} x_{k_{0}})

. These are examples of a departure from the orthodox network designs which bring significant advantages. We will show below how with even more drastic changes in the architecture one can realize significant advantages.

3.1.2. Gaussian Process Regression (GPR)

Another method we briefly highlight is Gaussian process regression (GPR) [100]. GPR answers the question “given the set of samples

f^{j}

of a function f(x) at certain points in space x^j, what are the expectation values f(x) and their variance Δf(x) for function values at other points in space x?” One assumes that correlation between data can be described with a kernel, a chosen type of the covariance function k(x₁, x₂). The answer is given by

f (x) = K^{*} K^{- 1} f Δ f (x) = K^{* *} - K^{*} K^{- 1} K^{*}^{T}

(5)

where f is a vector of all

f^{j}

values, and the matrices K and K* are computed from pairwise covariances among the data:

K = (\begin{matrix} \begin{matrix} k (x^{(1)}, x^{(1)}) + δ & k (x^{(1)}, x^{(2)}) \\ k (x^{(2)}, x^{(1)}) & k (x^{(2)}, x^{(2)}) + δ \end{matrix} & \dots & \begin{matrix} k (x^{(1)}, x^{(M)}) \\ k (x^{(2)}, x^{(M)}) \end{matrix} \\ ⋮ & ⋱ & ⋮ \\ \begin{matrix} k (x^{(M)}, x^{(1)}) & k (x^{(M)}, x^{(2)}) \end{matrix} & \dots & k (x^{(M)}, x^{(M)}) + δ \end{matrix}) K^{*} = (\begin{matrix} k (x, x^{(1)}) & k (x, x^{(2)}) & \begin{matrix} \dots & k (x, x^{(M)}) \end{matrix} \end{matrix}), K^{* *} = k (x, x)

(6)

The optional δ on the diagonal has the meaning of the magnitude of Gaussian noise and it helps generalization. Note that Equation (5) as written (and as it commonly appears in the literature) holds for data (the set of

f^{j}

) normalized to unit variance, otherwise the left-hand side should be multiplied by the variance of f.

The covariance function is usually chosen as one of the Matérn family of functions [101] given by

k (x, x^{'}) = σ^{2} \frac{2^{1 - ν}}{Γ (ν)} {(\sqrt{2 ν} \frac{|x - x^{'}|}{l})}^{ν} K_{ν} (\sqrt{2 ν} \frac{|x - x^{'}|}{l})

(7)

where Γ is the gamma function and K_v is the modified Bessel function of the second kind. At different values of ν this function becomes a Gaussian (

ν \to \infty

), a simple exponential (

ν = 1 / 2

) and various other widely used functions (such as Matern3/2 and Matern5/2 for ν = 3/2 and 5/2, respectively). The parameters of the covariance function are the only parameters, they are hyper-parameters and they are few, as few as one (for an isotropic kernel at fixed ν). This is therefore a non-parametric method, which is an advantage. While hyper-parameters still need to be chosen, the performance is usually about equally good as long as hyper-parameters are in some reasonable range. In Equation (7), the critical parameter is the length parameter l; the prefactor

σ^{2}

is fully correlated with δ in Equation (6). Note that Equation (5) is a non-direct product representation.

3.1.3. Relative Pros and Cons of GPR vs. NN

Because in GPR one has to wield matrices and inverses of matrices of the size M × M, where M is number of data, this method becomes increasingly costly as the dataset size grows. In fact, it quickly becomes prohibitively expensive for datasets as small as tens of thousands of data unless additional approximations are made [102]. However, the method is very accurate. In fact, GPR is the only method which we have seen outperform NN in controlled comparisons and on the quality of observables, i.e., not just various fitting error measures, but controlled tests of the quality of observables computed with functions constructed using NN or GPR [103]. GPR has been shown to be equivalent to an infinitely large neural network [104]. Contrary to NN, it provides estimates of uncertainty (we caution however that Δf in Equation (5) should not be directly used to compute an error bar of the mode; it reflects the uncertainty of the expectation value, which can be higher or lower than the fitting error [105,106]). It is quite costly not just to build but also to evaluate a GPR model when the datasets are more than a few thousand data. The matrix inverse also means instability when data points are close to each other, although it can be handled, e.g., by using pseudoinverses and appropriate values of δ. We also recently proposed a rectangular GPR, which achieves numeric stability and good generalization without δ [107].

Neural networks, in contrast, work well when some points are close; they work well with millions of data even on a desktop computer [78]. Their disadvantage is the large number of parameters that need to be determined in a nonlinear fit. There is therefore danger of overfitting, and one needs to carefully control for it by a judicious choice of architecture and also by using test sets. Often as few as 5% to 20% of the available data are reserved for testing, with the rest used for training. In our experience, this is grossly insufficient, and we recommend using test sets which are at least as large as training sets. This is a disadvantage as this requires more data. In general, with the same number of training points, an NN achieves a lower accuracy than GPR; alternatively, one needs more data to reach the same accuracy with NN as with GPR [103].

Given the advantages of GPR, it would be desirable to have a scheme in which GPR is used in a regime where it is most efficient, i.e., with few data. The approach which we introduce next allows doing just that.

3.2. High-Dimensional Model Representation (HDMR)

We now briefly introduce high-dimensional model representation (HDMR) [108,109,110]. HDMR is an expansion over orders of coupling. Its form is similar to the ANOVA decomposition [111,112]. The idea is to represent the function

f (x)

as a sum of first an uncoupled approximation, then terms due to collective action of couples of variables, then triples, etc., until the last term that describes the coupling term among all D variables:

f (x) = f_{0} + \sum_{i = 1}^{D} f_{i} (x_{i}) + + \sum_{\begin{array}{l} i = 1, \\ j = i + 1 \end{array}}^{D} f_{i j} (x_{i}, x_{j}) + \dots + \sum_{i_{1}, i_{2}, \dots, i_{d}}^{D} f_{i_{1}, i_{2}, \dots, i_{d}} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{d}}) + \dots + f_{1, 2, \dots, D} (x)

(8)

If taken to this last term, the expansion is exact. If the expression is truncated at some lower-order d terms, it is approximate. In most physical phenomena, the relative importance of these coupling terms drops rapidly with the order of coupling d. Depending on the application, one can stop at, e.g., 3rd order or 2nd order or even 1st order, i.e., uncoupled approximation, without making much error [108].

When one has stopped at some relatively small order of coupling d, one works only with low-dimensional component functions

f_{i_{1}, i_{2}, \dots, i_{d}} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{d}})

. All these component functions can be obtained from one and the same set of training points distributed in full D-dimensional space (in which case one obtains the so-called RS (random-sampling)-HDMR [113,114]). This is advantageous on several counts: low-dimensional functions are generally easier to build, they can be built from fewer data e.g., with sparse data, and not suffer from overfitting [115]. We indicated above that sparse data are a key problem and that, e.g., the Gaussian process regression does not work well with large dataset. One can use HDMR to build a multivariate function from sparse data and stay in the “comfort zone” of the method which is used to build the component functions. A representation with lower-dimensional functions is also easier to use in applications, particularly when f needs to be integrated. HDMR in that case allows computing only low-dimensional quadratures.

In the original HDMR formulation, all component functions are mutually orthogonal, i.e.,

\int_{D}^{} f_{i_{1} i_{2} \dots i_{d}} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{d}}) f_{j_{1} j_{2} \dots j_{m}} (x_{j_{1}}, x_{j_{2}}, \dots, x_{j_{m}}) d x = 0 \{i_{1}, i_{2}, \dots, i_{d}\} \neq \{j_{1}, j_{2}, \dots, j_{m}\}

(9)

and are equal to multidimensional integrals (specifically, (D–d) dimensional integrals need to be computed for d-dimensional component functions) which are quite difficult to compute when D is high [113,114]; further, any lower-dimensional component functions need to be constructed first before constructing d-dimensional functions. We proposed an extension of HMDR whereby we do not build the entire expansion but directly represent the function f with d-dimensional functions [105,116,117,118,119].

f (x) \approx \sum_{i = i}^{N} f_{i} (x_{1}^{}, x_{2}^{}, \dots, x_{d}^{})

(10)

This is achieved by dropping the orthogonality requirement. One need not build the entire HDMR expansion; it is possible to use only the terms of the desired dimensionality d that provides a desired accuracy. The lower-order terms are effectively subsumed into d-dimensional terms. Most importantly, when using machine learning to represent the component functions, no integrals need to be computed, and it is possible to alleviate the issue of combinatorial growth of the number of HDMR terms with d and D.

3.3. Combining HDMR with ML for Learning from Sparse Data

3.3.1. Machine Learning of HDMR Terms

HDMR component functions can be built with machine learning methods like neural networks [116,118,120] or Gaussian process regressions [105,106,119]. Practically this can be done by fitting component functions one at a time to the difference of the value of the function f at the training points and the sum of all other component functions:

f_{k_{1} k_{2} \dots k_{d}}^{(N N o r G P R)} (x_{k_{1}}, x_{k_{2}}, \dots, x_{k_{d}}) = f (x) - \sum_{\begin{array}{l} \{i_{1} i_{2} \dots i_{d}\} \in \{12 \dots D\} \\ \{i_{1} i_{2} \dots i_{d}\} \neq \{k_{1} k_{2} \dots k_{3}\} \end{array}}^{D} f_{i_{1} i_{2} \dots i_{d}}^{(N N o r G P R)} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{d}})

(11)

One then cycles through all functions in a sort of a self-consistency loop until convergence. We called the resulting methods (RS-)HDMR-NN and (RS-)HDMR-GPR, respectively [105,120]. In Equation (11) a single NN or GPR instance is used for each component function. In the case of NN, the entire HDMR representation can also be encored into the architecture of one NN. In practice, we found this to be rather cumbersome; the advantage of Equation (11) is that existing NN engines with simple architecture can be used and work well. When using GPR, one can achieve an HDMR form also by using an HDMR-type kernel [119,121]:

k (x, x^{'}) = \sum_{\{i_{1} i_{2} \dots i_{d}\} \in \{12 \dots D\}}^{} k_{i_{1} i_{2} \dots i_{d}}^{} (x_{i_{1} i_{2} \dots i_{d}}, x^{'}_{i_{1} i_{2} \dots i_{d}})

(12)

where

x_{i_{1} i_{2} \dots i_{d}} \equiv (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{d}})

. There is in that case the disadvantage over Equation (11) in that a custom kernel has to be defined, but one uses a single GPR instance and

f_{i_{1} i_{2} \dots i_{d}}^{G P R} (x_{i_{1}}, x_{i_{2}}, \dots, x_{i_{d}})

is optimal in the least squares sense without the need for fitting cycles of Equation (11). The individual component functions are then

f_{k_{1} k_{2} \dots k_{d}}^{} (x_{k_{1}}, x_{k_{2}}, \dots, x_{k_{d}}) = K_{i_{1} i_{2} \dots i_{d}}^{*} c

(13)

where

c = K^{- 1} f

and

K_{i_{1} i_{2} \dots i_{d}}^{*}

is a row vector with elements

k_{i_{1} i_{2} \dots i_{d}}^{} (x_{i_{1} i_{2} \dots i_{d}}, x_{i_{1} i_{2} \dots i_{d}}^{(n)})

. The values of the component functions at the training set are then

f_{i_{1} i_{2} \dots i_{d}} = K_{i_{1} i_{2} \dots i_{d}} c

. The relative importance of different component functions can be evaluated by computing the variance of

K_{i_{1} i_{2} \dots i_{d}} c

, where the (m,n) elements of the matrix

K_{i_{1} i_{2} \dots i_{d}}

are

k_{i_{1} i_{2} \dots i_{d}}^{} (x_{i_{1} i_{2} \dots i_{d}}^{(m)}, x_{i_{1} i_{2} \dots i_{d}}^{(n)})

. It can be used, in particular, to prune unnecessary component functions and thereby alleviate the issue of combinatorial scaling of the number of HDMR terms with D and d [106,119]. This scaling is a major bottleneck in high-dimensional problems. It arises because in HDMR the number and dimensionality of terms are hardwired to each other: the number of d-dimensional terms is the binomial coefficient C_d(D) and can be quite high. An even further extension that we proposed is to represent f(x) as a sum of lower-dimensional functions in new coordinates x⁽ⁱ⁾ which are linear combination of the original coordinates:

x^{(i)} = A^{(i)} x + b^{(i)} f (x) \approx \sum_{i = i}^{N} f_{i} (x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{d}^{(i)})

(14)

The full set of

x^{(i)}

is generally redundant, and the method, when used with NN component functions, was called Red-RS-HDMR-NN [117]. This has the advantage that the number of terms and their dimensionality are uncoupled, and one can achieve good accuracy with a small number N of low-dimensional terms by varying independently N and d. This is illustrated in Figure 4, where the error in the interatomic potential of vinyl bromide is plotted as a function of both the number and the dimensionality of terms. For comparison, in the standard HDMR, there would be about 3000 5-dimensional terms. The coordinate transformation itself can be conveniently done automatically (during the fit) by introduction of an additional NN layer with linear neurons. A nonlinear layer can also be used, but we showed that the universal approximator property of the form of Equation (14) holds even with a linear coordinate transformation [117].

3.3.2. RS-HDMR-NN (Random Sampling High-Dimensional Model Representation Neural Network)

When the product of N and d in the representation of Equation (11) is smaller than the dimensionality of the space D, one obtains dimensionality reduction. Specifically, if the coordinate transformation and the component functions are built with neural networks, one obtains effectively a simple version of an autoencoder. An autoencoder is a type of neural network that performs dimensionality reduction, typically by using layers with diminishing numbers of neurons [122].

Dimensionality reduction can be used to find intrinsic dimensionality of the data, which may be different from the dimensionality of the space. We highlighted above that heterogeneous catalysis is an important pillar of emerging energy technologies. Modeling of catalyzed reactions with either classical or quantum dynamics requires interatomic potentials. This type of HDMR-inspired dimensionality reduction was used to construct interatomic potentials. For example, in [123] the authors constructed with this approach an interatomic potential for catalytic decomposition of nitrous oxide on copper, in the frozen surface approximation, and computed dissociation probabilities with it. This was the first ab initio-based interatomic potential for a polyatomic molecule-surface catalytic reaction with full account of all molecular degrees of freedom (intramolecular as well as orientational with respect to the surface). The interatomic potential was built in a 15-dimensional configuration space that allowed preserving symmetries and asymptotic behavior even though the intrinsic dimensionality of this problem was nine. Figure 5 plots the error (on a test set) in the interatomic potential as a function of dimensionality of a single used component function (i.e., N = 1). The method allows correctly identifying intrinsic dimensionality as a value of d beyond which there is no improvement, and it allows building the function f sampled with only about 2 data per degree of freedom, with no overfitting [123].

3.3.3. RS-HDMR-GPR (Random Sampling High-Dimensional Model Representation Gaussian Process Regression)

The previous example was about using a combination of HDMR with neural networks. Here we give examples of using a combination of HDMR with Gaussian process regression. The first example has to do with the future of nuclear energy. Another example of HMDR-NN will be cited in the next section. Nuclear energy remains a great source of stable baseload electricity, which is needed in the context of intermittent sources such as solar and wind. A typical nuclear fuel cycle includes uranium mining (of uranium oxides), conversion to UF₆ gas, enrichment of the ²³⁵UF₆ fraction, conversion to solid uranium oxide, and then production of nuclear fuel assemblies. A key step in this cycle is uranium enrichment to bring the fraction of uranium-235 from the naturally occurring 0.7% to 3–5% (about 99.3% of naturally occurring U is ²³⁸U) [124]. Not all reactor types require enrichment but most reactors operating in the world do, while other applications (e.g., defense) require much higher enrichment degrees. The enrichment is typically done in the gaseous phase, by enriching UF₆ gas in centrifuges. This is a costly process: the enrichment cost can account for about 10% of the electricity cost [125]. For many years now, researchers and the industry have been studying laser-driven enrichment whereby one excites isotope-sensitive hyperfine transitions either of uranium or uranium hexafluoride [126,127]. This requires very high laser coherency (on the order of 10⁵) and brings with it various issues. If one could use instead vibrational transitions in UF₆, some of which are isotopomer-selective (such as the mode at around 628 cm⁻¹ which is different by about 0.6 cm⁻¹ between ²³⁵UF₆ and ²³⁸UF₆ implying necessary coherency on the order of 10³) [128], one could use cheaper and less coherent IR lasers.

One of us recently proposed the concept of IR laser-driven isotopomer selective desorption of UF₆ [129]. We computed that UF₆ can be adsorbed on different graphene derivatives with tunable adsorption energy depending on the derivative, and that in the adsorbed state there exists an isotopomer unique vibrational mode which can be used to heat and make desorb the molecules in an isotopomer-selective way, as illustrated in Figure 6.

Vibrational dynamics of UF₆ would be critical for such a technique, and accurate modeling of vibrational properties and dynamics critical for ability to simulate this process [130]. Unfortunately, good, well-resolved vibrational spectra of UF₆ are not even found in the experimental literature [128,131], and to compute accurate vibrational spectra or vibrational dynamics, one requires a good interatomic potential function (potential energy surface, PES), which for a UF₆ molecule is a 15-dimensional function. Accurate PESs for UF₆ are still unavailable, in particular, due to difficulty of building a 15-dimensional function from sparse data.

In [105], we applied the HDMR based approach with Gaussian process regression component functions to construct the 15-dimensional interatomic potential of UF₆ as

E^{U F_{6}} (x) \approx \sum_{i = i}^{N_{c f}} f_{i}^{G P R} (x_{1}^{(i)}, x_{2}^{(i)}, \dots, x_{d}^{(i)})

from samples computed with density functional theory [130]. We trained the model on 2000, 3000, and 5000 data and tested its quality on 50,000 data, i.e., we used a test set much larger that the training set. The results of test set errors obtained with different orders d of HDMR and different numbers of training data are summarized in Table 1. More details of the calculations are given in [105].

If we first look at the fit results with 5000 training data, we see something quite expected: the higher the considered order of coupling, d, the smaller the test error. The smallest error is achieved with a full-dimensional GPR (i.e., d = D, N_cf = 1). Consider now the results obtained with only 2000 training data. This is a very sparse dataset, with sampling density of only about 1.7 data per dimension. What we observe here is that the fit with three-dimensional functions gives a better test set error than the full-dimensional fit. This is because with low density of sampling, it is impossible to recover the full D-dimensional function [120]. The data points were sampled quasi-randomly [132] in the 15-dimensional space; they do not lie on subdimensional manifolds, but the information to recover the full D-dimensional terms is just not there. Low data density also increases the danger of overfitting. All this together argues for representations with lower-dimensional functions such as those based on HDMR. Ultimately this has to do with the fact that a finite-size dataset in a D-dimensional space is not a D-dimensional object but has a dimension anywhere between 0 and D [83,84].

3.4. When Are Deep NNs Useful?

Finally let us touch on the subject of so-called deep neural networks which have now been widely popularized (“deep learning”). We consider again the problem of kinetic energy functionals for orbital-free DFT, of which a good solution would significantly enhance researchers’ capabilities to perform large-scale ab initio simulations, as explained above. We are interested in the electronic kinetic energy or kinetic energy density, which we know how to express through the orbitals

ϕ_{i}

, with the expression of Equation (2), but want to express it as a function of density only. We can use any derivatives or powers of ρ or any other functions of ρ but without reference to

ϕ_{i}

:

T = T [ρ (x) | \nabla ρ, Δ ρ, ρ^{n}, \dots]

or

τ = τ [ρ (x) | \nabla ρ, Δ ρ, ρ^{n}, \dots]

. As an example, we show in Figure 7 how the kinetic energy density τ(x) looks for aluminum, magnesium, and silicon crystals:

It is very difficult or impossible to derive analytically an expression for the KEF accurate enough for use in applied simulations of most materials [77]. There is now much progress in machine learning this dependence, in particular, with neural networks but also with other methods, including GPR and some of the methods mentioned earlier [41,78,79,80,81,82]. This problem is a very stringent test for machine learning methods because the accuracy required here is very high, in the order of a thousandth of a percent unless there is significant error cancellation. We consider NN-based learning of

τ [ρ (x)]

.

Figure 8 shows one-dimensional cuts of the kinetic energy densities of bcc Li, hcp Mg, fcc Al, and cubic diamond Si along selection directions in these crystals, computed in [78]. The black lines are for Kohn–Sham kinetic energy density that we want to machine-learn. Neural networks were trained in a five-dimensional space of density dependent variables which were terms of the fourth-order gradient expansion [133]:

T_{4} = T^{(0)} + T^{(2)} + T^{(4)} T^{(0)} \equiv \int τ_{0} (r) d r = \frac{3}{10} {(3 π^{2})}^{\frac{2}{3}} \int ρ^{\frac{5}{3}} (r) d r T^{(2)} \equiv \int τ_{2} (r) d r = \frac{1}{72} \int \frac{{|\nabla ρ (r)|}^{2}}{ρ (r)} d r T^{(4)} \equiv \int τ_{4} (r) d r = \frac{{(3 π^{2})}^{- \frac{2}{3}}}{540} \int ρ^{\frac{1}{3}} [{(\frac{Δ ρ}{ρ})}^{2} - \frac{9}{8} (\frac{Δ ρ}{ρ}) {(\frac{\nabla ρ}{ρ})}^{2} + \frac{1}{3} {(\frac{\nabla ρ}{ρ})}^{4}] d r

(15)

The five summands in the integrands served as density-dependent variables (hence “T₄” in Figure 8).

With single hidden layer neural networks, we could fit accurately the data for each of the materials separately, visually overlapping with the black curve [78]. The red lines are from attempts to fit the kinetic energy densities of all four materials simultaneously with a single hidden layer NN. It is important to be able to do so to make sure that the resulting expression of the kinetic energy as a function of density has certain portability across many materials. The result is obviously not good. With a multilayer neural network, however, we could get a good fit simultaneously for all materials, shown in Figure 8 by the turquoise lines for a four-hidden layer NN with 20 neurons per layer [78]. This difference in the quality of the fit is not due to an inadequate number of neurons—no single-hidden layer NN was able to learn the KED of all the materials simultaneously.

A single hidden layer NN is a universal approximator, and indeed we saw no advantage of using multilayer networks when fitting individual materials, and we also saw no such advantage in our prior works on interatomic potentials [134,135]. One difference between those works and this case is the extremely uneven distribution of data. To illustrate this, we show in Figure 9 distributions (histograms) of the target function values, i.e., kinetic energy density τ (and τ₊, see Equation (2)) and of some of the density dependent variables we used [41]. In Figure 9,

p = \frac{{|\nabla ρ|}^{2}}{4 {(3 π^{2})}^{2 / 3} ρ^{8 / 3}}

is the scaled (to satisfy the so-called exact conditions [136]) squared gradient and

q = \frac{Δ ρ}{4 {(3 π^{2})}^{2 / 3} ρ^{5 / 3}}

is the scaled Laplacian of the density, TF is for

τ_{T F} (r) = \frac{3}{10} {(3 π^{2})}^{2 / 3} ρ^{5 / 3} (r)

-the Thomas–Fermi KED [137], and vW is for

τ_{v W} (r) = \frac{1}{8} \frac{{|\nabla ρ (r)|}^{2}}{ρ (r)}

–the von Weiszacker KED [138]. The KED distribution is very uneven. The distributions of the density-dependent variables are in some cases extremely uneven. What it means is that there are vast parts of the space which are extremely sparsely sampled but which are still important for the quality of the model, and here we clearly see the advantage of a deep NN. Data distribution is an issue that still needs to be better addressed in machine learning. Just using weighted fitting is not sufficient, as we also saw in our research [41,78].

When machine learning the KED from the data whose distribution is shown in Figure 9, we also confirmed that the HDMR-GPR combination is able to inform on relative importance of different combinations of variables (based on Equation (13)) [106]. This is a capability which goes beyond the automated relevance determination (ARD) used with GPR, whereby the inverse of the optimal length scale parameter in GPR (l in Equation (7)) with a squared exponential kernel can be used to determine the importance of different variables [100]. With HDMR-GPR, it is the relative importance of different variable combinations which can be estimated. As non-important variable combinations can be omitted, this provides a possibility to prune the number of HDMR terms, which is important to address a key disadvantage of HDMR (i.e., the combinatorial scaling of the number of terms). The knowledge of relative importance of different combinations of descriptors can be used for KEF development with other methods, not necessarily ML-based, including analytic methods.

4. Discussion and Conclusions

Machine learning today is very widely used to assist in the development of different technologies, including energy technologies. Especially novel technologies based on fuel cells, such as advanced solar cells, etc., are very “knowledge-deep”, i.e., they require multiple steps in understanding and development. This includes modeling at the materials level, device level, and system level, and at all these levels machine learning is useful, usually for prediction of material or device properties and performance parameters from descriptors. What we attempted to showcase in this perspective is that as far as the uses of ML for the benefit of energy technologies are concerned, there is more than meets the eye: many more aspects are being addressed with machine learning than just predicting material or device properties from descriptors. Much is being done at the back end by using machine learning to construct objects—such as interatomic potentials—which are part of the workflow of understanding and modeling of materials and phenomena for these technologies, or to improve modeling capabilities. This is important in particular for emergent energy technologies, as they require quantum mechanics-based understanding and modeling, which is difficult and costly but can be helped with machine learning.

We saw that when one learns dependences in multidimensional spaces, one always works with sparse data, and in this regime machine learning methods are effective because they avoid direct product representations. We used neural networks and Gaussian process regressions in our work. When one uses (regression type) NNs, there is no need to be restricted to the commonly used sigmoid neurons; one can get some useful properties like sum-of-products with other types of activation functions. When the sampling is sparse, the dimensionality of the data is typically lower than the dimensionality of the space, and it may be impossible to reconstruct the full-dimensional function. The danger of overfitting is also then increased. We reviewed combined methods which represent a multidimensional function with low-dimensional functions of any desired dimensionality, and those component functions are built with either NN or GPR and potentially could be built with other methods. This allows working with very sparse datasets without overfitting and it allows using machine learning methods within their “comfort zone”.

Much is being discussed about deep learning and deep networks. One should not forget that a single hidden layer NN is already a universal approximator. In practice though, when the distribution of the data is very uneven, we found that deep NNs are very useful. In other applications, where data were more uniformly distributed, we did not see an advantage with deep NN. In general, the issue of extremely uneven data distributions in some applications is a problem that is yet to be solved in substance. Going forward, we therefore expect methods that go beyond commonly used tools like NN and GPR and to use them instead as building blocks of more involved and powerful methods to gain increased attention and use, as well as methods which are specifically tailored to address the issues of data distribution and sparsity, and the issue of missing data.

Author Contributions

Conceptualization and methodology, S.M.; writing—original draft preparation, S.M.; writing—review and editing, S.M. and M.I.; project administration, S.M. and M.I. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tong, Q.; Gao, P.; Liu, H.; Xie, Y.; Lv, J.; Wang, Y.; Zhao, J. Combining Machine Learning Potential and Structure Prediction for Accelerated Materials Design and Discovery. J. Phys. Chem. Lett. 2020, 11, 8710–8720. [Google Scholar] [CrossRef]
Walters, W.P.; Barzilay, R. Applications of Deep Learning in Molecule Generation and Molecular Property Prediction. Accounts Chem. Res. 2020, 54, 263–270. [Google Scholar] [CrossRef]
Ramprasad, R.; Batra, R.; Pilania, G.; Mannodi-Kanakkithodi, A.; Kim, C. Machine learning in materials informatics: Recent applications and prospects. npj Comput. Mater. 2017, 3, 54. [Google Scholar] [CrossRef]
Wang, A.Y.-T.; Murdock, R.J.; Kauwe, S.K.; Oliynyk, A.O.; Gurlo, A.; Brgoch, J.; Persson, K.A.; Sparks, T.D. Machine Learning for Materials Scientists: An Introductory Guide toward Best Practices. Chem. Mater. 2020, 32, 4954–4965. [Google Scholar] [CrossRef]
Butler, K.T.; Davies, D.W.; Cartwright, H.; Isayev, O.; Walsh, A. Machine learning for molecular and materials science. Nature 2018, 559, 547–555. [Google Scholar] [CrossRef] [PubMed]
Moosavi, S.M.; Jablonka, K.M.; Smit, B. The Role of Machine Learning in the Understanding and Design of Materials. J. Am. Chem. Soc. 2020, 142, 20273–20287. [Google Scholar] [CrossRef] [PubMed]
del Cueto, M.; Troisi, A. Determining usefulness of machine learning in materials discovery using simulated research landscapes. Phys. Chem. Chem. Phys. 2021, 23, 14156–14163. [Google Scholar] [CrossRef] [PubMed]
Kalidindi, S.R. Feature engineering of material structure for AI-based materials knowledge systems. J. Appl. Phys. 2020, 128, 041103. [Google Scholar] [CrossRef]
Li, S.; Liu, Y.; Chen, D.; Jiang, Y.; Nie, Z.; Pan, F. Encoding the atomic structure for machine learning in materials science. WIREs Comput. Mol. Sci. 2021, 12, e1558. [Google Scholar] [CrossRef]
Im, J.; Lee, S.; Ko, T.-W.; Kim, H.W.; Hyon, Y.; Chang, H. Identifying Pb-free perovskites for solar cells by machine learning. npj Comput. Mater. 2019, 5, 37. [Google Scholar] [CrossRef] [Green Version]
Meftahi, N.; Klymenko, M.; Christofferson, A.J.; Bach, U.; Winkler, D.A.; Russo, S.P. Machine learning property prediction for organic photovoltaic devices. npj Comput. Mater. 2020, 6, 166. [Google Scholar] [CrossRef]
Sahu, H.; Ma, H. Unraveling Correlations between Molecular Properties and Device Parameters of Organic Solar Cells Using Machine Learning. J. Phys. Chem. Lett. 2019, 10, 7277–7284. [Google Scholar] [CrossRef] [PubMed]
Zhuo, Y.; Brgoch, J. Opportunities for Next-Generation Luminescent Materials through Artificial Intelligence. J. Phys. Chem. Lett. 2021, 12, 764–772. [Google Scholar] [CrossRef] [PubMed]
Mahmood, A.; Wang, J.-L. Machine learning for high performance organic solar cells: Current scenario and future prospects. Energy Environ. Sci. 2020, 14, 90–105. [Google Scholar] [CrossRef]
Li, F.; Peng, X.; Wang, Z.; Zhou, Y.; Wu, Y.; Jiang, M.; Xu, M. Machine Learning (ML)—Assisted Design and Fabrication for Solar Cells. Energy Environ. Mater. 2019, 2, 280–291. [Google Scholar] [CrossRef] [Green Version]
Wang, C.-I.; Joanito, I.; Lan, C.-F.; Hsu, C.-P. Artificial neural networks for predicting charge transfer coupling. J. Chem. Phys. 2020, 153, 214113. [Google Scholar] [CrossRef]
An, N.G.; Kim, J.Y.; Vak, D. Machine learning-assisted development of organic photovoltaics via high-throughput in situ formulation. Energy Environ. Sci. 2021, 14, 3438–3446. [Google Scholar] [CrossRef]
Rodríguez-Martínez, X.; Pascual-San-José, E.; Campoy-Quiles, M. Accelerating organic solar cell material’s discovery: High-throughput screening and big data. Energy Environ. Sci. 2021, 14, 3301–3322. [Google Scholar] [CrossRef]
Priya, P.; Aluru, N.R. Accelerated design and discovery of perovskites with high conductivity for energy applications through machine learning. npj Comput. Mater. 2021, 7, 90. [Google Scholar] [CrossRef]
Srivastava, M.; Howard, J.M.; Gong, T.; Dias, M.R.S.; Leite, M.S. Machine Learning Roadmap for Perovskite Photovoltaics. J. Phys. Chem. Lett. 2021, 12, 7866–7877. [Google Scholar] [CrossRef]
Teunissen, J.L.; Da Pieve, F. Molecular Bond Engineering and Feature Learning for the Design of Hybrid Organic–Inorganic Perovskite Solar Cells with Strong Noncovalent Halogen–Cation Interactions. J. Phys. Chem. C 2021, 125, 25316–25326. [Google Scholar] [CrossRef]
Miyake, Y.; Saeki, A. Machine Learning-Assisted Development of Organic Solar Cell Materials: Issues, Analyses, and Outlooks. J. Phys. Chem. Lett. 2021, 12, 12391–12401. [Google Scholar] [CrossRef] [PubMed]
Xu, S.; Liang, J.; Yu, Y.; Liu, R.; Xu, Y.; Zhu, X.; Zhao, Y. Machine Learning-Assisted Discovery of High-Voltage Organic Materials for Rechargeable Batteries. J. Phys. Chem. C 2021, 125, 21352–21358. [Google Scholar] [CrossRef]
Moses, I.A.; Joshi, R.P.; Ozdemir, B.; Kumar, N.; Eickholt, J.; Barone, V. Machine Learning Screening of Metal-Ion Battery Electrode Materials. ACS Appl. Mater. Interfaces 2021, 13, 53355–53362. [Google Scholar] [CrossRef] [PubMed]
Chen, A.; Zhang, X.; Chen, L.; Yao, S.; Zhou, Z. A Machine Learning Model on Simple Features for CO₂ Reduction Electrocatalysts. J. Phys. Chem. C 2020, 124, 22471–22478. [Google Scholar] [CrossRef]
Lamoureux, P.S.; Winther, K.T.; Torres, J.A.G.; Streibel, V.; Zhao, M.; Bajdich, M.; Abild-Pedersen, F.; Bligaard, T. Machine Learning for Computational Heterogeneous Catalysis. ChemCatChem 2019, 11, 3581–3601. [Google Scholar] [CrossRef] [Green Version]
Back, S.; Yoon, J.; Tian, N.; Zhong, W.; Tran, K.; Ulissi, Z.W. Convolutional Neural Network of Atomic Surface Structures To Predict Binding Energies for High-Throughput Screening of Catalysts. J. Phys. Chem. Lett. 2019, 10, 4401–4408. [Google Scholar] [CrossRef]
Toyao, T.; Maeno, Z.; Takakusagi, S.; Kamachi, T.; Takigawa, I.; Shimizu, K.-I. Machine Learning for Catalysis Informatics: Recent Applications and Prospects. ACS Catal. 2019, 10, 2260–2297. [Google Scholar] [CrossRef]
Li, X.; Paier, W.; Paier, J. Machine Learning in Computational Surface Science and Catalysis: Case Studies on Water and Metal–Oxide Interfaces. Front. Chem. 2020, 8, 601029. [Google Scholar] [CrossRef]
Pablo-García, S.; García-Muelas, R.; Sabadell-Rendón, A.; López, N. Dimensionality reduction of complex reaction networks in heterogeneous catalysis: From l inear-scaling relationships to statistical learning techniques. WIREs Comput. Mol. Sci. 2021, 11, e1540. [Google Scholar] [CrossRef]
Li, X.; Chiong, R.; Page, A.J. Group and Period-Based Representations for Improved Machine Learning Prediction of Heterogeneous Alloy Catalysts. J. Phys. Chem. Lett. 2021, 12, 5156–5162. [Google Scholar] [CrossRef] [PubMed]
Wu, D.; Zhang, J.; Cheng, M.-J.; Lu, Q.; Zhang, H. Machine Learning Investigation of Supplementary Adsorbate Influence on Copper for Enhanced Electrochemical CO₂ Reduction Performance. J. Phys. Chem. C 2021, 125, 15363–15372. [Google Scholar] [CrossRef]
Palkovits, S. A Primer about Machine Learning in Catalysis—A Tutorial with Code. ChemCatChem 2020, 12, 3995–4008. [Google Scholar] [CrossRef]
Giordano, L.; Akkiraju, K.; Jacobs, R.; Vivona, D.; Morgan, D.; Shao-Horn, Y. Electronic Structure-Based Descriptors for Oxide Properties and Functions. Accounts Chem. Res. 2022, 55, 298–308. [Google Scholar] [CrossRef] [PubMed]
Hohenberg, P.; Kohn, W. Inhomogeneous Electron Gas. Phys. Rev. 1964, 136, B864–B871. [Google Scholar] [CrossRef] [Green Version]
Kohn, W.; Sham, L.J. Self-consistent equations including exchange and correlation effects. Phys. Rev. 1965, 140, A1133–A1138. [Google Scholar] [CrossRef] [Green Version]
Rapaport, D.C. The Art of Molecular Dynamics Simulation, 2nd ed.; Cambridge University Press: Cambridge, UK, 2004; ISBN 978-0-521-82568-9. [Google Scholar]
Jansen, A.P.J. Kinetic Monte Carlo Algorithms. In An Introduction to Kinetic Monte Carlo Simulations of Surface Reactions; Jansen, A.P.J., Ed.; Lecture Notes in Physics; Springer: Berlin, Heidelberg, 2012; pp. 37–71. ISBN 978-3-642-29488-4. [Google Scholar]
Manzhos, S. Machine learning for the solution of the Schrödinger equation. Mach. Learn. Sci. Technol. 2020, 1, 013002. [Google Scholar] [CrossRef]
Behler, J. Perspective: Machine learning potentials for atomistic simulations. J. Chem. Phys. 2016, 145, 170901. [Google Scholar] [CrossRef] [Green Version]
Manzhos, S.; Golub, P. Data-driven kinetic energy density fitting for orbital-free DFT: Linear vs Gaussian process regression. J. Chem. Phys. 2020, 153, 074104. [Google Scholar] [CrossRef]
Kulik, H.; Hammerschmidt, T.; Schmidt, J.; Botti, S.; Marques, M.A.L.; Boley, M.; Scheffler, M.; Todorović, M.; Rinke, P.; Oses, C.; et al. Roadmap on Machine Learning in Electronic Structure. Electron. Struct. 2022. [Google Scholar] [CrossRef]
Duan, C.; Liu, F.; Nandy, A.; Kulik, H.J. Putting Density Functional Theory to the Test in Machine-Learning-Accelerated Materials Discovery. J. Phys. Chem. Lett. 2021, 12, 4628–4637. [Google Scholar] [CrossRef] [PubMed]
Friederich, P.; Häse, F.; Proppe, J.; Aspuru-Guzik, A. Machine-learned potentials for next-generation matter simulations. Nat. Mater. 2021, 20, 750–761. [Google Scholar] [CrossRef] [PubMed]
Statistical Review of World Energy | Energy Economics | Home. Available online: https://www.bp.com/en/global/corporate/energy-economics/statistical-review-of-world-energy.html (accessed on 7 February 2022).
Olah, G.A.; Prakash, G.K.S.; Goeppert, A. Anthropogenic Chemical Carbon Cycle for a Sustainable Future. J. Am. Chem. Soc. 2011, 133, 12881–12898. [Google Scholar] [CrossRef] [PubMed]
Nayak, P.K.; Mahesh, S.; Snaith, H.J.; Cahen, D. Photovoltaic solar cell technologies: Analysing the state of the art. Nat. Rev. Mater. 2019, 4, 269–285. [Google Scholar] [CrossRef]
Herbert, G.J.; Iniyan, S.; Sreevalsan, E.; Rajapandian, S. A review of wind energy technologies. Renew. Sustain. Energy Rev. 2007, 11, 1117–1145. [Google Scholar] [CrossRef]
Winter, M.; Brodd, R.J. What Are Batteries, Fuel Cells, and Supercapacitors? Chem. Rev. 2004, 104, 4245–4270. [Google Scholar] [CrossRef] [Green Version]
Birdja, Y.Y.; Pérez-Gallent, E.; Figueiredo, M.C.; Göttle, A.J.; Calle-Vallejo, F.; Koper, M.T.M. Advances and challenges in understanding the electrocatalytic conversion of carbon dioxide to fuels. Nat. Energy 2019, 4, 732–745. [Google Scholar] [CrossRef]
Detz, R.J.; Reek, J.N.H.; van der Zwaan, B.C.C. The future of solar fuels: When could they become competitive? Energy Environ. Sci. 2018, 11, 1653–1669. [Google Scholar] [CrossRef]
Barnhart, C.J.; Benson, S.M. On the importance of reducing the energetic and material demands of electrical energy storage. Energy Environ. Sci. 2013, 6, 1083–1092. [Google Scholar] [CrossRef]
Winter, M.; Barnett, B.; Xu, K. Before Li Ion Batteries. Chem. Rev. 2018, 118, 11433–11456. [Google Scholar] [CrossRef]
Abram, T.; Ion, S. Generation-IV nuclear power: A review of the state of the science. Energy Policy 2008, 36, 4323–4330. [Google Scholar] [CrossRef]
Ho, M.; Obbard, E.; A Burr, P.; Yeoh, G. A review on the development of nuclear power reactors. Energy Procedia 2019, 160, 459–466. [Google Scholar] [CrossRef]
Suman, S. Hybrid nuclear-renewable energy systems: A review. J. Clean. Prod. 2018, 181, 166–177. [Google Scholar] [CrossRef]
Shao, M.; Chang, Q.; Dodelet, J.-P.; Chenitz, R. Recent Advances in Electrocatalysts for Oxygen Reduction Reaction. Chem. Rev. 2016, 116, 3594–3657. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jahangiri, H.; Bennett, J.; Mahjoubi, P.; Wilson, K.; Gu, S. A review of advanced catalyst development for Fischer–Tropsch synthesis of hydrocarbons from biomass derived syn-gas. Catal. Sci. Technol. 2014, 4, 2210–2229. [Google Scholar] [CrossRef] [Green Version]
Chen, W.-H.; Chen, C.-Y. Water gas shift reaction for hydrogen production and carbon dioxide capture: A review. Appl. Energy 2019, 258, 114078. [Google Scholar] [CrossRef]
Chen, L.; Qi, Z.; Zhang, S.; Su, J.; Somorjai, G.A. Catalytic Hydrogen Production from Methane: A Review on Recent Progress and Prospect. Catalysts 2020, 10, 858. [Google Scholar] [CrossRef]
Lavoie, J.-M. Review on dry reforming of methane, a potentially more environmentally-friendly approach to the increasing natural gas exploitation. Front. Chem. 2014, 2, 81. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Jain, A.; Ong, S.P.; Hautier, G.; Chen, W.; Richards, W.D.; Dacek, S.; Cholia, S.; Gunter, D.; Skinner, D.; Ceder, G.; et al. Commentary: The Materials Project: A materials genome approach to accelerating materials innovation. APL Mater. 2013, 1, 011002. [Google Scholar] [CrossRef] [Green Version]
Liu, M.; Kitchin, J.R. SingleNN: Modified Behler–Parrinello Neural Network with Shared Weights for Atomistic Simulations with Transferability. J. Phys. Chem. C 2020, 124, 17811–17818. [Google Scholar] [CrossRef]
Behler, J. Constructing high-dimensional neural network potentials: A tutorial review. Int. J. Quantum Chem. 2015, 115, 1032–1050. [Google Scholar] [CrossRef]
Na, G.S.; Jang, S.; Lee, Y.-L.; Chang, H. Tuplewise Material Representation Based Machine Learning for Accurate Band Gap Prediction. J. Phys. Chem. A 2020, 124, 10616–10623. [Google Scholar] [CrossRef]
Xu, P.; Lu, T.; Ju, L.; Tian, L.; Li, M.; Lu, W. Machine Learning Aided Design of Polymer with Targeted Band Gap Based on DFT Computation. J. Phys. Chem. B 2021, 125, 601–611. [Google Scholar] [CrossRef] [PubMed]
Aykol, M.; Herring, P.; Anapolsky, A. Machine learning for continuous innovation in battery technologies. Nat. Rev. Mater. 2020, 5, 725–727. [Google Scholar] [CrossRef]
Deringer, V.L. Modelling and understanding battery materials with machine-learning-driven atomistic simulations. J. Phys. Energy 2020, 2, 041003. [Google Scholar] [CrossRef]
Thomas, J.K.; Crasta, H.R.; Kausthubha, K.; Gowda, C.; Rao, A. Battery monitoring system using machine learning. J. Energy Storage 2021, 40, 102741. [Google Scholar] [CrossRef]
Li, W.; Cui, H.; Nemeth, T.; Jansen, J.; Ünlübayir, C.; Wei, Z.; Zhang, L.; Wang, Z.; Ruan, J.; Dai, H.; et al. Deep reinforcement learning-based energy management of hybrid battery systems in electric vehicles. J. Energy Storage 2021, 36, 102355. [Google Scholar] [CrossRef]
Elkamel, M.; Schleider, L.; Pasiliao, E.L.; Diabat, A.; Zheng, Q.P. Long-Term Electricity Demand Prediction via Socioeconomic Factors—A Machine Learning Approach with Florida as a Case Study. Energies 2020, 13, 3996. [Google Scholar] [CrossRef]
Krishnadas, G.; Kiprakis, A. A Machine Learning Pipeline for Demand Response Capacity Scheduling. Energies 2020, 13, 1848. [Google Scholar] [CrossRef] [Green Version]
Nti, I.K.; Teimeh, M.; Nyarko-Boateng, O.; Adekoya, A.F. Electricity load forecasting: A systematic review. J. Electr. Syst. Inf. Technol. 2020, 7, 1–19. [Google Scholar] [CrossRef]
Antonopoulos, I.; Robu, V.; Couraud, B.; Kirli, D.; Norbu, S.; Kiprakis, A.; Flynn, D.; Elizondo-Gonzalez, S.; Wattam, S. Artificial intelligence and machine learning approaches to energy demand-side response: A systematic review. Renew. Sustain. Energy Rev. 2020, 130, 109899. [Google Scholar] [CrossRef]
Kim, J.Y.; Lee, J.-W.; Jung, H.S.; Shin, H.; Park, N.-G. High-Efficiency Perovskite Solar Cells. Chem. Rev. 2020, 120, 7867–7918. [Google Scholar] [CrossRef] [PubMed]
Pham, H.D.; Xianqiang, L.; Li, W.; Manzhos, S.; Kyaw, A.K.K.; Sonar, P. Organic interfacial materials for perovskite-based optoelectronic devices. Energy Environ. Sci. 2019, 12, 1177–1209. [Google Scholar] [CrossRef]
Witt, W.C.; del Rio, B.G.; Dieterich, J.M.; Carter, E.A. Orbital-free density functional theory for materials research. J. Mater. Res. 2018, 33, 777–795. [Google Scholar] [CrossRef]
Golub, P.; Manzhos, S. Kinetic energy densities based on the fourth order gradient expansion: Performance in different classes of materials and improvement via machine learning. Phys. Chem. Chem. Phys. 2018, 21, 378–395. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Fujinami, M.; Kageyama, R.; Seino, J.; Ikabata, Y.; Nakai, H. Orbital-free density functional theory calculation applying semi-local machine-learned kinetic energy density functional and kinetic potential. Chem. Phys. Lett. 2020, 748, 137358. [Google Scholar] [CrossRef]
Seino, J.; Kageyama, R.; Fujinami, M.; Ikabata, Y.; Nakai, H. Semi-local machine-learned kinetic energy density functional demonstrating smooth potential energy curves. Chem. Phys. Lett. 2019, 734, 136732. [Google Scholar] [CrossRef]
Snyder, J.C.; Rupp, M.; Hansen, K.; Blooston, L.; Müller, K.-R.; Burke, K. Orbital-free bond breaking via machine learning. J. Chem. Phys. 2013, 139, 224104. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Yao, K.; Parkhill, J. Kinetic Energy of Hydrocarbons as a Function of Electron Density and Convolutional Neural Networks. J. Chem. Theory Comput. 2016, 12, 1139–1147. [Google Scholar] [CrossRef] [PubMed]
Hausdorff, F. Dimension und äußeres Maß. Math. Ann. 1918, 79, 157–179. [Google Scholar] [CrossRef]
Kak, S. Information theory and dimensionality of space. Sci. Rep. 2020, 10, 20733. [Google Scholar] [CrossRef] [PubMed]
Bottou, L. Stochastic gradient descent tricks. In Neural Networks: Tricks of the Trade; Springer: Berlin/Heidelberg, Germany, 2012; pp. 421–436. ISBN 9783642352881. [Google Scholar]
Kolmogorov, A.N.; Arnol’d, V.; Boltjanskiĭ, V.; Efimov, N.; Èskin, G.; Koteljanskiĭ, D.; Krasovskiĭ, N.; Men’šov, D.; Portnov, I.; Ryškov, S.; et al. On the representation of continuous functions of many variables by superposition of continuous functions of one variable and addition. Am. Math. Soc. Transl. Ser. 2 1963, 28, 55–59. [Google Scholar] [CrossRef]
Sprecher, D.A. A Numerical Implementation of Kolmogorov’s Superpositions II. Neural Netw. 1997, 10, 447–457. [Google Scholar] [CrossRef]
Sprecher, D.A. A Numerical Implementation of Kolmogorov’s Superpositions. Neural Netw. 1996, 9, 765–772. [Google Scholar] [CrossRef]
Sprecher, D.A.; Draghici, S. Space-filling curves and Kolmogorov superposition-based neural networks. Neural Netw. 2002, 15, 57–67. [Google Scholar] [CrossRef]
Nees, M. Approximative versions of Kolmogorov’s superposition theorem, proved constructively. J. Comput. Appl. Math. 1994, 54, 239–250. [Google Scholar] [CrossRef] [Green Version]
Katsuura, H.; Sprecher, D.A. Computational aspects of Kolmogorov’s superposition theorem. Neural Netw. 1994, 7, 455–461. [Google Scholar] [CrossRef]
Sprecher, D.A. A universal mapping for kolmogorov’s superposition theorem. Neural Netw. 1993, 6, 1089–1094. [Google Scholar] [CrossRef]
Kurkova, V. Kolmogorov’s theorem and multilayer neural networks. Neural Netw. 1992, 5, 501–506. [Google Scholar] [CrossRef]
Hornik, K. Approximation capabilities of multilayer feedforward networks. Neural Netw. 1991, 4, 251–257. [Google Scholar] [CrossRef]
Hornik, K.; Stinchcombe, M.; White, H. Universal approximation of an unknown mapping and its derivatives using multilayer feedforward networks. Neural Netw. 1990, 3, 551–560. [Google Scholar] [CrossRef]
Gorban, A. Approximation of continuous functions of several variables by an arbitrary nonlinear continuous function of one variable, linear functions, and their superpositions. Appl. Math. Lett. 1998, 11, 45–49. [Google Scholar] [CrossRef] [Green Version]
Manzhos, S.; Carrington, T., Jr. Using neural networks to represent potential surfaces as sums of products. J. Chem. Phys. 2006, 125, 194105. [Google Scholar] [CrossRef]
Beck, M.; Jäckle, A.; Worth, G.; Meyer, H.-D. The multiconfiguration time-dependent Hartree (MCTDH) method: A highly efficient algorithm for propagating wavepackets. Phys. Rep. 2000, 324, 1–105. [Google Scholar] [CrossRef]
Schmitt, M. On the Complexity of Computing and Learning with Multiplicative Neural Networks. Neural Comput. 2002, 14, 241–301. [Google Scholar] [CrossRef] [PubMed]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Processes for Machine Learning; MIT Press: Cambridge MA, USA, 2006; ISBN 0-262-18253-X. [Google Scholar]
Genton, M.G. Classes of Kernels for Machine Learning: A Statistics Perspective. J. Mach. Learn. Res. 2001, 2, 299–312. [Google Scholar]
Smola, A.; Bartlett, P. Sparse Greedy Gaussian Process Regression. In Proceedings of the Advances in Neural Information Processing Systems; MIT Press: Cambridge, MA, USA, 2001; Volume 13. [Google Scholar]
Kamath, A.; Vargas-Hernández, R.A.; Krems, R.V.; Carrington, T., Jr.; Manzhos, S. Neural networks vs Gaussian process regression for representing potential energy surfaces: A comparative study of fit quality and vibrational spectrum accuracy. J. Chem. Phys. 2018, 148, 241702. [Google Scholar] [CrossRef] [PubMed]
Warner, B.A.; Neal, R.M. Bayesian Learning for Neural Networks (Lecture Notes in Statistical Vol. 118). J. Am. Stat. Assoc. 1997, 92, 791. [Google Scholar] [CrossRef]
Boussaidi, M.A.; Ren, O.; Voytsekhovsky, D.; Manzhos, S. Random Sampling High Dimensional Model Representation Gaussian Process Regression (RS-HDMR-GPR) for Multivariate Function Representation: Application to Molecular Potential Energy Surfaces. J. Phys. Chem. A 2020, 124, 7598–7607. [Google Scholar] [CrossRef] [PubMed]
Ren, O.; Boussaidi, M.A.; Voytsekhovsky, D.; Ihara, M.; Manzhos, S. Random Sampling High Dimensional Model Representation Gaussian Process Regression (RS-HDMR-GPR) for representing multidimensional functions with machine-learned lower-dimensional terms allowing insight with a general method. Comput. Phys. Commun. 2021, 271, 108220. [Google Scholar] [CrossRef]
Manzhos, S.; Ihara, M. Rectangularization of Gaussian Process Regression for Optimization of Hyperparameters. arXiv 2021, arXiv:2112.02467. [Google Scholar]
Li, G.; Rosenthal, C.; Rabitz, H. High Dimensional Model Representations. J. Phys. Chem. A 2001, 105, 7765–7777. [Google Scholar] [CrossRef]
Rabitz, H.; Aliş, Ö.F. General foundations of high-dimensional model representations. J. Math. Chem. 1999, 25, 197–233. [Google Scholar] [CrossRef]
Alış, F.; Rabitz, H. Efficient Implementation of High Dimensional Model Representations. J. Math. Chem. 2001, 29, 127–142. [Google Scholar] [CrossRef]
Fisher, R.A. On the “Probable Error” of a Coefficient of Correlation Deduced from a Small Sample. Metron 1921, 1, 3–32. [Google Scholar]
Sobol′, I.M. Global sensitivity indices for nonlinear mathematical models and their Monte Carlo estimates. Math. Comput. Simul. 2001, 55, 271–280. [Google Scholar] [CrossRef]
Li, G.; Hu, J.; Wang, S.-W.; Georgopoulos, P.G.; Schoendorf, A.J.; Rabitz, H. Random Sampling-High Dimensional Model Representation (RS-HDMR) and Orthogonality of Its Different Order Component Functions. J. Phys. Chem. A 2006, 110, 2474–2485. [Google Scholar] [CrossRef]
Wang, S.-W.; Georgopoulos, P.G.; Li, G.; Rabitz, H. Random Sampling-High Dimensional Model Representation (RS-HDMR) with Nonuniformly Distributed Variables: Application to an Integrated Multimedia/Multipathway Exposure and Dose Model for Trichloroethylene. J. Phys. Chem. A 2003, 107, 4707–4716. [Google Scholar] [CrossRef]
Manzhos, S.; Ihara, M. On the Optimization of Hyperparameters in Gaussian Process Regression with the Help of Low-Order High-Dimensional Model Representation. arXiv 2022, arXiv:2112.01374. [Google Scholar]
Manzhos, S.; Yamashita, K.; Carrington, T. Extracting Functional Dependence from Sparse Data Using Dimensionality Reduction: Application to Potential Energy Surface Construction. In Proceedings of the Coping with Complexity: Model Reduction and Data Analysis; Gorban, A.N., Roose, D., Eds.; Springer: Berlin/Heidelberg, Germany, 2011; pp. 133–149. [Google Scholar]
Manzhos, S.; Carrington, T., Jr. Using redundant coordinates to represent potential energy surfaces with lower-dimensional functions. J. Chem. Phys. 2007, 127, 014103. [Google Scholar] [CrossRef]
Manzhos, S.; Yamashita, K.; Carrington, T. Fitting sparse multidimensional data with low-dimensional terms. Comput. Phys. Commun. 2009, 180, 2002–2012. [Google Scholar] [CrossRef]
Manzhos, S.; Sasaki, E.; Ihara, M. Easy representation of multivariate functions with low-dimensional terms via Gaussian process regression kernel design: Applications to machine learning of potential energy surfaces and kinetic energy densities from sparse data. Mach. Learn. Sci. Technol. 2022, 3, 01LT02. [Google Scholar] [CrossRef]
Manzhos, S.; Carrington, T., Jr. A random-sampling high dimensional model representation neural network for building potential energy surfaces. J. Chem. Phys. 2006, 125, 084109. [Google Scholar] [CrossRef]
Duvenaud, D.; Nickisch, H.; Rasmussen, C.E. Additive Gaussian Processes. In Advances in Neural Information Processing Systems; Neural Information Processing Systems: San Diego, CA, USA, 2011; pp. 226–234. [Google Scholar]
Hinton, G.E.; Salakhutdinov, R.R. Reducing the Dimensionality of Data with Neural Networks. Science 2006, 313, 504–507. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Manzhos, S.; Yamashita, K. A model for the dissociative adsorption of N₂O on Cu(100) using a continuous potential energy surface. Surf. Sci. 2010, 604, 555–561. [Google Scholar] [CrossRef]
Wolfsberg, M.; Van Hook, A.; Paneth, P.; Rebelo, L.P.N. Isotope Effects; Springer: Dordrecht, The Netherlands, 2009. [Google Scholar]
Schneider, E.; Carlsen, B.; Tavrides, E.; van der Hoeven, C.; Phathanapirom, U. Measures of the environmental footprint of the front end of the nuclear fuel cycle. Energy Econ. 2013, 40, 898–910. [Google Scholar] [CrossRef] [Green Version]
Parvin, P.; Sajad, B.; Silakhori, K.; Hooshvar, M.; Zamanipour, Z. Molecular laser isotope separation versus atomic vapor laser isotope separation. Prog. Nucl. Energy 2004, 44, 331–345. [Google Scholar] [CrossRef]
Ronander, E.; Strydom, H.J.; Botha, L.R. High-pressure continuously tunable CO₂ lasers and molecular laser isotope separation. Pramana 2014, 82, 49–58. [Google Scholar] [CrossRef]
McDowell, R.S.; Sherman, R.J.; Asprey, L.B.; Kennedy, R.C. Vibrational spectrum and force field of molybdenum hexafluoride. J. Chem. Phys. 1975, 62, 3974–3978. [Google Scholar] [CrossRef]
Koh, Y.W.; Westerman, K.; Manzhos, S. A computational study of adsorption and vibrations of UF6 on graphene derivatives: Conditions for 2D enrichment. Carbon 2015, 81, 800–806. [Google Scholar] [CrossRef]
Manzhos, S.; Carrington, T.; Laverdure, L.; Mosey, N. Computing the Anharmonic Vibrational Spectrum of UF6 in 15 Dimensions with an Optimized Basis Set and Rectangular Collocation. J. Phys. Chem. A 2015, 119, 9557–9567. [Google Scholar] [CrossRef] [PubMed]
Berezin, A.; Malyugin, S.; Nadezhdinskii, A.; Namestnikov, D.; Ponurovskii, Y.; Stavrovskii, D.; Shapovalov, Y.; Vyazov, I.; Zaslavskii, V.; Selivanov, Y.; et al. UF6 enrichment measurements using TDLS techniques. Spectrochim. Acta Part A Mol. Biomol. Spectrosc. 2007, 66, 796–802. [Google Scholar] [CrossRef] [PubMed]
Sobol’, I. On the distribution of points in a cube and the approximate evaluation of integrals. USSR Comput. Math. Math. Phys. 1967, 7, 86–112. [Google Scholar] [CrossRef]
Hodges, C.H. Quantum Corrections to the Thomas–Fermi Approximation—The Kirzhnits Method. Can. J. Phys. 1973, 51, 1428–1437. [Google Scholar] [CrossRef]
Manzhos, S.; Dawes, R.; Carrington, T. Neural network-based approaches for building high dimensional and quantum dynamics-friendly potential energy surfaces. Int. J. Quantum Chem. 2014, 115, 1012–1020. [Google Scholar] [CrossRef] [Green Version]
Manzhos, S.; Carrington, T. Neural Network Potential Energy Surfaces for Small Molecules and Reactions. Chem. Rev. 2020, 121, 10187–10217. [Google Scholar] [CrossRef]
Bartlett, R.J.; Ranasinghe, D.S. The power of exact conditions in electronic structure theory. Chem. Phys. Lett. 2016, 669, 54–70. [Google Scholar] [CrossRef]
Fermi, E. Eine statistische Methode zur Bestimmung einiger Eigenschaften des Atoms und ihre Anwendung auf die Theorie des periodischen Systems der Elemente. Eur. Phys. J. A 1928, 48, 73–79. [Google Scholar] [CrossRef]
Weizsäcker, C.F.V. Zur Theorie der Kernmassen. Eur. Phys. J. A 1935, 96, 431–458. [Google Scholar] [CrossRef]

Figure 1. Left: Schematic of the SQL data structure of Catalysis-Hub.org, used to store reaction energies (reactions table, green) and DFT calculations (ASE database, blue). Since each reaction energy involves several DFT calculations (and the same DFT calculations can potentially be used for several reactions), a many-to-many mapping schema is used to preserve connections between the table rows. Right: Machine learning-enhanced catalyst candidate prediction: bulk and surface structures retrieved from structure databases like materialsproject.org, OQMD, Catalysis-Hub.org, etc., are used for automated slab generation and enumeration of possible adsorption sites. In an iterative process, limited numbers of DFT-calculated adsorption energies and machine-learning-predicted adsorption energies are used to inform microkinetic models to eventually suggest promising catalyst candidates that should be investigated by experiment. Adapted with permission from [26]. Copyright 2019 Wiley-VCH Verlag GmbH.

Figure 2. Left: Distribution of types of ML techniques applied in design and fabrication of solar cells. Right: Distribution of applications in design and fabrication of solar cells assisted by ML techniques. Adapted with permission from [15]. Copyright 2019 Wiley-VCH Verlag GmbH.

Figure 3. Schematic representation of a single-hidden layer neural network.

Figure 4. Test set root mean square error (rmse) of the interatomic potential of vinyl bromide as function of both the number and the dimensionality of terms of Red-RS-HDMR-NN. Reproduced with permission from [117]. Copyright 2010 American Institute of Physics.

Figure 5. Test set mean absolute error (mae) of the interatomic potential of N₂O over Cu (the system shown on the right) as a function of the dimensionality of a single component function of Red-RS-HDMR-NN. Reproduced with permission from [123]. Copyright 2010 Elsevier.

Figure 6. Principle of IR laser-driven isotopomer-selective desorption of UF₆ proposed in [129].

Figure 7. Kinetic energy densities within the unit cells of (a) fcc Al, (b) hcp Mg, and (c) cubic diamond Si crystals.

Figure 8. One-dimensional cuts of the kinetic energy densities of bcc Li (top left), hcp Mg (top right), fcc Al (bottom left), and cubic diamond Si (bottom right) along selected directions in the crystal lattice. The target Kohn–Sham kinetic energy density is shown as a black line, results of a single-hidden layer NN fit of the KEDs of all materials simultaneously with a red line (“[80] NN”, where 80 is the number of neurons in the hidden layer), and the results of a four-hidden layer NN fit of the KEDs of all materials simultaneously with a turquoise line (“[20 20 20 20] NN”, where 20 is the number of neurons in each hidden layer). See [78] for details. Adapted with permission from Ref. [78]. Copyright 2019 The Owner Societies.

Figure 9. Distributions (histograms) of the kinetic energy densities and density dependent variables in a dataset combining data from Al, Mg, and Si at equilibrium geometry as well as under uniform compression and extension. Adapted with permission from [41]. Copyright 2020 American Institute of Physics.

Table 1. Test set root mean square error (rmse) when fitting the potential energy surface of UF₆ with HDMR-GPR of different orders d for different numbers of training points N_train. For comparison, the results with a full 15-dimensional GPR are also shown. The numbers of component functions N_cf at each d are also shown.

Rmse ¹	N_cf\N_train	5000	3000	2000
Full-D (d = D)	1	42.2	75.4	106.7
d = 1	15	234.6	236.4	237.3
d = 2	105	168.1	178.6	190.3
d = 3	455	65.6	78.0	97.4

¹ On the test set of 50,000 points.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Manzhos, S.; Ihara, M. Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies. Physchem 2022, 2, 72-95. https://doi.org/10.3390/physchem2020006

AMA Style

Manzhos S, Ihara M. Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies. Physchem. 2022; 2(2):72-95. https://doi.org/10.3390/physchem2020006

Chicago/Turabian Style

Manzhos, Sergei, and Manabu Ihara. 2022. "Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies" Physchem 2, no. 2: 72-95. https://doi.org/10.3390/physchem2020006

Article Menu

Advanced Machine Learning Methods for Learning from Sparse Data in High-Dimensional Spaces: A Perspective on Uses in the Upstream of Development of Novel Energy Technologies

Abstract

1. Introduction

2. High Dimensionality and Extremely Low Data Density in the Space of Descriptors

2.1. Examples of Input—Output Mappings Used in ML for Energy Technologies

2.2. New Technologies and Challengies Require New Simulation Methods–A Large Scope for Machine Learning

2.3. The Curse of Dimensionality and Why ML Techniques Are Effective

3. Advanced Techniques for Working with Sparse Data

3.1. Brief Introduction to Neural Networks and Gaussian Process Regression

3.1.1. Neural Networks (NN)

3.1.2. Gaussian Process Regression (GPR)

3.1.3. Relative Pros and Cons of GPR vs. NN

3.2. High-Dimensional Model Representation (HDMR)

3.3. Combining HDMR with ML for Learning from Sparse Data

3.3.1. Machine Learning of HDMR Terms

3.3.2. RS-HDMR-NN (Random Sampling High-Dimensional Model Representation Neural Network)

3.3.3. RS-HDMR-GPR (Random Sampling High-Dimensional Model Representation Gaussian Process Regression)

3.4. When Are Deep NNs Useful?

4. Discussion and Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI