Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States

Ma, Xin; Cai, Yubin; Yuan, Hong; Deng, Yanqiao

doi:10.3390/su15097086

Open AccessArticle

Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States

by

Xin Ma

^1,*

,

Yubin Cai

^1,2,

Hong Yuan

^1,3 and

Yanqiao Deng

^1,3

¹

School of Mathematics and Physics, Southwest University of Science and Technology, Mianyang 621010, China

²

School of Mathematical Sciences, University of Electronic Science and Technology of China, Chengdu 611731, China

³

School of Management Science and Real Estate, Chongqing University, Chongqing 400045, China

^*

Author to whom correspondence should be addressed.

Sustainability 2023, 15(9), 7086; https://doi.org/10.3390/su15097086

Submission received: 21 February 2023 / Revised: 26 March 2023 / Accepted: 13 April 2023 / Published: 23 April 2023

(This article belongs to the Special Issue Development Trends of Environmental and Energy Economics)

Download

Browse Figures

Versions Notes

Abstract

:

Energy forecasting based on univariate time series has long been a challenge in energy engineering and has become one of the most popular tasks in data analytics. In order to take advantage of the characteristics of observed data, a partially linear model is proposed based on principal component analysis and support vector machine methods. The principal linear components of the input with lower dimensions are used as the linear part, while the nonlinear part is expressed by the kernel function. The primal-dual method is used to construct the convex optimization problem for the proposed model, and the sequential minimization optimization algorithm is used to train the model with global convergence. The univariate forecasting scheme is designed to forecast the primary energy consumption of the electric power sector of the United States using real-world data sets ranging from January 1973 to January 2020, and the model is compared with eight commonly used machine learning models as well as the linear auto-regressive model. Comprehensive comparisons with multiple evaluation criteria (including 19 metrics) show that the proposed model outperforms all other models in all scenarios of mid-/long-term forecasting, indicating its high potential in primary energy consumption forecasting.

Keywords:

support vector machines; principal component analysis; partially linear models; primary energy consumption

1. Introduction

Energy forecasting has long been a hot spot in this era of energy revolution. In most recent years, energy forecasting is already available to bring profits to enterprises by helping them make more reasonable financial plans [1]. On the other hand, energy consumption is not only an indicator for economics or finance but also an important factor for environmental issues, especially for carbon-related issues [2]. With the more diverse impact of real-world problems, energy forecasting is appealing to many researchers and engineers to make their own contributions. Topics of energy forecasting are also broadened to wider areas, such as energy consumption [3], energy production [4], energy price [5], the relationship between energy and economics and the environment [6], etc.

There is a long history of the application of primary for industrial production. Accurate forecasts of primary energy forecasting are still of great importance for making decisions in energy marketing, management, and also in the policies for pollution emissions. However, our investigation of the existing literature on energy forecasting (presented in Section 2) indicates that there are still issues in existing methods and implies that there is a research gap in the application of partially linear models for primary energy forecasting. In actuality, the time series of primary energy consumption often has very clear patterns of variation, especially for the stable economical entities in mid–short periods; therefore, it is suitable to use the deterministic models to fit such properties. On the other hand, with the development of data-capturing technologies, it is much easier to obtain more data sets to build forecasting models. Thus, it is natural to consider using machine learning models to further improve forecasting accuracy. Above all, it is more reasonable to combine the merits of these models for better practice and higher accuracy in real-world applications.

The partially linear model is a typical example of the practice of combining models with deterministic and indeterministic formulations. The earliest work using the partially linear model should be credited to Engle et al., in which a very simple combination of linear regression and a nonlinear function was used [7]. The semi-parametric support vector machines (SVM) presented by Smola and Schölkopf was the first work that used machine learning models to build such a partially linear structure [8] in a uniform way, and the linear kernel was used to represent the linear part. Espinoza et al. presented another version of a kernel-based partially linear model based on the framework of least squares support vector machines (LSSVMs) [9]. Conversely, this work uses the nonlinear kernel to represent the nonlinear part, and it also presented an analytic way of training the model for the first time. Such properties of analytical solutions make them much easier to implement more models, and several models have been developed for function estimation and system identification [10,11,12]. In the last several years, Ma et al. used a simplified formulation to build the kernel-based grey system models by regularizing all linear parameters and parameters in the feature space [13,14,15], which actually also shares the philosophy of Hammerstein system models. The work by [16] Matí also uses the method of regularizing all parameters and made it easier to train a partially linear SVM. Within the different specific ways for implementation, all of these works have proven that the kernel-based partially linear models are much more efficient in the cases in which prior knowledge is available, such as a known linear relationship between the input and output.

It can be learned from the previous works that an efficient partially linear model can be developed if the features of the data are properly treated. For instance, Xu et al. [17] pointed out that it is also reasonable to separate the linear and nonlinear functions of the input, where the partially linear LSSVM based on this idea can then outperform the other models. Enlightened by this pattern, a new partially linear SVM using principal linear components extracted using a principal component analysis (PCA) is developed, and its related theoretical and computational problems will be discussed in detail. The real-world applications of forecasting the monthly primary energy consumption of electric power sector in the US will be presented, and the proposed model will be compared with several other machine learning models that have been very popular in recent research studies.

The rest of this work is organized as follows: literature studies are presented in Section 2; preliminary examinations on the specific formulation of the partially linear model, with the related theoretical basis, and the computational details of the PCA are introduced in Section 3; a complete representation of the proposed partially linear component support vector machine (PLC-SVM) is presented in Section 4, including its formulation in primal and dual spaces and its computational details for univariate time series forecasting; the case study forecasting the monthly primary energy consumption of the electric power sector in the US based on a data set with 565 months of real-world data is presented in Section 5, along with a comprehensive comparison between different models and a detailed discussion; the conclusions are drawn in Section 6.

2. Literature Study

In this section, some recent literature on energy forecasting will be reviewed, and the details of the most commonly used structured and non-structured models for energy forecasting will be briefly summarized. A short discussion on the findings and research gaps will also be presented in the last subsection. For convenience, an overview of the main models for energy forecasting reviewed in this section is presented in Figure 1.

2.1. The Structured Models for Energy Forecasting

In this subsection, the structured models are roughly categorized into empirical models, linear models, and grey system models.

The empirical models are often presented as specific functions (see [18]), which are often built with engineering experience and are directly validated in practice. These models are often easy to use but are not suitable for very complex data sets. Recent works have paid significantly fewer attention to such models.

The linear regression (LR) and autoregressive integrated moving average (ARIMA) models both share linear structures. While the LR model only simulates a simple linear correlation between the input and output variables [19,20], the ARIMA model mainly considers the auto-correlation of the time series. The linear models are quite popular in the application of energy forecasting and have been used to forecast oil consumption [19], electricity consumption [21,22], demand [20,23], wind generation [24], total energy demand and supply [25], etc. However, the ARIMA model often suffers from “overdifferece” [26], and both of these linear models are limited in describing nonlinear data sets.

Grey system models are increasingly popular in energy forecasting. There are several techniques used in the recent literature, including designing new structures to fit the data (e.g., nonlinear whitening equations [27], time-delayed terms [28], and periodic terms [29]), using complex accumulation operators (e.g., Hausdorff fractional order accumulation [30] and buffer operators [31]), and combining grey system models with other methods (e.g., Kalman filter [32] and Markov model [33]). Researchers often use intelligence optimizers when new methods contain nonlinear parameters [27,29,30,31]. One advantage of grey system models is their ability to make reliable predictions with limited data. However, for more complex forecasting applications, the proper structure or preprocessing methods still require the experience of researchers.

2.2. The Non-Structured Models for Energy Forecasting

Non-structured models do not have deterministic structures; a complete formulation can only be determined by the data sets. Machine learning is one of the most popular non-structured models, and recent literature has shown considerable interest in the application of these models. The most popular machine learning models for energy forecasting are neural networks, support vector machines, and regression trees.

Neural networks, particularly multilayer perceptrons, remain popular for energy forecasting, with applications in areas such as electricity [34,35,36] and building energy consumption [37], ocean wave energy and photovoltaic plants generation forecasting [38], etc. Deep learning has led to the development of more complex models, such as LSTM-based networks with fully connected layers [39,40] or convolutional layers [41,42,43]. Other types of layers, such as bagged echo state networks [44], echo state networks [45], and radial belief networks [46], are also used. While these complex networks improve flexibility, they increase computational costs and require expert knowledge for the design. Thus, developing general models for energy forecasting remains challenging.

Kernel-based machine learning models, especially SVMs, remain popular for energy forecasting. Recent studies have focused on combining SVMs with evolutionary algorithms such as particle swarm optimization (PSO) [47], differential evolution (DE) [48], improved chicken swarm optimization (ICSO) [49], covariance matrix adaptation evolutionary strategy (CMAES) [50], improved fruit fly optimization (IFFO) [51], and Harris Hawks optimization [52], to optimize the hyperparameters automatically. These models are less time-consuming and have higher generality. However, partially linear kernel-based models have not been used in recent energy forecasting studies.

Many new models based on the basic regression trees have been developed in the past decade and are also widely adopted in energy forecasting, such as in carbon trading volume and price [53], building energy consumption [54], solar radiation [55], hydro-energy [56], etc. One significant merit of the regression tree-based models is that the ones with shallow structures are generally explainable. However, efficient regression trees usually become deeper with larger or more complex data sets, and a large amount of hyperparameters may also make the overall forecasting process too complex.

Hybrid models are gaining more interest in energy forecasting in both the literature and in competitions [57]. The main schemes found in the literature can be categorized into three classes. The first class is to combine the machine learning models and the preprocessing methods, such as variational mode decomposition (VMD), autoencoder [58], singular spectrum analysis (SSA) [59], wavelet transform [60], etc. The second class is to combine different machine learning models using the ensemble learning scheme [61,62,63] or multiple combining scheme [64,65], among other schemes. The third class is actually the integration of the above two schemes. In these works, the decomposition methods are often adopted, such as empirical mode decomposition (EMD) [66] and complete ensemble empirical mode decomposition (CEEMD) [67]. Despite being simple and effective, these hybrid models are more complex than other machine learning models and can lead to longer training times, less explainability, and the need for better hardware.

2.3. A Brief Summary of Literature Study

According to the literature study presented above, the research gaps can be briefly summarized in two parts: (1) In terms of methodology, machine learning models are becoming more popular in recent works for energy forecasting. However, along with the higher performance of more complex models, it raises other issues such as higher computational complexity and an incomplete framework of appropriate models in real-world applications. (2) In terms of applications, more complex models often need larger-sized data sets, and many works only present good performance in mid-/short-term predictions. The PLSVM method illustrates a new way of combining the linearity and nonlinearity of the data sets but has not been used in energy forecasting applications based on our investigation.

To fill the above research gaps, this work presents a new machine learning model for energy forecasting in real-world applications, and the main contributions can be summarized as follows:

A partially linear component support vector machine is developed, which uses the principal linear features of the input data set obtained by a PCA. This way will reduce the risk of multi-collinearity and keep the model as simple as possible.
A theoretical analysis is also presented, showing that the computational complexity of the main training process of the proposed model is in the same order as the existing SVM model.
A complete partially linear auto-regression scheme for out-of-sample time series forecasting is presented in a real-world application with different scenarios on forecasting the primary energy consumption of the electric power sector of the United States, showing that the proposed model outperforms the cutting-edge models, especially in mid-/long-term forecasting.

3. Preliminaries

In this section, the main idea of the partially linear model and key steps of the principal component analysis (PCA) will be briefly summarized.

3.1. Main Idea of the Partially Linear Model

One typical definition of the partially linear model is [68]

y = β^{T} x^{lin} + g (x^{nonl}),

(1)

where

x^{lin}

consists of the linear dimensions of the input x,

x^{nonl}

consists of the nonlinear dimensions, and

g (\cdot)

is an unknown nonlinear function. However, it has been argued that this formulation only separates the linear dimensions of the input vectors, and a more reasonable approach is to separate the linear functions of the input vector [17]. Enlightened by this idea, a simpler formulation is considered

y = β^{T} x + g (x),

(2)

where

β^{T} x

is the linear function of x and

g (x)

is an unknown nonlinear function of x.

Remark 1.

It is well known that any differentiable real function can be written by the formulation

f (x) = f (x_{0}) + D f (x_{0}) (x - x_{0}) + R (x - x_{0})

(3)

according to Taylor’s theorem [69], where D is a differential operator (For multivariable functions, the differential operator can be written as

D f = {(\frac{\partial f}{x_{1}}, \frac{\partial f}{x_{2}}, . . ., \frac{\partial f}{x_{d}})}^{T}

, and the products of between the vectors are inner products). This formulation can be transformed compactly by

f (x) = D f (x_{0}) x + [R (x - x_{0}) + f (x_{0}) - D f (x_{0}) x_{0}] .

(4)

It is clear that the first term is a linear function of x and the second term is a nonlinear function (with constant bias). It is obvious that this formulation is mathematically equivalent to (2).

Based on this idea, the linear function of the input will be treated in a more direct way, and this will make it more stable than treating the linear part in a fully nonlinear way. For example, if the real nonlinearity follows a polynomial function such as

F (x) = a_{0} + a_{1} x + a_{2} x^{2} + a_{3} x^{3},

(5)

it will be unstable to approximate it using a full nonlinear function as the linear term

a_{1} x

will be over-estimated. Above all, the formulation in (2) is considered to build the partially linear model in this paper.

3.2. Principal Component Analysis

As described above, a partially linear model (2) has a linear function of the input. But in real-world applications, the elements in such a linear input may have high multicollinearity, which may lead to ill-posed problems and higher computational complexity. In this work, the principal component analysis (PCA) is used to reduce the dimension of the input.

PCA is one of the most popular classical linear methods, which can efficiently extract the linear features of the input vector and make it more stable for linear function estimations. For the original input

x = {(x^{1}, x^{2}, \dots, x^{d})}^{T}

, where

x^{i} (i = 1, 2, \dots, d)

represent the elements (features) of the input, the main goal of the PCA is to find a linear transformation A that transforms the original input x into a new vector z, of which the features are linearly independent to each other. For convenience, a set of an input is denoted by

X = (\begin{matrix} x_{1}, x_{2}, \dots, x_{N} \end{matrix})

(6)

and the objective of the PCA is to find a linear matrix that satisfies:

A_{d \times d} (X_{d \times N} - U_{d \times N}) = Z_{d \times N}

(7)

where U is the matrix of mean values of X, of which the elements are

u_{i j} = \frac{1}{N} \sum_{k = 1}^{N} x_{k}^{j} (i = 1, \dots, N, j = 1, \dots, d)

. The transformation matrix A can be denoted by

A = (\begin{matrix} ξ_{1}, ξ_{2}, ߪ, ξ_{d} \end{matrix})

(8)

where

ξ_{i}

are the eigenvectors of the auto-covariance matrix

(X - U) {(X - U)}^{T}

, i.e.,

(X - U) {(X - U)}^{T} ξ_{i} = λ ξ_{i} .

(9)

The order of the eigenvectors is coincidental with the descending order of the corresponding eigenvalues

λ_{i}

of the auto-covariance matrix of X.

The contribution ratio of the k-th linear component in the new features Z is calculated by

r_{k} = \frac{λ_{k}}{\sum_{i = 1}^{d} λ_{i}} .

(10)

The total contributions of the first k components are the sum of the first k ratios defined in (10). As the auto-covariance matrix

(X - U) {(X - U)}^{T}

is a positive semi-definite symmetric matrix, all eigenvalues are non-negative; thus, the contribution ratios

r_{k}

are all non-negative. Furthermore, the total contributions of the first k components are in the range

[0, 1]

. Usually, if the contributions of some components are larger than a threshold

r_{p}

, they can contain almost all of the information of the original samples, and these components are called the principal components.

4. The Proposed Partially Linear Component Support Vector Machines

The modeling procedures and some key notes on the theoretical basis of the proposed partially linear component support vector machines for regression will now be presented.

4.1. Partially Linear Component Model in the Feature Space

A support vector machine model for regression essentially estimates a nonlinear function in a feature space, which is defined by

y = w^{T} φ (x) + b,

(11)

where

φ : R^{d} \to F

(12)

is a feature mapping that maps the vector in

R^{d}

space to a feature space, and

w^{T} φ (x) + b

is a linear approximation in the feature space of a nonlinear function, i.e.,

g (x)

in (2) can be approximated in this way. Based on this idea, it is very natural to rewrite the partially linear function (2) into the following formulation

y = β^{T} z + w^{T} φ (x) + b,

(13)

where z is a vector only containing the principal linear components corresponding to x. According to the basic principles of functional analysis, it is very easy to build a new feature space using

\tilde{F} = \{(\begin{matrix} z \\ φ (x) \end{matrix}) ∣ z \in R^{p}, φ (x) \in F; x \in R^{d}\},

(14)

where p is the number of principal linear components and d is the dimension of x. Thus, it is very easy to define a new feature mapping

ϕ : R^{d} \to \tilde{F}

using

ϕ (x) = (\begin{matrix} z \\ φ (x) \end{matrix}) .

(15)

The linear weights can then be concatenated using

ω = (\begin{matrix} β \\ w \end{matrix}) .

(16)

The partially linear model can then be compactly written in the new feature space

\tilde{F}

by

y = ω^{T} ϕ (x) + b .

(17)

4.2. Partially Linear Component Support Vector Machines in Primal and Dual Formulations

Within Formula (17), the primal problem of the partially linear component support vector machine for regression (PLC-SVM) can be defined as

min_{ω} \frac{1}{2} {∥ ω ∥}^{2} + C \sum_{i = 1}^{N} (ξ_{i} + ξ_{i}^{*})

(18)

s . t . \{\begin{matrix} y_{i} - ω^{T} ϕ (x) - b & \leq ε + ξ_{i} \\ ω^{T} ϕ (x) + b - y_{i} & \leq ε + ξ_{i}^{*} \\ ξ_{i}, ξ_{i}^{*} & \geq 0 \end{matrix} .

This formulation shares the same primal problem of the support vector regression, which is often known as the

ε

-insensitive formulation. However, this formulation is not available for computation use; thus, its corresponding dual problem should be used, which is defined by Smola et al. [70]

\begin{matrix} max_{α, α^{*}} J (α, α^{*}) & = - \frac{1}{2} \sum_{i, j = 1}^{N} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) ϕ^{T} (x_{i}) ϕ (x_{j}) \\ - ε \sum_{i = 1}^{N} (α_{i} + α_{i}^{*}) + \sum_{i = 1}^{N} y_{i} (α_{i} - α_{i}^{*}) \end{matrix}

(19)

s . t . \{\begin{matrix} \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) = 0 \\ α_{i}, α_{i}^{*} \in [0, C] \end{matrix} .

It is very important to notice that the linear weight

ω

in the feature space can be expressed by the linear combination of the mapping

ϕ

, and the weights are the Lagrangian multipliers, i.e.,

ω = \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) ϕ (x_{i}) = (\begin{matrix} \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) z_{i} \\ \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) φ (x_{i}) \end{matrix}) .

(20)

Then the partially linear function can be written as

\begin{matrix} ω^{T} ϕ (x) & = (\sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) z_{i}^{T}, \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) φ^{T} (x_{i})) \cdot (\begin{matrix} z_{j} \\ φ (x_{j}) \end{matrix}) \\ = \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) z_{i}^{T} z_{j} + \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) φ^{T} (x_{i}) φ (x_{j}) \end{matrix} .

(21)

Recalling the definition of

ω

in (16), it is easy to notice that

β = \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) z_{i} .

(22)

Thus the partially linear function can be rewritten as

\begin{matrix} ω^{T} ϕ (x) = β^{T} z_{j} + \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) φ^{T} (x_{i}) φ (x_{j}) . \end{matrix}

(23)

According to the kernel trick, the inner product of a feature mapping can be expressed by a kernel function that satisfies Mercer’s condition, i.e.,

φ^{T} (x_{i}) φ (x_{j}) = k (x_{i}, x_{j}) .

(24)

Noticing that the nonlinear mapping

ϕ

contains a linear and nonlinear part according to its definition (15), the inner products should be written as

\begin{matrix} ϕ^{T} (x_{i}) ϕ (x_{j}) & = (z_{i}^{T}, φ^{T} (x_{i})) (\begin{matrix} z_{j} \\ φ (x_{j}) \end{matrix}) \\ = z_{i}^{T} z_{j} + φ^{T} (x_{i}) φ (x_{j}) \\ = z_{i}^{T} z_{j} + k (x_{i}, x_{j}) \end{matrix} .

(25)

Finally, the partially linear model can now be written as

\begin{matrix} y & = ω^{T} ϕ (x) + b \\ = (β^{T}, w^{T}) (\begin{matrix} z \\ φ (x) \end{matrix}) + b \\ = β^{T} z + w^{T} φ (x) + b \\ = β^{T} z + \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) k (x_{i}, x) + b \end{matrix} .

(26)

The Gaussian kernel (also known as the radial basis function kernel) is often used

k (x_{i}, x_{j}) = exp (- γ {∥x_{i} - x_{j}∥}^{2}),

(27)

where

γ

is known as the reciprocal of the squares of the kernel width

σ

. The dual problem that can be used for computation can now be expressed within the inner product (25) as

\begin{matrix} max J (α, α^{*}) & = - \frac{1}{2} (\sum_{i, j = 1}^{N} (α_{i} - α_{i}^{*}) (α_{j} - α_{j}^{*}) (Z^{T} Z + K)) \\ - ε \sum_{i = 1}^{N} (α_{i} + α_{i}^{*}) + \sum_{i = 1}^{N} y_{i} (α_{i} - α_{i}^{*}) \end{matrix}

(28)

s . t . \{\begin{matrix} \sum_{i = 1}^{N} (α_{i} - α_{i}^{*}) = 0 \\ α_{i}, α_{i}^{*} \in [0, C] \end{matrix},

where

K = {(k (x_{i}, x_{j}))}_{N \times N}

.

Remark 2.

The Gram matrix

Z^{T} Z

is a positive semi-definite symmetric matrix, while K is also known as a positive semi-definite symmetric matrix; thus, their addition

Z^{T} Z + K

is also a positive semi-definite symmetric. Thus, the dual problem (28) satisfies the condition of the typical quadratic programming (QR), and it can be solved using sequential minimum optimization (SMO) with global convergence as proven by Takahashi et al. [71].

Within the above procedures and analysis, the overall computational steps of the proposed PLC-SVM are now clear, and a summarization is presented in the pseudo-code in Algorithm 1. The main computational steps can be roughly divided into four parts: the first part is preparing the data set and initializing the key settings; the second part utilizes the PCA to extract the principal linear components while building the kernel matrix used in (28); the third part solves the dual problem using the SMO algorithm, which has the same implementation as LibSM [72]; the last part makes predictions using the trained PLC-SVM model.

Algorithm 1: Algorithm of PLC-SVM (training and predicting).

Remark 3.

The complexity of the proposed PLC-SVM model is mainly contributed by the PCA and the cost for solving the dual problem (28). The complexity of the PCA is known to be

O (d^{2} \cdot N + d^{3})

for the worse cases. The complexity of the SMO in LibSVM is between

O (N^{2})

and

O (N^{3})

.

In general, the sample size is much larger than the dimension of the input vector, i.e.,

N ≫ d

; therefore, the total complexity of the PLC-SVM model is generally slightly larger than the SVM model with the same hyperparameters.

4.3. Forecasting Scheme for Univariate Time Series

The proposed model presented above essentially estimates a static model describing the relationship between the input and the output. But for time series forecasting, the model should estimate the correlation between the time series and the former points. One typical formulation is the auto-regressive model, which is represented by

y_{t} = f (y_{t - 1}, y_{t - 2}, \dots, y_{t - τ}) .

(29)

In other words, the former series with

τ

points constructs a vector

x_{t} = [y_{t - 1}, y_{t - 2}, \dots,

y_{t - τ}]^{T}

, which play a role as the input of the regression models. When the function

f (\cdot)

is nonlinear, the Equation (29) is known as the nonlinear auto-regressive model (NAR). In this regard, it is easy to use the PLC-SVM model to build such an auto-regressive model; the main difference is that PLC-SVM considers the principal linear components of the input. Thus, the final model used in this work can be written as

y_{t} = w^{T} z_{t} + g (y_{t - 1}, y_{t - 2}, \dots, y_{t - τ}) .

(30)

where

z_{t}

is the vector of which the elements are the linear components that are transformed by the PCA.

A complete partially linear auto-regression forecasting scheme is presented in Algorithm 2.

When executing the forecasting procedures, the newly predicted values of

y_{t}

would be added into the input at the next time step; thus the future points can be estimated using the models based on such recurrent scheme. It should be noticed that such a procedure is different to the n-step ahead forecasting, while all the future values would be forecasted only based on the in-sample data.

Algorithm 2: Algorithmof partially linear auto-regression based on PLC-SVM.

5. Case Study

In this section, a real-world case study of forecasting the monthly primary energy consumption of the electric power sector in the US will be presented with three cases. The background information, evaluation metrics, and models for comparison will be introduced first, and then the results along with a discussion of the results will be presented. The general framework illustrating the overall procedures of this case study is presented in Figure 2.

5.1. Data Collection and Preprocessing

As discussed in Section 1, the primary energy consumption is of great importance for industrial economics. In this section, the real-world case of the primary energy consumption of the electric power sector in the US was considered.

The raw data of the monthly primary energy consumption from January 1973 to January 2020 were collected from the US Energy Information Administration (EIA) website (https://www.eia.gov/totalenergy/data/monthly/ Monthly Energy Review of the US, accessed on 1 March 2020). As shown in Figure 3, the data set contains 565 points of monthly primary energy consumption of the electric power sector in the US (unit: trillion Btu). The time series data were firstly reconstructed using the steps presented in line 2 to line 5 in Algorithm 2. Then, the first 90% of points are used as in-sample data, and the remaining 10% of points are used as out-of-sample data; furthermore, the first 90% as in-sample data are finally used for training the models, and the remaining 10% of in-sample data are used for validating the performance of the models. In order to make it easier to train the machine learning models, the raw data were divided by the largest value in the in-sample data before training, and the final predicted values were multiplied by the same largest value.

5.2. Models for Comparison and Evaluation Metrics

Nine models were selected for comparison with the proposed PLC-SVM, of which their information is summarized in Table 1 with descriptions of the corresponding hyperparameters. As described above, the PLC-SVM model is essentially based on the methodology of the SVM model; thus, the most closely related models are chosen for comparison. For convenience, the Gaussian kernel (27) is selected for SVM and LSSVM, and the rational quadratic kernel is selected for GPR, as suggested in [73]. On the other hand, the PLC-SVM model has a partially linear structure, and the linear auto-regressive model is also used as the baseline model for comparison.

AR: The linear auto-regressive model (AR) used in this work is formulated as $y_{t} = a_{0} + a_{1} y_{t - 1} + a_{2} y_{t - 2} + \dots + a_{τ} y_{t - τ}$ , which can be regarded as a simplified version of PLC-SVM (without the kernel-based term and $C \to \infty$ ), and the parameters are estimated using the ordinary least squares method. With no hyperparameters, the AR does not need to be further optimized by the grid search cross-validation like in the other machine learning models.
SVM: The $ε$ -insensitive support vector machine (SVM) model for regression is selected in this work, of which the modelling details are described in [70,72]. It shares the most similar regularization formulation to PLC-SVM but has no partially linear part.
LSSVM: The least squares support vector machine (LSSVM) model presented by Suykens in 1999 [74] is another version of SVM that uses quality constraints. The regression version of LSSVM is based on the LSSVM model for function estimation described in [75].
GPR: The Gaussian process regression (GPR) model also uses the kernel combinations developed from the SVM model as described in [73]; the main difference is that the GPR approach is mainly based on the Bayesian theory.

Decision tree-based models are another kind of cutting-edge method, and they are widely used in the energy forecasting fields, such as in carbon energy [53], building energy [54], solar energy [55], and hydro-energy applications [56], among others. These models all use the regression tree and ensemble learning method, such as boosting and bagging, and often have high performance in time series forecasting with high accuracy and stability and very low time cost. Thus, it will be very interesting to see whether the proposed PLC-SVM can outperform these emerging models in this case. Information on these models is listed below:

RF: The random forest (RF) model is one of the most classical tree-based models, which mainly ensembles the weak regressors using bagging. The general method was first proposed by Ho in 1995 [76], and a complete work was first presented by Breiman in 2001 [77].
XGB: The extreme gradient boosting (XGB) model was proposed by Chen in 2015, and the complete work was published in 2016 [78]. It was famous for its high performance in dealing with complex features and its extremely fast speed [79].
LGBM: The light gradient boosting model was proposed by Ke in 2017 [80], who has won the one million bonus from Alibaba Ltd. using this model with his partners. The LGBM model uses multiple technologies to boost the original gradient boosting models, and it can even be more stable and faster than XGB in some tasks.
CATB: Gradient boosting with categorical features support (CATB) was proposed by Prokhorenkova et al. in 2018 [81]. It has a very good performance in dealing with categorical features and has very good robustness.

The recurrent neural networks are widely used in time series forecasting and in related works in recent years. In this work, a state-of-the-art gated recurrent unit is used for comparison. The detailed information of this model is described as follows.

GRU: The gated recurrent unit (GRU) model was introduced by Cho et al. [82] in 2014 as a simplified version of the long short-term memory (LSTM) model by Hochreiter and Schmidhuber [83] in 1997. In time series forecasting, the GRU model is often combined with other layers to capture more complex data patterns or shapes. In this study, a three-layer neural network was used, consisting of a GRU layer directly connected to the input data, an activation layer using a sigmoid function, and an output layer with a linear full connection.

Table 1. Models for comparison and their hyperparameters.

Model	Abbreviation	References	Hyperparameters
Auto-Regressive	AR	[21]
Support Vector Machine	SVM	[70,72]	Kernel parameter, regularization parameter
Least Squares Support Vector Machine	LSSVM	[74]	Kernel parameter, regularization parameter
Gaussian Process Regression	GPR	[73]	Kernel type
Random Forest	RF	[76]	Bootstrap (whether bootstrap samples are used when building trees), maximum tree depth, number of features for the best split, minimum samples at a leaf node, minimum samples for splitting an internal node, number of trees
Extreme Gradient Boosting	XGB	[78]	Minimum loss reduction, learning rate, maximum tree depth, minimum weight for new node, L1 regularization parameter
Light Gradient Boosting	LGBM	[80]	Maximum tree depth, maximum tree leaves, minimum number of data needed in a child, L1 regularization parameter, L2 regularization parameter
Gradient Boosting with Categorical Features Support	CATB	[81]	Maximum number of trees, tree depth, L2 regularization parameter
Gated Recurrent Unit	GRU	[82]	Hidden size

To ensure a fair comparison, all machine learning models were utilized as nonlinear auto-regressive models, similar to PLC-SVM in Algorithm 2 (one can use these models in line-6 to implement the overall workflow). The models were implemented using Python 3.7, and their forecasting performances were evaluated using the multiple criteria listed in Table 2. The scikit-learn [84] library’s built-in grid search method was used for tuning the hyperparameters of the models except for the AR model. Detailed information on the hyperparameters and original references are summarized in Table 1. In order to make the grid search process executable, we only choose the most important hyperparameters for each model, following the engineering experience or suggestions made by the original references. As time series require forward series validation to determine the model’s performance, it is more reasonable to use 90% of the in-sample data for training and the remaining 10% for validation, as in [85].

5.3. Results

In order to make a comprehensive comparison between the PLC-SVM model and the other models, three sub-cases based on the same data sets with different lags were carried out.

5.3.1. Case I: $τ = 18$

In this case, the time lag is set as

τ = 18

, i.e., every point will be predicted based on the former 18 points in the way presented in Algorithm 2. Four linear principal components are transformed by the PCA (

r = 0.95

) from eighteen dimensions, which are presented in Equation (A1) in Appendix A. Then, the semi-analytical output function of PLC-SVM can be written as

\begin{matrix} y & = β^{T} z + w^{T} φ (x) + b \\ = - 0.1643 z_{1} - 0.0156 z_{2} - 0.2404 z_{3} - 0.0692 z_{4} + w^{T} φ (x) + 0.6824 . \end{matrix}

(31)

The testing metrics of all models are listed in Table 3. It is clear that the overall performance of the PLC-SVM model is the best among all models as all of its metrics are the best. It is very interesting to see that the SVM model has the closest performance to PLC-SVM in this case, and this is easy to explain as they share similar methodologies (kernel method and

ε

-insensitive loss function). In the kernel-based models, the SVM model has the best performance aside from the PLC-SVM model, while the GPR model has the worst performance. The RF model performs the best and CATB performs the worst in the tree-based models. The GRU model only outperforms the worst tree-based models, which is a performance that is even worse than the linear AR model.

The predicted values of all 10 models, along with the percentage errors at each point, are plotted in Figure 4. It is very interesting to see that the values predicted by PLC-SVM and SVM are very close, which is coincident with the results in the metrics described above. It is also very clear that the predicted series of the other models except for CATB appear to be larger than the raw data, which are less stable than PLC-SVM and SVM, whereas the predicted values of CATB tend to be approximately constant in the last steps. It is interesting to see that the predicted values of GRU in the first few steps are actually acceptable, but most predicted values become smaller than the raw data with longer steps. The predicted values of AR are very close to the average value, which is coincident with its properties.

From another point of view, the PEs of PLC-SVM and SVM are approximately distributed around zero, as shown in Figure 4. However, more PEs of LSSVM, GPR, RF, LGBM, XGB, and AR are larger than zero; this indicates that these models overestimated future consumption. In contrast, more PEs of CATB and GRU are smaller than zero; this indicates that these models underestimated the future trend of consumption. Overall, the PLC-SVM model has the best performance in primary energy consumption in this case.

5.3.2. Case II: $τ = 24$

In this case, the time lag is set as

τ = 24

, i.e., every point will be predicted based on the former 24 points as described in Algorithm 2. The PCA (

r = 0.95

) transforms the 24 dimensions into 5 principal components, which are presented in Equation (A1) in Appendix A. Then, the output function of the PLC-SVM model can be written as:

\begin{matrix} y & = β^{T} z + w^{T} φ (x) + b \\ = - 0.074 z_{1} - 0.0036 z_{2} - 0.2802 z_{3} + 0.0157 z_{4} - 0.2419 z_{5} + w^{T} φ (x) + 0.6149 \end{matrix}

(32)

The testing metrics of all models are listed in Table 4. In this case, the performance of PLC-SVM is also the best among these models, and the errors are smaller than the other models on a more significant scale; SVM still has the closest performance to PLC-SVM. RF performs best among the tree-based models, while GRP and CATB perform the worst in the kernel-based and the tree-based models, respectively. In this case, GRU has the worst performance of all of the models. For the AR model, although it outperforms several other models, its metrics are still significantly worse than PLC-SVM.

The predicted values and PEs of all 10 models are plotted in Figure 5. The values predicted by PLC-SVM and SVM still seem to be close. But in this case, it is more obvious that LSSVM, GPR, RF, LGBM, XGB, and AR all overestimate the observations, and it is clear that the overall trends reflected by these models are less stable and appear to be increasing. The values predicted by CATB still appear to decay, of which the peak values are too far away from the observations. Only some of the first predicted values by GRU are close to the raw data, but most of the following predicted values are larger than the average value of the corresponding raw data.

By analyzing the PEs shown in Figure 5, it is very clear that most PEs of LSSVM, GPR, RF, LGBM, XGB, GRU, and AR are larger than zero. This presents a clearer picture that these models all overestimate the future trend of real consumption. Meanwhile, most PEs of CATB are smaller than zero, and most of them are too large, indicating that the results of this model are not acceptable at all. The positive and negative PEs of PLC-SVM and SVM appear to be approximately equivalent, and the MPE (defined in Table 2) is closest to zero. Overall, the advantage of PLC-SVM over the other models is still significant in this case.

5.3.3. Case III: $τ = 30$

In this case, the time lag is set as

τ = 30

, i.e., every point will be predicted based on the former 30 points. There are five principal components that are transformed by the PCA (

r = 0.95

), which are presented in Equation (A1) in Appendix A. The output function of the PLC-SVM model is obtained as:

\begin{matrix} y & = β^{T} z + w^{T} φ (x) + b \\ = - 0.1029 z_{1} - 0.1242 z_{2} - 0.2137 z_{3} - 0.0626 z_{4} - 0.104 z_{5} + w^{T} φ (x) + 0.9909 \end{matrix}

(33)

The testing metrics of all models are listed in Table 5. PLC-SVM is still the best model in this case, and it is very interesting to see that all of its metrics are generally better than the previous two cases. GRU has the second-best performance in this case, and its MedAe is even closer to zero than PLC-SVM. The performance of SVM is significantly worse than PLC-SVM in this case. XGB performs the best among the tree-based models, which has the closest performance to PLC-SVM. Meanwhile, GPR and CATB still have the worst performance in this case.

The predicted values of all 10 models are plotted in Figure 6. The values predicted by PLC-SVM appear to be closer to the observations in this case than they were in the previous two cases. Having the closest performance to PLC-SVM, the predicted values of GRU are very close to most peak values, which appear to be closer to the raw data than the tree-based model XGB. The values predicted by CATB still decay with more steps. It is very interesting to see that only the predicted values by PLC-SVM and CATB all fall within the range of the observations, while there are several points by the other models that are larger than the nearby peak values.

By looking at the results of PEs plotted in Figure 6, most values of PEs of CATB are negative, and most PEs of the other models are positive; this indicates that most models over-estimated the raw data in this case. Moreover, it is very clear that the distributions of PEs of PLC-SVM and GRU appear to be more uniform than others. However, it is clear that the PEs of GRU with larger steps become larger than PLC-SVM; this is the reason why the overall metrics for GRU are not the best. Overall, although GRU presents a highly competitive performance, the PLC-SVM model still performs the best in this case.

5.4. Discussion

It is clear that the PLC-SVM model has the best performance in all cases. One significant finding is that the PLC-SVM model indeed improved the accuracy of the SVM model. Having a similar structure and training algorithm, the SVM model can approach PLC-SVM with a smaller

τ

, as it was shown when

τ = 18, 24

. But it is interesting to note that the difference between PLC-SVM and SVM becomes larger with longer lag, as it is shown that the related metrics of PLC-SVM are significantly better than SVM when

τ = 30

. This indicates that the PLC-SVM model has a better performance in higher dimensional problems than the SVM model. It is very interesting to note that although the performance of the AR model is not the best, it generally presents a moderate performance in all cases. This indicates that there indeed exists a linear relationship between the current primary energy consumption and the former ones. Having a partially linear structure, the PLC-SVM model has taken advantage of such linear features. It is easy to see that such improvements are from its structure of a partially linear formulation, which takes most advantages of the linear features of the original series. At this stage, it can be confirmed that such linear features make the predicted series using the PLC-SVM model more accurate and stable than the SVM model, and this is also reflected in the Figure 4, Figure 5 and Figure 6.

It should also be noted that the tree-based models are also very competitive compared with the PLC-SVM model. The best tree-based model in each case often presents a very close performance to the PLC-SVM model and is even much better than the other kernel-based models in some cases. Moreover, it is very interesting to see that the XGB model performs the second best when

τ = 30

and is much better than the other kernel-based models. This greatly coincides with a well-recognized result that tree-based models have very good performance in high-dimensional problems.

Although neural networks using the GRU model often perform much worse than other models with shorter lags, it is also very interesting to see that it performs quite well when

τ = 30

, of which the metrics are the closest to the best model in this case, and even the MedAe model is better than that of the PLC-SVM model. This implies that the GRU model is very competitive with larger lags. However, even in such conditions, the overall performance of GRU is still slightly worse than PLC-SVM.

However, the advantages of PLC-SVM over the tree-based models and GRU is still significant. One of the most significant advantages of PLC-SVM is that it only has some hyperparameters to tune. In the above cases, only the regularization parameter C and kernel parameter

γ

are tuned, while the

ε

is set as a determined value (this is reasonable because it uses the

ε

-insensitive cost function). However, all tree-based models and GRU (also the other neural networks) have a lot of hyperparameters to tune, such as the maximum depth of trees, the number of estimators, and even other parameters that need fine-tuning. This is very important because less hyperparameter often means that the model is easier to tune, less time-consuming, and further makes it easier to design an optimal prediction scheme in real-world applications. Another advantage of PLC-SVM is its global convergence. And as mentioned in Section 4.2, the dual formulation is essentially a convex optimization; thus, the PLC-SVM model can be trained with global convergence. However, the algorithms used for the tree-based models and GRU (e.g., bagging for RF and gradient-based algorithms for other tree-based models and GRU) do not have global convergence; thus, they generally need more trials to obtain well-trained models.

For application implications, it is first suggested to use larger time lags as the PLC-SVM model presents a better performance with such settings, and this implies that more features may make the performance of PLC-SVM better. Another point is that the forecasting terms considered in this work are not short. In the above cases, the forecasting steps are all 55, which means that the monthly primary energy consumptions in 55 months (almost 5 years) are predicted. Considering the performance of stability and accuracy, it is reasonable to say that PLC-SVM is eligible to be used for primary energy consumption forecasting in the electric power sector for the mid-/long-term. Such performance may make it a potential tool for decision-making and marketing planning in the future.

6. Conclusions

A partially linear component support vector machine, named PLC-SVM, was proposed in this work. By using the PCA algorithm, the linear part of PLC-SVM has fewer linear dimensions, reducing the risk of multicollinearity and computational complexity. The methodology of SVM was used to construct the partially linear framework, and the use of the primal-dual trick causes the PLC-SVM model to have global optimality and easy implementation. The case study focused on the primary energy consumption forecasting of the electric power sector in the US by using the univariate time series data from January 1973 to January 2020, which contains 565 points of monthly primary energy consumption. The results of three sub-cases showed that the PLC-SVM model presents more accurate and stable forecasting results than the other three kinds of typical machine learning models and the linear AR model with different lags; larger lags might improve the performance of the PLC-SVM model. Within the above discussions, the PLC-SVM model is eligible to make mid-/long-term forecasting for primary energy forecasting of electric sectors in the US. Considering its general formulation, it can be expected to be used for forecasting more kinds of energies in future works.

The possible limitations of this work are twofold. The first issue is that this model might not be suitable for cases with too small of data sets. In such conditions, the available lags would be very small, which means that the original dimension of the linear part is already small; thus, obviously, the PCA will not work well. Another limitation is that this work only considered the most commonly used Gaussian kernel in the applications. More kernels can be designed based on achieving a very good performance if proper knowledge is used. In this regard, future works can also be extended by using more advanced kernels or new kernels that are designed for specific cases, as is suggested in the kernel cookbook by David [86].

Author Contributions

Conceptualization, methodology, writing—original draft preparation, funding acquisition, X.M.; software, X.M. and Y.C.; writing—review and editing, H.Y. and Y.D. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Humanities and Social Science Fund of the Ministry of Education of China (19YJCZH119), the Scientific and Technological Achievements Transformation Project of the Sichuan Scientific Research Institute (2022JDZH0035), and the National College Students Innovation and Entrepreneurship Training Program of China (S202210619106).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

Publicly available datasets were analyzed in this study. The data can be found at https://www.eia.gov/totalenergy/data/monthly/, accessed on 1 March 2020.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A

The expressions of the principal components obtained in the case studies are presented in this section. In all of the following formulae,

z_{t}^{i}

represents the ith element in the vector

z_{t}

.

The four principal components in Case I:

\begin{matrix} z_{t}^{1} & = - 0.2298 y_{t - 1} - 0.2321 y_{t - 2} - 0.2343 y_{t - 3} - 0.2353 y_{t - 4} - 0.2355 y_{t - 5} \\ - 0.2346 y_{t - 6} - 0.2331 y_{t - 7} - 0.2326 y_{t - 8} - 0.2326 y_{t - 9} - 0.2332 y_{t - 10} \\ - 0.2337 y_{t - 11} - 0.2348 y_{t - 12} - 0.2371 y_{t - 13} - 0.2391 y_{t - 14} - 0.2410 y_{t - 15} \\ - 0.2417 y_{t - 16} - 0.2415 y_{t - 17} - 0.2401 y_{t - 18} - 2.7449 \\ z_{t}^{2} & = 0.2538 y_{t - 1} + 0.3692 y_{t - 2} + 0.1697 y_{t - 3} - 0.1758 y_{t - 4} - 0.3613 y_{t - 5} \\ - 0.2335 y_{t - 6} + 0.0537 y_{t - 7} + 0.2129 y_{t - 8} + 0.1098 y_{t - 9} - 0.1205 y_{t - 10} \\ - 0.2098 y_{t - 11} - 0.0388 y_{t - 12} + 0.2465 y_{t - 13} + 0.3589 y_{t - 14} + 0.1618 y_{t - 15} \\ - 0.1788 y_{t - 16} - 0.3623 y_{t - 17} - 0.2350 y_{t - 18} + 0.0178 \\ z_{t}^{3} & = - 0.2789 y_{t - 1} + 0.0345 y_{t - 2} + 0.3278 y_{t - 3} + 0.3192 y_{t - 4} + 0.0167 y_{t - 5} \\ - 0.2898 y_{t - 6} - 0.2963 y_{t - 7} - 0.0460 y_{t - 8} + 0.2106 y_{t - 9} + 0.2058 y_{t - 10} \\ - 0.0534 y_{t - 11} - 0.2939 y_{t - 12} - 0.2745 y_{t - 13} + 0.0343 y_{t - 14} + 0.3226 y_{t - 15} \\ + 0.3152 y_{t - 16} + 0.0162 y_{t - 17} - 0.2853 y_{t - 18} - 0.0094 \\ z_{t}^{4} & = - 0.1046 y_{t - 1} - 0.1707 y_{t - 2} - 0.1736 y_{t - 3} - 0.1765 y_{t - 4} - 0.1749 y_{t - 5} \\ - 0.1029 y_{t - 6} + 0.0856 y_{t - 7} + 0.3274 y_{t - 8} + 0.5002 y_{t - 9} + 0.4970 y_{t - 10} \\ + 0.3211 y_{t - 11} + 0.0822 y_{t - 12} - 0.1002 y_{t - 13} - 0.1688 y_{t - 14} - 0.1716 y_{t - 15} \\ - 0.1726 y_{t - 16} - 0.1686 y_{t - 17} - 0.0982 y_{t - 18} + 0.0197, \end{matrix}

(A1)

The five principal components in Case II:

\begin{matrix} z_{t}^{1} & = - 0.1974 y_{t - 1} - 0.1983 y_{t - 2} - 0.1994 y_{t - 3} - 0.2003 y_{t - 4} - 0.2006 y_{t - 5} \\ - 0.2007 y_{t - 6} - 0.2013 y_{t - 7} - 0.2021 y_{t - 8} - 0.2031 y_{t - 9} - 0.2037 y_{t - 10} \\ - 0.2041 y_{t - 11} - 0.2042 y_{t - 12} - 0.2045 y_{t - 13} - 0.2052 y_{t - 14} - 0.2060 y_{t - 15} \\ - 0.2064 y_{t - 16} - 0.2064 y_{t - 17} - 0.2066 y_{t - 18} - 0.2066 y_{t - 19} - 0.2073 y_{t - 20} \\ - 0.2082 y_{t - 21} - 0.2086 y_{t - 22} - 0.2088 y_{t - 23} - 0.2085 y_{t - 24} - 3.1723 \\ z_{t}^{2} & = 0.2448 y_{t - 1} + 0.2681 y_{t - 2} + 0.0273 y_{t - 3} - 0.2352 y_{t - 4} - 0.2593 y_{t - 5} \\ - 0.0202 y_{t - 6} + 0.2412 y_{t - 7} + 0.2631 y_{t - 8} + 0.0233 y_{t - 9} - 0.2385 y_{t - 10} \\ - 0.2610 y_{t - 11} - 0.0239 y_{t - 12} + 0.2365 y_{t - 13} + 0.2601 y_{t - 14} + 0.0241 y_{t - 15} \\ - 0.2340 y_{t - 16} - 0.2580 y_{t - 17} - 0.0217 y_{t - 18} + 0.2346 y_{t - 19} + 0.2550 y_{t - 20} \\ + 0.0197 y_{t - 21} - 0.2375 y_{t - 22} - 0.2598 y_{t - 23} - 0.0274 y_{t - 24} + 0.0187 \\ z_{t}^{3} & = - 0.1634 y_{t - 1} + 0.1274 y_{t - 2} + 0.2912 y_{t - 3} + 0.1644 y_{t - 4} - 0.1281 y_{t - 5} \\ - 0.2937 y_{t - 6} - 0.1675 y_{t - 7} + 0.1244 y_{t - 8} + 0.2910 y_{t - 9} + 0.1672 y_{t - 10} \\ - 0.1206 y_{t - 11} - 0.2850 y_{t - 12} - 0.1628 y_{t - 13} + 0.1231 y_{t - 14} + 0.2846 y_{t - 15} \\ + 0.1610 y_{t - 16} - 0.1273 y_{t - 17} - 0.2893 y_{t - 18} - 0.1642 y_{t - 19} + 0.1217 y_{t - 20} \\ + 0.2844 y_{t - 21} + 0.1630 y_{t - 22} - 0.1195 y_{t - 23} - 0.2806 y_{t - 24} + 0.0035 \\ z_{t}^{4} & = + 0.1268 y_{t - 1} + 0.2401 y_{t - 2} + 0.2935 y_{t - 3} + 0.2712 y_{t - 4} + 0.1755 y_{t - 5} \\ + 0.0331 y_{t - 6} - 0.1173 y_{t - 7} - 0.2321 y_{t - 8} - 0.2813 y_{t - 9} - 0.2537 y_{t - 10} \\ - 0.1586 y_{t - 11} - 0.0226 y_{t - 12} + 0.1203 y_{t - 13} + 0.2325 y_{t - 14} + 0.2837 y_{t - 15} \\ + 0.2588 y_{t - 16} + 0.1609 y_{t - 17} + 0.0186 y_{t - 18} - 0.1304 y_{t - 19} - 0.2435 y_{t - 20} \\ - 0.2899 y_{t - 21} - 0.2600 y_{t - 22} - 0.1640 y_{t - 23} - 0.0287 y_{t - 24} + 0.0157 \\ z_{t}^{5} & = - 0.2626 y_{t - 1} - 0.1649 y_{t - 2} - 0.0228 y_{t - 3} + 0.1231 y_{t - 4} + 0.2348 y_{t - 5} \\ + 0.2849 y_{t - 6} + 0.2609 y_{t - 7} + 0.1695 y_{t - 8} + 0.0315 y_{t - 9} - 0.1180 y_{t - 10} \\ - 0.2375 y_{t - 11} - 0.2914 y_{t - 12} - 0.2640 y_{t - 13} - 0.1631 y_{t - 14} - 0.0192 y_{t - 15} \\ + 0.1277 y_{t - 16} + 0.2367 y_{t - 17} + 0.2839 y_{t - 18} + 0.2578 y_{t - 19} + 0.1635 y_{t - 20} \\ + 0.0244 y_{t - 21} - 0.1229 y_{t - 22} - 0.2389 y_{t - 23} - 0.2886 y_{t - 24} + 0.0057 \end{matrix}

(A2)

The five principal components in Case III:

\begin{matrix} z_{t}^{1} & = - 0.1751 y_{t - 1} - 0.1763 y_{t - 2} - 0.1776 y_{t - 3} - 0.1786 y_{t - 4} - 0.1791 y_{t - 5} \\ - 0.1789 y_{t - 6} - 0.1783 y_{t - 7} - 0.1783 y_{t - 8} - 0.1786 y_{t - 9} - 0.1794 y_{t - 10} \\ - 0.1800 y_{t - 11} - 0.1805 y_{t - 12} - 0.1816 y_{t - 13} - 0.1826 y_{t - 14} - 0.1838 y_{t - 15} \\ - 0.1846 y_{t - 16} - 0.1850 y_{t - 17} - 0.1848 y_{t - 18} - 0.1839 y_{t - 19} - 0.1836 y_{t - 20} \\ - 0.1838 y_{t - 21} - 0.1844 y_{t - 22} - 0.1848 y_{t - 23} - 0.1852 y_{t - 24} - 0.1860 y_{t - 25} \\ - 0.1870 y_{t - 26} - 0.1880 y_{t - 27} - 0.1888 y_{t - 28} - 0.1890 y_{t - 27} - 0.1887 y_{t - 30} - 3.5511 \\ z_{t}^{2} & = 0.1790 y_{t - 1} + 0.2921 y_{t - 2} + 0.1452 y_{t - 3} - 0.1319 y_{t - 4} - 0.2838 y_{t - 5} \\ - 0.1779 y_{t - 6} + 0.0656 y_{t - 7} + 0.2018 y_{t - 8} + 0.1083 y_{t - 9} - 0.1034 y_{t - 10} \\ - 0.2005 y_{t - 11} - 0.0685 y_{t - 12} + 0.1748 y_{t - 13} + 0.2863 y_{t - 14} + 0.1417 y_{t - 15} \\ - 0.1319 y_{t - 16} - 0.2826 y_{t - 17} - 0.1787 y_{t - 18} + 0.0610 y_{t - 19} + 0.1952 y_{t - 20} \\ + 0.1042 y_{t - 21} - 0.1026 y_{t - 22} - 0.1974 y_{t - 23} - 0.0679 y_{t - 24} + 0.1700 y_{t - 25} \\ + 0.2787 y_{t - 26} + 0.1371 y_{t - 27} - 0.1307 y_{t - 28} - 0.2784 y_{t - 29} - 0.1765 y_{t - 30} + 0.0241 \\ z_{t}^{3} & = - 0.2259 y_{t - 1} + 0.0134 y_{t - 2} + 0.2487 y_{t - 3} + 0.2509 y_{t - 4} + 0.0205 y_{t - 5} \\ - 0.2200 y_{t - 6} - 0.2301 y_{t - 7} - 0.0257 y_{t - 8} + 0.1870 y_{t - 9} + 0.1886 y_{t - 10} \\ - 0.0226 y_{t - 11} - 0.2285 y_{t - 12} - 0.2235 y_{t - 13} + 0.0123 y_{t - 14} + 0.2446 y_{t - 15} \\ + 0.2475 y_{t - 16} + 0.0209 y_{t - 17} - 0.2171 y_{t - 18} - 0.2266 y_{t - 19} - 0.0258 y_{t - 20} \\ + 0.1822 y_{t - 21} + 0.1840 y_{t - 22} - 0.0227 y_{t - 23} - 0.2241 y_{t - 24} - 0.2184 y_{t - 25} \\ + 0.0123 y_{t - 26} + 0.2393 y_{t - 27} + 0.2427 y_{t - 28} + 0.0206 y_{t - 29} - 0.2127 y_{t - 30} - 0.0056 \\ z_{t}^{4} & = - 0.0758 y_{t - 1} - 0.1498 y_{t - 2} - 0.1754 y_{t - 3} - 0.1783 y_{t - 4} - 0.1588 y_{t - 5} \\ - 0.0878 y_{t - 6} + 0.0486 y_{t - 7} + 0.2115 y_{t - 8} + 0.3274 y_{t - 9} + 0.3331 y_{t - 10} \\ + 0.2270 y_{t - 11} + 0.0685 y_{t - 12} - 0.0694 y_{t - 13} - 0.1451 y_{t - 14} - 0.1714 y_{t - 15} \\ - 0.1734 y_{t - 16} - 0.1522 y_{t - 17} - 0.0802 y_{t - 18} + 0.0554 y_{t - 19} + 0.2159 y_{t - 20} \\ + 0.3289 y_{t - 21} + 0.3320 y_{t - 22} + 0.2245 y_{t - 23} + 0.0658 y_{t - 24} - 0.0718 y_{t - 25} \\ - 0.1473 y_{t - 26} - 0.1729 y_{t - 27} - 0.1743 y_{t - 28} - 0.1526 y_{t - 29} - 0.0818 y_{t - 30} + 0.0134 \\ z_{t}^{5} & = - 0.1951 y_{t - 1} - 0.0796 y_{t - 2} - 0.0195 y_{t - 3} + 0.0073 y_{t - 4} + 0.0727 y_{t - 5} \\ + 0.1963 y_{t - 6} + 0.3028 y_{t - 7} + 0.2909 y_{t - 8} + 0.1277 y_{t - 9} - 0.1064 y_{t - 10} \\ - 0.2759 y_{t - 11} - 0.2963 y_{t - 12} - 0.1997 y_{t - 13} - 0.0826 y_{t - 14} - 0.0198 y_{t - 15} \\ + 0.0100 y_{t - 16} + 0.0759 y_{t - 17} + 0.1962 y_{t - 18} + 0.2996 y_{t - 19} + 0.2851 y_{t - 20} \\ + 0.1207 y_{t - 21} - 0.1114 y_{t - 22} - 0.2777 y_{t - 23} - 0.2967 y_{t - 24} - 0.1998 y_{t - 25} \\ - 0.0824 y_{t - 26} - 0.0190 y_{t - 27} + 0.0107 y_{t - 28} + 0.0745 y_{t - 29} + 0.1899 y_{t - 30} - 0.00001 . \end{matrix}

(A3)

References

Statt, N. Google and DeepMind Are Using AI to Predict the Energy Output of Wind Farms. The Verge. p. 1. Available online: https://www.theverge.com/2019/2/26/18241632/google-deepmind-wind-farm-ai-machine-learning-green-energy-efficiency (accessed on 26 February 2019).
Ma, M.; Ma, X.; Cai, W.; Cai, W. Low carbon roadmap of residential building sector in China: Historical mitigation and prospective peak. Appl. Energy 2020, 273, 115247. [Google Scholar] [CrossRef]
Lu, H.; Ma, X.; Azimi, M. Us natural gas consumption prediction using an improved kernel-based nonlinear extension of the arps decline model. Energy 2020, 194, 116905. [Google Scholar] [CrossRef]
Zeng, B.; Zhou, M.; Liu, X.; Zhang, Z. Application of a new grey prediction model and grey average weakening buffer operator to forecast China’s shale gas output. Energy Rep. 2020, 6, 1608–1618. [Google Scholar] [CrossRef]
Niu, T.; Wang, J.; Lu, H.y.; Yang, W.; Du, P. A learning system integrating temporal convolution and deep learning for predictive modeling of crude oil price. IEEE Trans. Ind. Inform. 2020, 17, 4602–4612. [Google Scholar] [CrossRef]
Yang, J.; Cai, W.; Ma, M.; Li, L.; Liu, C.; Ma, X.; Li, L.; Chen, X. Driving forces of China’s CO2 emissions from energy consumption based on kaya-lmdi methods. Sci. Total Environ. 2020, 711, 134569. [Google Scholar] [CrossRef]
Engle, R.F.; Granger, C.W.J.; Rice, J.; Weiss, A. Semiparametric estimates of the relation between weather and electricity sales. J. Am. Stat. Assoc. 1986, 81, 310–320. [Google Scholar] [CrossRef]
Smola, A.J.; Frieß, T.; Schölkopf, B. Semiparametric support vector and linear programming machines. In Proceedings of the Advances in Neural Information Processing Systems 11, NIPS Conference, Denver, CO, USA, 30 November–5 December 1998; pp. 585–591. [Google Scholar]
Espinoza, M.; Suykens, J.A.K.; De Moor, B. Kernel based partially linear models and nonlinear identification. IEEE Trans. Autom. Control 2005, 50, 1602–1606. [Google Scholar] [CrossRef]
Goethals, I.; Pelckmans, K.; Suykens, J.A.K.; De Moor, B. Identification of mimo hammerstein models using least squares support vector machines. Automatica 2005, 41, 1263–1272. [Google Scholar] [CrossRef]
Varoquaux, G. Cross-validation failure: Small sample sizes lead to large error bars. Neuroimage 2018, 180, 68–77. [Google Scholar] [CrossRef]
Castro-Garcia, R.; Agudelo, O.M.; Suykens, J.A.K. Impulse response constrained ls-svm modelling for mimo hammerstein system identification. Int. J. Control 2019, 92, 908–925. [Google Scholar] [CrossRef]
Ma, X.; Liu, Z. Predicting the oil production using the novel multivariate nonlinear model based on Arps decline model and kernel method. Neural Comput. Appl. 2018, 29, 579–591. [Google Scholar] [CrossRef]
Ma, X.; Liu, Z. The kernel-based nonlinear multivariate grey model. Appl. Math. Model. 2018, 56, 217–238. [Google Scholar] [CrossRef]
Ma, X. A brief introduction to the grey machine learning. J. Grey Syst. 2019, 31, 1–12. [Google Scholar]
Matías, J.M.; Taboada, J.; Ordóñez, C.; González-Manteiga, W. Partially linear support vector machines applied to the prediction of mine slope movements. Math. Comput. Model. 2010, 51, 206–215. [Google Scholar] [CrossRef]
Xu, Y.; Chen, D.R. Partially-linear least-squares regularized regression for system identification. IEEE Trans. Autom. Control 2009, 54, 2637–2641. [Google Scholar]
Fan, J.; Wu, L.; Zhang, F.; Cai, H.; Zeng, W.; Wang, X.; Zou, H. Empirical and machine learning models for predicting daily global solar radiation from sunshine duration: A review and case study in China. Renew. Sustain. Energy Rev. 2019, 100, 186–212. [Google Scholar] [CrossRef]
Chang, Y.; Choi, Y.; Kim, C.S.; Miller, J.I.; Park, J.Y. Forecasting regional long-run energy demand: A functional coefficient panel approach. Energy Econ. 2021, 96, 105117. [Google Scholar] [CrossRef]
Johannesen, N.J.; Kolhe, M.; Goodwin, M. Relative evaluation of regression tools for urban area electrical energy demand forecasting. J. Clean. Prod. 2019, 218, 555–564. [Google Scholar] [CrossRef]
Akdi, Y.; Gölveren, E.; Okkaoğlu, Y. Daily electrical energy consumption: Periodicity, harmonic regression method and forecasting. Energy 2020, 191, 116524. [Google Scholar] [CrossRef]
Khalifa, A.; Caporin, M.; Di Fonzo, T. Scenario-based forecast for the electricity demand in qatar and the role of energy efficiency improvements. Energy Policy 2019, 127, 155–164. [Google Scholar] [CrossRef]
Nafil, A.; Bouzi, M.; Anoune, K.; Ettalabi, N. Comparative study of forecasting methods for energy demand in morocco. Energy Rep. 2020, 6, 523–536. [Google Scholar] [CrossRef]
Dumitru, C.-D.; Gligor, A. Wind energy forecasting: A comparative study between a stochastic model (arima) and a model based on neural network (ffann). Procedia Manuf. 2019, 32, 410–417. [Google Scholar] [CrossRef]
Rakpho, P.; Yamaka, W. The forecasting power of economic policy uncertainty for energy demand and supply. Energy Rep. 2021, 7, 338–343. [Google Scholar] [CrossRef]
Karia, A.A.; Bujang, I.; Ahmad, I. Fractionally integrated arma for crude palm oil prices prediction: Case of potentially overdifference. J. Appl. Stat. 2013, 40, 2735–2748. [Google Scholar] [CrossRef]
Wang, Z.-X.; Jv, T.-Q. A non-linear systematic grey model for forecasting the industrial economy-energy-environment system. Technol. Forecast. Soc. Chang. 2021, 167, 120707. [Google Scholar] [CrossRef]
Ma, X.; Lu, H.; Ma, M.; Wu, L.; Cai, Y. Urban natural gas consumption forecasting by novel wavelet-kernelized grey system model. Eng. Appl. Artif. Intell. 2023, 119, 105773. [Google Scholar] [CrossRef]
Qian, W.; Sui, A. A novel structural adaptive discrete grey prediction model and its application in forecasting renewable energy generation. Expert Syst. Appl. 2021, 186, 115761. [Google Scholar] [CrossRef]
Wang, Y.; Nie, R.; Ma, X.; Liu, Z.; Chi, P.; Wu, W.; Guo, B.; Yang, X.; Zhang, L. A novel hausdorff fractional ngmc (p, n) grey prediction model with grey wolf optimizer and its applications in forecasting energy production and conversion of China. Appl. Math. Model. 2021, 97, 381–397. [Google Scholar] [CrossRef]
Wang, Z.-X.; He, L.-Y.; Zheng, H.-H. Forecasting the residential solar energy consumption of the united states. Energy 2019, 178, 610–623. [Google Scholar] [CrossRef]
Moonchai, S.; Chutsagulprom, N. Short-term forecasting of renewable energy consumption: Augmentation of a modified grey model with a kalman filter. Appl. Soft Comput. 2020, 87, 105994. [Google Scholar] [CrossRef]
Xie, N.; Yuan, C.; Yang, Y. Forecasting China’s energy demand and self-sufficiency rate by grey forecasting model and markov model. Int. J. Electr. Power Energy Syst. 2015, 66, 1–8. [Google Scholar] [CrossRef]
Piazza, A.D.; Piazza, M.C.D.; Tona, G.L.; Luna, M. An artificial neural network-based forecasting model of energy-related time series for electrical grid management. Math. Comput. Simul. 2021, 184, 294–305. [Google Scholar] [CrossRef]
Kobylinski, P.; Wierzbowski, M.; Piotrowski, K. High-resolution net load forecasting for micro-neighbourhoods with high penetration of renewable energy sources. Int. J. Electr. Power Energy Syst. 2020, 117, 105635. [Google Scholar] [CrossRef]
Al-Gabalawy, M.; Hosny, N.S.; Adly, A.R. Probabilistic forecasting for energy time series considering uncertainties based on deep learning algorithms. Electr. Power Syst. Res. 2021, 196, 107216. [Google Scholar] [CrossRef]
Katsatos, A.L.; Moustris, K.P. Application of artificial neuron networks as energy consumption forecasting tool in the building of regulatory authority of energy, athens, greece. Energy Procedia 2019, 157, 851–861. [Google Scholar] [CrossRef]
Bento, P.M.R.; Pombo, J.A.N.; Mendes, R.P.G.; Calado, M.R.A.; Mariano, S.J.P.S. Ocean wave energy forecasting using optimised deep learning neural networks. Ocean. Eng. 2021, 219, 108372. [Google Scholar] [CrossRef]
Abu-Salih, B.; Wongthongtham, P.; Morrison, G.; Coutinho, K.; Al-Okaily, M.; Huneiti, A. Short-term renewable energy consumption and generation forecasting: A case study of western australia. Heliyon 2022, 8, e09152. [Google Scholar] [CrossRef]
Somu, N.; Gauthama Raman, M.R.; Ramamritham, K. A hybrid model for building energy consumption forecasting using long short term memory networks. Appl. Energy 2020, 261, 114131. [Google Scholar] [CrossRef]
Khan, N.; Haq, I.U.; Khan, S.U.; Rho, S.; Lee, M.Y.; Baik, S.W. Db-net: A novel dilated cnn based multi-step forecasting model for power consumption in integrated local energy systems. Int. J. Electr. Power Energy Syst. 2021, 133, 107023. [Google Scholar] [CrossRef]
Etxegarai, G.; López, A.; Aginako, N.; Rodríguez, F. An analysis of different deep learning neural networks for intra-hour solar irradiation forecasting to compute solar photovoltaic generators’ energy production. Energy Sustain. Dev. 2022, 68, 1–17. [Google Scholar] [CrossRef]
Gao, Y.; Ruan, Y.; Fang, C.; Yin, S. Deep learning and transfer learning models of energy consumption forecasting for a building with poor information data. Energy Build. 2020, 223, 110156. [Google Scholar] [CrossRef]
Hu, H.; Wang, L.; Peng, L.; Zeng, Y. Effective energy consumption forecasting using enhanced bagged echo state network. Energy 2020, 193, 116778. [Google Scholar] [CrossRef]
Hu, H.; Wang, L.; Lv, S. Forecasting energy consumption and wind power generation using deep echo state network. Renew. Energy 2020, 154, 598–613. [Google Scholar] [CrossRef]
Natarajan, Y.; Kannan, S.; Selvaraj, C.; Mohanty, S.N. Forecasting energy generation in large photovoltaic plants using radial belief neural network. Sustain. Comput. Inform. Syst. 2021, 31, 100578. [Google Scholar] [CrossRef]
Cui, Y.; Jia, L.; Fan, W. Estimation of actual evapotranspiration and its components in an irrigated area by integrating the shuttleworth-wallace and surface temperature-vegetation index schemes using the particle swarm optimization algorithm. Agric. For. Meteorol. 2021, 307, 108488. [Google Scholar] [CrossRef]
Zhang, F.; Deb, C.; Lee, S.E.; Yang, J.; Shah, K.W. Time series forecasting for building energy consumption using weighted support vector regression with differential evolution optimization technique. Energy Build. 2016, 126, 94–103. [Google Scholar] [CrossRef]
Wen, L.; Cao, Y. Influencing factors analysis and forecasting of residential energy-related CO2 emissions utilizing optimized support vector machine. J. Clean. Prod. 2020, 250, 119492. [Google Scholar] [CrossRef]
Mason, K.; Duggan, J.; Howley, E. Forecasting energy demand, wind generation and carbon dioxide emissions in ireland using evolutionary neural networks. Energy 2018, 155, 705–720. [Google Scholar] [CrossRef]
Hu, G.; Xu, Z.; Wang, G.; Zeng, B.; Liu, Y.; Lei, Y. Forecasting energy consumption of long-distance oil products pipeline based on improved fruit fly optimization algorithm and support vector regression. Energy 2021, 224, 120153. [Google Scholar] [CrossRef]
Abba, S.I.; Rotimi, A.; Musa, B.; Yimen, N.; Kawu, S.J.; Lawan, S.M.; Dagbasi, M. Emerging harris hawks optimization based load demand forecasting and optimal sizing of stand-alone hybrid renewable energy systems—A case study of Kano and Abuja, Nigeria. Results Eng. 2021, 12, 100260. [Google Scholar] [CrossRef]
Lu, H.; Ma, X.; Huang, K.; Azimi, M. Carbon trading volume and price forecasting in China using multiple machine learning models. J. Clean. Prod. 2020, 249, 119386. [Google Scholar] [CrossRef]
Lu, H.; Cheng, F.; Ma, X.; Hu, G. Short-term prediction of building energy consumption employing an improved extreme gradient boosting model: A case study of an intake tower. Energy 2020, 117756. [Google Scholar] [CrossRef]
Fan, J.; Wang, X.; Zhang, F.; Ma, X.; Wu, L. Predicting daily diffuse horizontal solar radiation in various climatic regions of China using support vector machine and tree-based soft computing models with local and extrinsic climatic data. J. Clean. Prod. 2020, 248, 119264. [Google Scholar] [CrossRef]
Huang, G.; Wu, L.; Ma, X.; Zhang, W.; Fan, J.; Yu, X.; Zeng, W.; Zhou, H. Evaluation of catboost method for prediction of reference evapotranspiration in humid regions. J. Hydrol. 2019, 574, 1029–1041. [Google Scholar] [CrossRef]
Hong, T.; Xie, J.; Black, J. Global energy forecasting competition 2017: Hierarchical probabilistic load forecasting. Int. J. Forecast. 2019, 35, 1389–1399. [Google Scholar] [CrossRef]
Bedi, J.; Toshniwal, D. Energy load time-series forecast using decomposition and autoencoder integrated memory network. Appl. Soft Comput. 2020, 93, 106390. [Google Scholar] [CrossRef]
Adedeji, P.A.; Akinlabi, S.; Ajayi, O.; Madushele, N. Non-linear autoregressive neural network (narnet) with ssa filtering for a university energy consumption forecast. Procedia Manuf. 2019, 33, 176–183. [Google Scholar] [CrossRef]
Tayab, U.B.; Lu, J.; Yang, F.; AlGarni, T.S.; Kashif, M. Energy management system for microgrids using weighted salp swarm algorithm and hybrid forecasting approach. Renew. Energy 2021, 180, 467–481. [Google Scholar] [CrossRef]
Zhang, G.; Tian, C.; Li, C.; Zhang, J.J.; Zuo, W. Accurate forecasting of building energy consumption via a novel ensembled deep learning method considering the cyclic feature. Energy 2020, 201, 117531. [Google Scholar] [CrossRef]
Xiao, J.; Li, Y.; Xie, L.; Liu, D.; Huang, J. A hybrid model based on selective ensemble for energy consumption forecasting in China. Energy 2018, 159, 534–546. [Google Scholar] [CrossRef]
Khan, W.; Walker, S.; Zeiler, W. Improved solar photovoltaic energy generation forecast using deep learning-based ensemble stacking approach. Energy 2022, 240, 122812. [Google Scholar] [CrossRef]
Kazemzadeh, M.; Amjadian, A.; Amraee, T. A hybrid data mining driven algorithm for long term electric peak load and energy demand forecasting. Energy 2020, 204, 117948. [Google Scholar] [CrossRef]
Tran, D.; Luong, D.; Chou, J. Nature-inspired metaheuristic ensemble model for forecasting energy consumption in residential buildings. Energy 2020, 191, 116552. [Google Scholar] [CrossRef]
Liu, Z.; Wang, X.; Zhang, Q.; Huang, C. Empirical mode decomposition based hybrid ensemble model for electrical energy consumption forecasting of the cement grinding process. Measurement 2019, 138, 314–324. [Google Scholar] [CrossRef]
da Silva, R.G.; Dal Molin Ribeiro, M.H.; Moreno, S.R.; Mariani, V.C.; Leandro dos Santos Coelho, L. A novel decomposition-ensemble learning framework for multi-step ahead wind energy forecasting. Energy 2021, 216, 119174. [Google Scholar] [CrossRef]
Härdle, W.; Liang, H.; Gao, J. Partially Linear Models; Springer: Berlin/Heidelberg, Germany, 2012. [Google Scholar]
Rudin, W. Principles of Mathematical Analysis; McGraw-Hill: New York, NY, USA, 1976; Volume 3. [Google Scholar]
Smola, A.J.; Schölkopf, B. A tutorial on support vector regression. Stat. Comput. 2004, 14, 199–222. [Google Scholar] [CrossRef]
Takahashi, N.; Guo, J.; Nishi, T. Global convergence of smo algorithm for support vector regression. IEEE Trans. Neural Netw. 2008, 19, 971–982. [Google Scholar] [CrossRef] [PubMed]
Chang, C.; Lin, C. Libsvm: A library for support vector machines. ACM Trans. Intell. Syst. Technol. TIST 2011, 2, 1–27. [Google Scholar] [CrossRef]
Rasmussen, C.E.; Williams, C.K.I. Gaussian Process Regression for Machine Learning; MIT Press: Cambridge, MA, USA, 2006. [Google Scholar]
Suykens, J.A.K.; Vandewalle, J. Least squares support vector machine classifiers. Neural Process. Lett. 1999, 9, 293–300. [Google Scholar] [CrossRef]
De Brabanter, J.; De Moor, B.; Suykens, J.A.K.; Van Gestel, T.; Vandewalle, J.P.L. Least Squares Support Vector Machines; World Scientific: Singapore, 2002. [Google Scholar]
Ho, T.K. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach. Intell. 1998, 20, 832–844. [Google Scholar]
Breiman, L. Random forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef]
Chen, T.; Guestrin, C. Xgboost: A scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar]
Dong, J.; Zeng, W.; Wu, L.; Huang, J.; Gaiser, T.; Srivastava, A.K. Enhancing short-term forecasting of daily precipitation using numerical weather prediction bias correcting with XGBoost in different regions of China. Eng. Appl. Artif. Intell. 2023, 117, 105579. [Google Scholar] [CrossRef]
Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Ye, Q.; Liu, T. Lightgbm: A highly efficient gradient boosting decision tree. In Proceedings of the NIPS’17: 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; pp. 3146–3154. [Google Scholar]
Prokhorenkova, L.; Gusev, G.; Vorobev, A.; Dorogush, A.V.; Gulin, A. Catboost: Unbiased boosting with categorical features. In Proceedings of the NIPS’18: 32st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 3–8 December 2018; pp. 6638–6648. [Google Scholar]
Cho, K.; van Merriënboer, B.; Bahdanau, D.; Bengio, Y. On the properties of neural machine translation: Encoder–decoder approaches. In Proceedings of the SSST-8, 8th Workshop on Syntax, Semantics and Structure in Statistical Translation, Doha, Qatar, 25 October 2014; pp. 103–111. [Google Scholar]
Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput. 1997, 9, 1735–1780. [Google Scholar] [CrossRef] [PubMed]
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine learning in python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Miranian, A.; Abdollahzade, M. Developing a local least-squares support vector machines-based neuro-fuzzy model for nonlinear and chaotic time series prediction. IEEE Trans. Neural Netw. Learn. Syst. 2012, 24, 207–218. [Google Scholar] [CrossRef]
Duvenaud, D. Automatic Model Construction with GAUSSIAN Processes. Ph.D. Thesis, University of Cambridge, Cambridge, UK, 2014. [Google Scholar]

Figure 1. Overview of the models for energy foecasting in recent years.

Figure 2. The general framework of the proposed PLC-SVM model structure and its application in US primary energy consumption in the electric power sector forecasting.

Figure 3. Raw data of monthly primary energy consumption of the electric power sector in the US from January 1973 to January 2020.

Figure 4. Predicted values using (a) PLC-SVM, (b) SVM, (c) LSSVM, (d) GPR, (e) RF, (f) LGBM, (g) XGB, (h) CATB, (i) GRU, and (j) AR with

τ = 18

.

Figure 4. Predicted values using (a) PLC-SVM, (b) SVM, (c) LSSVM, (d) GPR, (e) RF, (f) LGBM, (g) XGB, (h) CATB, (i) GRU, and (j) AR with

τ = 18

.

Figure 5. Predicted values using (a) PLC-SVM, (b) SVM, (c) LSSVM, (d) GPR, (e) RF, (f) LGBM, (g) XGB, (h) CATB, (i) GRU, (j) AR with

τ = 24

.

Figure 5. Predicted values using (a) PLC-SVM, (b) SVM, (c) LSSVM, (d) GPR, (e) RF, (f) LGBM, (g) XGB, (h) CATB, (i) GRU, (j) AR with

τ = 24

.

Figure 6. Predicted values using (a) PLC-SVM, (b) SVM, (c) LSSVM, (d) GPR, (e) RF, (f) LGBM, (g) XGB, (h) CATB, (i) LSTM, (j) AR with

τ = 30

.

Figure 6. Predicted values using (a) PLC-SVM, (b) SVM, (c) LSSVM, (d) GPR, (e) RF, (f) LGBM, (g) XGB, (h) CATB, (i) LSTM, (j) AR with

τ = 30

.

Table 2. Metrics used in this paper.

Metrics	Abbreviation	Formula
Average Error	AE	$\frac{1}{n} \sum_{k = 1}^{n} (x^{(0)} (k) - {\hat{x}}^{(0)} (k))$
Average Relative Error	ARE	$\frac{1}{n} \sum_{k = 1}^{n} \|\frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{x (k)}\|$
Index of Agreement	IA	$1 - \frac{\sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}}{\sum_{k = 1}^{n} {(\|x^{(0)} (k) - {\bar{x}}^{(0)}\| + \|{\hat{x}}^{(0)} (k) - {\bar{\hat{x}}}^{(0)}\|)}^{2}}$
Mean Arctangent Absolute Percentage Error	MAAPE	$\frac{1}{n} \sum_{k = 1}^{n} arctan (\|\frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{x (k)}\|)$
Mean Absolute Error	MAE	$\frac{1}{n} \sum_{k = 1}^{n} \|x^{(0)} (k) - {\hat{x}}^{(0)} (k)\|$
Mean Absolute Percentage Error	MAPE	$\frac{1}{n} \sum_{k = 1}^{n} \|\frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{x (k)}\| \times 100 %$
Median Absolute Error	MedAe	$\frac{1}{n} \sum_{k = 1}^{n} arctan (\|\frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{x (k)}\|)$
Mean Percentage Error	MPE	$\frac{1}{n} \sum_{k = 1}^{n} \frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{x (k)} \times 100 %$
Mean Squared Error	MSE	$\frac{1}{n} \sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}$
Mean Squared Logarithmic Error	MSLE	$\frac{1}{n} \sum_{k = 1}^{n} {\|log (x^{(0)} (k) + 1) - log ({\hat{x}}^{(0)} (k) + 1)\|}^{2}$
Normalized Root Mean Square Error	NRMSE	$\frac{\sqrt{\frac{1}{n} \sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}}}{x^{(0)} {(k)}_{max} - x^{(0)} {(k)}_{min}}$
Percent Bias	Pibas	$\frac{\sum_{k = 1}^{n} (x^{(0)} (k) - {\hat{x}}^{(0)} (k))}{\sum_{k = 1}^{n} {\hat{x}}^{(0)} (k)}$
Coefficient of Determination	R $^{2}$	$1 - \frac{\sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}}{\sum_{k = 1}^{n} {(x^{(0)} (k) - {\bar{x}}^{(0)})}^{2}}$
Root Mean Square Error	RMSE	$\sqrt{\frac{1}{n} \sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}}$
Root Mean Square Logarithmic Error	RMSLE	$\sqrt{\frac{1}{n} {\sum_{k = 1}^{n} \|log (x^{(0)} (k) + 1) - log ({\hat{x}}^{(0)} (k) + 1)\|}^{2}}$
Root Mean Square Percentage Error	RMSPE	$\sqrt{\frac{1}{n} \sum_{k = 1}^{n} {\|\frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{x (k)}\|}^{2}}$
Symmetric Mean Absolute Percentage Error	SMAPE	$\frac{1}{n} \sum_{k = 1}^{n} \|\frac{x^{(0)} (k) - {\hat{x}}^{(0)} (k)}{0.5 x^{(0)} (k) + 0.5 {\hat{x}}^{(0)} (k)}\| \times 100 %$
Theil U Statistic 1	U1	$\frac{\sqrt{\frac{1}{n} \sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}}}{\sqrt{\frac{1}{n} \sum_{k = 1}^{n} {(x^{(0)} (k))}^{2}} + \sqrt{\frac{1}{n} \sum_{k = 1}^{n} {({\hat{x}}^{(0)} (k))}^{2}}}$
Theil U Statistic 2	U2	$\frac{\sqrt{\frac{1}{n} \sum_{k = 1}^{n} {(x^{(0)} (k) - {\hat{x}}^{(0)} (k))}^{2}}}{\sqrt{\sum_{k = 1}^{n} {(x^{(0)} (k))}^{2}}}$

Table 3. Results of the metrics of the ten models with time lag

τ = 18

.

Table 3. Results of the metrics of the ten models with time lag

τ = 18

.

	PLC-SVM	SVM	LSSVM	GPR	RF	LGBM	XGB	CATB	GRU	AR
AE	−30.4026	−31.9230	−131.3443	−144.5654	−105.5290	−135.2636	−86.1456	251.4290	161.4449	−79.9688
ARE	0.0372	0.0379	0.0489	0.0511	0.0387	0.0478	0.0398	0.0866	0.0622	0.0423
IA	0.9371	0.9347	0.9114	0.9224	0.9492	0.9194	0.9354	0.5436	0.8856	0.9202
MAAPE	0.0372	0.0379	0.0488	0.0510	0.0386	0.0477	0.0397	0.0859	0.0621	0.0422
MAE	115.2908	117.3400	145.8265	157.6530	116.6060	144.3825	119.7081	290.9737	193.6625	127.7884
MAPE	3.7240	3.7935	4.8940	5.1123	3.8695	4.7775	3.9777	8.6588	6.2247	4.2286
MedAe	94.8200	97.0436	123.2384	135.7107	106.0792	119.4479	88.2682	205.1306	178.9841	95.8115
MPE	−1.3489	−1.4058	−4.5055	−4.7546	−3.5528	−4.5172	−3.0012	7.2134	5.0866	−2.9021
MSE	19,396.6580	19,970.1730	32,269.8433	33,986.8550	20,322.6371	33,104.5516	24,215.1642	146,391.5235	51,779.6932	26,994.5302
MSLE	0.0020	0.0021	0.0034	0.0033	0.0022	0.0034	0.0026	0.0142	0.0060	0.0028
NRMSE	0.1181	0.1199	0.1524	0.1564	0.1209	0.1543	0.1320	0.3245	0.1930	0.1394
Pibas	−0.0096	−0.0101	−0.0402	−0.0440	−0.0325	−0.0413	−0.0267	0.0871	0.0543	−0.0249
R2	0.8228	0.8175	0.7051	0.6895	0.8143	0.6975	0.7787	−0.3376	0.5269	0.7533
RMSE	139.2719	141.3159	179.6381	184.3552	142.5575	181.9466	155.6122	382.6115	227.5515	164.3001
RMSLE	0.0446	0.0453	0.0587	0.0575	0.0464	0.0582	0.0506	0.1192	0.0777	0.0534
RMSPE	0.0457	0.0464	0.0615	0.0599	0.0481	0.0610	0.0529	0.1087	0.0739	0.0559
SMAPE	3.6733	3.7405	4.7180	4.9439	3.7609	4.6019	3.8561	9.2711	6.4815	4.1009
U1	0.0220	0.0223	0.0279	0.0286	0.0222	0.0282	0.0244	0.0633	0.0370	0.0257
U2	0.0441	0.0448	0.0569	0.0584	0.0452	0.0577	0.0493	0.1213	0.0721	0.0521

Table 4. Results of the metrics of the ten models with time lag

τ = 24

.

Table 4. Results of the metrics of the ten models with time lag

τ = 24

.

	PLC-SVM	SVM	LSSVM	GPR	RF	LGBM	XGB	CATB	GRU	AR
AE	−69.5776	−88.3138	−146.5563	−158.9098	−122.4937	−129.8408	−140.6233	198.5871	−290.2731	−126.2264
ARE	0.0396	0.0417	0.0504	0.0532	0.0439	0.0504	0.0505	0.0742	0.1009	0.0488
IA	0.9309	0.9262	0.9178	0.9200	0.9365	0.9044	0.9078	0.6071	0.7168	0.9054
MAAPE	0.0395	0.0416	0.0503	0.0531	0.0438	0.0502	0.0503	0.0738	0.1001	0.0486
MAE	120.1745	125.5630	152.0696	163.6975	132.1869	152.4988	151.0061	248.7770	298.6850	145.9707
MAPE	3.9617	4.1726	5.0412	5.3196	4.3856	5.0356	5.0492	7.4202	10.0853	4.8787
MedAe	94.6035	104.8577	123.1539	143.7897	119.0864	116.1281	125.8849	169.8898	294.6203	123.9808
MPE	−2.5357	−3.1289	−4.8879	−5.1840	−4.1159	−4.3672	−4.7381	5.5953	−9.8633	−4.3340
MSE	23,899.4588	26,147.9685	33,695.2393	36,364.6930	25,312.3053	39,167.0882	37,253.4219	106,653.0431	119,973.9310	35,560.0139
MSLE	0.0025	0.0028	0.0035	0.0035	0.0027	0.0040	0.0039	0.0100	0.0122	0.0037
NRMSE	0.1311	0.1372	0.1557	0.1617	0.1349	0.1679	0.1637	0.2770	0.2938	0.1599
Pibas	−0.0217	−0.0274	−0.0446	−0.0482	−0.0376	−0.0397	−0.0429	0.0676	−0.0847	−0.0387
R2	0.7816	0.7611	0.6921	0.6677	0.7687	0.6421	0.6596	0.0255	-0.0962	0.6751
RMSE	154.5945	161.7033	183.5626	190.6953	159.0984	197.9068	193.0115	326.5778	346.3725	188.5736
RMSLE	0.0502	0.0527	0.0591	0.0594	0.0516	0.0629	0.0625	0.1002	0.1106	0.0611
RMSPE	0.0524	0.0552	0.0620	0.0620	0.0536	0.0663	0.0659	0.0929	0.1193	0.0645
SMAPE	3.8536	4.0426	4.8589	5.1372	4.2496	4.8406	4.8452	7.8382	9.4289	4.6886
U1	0.0243	0.0253	0.0285	0.0295	0.0248	0.0307	0.0299	0.0536	0.0526	0.0293
U2	0.0490	0.0513	0.0582	0.0604	0.0504	0.0627	0.0612	0.1035	0.1098	0.0598

Table 5. Results of the metrics of the ten models with time lag

τ = 30

.

Table 5. Results of the metrics of the ten models with time lag

τ = 30

.

	PLC-SVM	SVM	LSSVM	GPR	RF	LGBM	XGB	CATB	GRU	AR
AE	−85.5618	−123.2492	−145.0989	−157.4847	−131.4612	−128.6614	−108.7032	173.3637	−95.0323	−138.1859
ARE	0.0390	0.0458	0.0492	0.0517	0.0447	0.0486	0.0428	0.0676	0.0402	0.0494
IA	0.9321	0.9161	0.9196	0.9215	0.9345	0.9086	0.9317	0.6547	0.9235	0.9063
MAAPE	0.0389	0.0457	0.0491	0.0516	0.0446	0.0485	0.0427	0.0673	0.0400	0.0492
MAE	117.0587	136.7596	148.3506	158.8722	135.0314	145.6343	128.2523	224.7215	120.2737	147.6295
MAPE	3.8959	4.5838	4.9192	5.1695	4.4729	4.8630	4.2765	6.7565	4.0169	4.9359
MedAe	94.3220	111.6924	114.1242	129.1694	121.9611	123.7311	108.4888	147.0317	76.1974	128.1547
MPE	−3.0065	−4.2051	−4.8233	−5.1293	−4.3722	−4.3847	−3.6771	4.8896	−3.2737	−4.6751
MSE	23,675.5999	30,820.1597	32,348.9584	34,821.7870	26,206.9815	33,890.0351	26,217.1006	85,262.6857	29,073.8466	35,635.1807
MSLE	0.0025	0.0033	0.0034	0.0034	0.0027	0.0036	0.0028	0.0080	0.0031	0.0037
NRMSE	0.1305	0.1489	0.1525	0.1583	0.1373	0.1561	0.1373	0.2477	0.1446	0.1601
Pibas	−0.0266	−0.0379	−0.0444	−0.0480	−0.0404	−0.0395	−0.0336	0.0587	−0.0295	−0.0423
R2	0.7736	0.7053	0.6907	0.6671	0.7494	0.6760	0.7494	0.1848	0.7220	0.6593
RMSE	153.8688	175.5567	179.8582	186.6060	161.8857	184.0925	161.9170	291.9977	170.5105	188.7728
RMSLE	0.0502	0.0571	0.0579	0.0583	0.0522	0.0597	0.0525	0.0894	0.0553	0.0611
RMSPE	0.0525	0.0600	0.0606	0.0608	0.0543	0.0628	0.0547	0.0837	0.0586	0.0644
SMAPE	3.7758	4.4170	4.7445	4.9923	4.3319	4.6801	4.1388	7.0836	3.8643	4.7417
U1	0.0242	0.0274	0.0280	0.0290	0.0252	0.0287	0.0253	0.0479	0.0267	0.0294
U2	0.0490	0.0559	0.0572	0.0594	0.0515	0.0586	0.0515	0.0929	0.0543	0.0601

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Ma, X.; Cai, Y.; Yuan, H.; Deng, Y. Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States. Sustainability 2023, 15, 7086. https://doi.org/10.3390/su15097086

AMA Style

Ma X, Cai Y, Yuan H, Deng Y. Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States. Sustainability. 2023; 15(9):7086. https://doi.org/10.3390/su15097086

Chicago/Turabian Style

Ma, Xin, Yubin Cai, Hong Yuan, and Yanqiao Deng. 2023. "Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States" Sustainability 15, no. 9: 7086. https://doi.org/10.3390/su15097086

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States

Abstract

1. Introduction

2. Literature Study

2.1. The Structured Models for Energy Forecasting

2.2. The Non-Structured Models for Energy Forecasting

2.3. A Brief Summary of Literature Study

3. Preliminaries

3.1. Main Idea of the Partially Linear Model

3.2. Principal Component Analysis

4. The Proposed Partially Linear Component Support Vector Machines

4.1. Partially Linear Component Model in the Feature Space

4.2. Partially Linear Component Support Vector Machines in Primal and Dual Formulations

4.3. Forecasting Scheme for Univariate Time Series

5. Case Study

5.1. Data Collection and Preprocessing

5.2. Models for Comparison and Evaluation Metrics

5.3. Results

5.3.1. Case I: $τ = 18$

5.3.2. Case II: $τ = 24$

5.3.3. Case III: $τ = 30$

5.4. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

Partially Linear Component Support Vector Machine for Primary Energy Consumption Forecasting of the Electric Power Sector in the United States

Abstract

1. Introduction

2. Literature Study

2.1. The Structured Models for Energy Forecasting

2.2. The Non-Structured Models for Energy Forecasting

2.3. A Brief Summary of Literature Study

3. Preliminaries

3.1. Main Idea of the Partially Linear Model

3.2. Principal Component Analysis

4. The Proposed Partially Linear Component Support Vector Machines

4.1. Partially Linear Component Model in the Feature Space

4.2. Partially Linear Component Support Vector Machines in Primal and Dual Formulations

4.3. Forecasting Scheme for Univariate Time Series

5. Case Study

5.1. Data Collection and Preprocessing

5.2. Models for Comparison and Evaluation Metrics

5.3. Results

5.3.1. Case I: τ = 18

5.3.2. Case II: τ = 24

5.3.3. Case III: τ = 30

5.4. Discussion

6. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

Appendix A

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

5.3.1. Case I: $τ = 18$

5.3.2. Case II: $τ = 24$

5.3.3. Case III: $τ = 30$