Developing Feedforward Neural Networks as Benchmark for Load Forecasting: Methodology Presentation and Application to Hospital Heat Load Forecasting

Stienecker, Malte; Hagemeier, Anne

doi:10.3390/en16042026

Open AccessArticle

Developing Feedforward Neural Networks as Benchmark for Load Forecasting: Methodology Presentation and Application to Hospital Heat Load Forecasting

by

Malte Stienecker

^*

and

Anne Hagemeier

Fraunhofer UMSICHT, Fraunhofer Institute for Environmental, Safety, and Energy Technology, Osterfelder Str.3, 46047 Oberhausen, Germany

^*

Author to whom correspondence should be addressed.

Energies 2023, 16(4), 2026; https://doi.org/10.3390/en16042026

Submission received: 17 January 2023 / Revised: 10 February 2023 / Accepted: 16 February 2023 / Published: 18 February 2023

(This article belongs to the Section F5: Artificial Intelligence and Smart Energy)

Download

Browse Figures

Versions Notes

Abstract

:

For load forecasting, numerous machine learning (ML) approaches have been published. Besides fully connected feedforward neural networks (FFNNs), also called multilayer perceptron, more advanced ML approaches like deep, recurrent or convolutional neural networks or ensemble methods have been applied. However, evaluating the added benefit by novel approaches is difficult. Statistical or rule-based methods constitute a too low benchmark. FFNNs need extensive tuning due to their manifold design choices. To address this issue, a structured, comprehensible five-step FFNN model creation methodology is presented, which constitutes of initial model creation, internal parameter selection, feature engineering, architecture tuning and final model creation. The methodology is then applied to forecast real world heat load data of a hospital in Germany. The forecast constitutes of 192 values (upcoming 48 h in 15 min resolution) and is composed of a multi-model univariate forecasting strategy, with three test models developed at first. As a result, the test models show great similarities which simplifies creation of the remaining models. A performance increase of up to 18% between initial and final models points out the importance of model tuning. As a conclusion, comprehensible model tuning is vital to use FFNN models as benchmark. The effort needed can be reduced by the experience gained through repeated application of the presented methodology.

Keywords:

short-term forecasting; machine learning; feedforward neural network; benchmarking; feature selection; heat load prediction; energy demand; hospital

1. Introduction

1.1. Problem Statement

Accurate short-term load forecasting is a frequently addressed problem in scientific literature. Various short-term forecasting models have therefore been presented in the literature to improve the forecast accuracy on a given dataset. [1]

Earlier, the focus lay on novel rule-based models, like variations of the autoregressive integrated moving average (ARIMA) method [2]. With the emergence and successful application of machine learning techniques, the focus has then turned towards models based on artificial neural networks (ANN) [3]. Most of the first presented ANN models are now known as fully connected feedforward neural networks (FFNN), also termed Multilayer Perceptron, shallow neural networks or vanilla neural networks. With increasing computing power also more advanced machine learning (ML) based models for load forecasting have been proposed [4], like recurrent [2] or convolutional neural networks [5] or multiple models were combined to an ensemble [6].

However, more complex models consume more computational resources, which results in a higher energy demand for model training and tuning. This often conflicts with the underlying purpose of the proposed models, like increasing efficiency in energy systems operation. In such cases, only an increased forecast accuracy justifies more complex models. In fact, even if the forecast accuracy is increased, to what extend this increased accuracy actually results in a more efficient energy system operation is still an open research question [7]. Consequently, from an application perspective, model complexity could be seen as even more detrimental.

Therefore, novel approaches must prove to outperform existing, computationally less expensive forecast approaches. In the literature, three strategies of model performance evaluation can be found:

1.: Evaluation of the proposed model individually via performance scores, like mean absolute percentage error (MAPE), and comparison with the performance score values achieved by other forecasting models with a similar task presented in another publication.
2.: Evaluation of the proposed model via comparison with some other forecasting model, considered well accepted, adopted from another publication and applied to the same data set as benchmark.
3.: Evaluation of several models via comparison of their individual performance scores achieved on the same data set and presented in the same paper.

Regarding the first strategy, one must keep in mind that the values of performance scores depend on the underlying data set. For example, a certain MAPE value is easier to reach for a time series with variations around a higher value than for a time series with the same variations around a lower value. The difficulty to reach a certain MAPE value is hence base load dependent. Therefore, the value of a chosen performance score alone does neither reflect the difficulty of the forecasting task nor the quality of the presented model. Judging the difficulty of the forecasting task is assisted by presenting values for the same performance score of a different model applied to the same data set.

Evaluating the model performance by referring to the performance score of a different model, called benchmark, applied to the same data set, sums up the second evaluation strategy. Typical benchmarks are rule-based models like naïve forecasts or ARIMA models, for example used by Ryu et al. [8]. Such approaches are well defined and easily applicable. Therefore, they are very useful to evaluate simple ML models like FFNNs and have been used frequently in the literature. In many cases, the presented FFNN models outperform the rule-based benchmark models [9]. Therefore, since also simple ML models are likely to outperform rule-based models, rule-based models reflect a low benchmark, when presenting more advanced ML models. To assess the added value provided by a novel ML model, it must be compared with other, possibly simpler, or earlier proposed ML models also.

This is done in the third strategy of model performance evaluation, where several models are designed and applied to the same data set. As opposed to rule-based models, the procedure of model creation for simple ML models like FFNNs is, unfortunately, not well defined. Manifold design choices exist when creating FFNNs and no accepted model creation procedure exists. Kováč et al. [10] mention this as the biggest disadvantage of ANN models. Consequently, model performance depends on the design choices taken by the authors and in turn on their experience in creating FFNN models. Naturally, experience is diverse among different research groups and different researchers. Hence the reader needs to be enabled to judge whether the chosen benchmark is performing as well as it could and is thereby trustworthy. To do so, not only the presented novel approach but also the model used as benchmark needs to be explained in detail. Understandably, novel approaches typically get published solely if they outperform the competing benchmark. However, whether a superior model performance is of general nature and not based on an experience bias towards the presented novel approach, remains unclear without detailed benchmark description. Interpreting the model performance is then impossible for the reader, which may prevent adopting the proposed methodology. As a result, a well-designed and thoroughly described benchmark model is in the interest of both the author and the reader.

Unfortunately, neither well-designed nor thoroughly described benchmark models are standard. Amasyali et al. [11] as well as Tealab et al. [12] and Haben et al. [1] state in their review papers that presenting insufficient information both about the proposed models and the benchmark is a common issue in forecasting literature. This incorporates the problem that proposed approaches may be compared with overly simple models with poorly tuned parameters [7].

In 2001, Hippert et al. [3] stated that comparing ML models with FFNN-based models—although the most common—is not valid since FFNNs can yet not be considered standard or well accepted. Twenty years later we, the authors, believe that FFNNs can now be seen as a well-accepted, simple ML forecasting method and can hence serve as a meaningful benchmark for more advanced ML approaches. This is supported by the fact, that FFNNs still comprise the most popular ANN models [13] and that they have already frequently been used as benchmark in the literature [1].

To the best of our knowledge, no structured methodology to create FFNN benchmark models has yet been presented in the literature. Of course, all aspects of FFNN model creation like basic hyperparameter tuning, feature selection or input scaling have been studied before, but–as a results of a missing widely accepted methodology–how these tasks are tackled still varies greatly within the literature.

This publication therefore addresses the research gap that a structured, well defined FFNN model creation procedure is still missing. Our contribution, therefore, is the development of a five-step model creation methodology for FFNN benchmark model creation within the field of short-term load forecasting. It does not contain novel approaches but combines known procedures in a structured manner. Model developers can follow the presented approach instead of the diverse approaches that have been used so far. The selected parameter choices as result of the five-step methodology should then be stated in the corresponding publication.

As one advantage, this increases interpretability of obtained results and highlights, which design choices affected model performance more strongly. Even though the overall best performing FFNN model may not be achieved, following a structured model development makes benchmarking with the developed model more meaningful. Furthermore, the methodology emphasizes the manifoldness of design choices for creating FFNN forecasting models, such that no important aspects get overlooked. This decreases the dependency of chosen FFNN model structure and model performance on the individual developer’s experience.

Finally, through multiple application of the proposed methodology, data set independent answers are gathered regarding beneficial feature selection, input representation, input scaling as well as model architecture. These answers will clarify whether some value choices are applicable independent of forecast horizons, different object types, like residential households or districts, or even different forecast targets, like electricity demand, cooling load or primary energy usage. Profound knowledge about relevant design choices regarding FFNN model performance may decrease the effort and the computational burden needed for creating meaningful benchmark models.

1.2. Case Study

The developed methodology will be exemplified by a case study application. A strong motivation for accurate short-term load forecasting is optimal scheduling and control–also known as optimal operation–to lower operational costs and enhance the integration of distributed, intermittent renewable energy sources [14]. For this purpose, load forecasting is needed with a prediction horizon from one step ahead—e.g., 15 min–up to 48 h ahead, depending on the specific circumstances.

As very energy intensive buildings [15], which are, in addition, typically located centrally within districts, hospitals are especially promising for demand-side management. In addition, hospitals have high demands for both electricity and heat. Generation units like combined heat and power units are, therefore, common in hospitals [16]. Such cross-energy systems in combination with storage units are particularly suited for flexibility assessment. Hospitals are complex buildings and the energy demand between different hospitals differs largely due to the overall size, equipment, specializations and energy efficiency. Therefore, load profile generation is challenging, which favors usage of easily applicable data-driven forecasting models.

Very limited publications regarding load forecasting of hospitals exist. On 21 November 2022 a literature review on Web of Science [17] was performed, searching for the following phrase combination in either title, topic or abstract: (load* OR demand* OR consumption*) AND (predict* OR forecast*) AND (thermal OR heat* OR energy) AND hospital.

284 results were found from which 27 publications actually address load forecasting. Even though terms like ’power’ or ’electricity’ were left out purposely, most of these publications address electric power load forecasting. Only three publications [18,19,20] additionally investigate heat load forecasting also.

Ma et al. [18] predict power, heat and cooling load jointly by a genetic-algorithm back-propagation neural network. Power, heat, and cooling load is predicted for the next four hours in a 15 min resolution, such that the neural network has 48 output nodes. The proposed model is compared with a support-vector-machine-based neural network and a Kalman-filter-based method.

Manno et al. [19] predict power and heat load with two identical FFNN models trained independently with their corresponding target data (power or heat, respectively). Power and heat load is predicted for the next 24 h in hourly resolution. The FFNN model is compared with an ARIMA model, a support vector machine and a long-short term memory network.

Kristofersen et al. [20] performed heat load predictions for the lab center and the gastro center, i.e., for two buildings belonging to the complex of hospital buildings. They predicted the heat load for 15 individual points in time with five different methodologies. Among Decision Tree, Bagging Decision Tree, Support Vector Regression, AdaBoost and RandomForest the latter two models performed best.

Thereby, isolated investigation of short-term heat load forecasting for hospitals has seemingly not been carried out before. Consequently, heat load forecasting of hospitals is seen as a relevant, yet poorly investigated field of research and was, hence, chosen as case study to apply the developed model creation methodology.

2. Materials and Methods

2.1. Data Acquisition

To forecast the hospital’s heat load, previous load data, weather related features as well as time related features were considered as inputs. The hospital used for this study is a medium size hospital with 300 beds, located in Hattingen, Germany. The main, fourteen-story building was built in 1963 and later supplemented with some smaller extensions. The total floor area of the building is 17,500 m

^{2}

. Additionally, an administrative building and a nursing home is supplied with heat by the hospital’s units. Provision of heat is achieved by a gas-fueled CHP plant with generation capacities of

250 {k W}_{e l}

and

380 {k W}_{t h}

, two gas-fueled boilers with

750 {k W}_{t h}

generation capacity each, and a hot water heat storage with a volume of 14

m^{3}

. The measurements were taken in one minute resolution by the Dynasonics DXN Portable Hybrid Ultrasonic Flow Meter from Badger Meter [21], measuring inlet and outlet temperature as well as flow rate. Measurements started in December 2018. Unfortunately, no joint point of heat supply suitable for performing the ultrasonic flow measurements could be found. Instead, heat supply of all devices had to be measured separately. This comes with two complications:

Three separate measurements to acquire a complete load curve are more fault-prone than a single measurement. Hence, it took until March 2020 to obtain consistent and fairly continuous load measurements.
Measuring the heat flow close to the generation units includes measurements of their specific operation characteristics, like starting and stopping behavior or discrete pumping thresholds. This results in a volatile load curve reflecting generation rather than demand.

Before dealing with the latter complication, preprocessing of the data with help of Python [22] and Pandas [23] was performed. Faulty measurements were removed and–in case of isolated single errors–replaced through linear interpolation. Furthermore, based on the inlet

T_{i n l e t}

and return temperatures

T_{r e t u r n}

as well as the flow rate data

\dot{V}

, heat supply

\dot{Q}

was calculated. The formula used for calculations is

\dot{Q} = ρ \cdot c_{p} \cdot \dot{V} \cdot (T_{i n l e t} - T_{r e t u r n}),

(1)

ρ

denoting the fluid density of water,

c_{p}

the specific heat capacity at constant pressure p. All temperatures given in degrees centigrade.

Having a preprocessed heat generation profile, one can try to convert this into a heat load profile. The measured heat generation profile mainly differs from the desired heat load profile due to the thermal inertia of the distribution pipes and radiators within the building. The heat generation curve was, therefore, smoothed by averaging over a centered Gaussian window. Different window sizes and standard deviations were tested. To remove all switching fragments, the window was, finally, chosen to be 180 min long with a standard deviation of 40 min. The Pearson correlation coefficient between the raw and the smoothed data is 0.93 while the total heat demand remains unchanged. As an example, Figure 1 displays both the measured heat generation and the smoothed heat load profile over one week in July.

Finally, by aggregating to a 15 min resolution, a heat load time series from March 2020 to April 2021—with some gaps in between—was obtained. From the full data set, data between 1 August 2020 and 11 February 2021 has been removed. This is a consequence of an upcoming research project, where the created prediction models will be used to forecast that period of heat load. Hence, this data may not be used during the model creation procedure. Thereby, 17,000 heat load values, covering roughly half a year, remain for training, validation, and testing of the FFNN prediction models, including values from summer as well as autumn and winter.

In addition, measurements regarding ambient temperature and relative humidity have been performed. However, here too, faulty and missing data was observed. Therefore, the local measurements were compared with measurements from Essen-Bredeney, provided by the Deutscher Wetterdienst (DWD) [24]. To keep the composite time series as continuous as possible, DWD weather measurements were eventually used.

Lastly, the composite time series also comprises time related features. Features like time-of-day or day-of-week can simply be extracted from the time stamps. Additionally, public holidays were obtained from the holidays package [25] and school holidays for North Rhine-Westphalia were added manually.

2.2. Forecast Composition

The first step of every forecast approach is to clarify which values need to be predicted and when. For short-term load forecasting, predicting only one time step ahead is typically not sufficient. More common are day ahead forecasts, meaning that at some point in time today (e.g., before noon) the predicted values for a larger time window like the entire next day (e.g., in the form of 24 hourly or 96 quarter-hourly values) must be available. This determines the necessary forecast horizons. In the given example, the forecast horizons start with 13 h ahead (just before noon to just after midnight) and end with 36 h ahead. Four different strategies to compose the 24 predicted hourly heat load values can be distinguished:

1.: Single-Model Univariate Forecasting (singleXone)
2.: Single-Model Multivariate Forecasting (singleXseq)
3.: Multi-Model Univariate Forecasting (multiXone)
4.: Multi-Model Multivariate Forecasting (multiXseq).

The different possibilities of composing a forecast consisting of 24 predicted values are sketched in Figure 2. Along the horizontal axis, the number of models and, along the vertical axis, the number of outputs from each model varies from 1 to 24. Both scales are displayed logarithmically with base two. By A, B, C and D, four examples of the four strategies are highlighted.

The black dot in the bottom left corner of the triangle, marked with an A, represents the only possibility for a single-model univariate forecast (e.g., used in [26]). In this case, a single model with only one output neuron (singleXone) is used to predict the 24 values iteratively by 24 executions of the same model with different input sets. The largest necessary horizon is chosen for that model to be applicable for all 24 values, i.e., 36 h ahead in the above example. The benefit is that only a single model needs to be developed and trained. However, by using the largest necessary horizon for all values, most recent observations are not taken into account for values at the beginning of the necessary time window. The forecast accuracy typically decreases with increasing forecast horizon. As a result, the large forecast horizon can be a significant drawback of the singleXone strategy.

The single-model multivariate forecasting strategy reduces the average forecast horizon. This strategy (e.g., used in [13]), where a single model predicts a sequence of values through multiple outputs each time it is executed (singleXseq), is represented by the black vertical line. B highlights the example of a single model with six outputs (1X6). In the above example, these six outputs would represent the horizons 31, 32, 33, 34, 35 and 36 h ahead, decreasing the average horizon to 33.5 h ahead. The model would be executed four times to collect all 24 values. The extremum case in the top left corner of the triangle, is a single model which predicts all 24 values at once, resulting in the shortest achievable average horizon of 24.5 h ahead. As a benefit, this strategy hence reduces the average forecast horizon and can result in an increased forecast accuracy. Moreover, dependencies among the predicted values might be represented better by a multivariate forecast. As a drawback however, the increased number of model outputs does also increase model complexity. A neural network with 50 neurons in the final hidden layer (including bias unit) has 50 trainable weights in case of a single output and 1200 trainable weights in case of 24 outputs. Especially in case of data scarcity, the increased complexity might counterbalance the benefit of a reduced average horizon. Instead of multiple sequential, also multiple isolated values (e.g., 24 and 48 h ahead) could be predicted at once, within a multivariate forecasting approach. Following the above arguments, this is a promising approach, if the isolated values strongly depend on each other. However, to the best of our knowledge, no case study for FFNN heat load prediction following that approach has yet been presented.

While keeping the individual model complexity low, the average forecast horizon is also reduced by a multi-model univariate forecasting strategy. In this approach, represented by the black horizontal line, multiple models predict one heat load value in each execution (multiXone). C highlights the example of using four independent models with a single output each (4X1). These models would, hence, have a forecast horizon of 18, 24, 30 and 36 h ahead, resulting in an average forecast horizon of 27 h ahead. Besides, the latter model is thereby equivalent to the model used in the singleXone strategy. Each model still gets executed iteratively six times to receive the necessary 24 values, but the average horizon is already reduced significantly. Additionally, multi-model strategies open up the opportunity of using different methodologies for different forecast horizons, for example usage of recurrent neural networks for predictions up to 24 h ahead and FFNNs above. As a drawback, instead of a single model, this forecast composition strategy requires development and training of four individual models. Using multiple independent models also adds discontinuities to the forecast, originating from the shift between models. Whether this comes with practical issues in forecast applications has yet to be investigated. If so, discontinuities could–at least partly–be tackled by ensemble methods like averaging two overlapping predictions for the time steps before and after each model shift.

The final strategy is multi-model multivariate forecasting (multiXseq), represented by the entire hatched, green area in Figure 2, with the corresponding example highlighted by D. In this example, four models with six outputs each are used (4X6), such that each model only gets executed once. The six outputs of the first model cover the forecast horizons 13, 14, 15, 16, 17 and 18 h ahead. The six outputs of the second model consequently cover the following six hours and so on. The average forecast horizons of the four models are hence 15.5, 21.5, 27.5, and 33.5 h ahead, such that the overall average forecast horizon is again 24.5 h ahead, the shortest horizon possible. The benefits and drawbacks of a multi-model and a multivariate forecasting strategy were already mentioned above.

In summary, the vertical edge of the triangle in Figure 2 represents all possible compositions with a single model, the horizontal edge of the triangle all possible compositions with a single output and the hypotenuse all possible compositions with a single execution of each model. In addition, the hypotenuse also reflects all possible compositions resulting in the shortest achievable forecast horizon. Most publications choose one of the three triangle’s corners as their forecasting strategy [3].

In the presented case study, a heat load forecast of the upcoming 48 h in 15 min resolution (192 predicted values) is required. Since data scarcity is an issue, the models were kept as simple as possible by omitting a multivariate output. To reduce the forecast horizon and improve forecast accuracy, a multi-model approach was chosen, resulting in a 48X1 multi-model multivariate forecasting strategy. Hence, 48 models, each being responsible for four quarter-hourly values within the full forecast, need to be developed.

2.3. Model Creation Methodology

With the necessary data acquired (Section 2.1) and the forecast composition clarified (Section 2.2), prediction models can be created.

As opposed to single-model forecasting (singleXone and singleXseq), multi model forecasting (multiXone and multiXseq) requires several models. However, the given task of each model is very similar, since only the forecast horizon varies. In the presented case study, the first model predicts one hour ahead, the second model two hours ahead and so on. As a hypothesis, due to the similar task of each model, well performing FFNN models for each task may have many properties in common. To test this hypothesis, only three models with three different prediction horizons, termed test horizons, were initially developed. Three test horizons were chosen, because three is considered the minimal number of models, where common properties are unlikely to be accidental. The three chosen test horizons are 2, 22 and 42 h ahead. The three horizons reflect a relatively short, a medium and a relatively long horizon compared with the full time period needed (48 h ahead). Thereby, they reveal differences between short-term and long-term forecasting within the necessary forecast horizons. The exact numbers of 2, 22 and 42 could have been chosen differently as well. Purposely avoided, however, were horizons that coincide with the seasonality of the measured data (e.g., 24 or 48 h ahead). As a result of the three test horizons, the properties shared by all three best performing test models will be kept constant for all other models without further testing. Depending on the number of common properties, this simplifies model development for the remaining 45 models considerably.

In the presented case study, data scarcity is a concerning issue. Therefore, five-fold nested cross-validation is applied and average scores of all folds presented. The 20% test data were separated en bloc. This prohibits that test data is included in the lagged input features during training. The validation data drawn from the 80% training data, in contrast, is selected randomly, possibly including training data in their lagged input features.

The model creation methodology for each of the three test models constitutes of five steps:

1.: Initial Model Creation
2.: Internal Parameter Selection
3.: Feature Engineering
4.: Architecture Tuning
5.: Final Model Creation

In the first step, a set of initial parameters is defined. This set of parameters defines the initial model and serves as a starting point for further model tuning. The idea is to have a rather simple model that proves to be capable of learning the underlying patterns. That model’s predictions should visibly follow daily or weekly patterns and have the same scale as the given targets. Manual tests of isolated parameters regarding the number of hidden layers and hidden neurons were performed, to test whether this increases model performance significantly. A suitable validation split, batch size, learning rate as well as the number of necessary training epochs for all considered learning algorithms were also determined. This ensures a fair comparison among the considered learning algorithms and decreases training time in the following steps by faster convergence and the least possible number of epochs. The initial parameter set is chosen based personal experience (e.g., previous model developments) and relevant literature.

In the second step, combinations of internal parameters are tested randomly, and suitable values for these parameters are selected. Internal parameters include among others the kernel and bias initializations as well as the activation functions for each layer, the loss function and the learning algorithm. These parameters, investigated in this step, are expected to yield value choices, which barely depend on the exact number of input units, input representation or the precise number of hidden neurons. In turn, early tuning of the learning algorithm or the batch size decreases training time and, hence, computational burden of the ongoing model development in the following steps.

In the third step, influential input features and their representations are investigated. Besides selecting the input features, appropriate representation of the information within these inputs is an important, yet sometimes barely touched aspect of model tuning.

One typical example is the time-of-day input. Apart from the popular representations by a scalar continuous feature (0 to 1) or a vector of binary features from one-hot-encoding (

[1, 0, \dots, 0]

to

[0, 0, \dots, 1]

), further possibilities exist. Two of which are, encoding time of day circularly by sine-cosine functions in two features (e.g., used in [27]) or encoding time categorically by five binary features in case of an hourly resolution (e.g., used in [28]). Furthermore, time-of-day can be given in local time, possibly containing clock changes, or in coordinated universal time (UTC), representing time continuously. Step three clarifies whether one of these representations results in a superior model performance. Testing of the different parameter combinations is based on a random search approach.

In the fourth step, random search is again used to tune the model architecture. Besides the number of hidden layers and hidden neurons in each layer, this involves the number of lags used as input features. Lagged values for both historical measurements and forecasted exogenous observations are taken into account, since historic measurements from the last hours, last day or—in case of a weekly pattern—last week typically contain valuable information.

In the fifth and last step, the final prediction models are created. Several models with the identical parameter set, termed trials, are created and trained, to account for the random weight initialization. As opposed to the nested cross-validation approach from the previous steps, the models in this final step are trained on the full data set, to make the model applicable to all seasons. From all trials the best performing model, evaluated on the entire data set (in-sample evaluation), is selected and saved as final prediction model for that horizon.

Model creation was performed with Python 3.8 [22], TensorFlow 2.3 [29] and Scikit-learn 0.24 [30]. Data handling and visualization was done with Pandas 1.4 [23] and matplotlib 3.3 [31].

3. Results

The presented methodology was applied to a heat load data set of the hospital in Hattingen, Germany. Figure 3 displays both the measured heat load data and the ambient temperature data. The y-axis of the temperature data, on the right-hand side, is inverted, to highlight the correlation between heat load and ambient temperature. Due to data scarcity, five-fold nested cross-validation was applied, such that the different colors of the heat load data correspond to the different test data sets.

3.1. Initial Model Creation

For the initial parameter sets, identical parameters were chosen for all three test horizons, apart from the temperature prediction input. Few parameter alterations between the three horizons were tested but did not result in significantly improved performance and were hence omitted. Only the temperature predictions, used as input, were adapted to the forecast horizon. The full list of FFNN parameters for the initial models is presented in the Appendix A in Table A1.

Table 1 lists the average MAPE values from all five folds of the three test horizons. For comparison, a persistence forecast is used. The persistence forecast is a naïve forecasting method, where measurements are shifted forward by a fixed time period to obtain the predicted values. The corresponding MAPE scores for the persistence forecast are also listed in Table 1.

Due to a daily seasonality, the accuracy of the persistence forecast is typically improved if the forecast horizon meets the daily seasonality. This means that the persistence forecast with 24 h horizon outperforms the persistence forecast with 22 h horizon. Consequently, the 22 h ahead FFNN model (+22-model) is also compared with the 24 h ahead persistence forecast, and the 42 h ahead FFNN (+42-model) is compared with the 48 h ahead persistence forecast. The MAPE scores of the +24-persistence forcast and the +48-persistence forecast are 12.5% and 17.8%, respectively. Irrespective of the specific horizon, the initial FFNN models clearly outperform the persistence forecasts.

For the +42-model, the MAPE for the five folds are 12.1%, 9.4%, 8.6%, 7.7%, 8.2% for the five test sets from left to right in Figure 3, starting with the green period around April 2020. Thereby, worst performance is achieved for the first test period around April 2020, highlighted in green, and best performance for the test period containing the peak heat load in February 2021, highlighted in blue.

3.2. Internal Parameter Selection

Table A2 in Appendix A lists all parameters and their value choices considered during the internal parameter selection. The value choices selected for implementation are highlighted in bold.

Figure 4 exemplifies the results regarding the optimizer selection for the +2-test-horizon. For this representation, all tested models were sorted by the optimizer used and colored accordingly. On the x-axis, a consecutive integer number is used as model identifier. On the y-axis, the average MAPE from the five folds is given. About 40,000 models with different randomly chosen parameter combinations were trained and tested. All models represented by the same color use the same optimizer but differ in all other parameters. Few duplicates–in the sense of models with identical parameter combinations–are possible, due to random parameter selection. This display illustrates whether a specific optimizer allows for a higher achievable model performance irrespective of all the other model parameters.

As can be seen in Figure 4, best performing models either use the Adam (beige) or the Adamax optimizer (blue). The minimal achievable MAPE is visibly worse, when Adadelta (green) or RMSprop (brown) is used, which hence seem to hinder optimal model performance. Similar results were obtained for the two other test horizons also. Since Adamax is a variation of the Adam optimizer, the latter was selected.

By means of the same analysis, the following results were obtained for the other parameters in question: For the activation function, the best models using the rectified linear unit (ReLU) outperform the best models using other activation functions, like the sigmoid or the hyperbolic tangent function. As loss function, the mean absolute error (MAE) was observed to be beneficial for model performance compared to the mean squared error (MSE). Regarding the batch size, batches between 32 and 1024 samples were tested, while equally well performing models were obtained up to a batch size of 128 samples. Above 128 samples, model performances decreased. Since a higher batch size increases training efficiency, 128 samples were selected as batch size.

Of rather low important for the model performance were the initializers. All initializers both for kernels and the bias units resulted in a comparable model performance. The default initializers for dense layers from tensorflow.keras, being glorot_uniform for the kernel and zero for the bias units, were, hence, selected.

Significant differences among the three test horizons were only observed for the scaling choice. A common observation is that omitting any scaling causes below average performance. Less clear among the three horizons is which scaling results in the best performing models. For the +42-models, min-max scaling to [0, 1] for both heat and ambient temperature clearly outperforms the other scaling choices, whereas all three scaling choices result in equally well performing models for the other two test horizons +2 and +22. Since min-max scaling to [0, 1] for both heat and temperature proved beneficial for one test horizon and produces comparable results for the other test horizons, it was selected for further usage.

3.3. Feature Engineering

During feature engineering, relevant inputs and their representations are determined. Table A3 lists all considered inputs and their corresponding representation choices.

Using previous heat demand values and ambient temperature predictions as inputs is known to be beneficial for model performance. This has been shown by publications earlier (e.g., in [32]) and was validated by manual tests for the presented case study. These two, therefore, comprise an inherent component of the input and excluding them is not tested.

Besides using previous ambient temperature values as inputs, relative humidity as input was considered, both as previously observed and predicted values. For all considered inputs, different numbers of features representing the contained information are tested, to avoid that the input is excluded simply due to the fact that the information was underrepresented within the full set of input features. Other weather-related inputs, like irradiance, wind speed or wind direction were not considered. Firstly, these data were unavailable for the object of interest. Secondly, measurements from other, even nearby locations may differ considerably to the conditions at that specific building. Furthermore, lastly, including for example wind speed information has not improved the forecast accuracy in other investigations regarding heat load forecasting (e.g., shown in [33]).

From the considered inputs, relative humidity and weekday inputs proved irrelevant in all cases for the presented case study. Slightly beneficial for model performance was using the holiday information as input in the form of a scalar binary feature, being one for Sundays and bank holidays and zero else.

Of importance proved to be the time-of-day input. Considered was a scalar continuous feature and a sine-cosine representation by two features, both with the time-of-day in CET and UTC. Not considered was the one-hot-encoding because this approach results in a large number of input features (95 in case of a 15 min resolution) and thereby a large number of trainable weights in the model. Figure 5 displays the MAPE scores of all tested parameter combinations during feature engineering. The combinations are sorted and highlighted in different colors according to their time-of-day input.

The highest MAPE scores–and thus lowest performances–are achieved by omitting time-of-day as input entirely (beige). The lowest MAPE scores in turn, are achieved by using the circular representation of time. The performance differences between the circular, the scalar, and no time input, are significant, whereas the differences between using time in CET and UTC are marginal. For further usage, the circular representation of time-of-day in CET (green) was selected.

3.4. Architecture Tuning

During architecture tuning, the precise layout of the neural network and the number of features containing lagged values is determined. Table A4 lists all considered parameters and their corresponding value choices for architecture tuning.

Figure 6 gives an overview of the results for the +22-models depending on the number of lagged heat load observations (A) and the number of lagged ambient temperature predictions (B) as features. All tested models are sorted by the number of lagged features used, from zero features on the far left (green) and twenty features on the far right (grey). As an example, when t denotes the time of prediction making and three lagged heat load features are used, than the observed heat load values at times

t - 1 h

,

t - 2 h

and

t - 3 h

are included in the input vector. Similarly, three lagged features for the ambient temperature prediction correspond to the three hourly values prior to the desired time stamp. For the +22-horizon, this corresponds to the values

t + 22 h

,

t + 21 h

and

t + 20 h

. Predictions for a time stamp ahead of the prediction horizon, e.g.,

t + 23 h

or

t + 24 h

, are not considered, since no predictive control of the heating system was yet implemented at the object of interest.

As Figure 6 highlights, models either using no previous heat load observations (zero features) or no ambient temperature predictions as input clearly underperform. In these cases, the minimal MAPE scores are slightly above (A) or slightly below (B) 12.5%. Compared to the overall best performing models with MAPE scores of less than 7.5%, this reflects an increase of about five percentage points in both cases. For the shorter, +2-horizon, the importance of previous heat load observations increases (difference of 10 percentage points between 16% and 6%) and the importance of ambient temperature predictions decreases (1 percentage point between 7% and 6%). For the longer, +42-horizon, the opposite was observed, with a difference of 4 percentage points for the heat load input and 9 percentage points for the ambient temperature input.

A closer look on the cases from one lagged feature up to twenty lagged features, similar to the presentation in Figure 4, reveals performance differences among these choices, also. The optimal number of lagged features, resulting in minimal MAPE scores, is consistent among the three test horizons. Therefore, the optimal numbers of lagged features were selected identically for all horizons as ten, in case of previous heat load observations, two, in case of previous ambient temperature observations, and three, in case of ambient temperature predictions.

Regarding the number of hidden layers, two hidden layers proved sufficient. Adding a third layer did not visibly improve model performances. More important than the number of layers was the number of trainable weights within the FFNN model. Together with the number of input features, the number of hidden neurons in the hidden layers determines the number of trainable weights. Both too few and too many trainable weights were disadvantageous. The suitable range, however, depended on the forecast horizon. While model performance was observed independent of the number of hidden neurons in the second layer in all cases, significant differences among the models with different numbers of hidden neurons in the first layer occurred. Figure 7 depicts the model performance depending on the number of hidden neurons in the first layer in case of the +2-horizon (A) and the +42-horizon (B).

While minimal MAPE scores are achieved by models having 70 or more hidden neurons in case A, 20 or less hidden neurons is the optimal number in case B. In fact, having 70 hidden neurons is disadvantageous for model performance in case B. For larger forecast horizons, optimal model performance was, therefore, achieved by fewer trainable weights, than in case of the shorter forecast horizon.

The minimal MAPE score achieved during architecture tuning for the three test horizons are listed in Table 2. Compared with the performance of the initial models from Section 3.1, the performance is increased by 18%, 7% and 12%.

3.5. Final Model Creation

In the last step, the final 48 models are created. In the previous steps, the necessary parameters for the three test models with horizons +2, +22 and +42 h were determined. For only one parameter, the selection of a specific value was observed advantageous for one test horizon, while the same value was disadvantageous for another test horizon. This confirms the hypothesis from Section 2.3, that the optimal FFNN models have many properties in common, due to the similar task. Therefore, only one parameter requires tuning for the other 45 models. This parameter is the number of hidden neurons in the first hidden layer, denoted in the following as

n_{1}

. From Figure 7, it can be seen that no single sweet spot for the number of neurons exists. Contrariwise, a broader range of hidden neurons results in well performing models. Therefore, architecture tuning was repeated for five more test horizons and their number of necessary hidden neurons for minimal MAPE scores determined. Figure 8 displays the experimentally determined numbers depending on the forecast horizon.

The necessary number of neurons for the eight test models roughly follows an exponential decay. The number of neurons

n_{1}

for the additional 40 forecast horizons H in hours were, therefore, selected based on the fitted exponential decay:

n_{1} (H) = 86.3 \cdot e^{- 0.055 H} .

(2)

As mentioned in Section 3.4, the number of hidden neurons in the second layer has no visible effect on model performance. However, to accompany the decreasing need of trainable weights within the models, the number of neurons in the second hidden layer is once changed from 30 neurons for all models with a horizon below 32 h ahead, to 20 neurons for models with a larger forecast horizon.

With the numbers of hidden neurons selected, all 48 final models could be created. For final model creation, ten equivalent models for each horizon were trained and tested on the full data set (in-sample evaluation). Depending on the number of trainable weights, training and testing of each model takes between six and ten seconds on a standard CPU. From the ten models, the best performing model was selected for implementation. Figure 9 displays the in-sample MAPE scores of all forecast horizons.

As expected, the MAPE scores generally increase with larger forecast horizons. Moreover, the in-sample scores are slightly better (about 1 percentage point for the three test horizons), than the out-of-sample scores from Section 3.4. The increased training set might, however, also results in an increased model performance. To exemplify the models’ forecasts, predictions and targets for the week with the largest and the smallest occurring MAPE value are shown in Appendix B.

The maximum mean error of all 48 final models for the full time period is

- 7.9

k

W

for the +39-model. This corresponds to a mean underestimation of 1.8%, regarding the average heat load of 439

k

W

. The mean errors on a daily basis are naturally larger than for the full time period (between

- 101

k

W

and

+ 117

k

W

with a standard deviation of

20.7

k

W

). With a median of

1.6

k

W

and a mean of

1.7

k

W

, the daily errors distribute fairly equally around and are on average close to zero. Hence, the errors approximate white noise, which suggests well fitted prediction models.

4. Discussion

4.1. Case Study

As a case study, heat load was predicted for the hospital in Hattingen, Germany. The necessary forecast covers the upcoming 48 h in 15 min resolution. From the four different forecasting strategies of Section 2.2, a multiXone approach was chosen, with 48 models each predicting one value per execution (48X1). This reduces model complexity and the number of training samples needed, thanks to a single output neuron, compared to a multiXseq approach. At the same time, the workload is increased compared to a singleXone approach, since several models must be developed.

To limit the primary workload and check for similarities, three test models with different horizons (+2, +22 and +42 h) were developed. Each of the three models were developed by a five-step model creation methodology.

The parameter set of the three initial models were chosen identically. All three initial models already outperformed their naive benchmark model considerably. For this reason, naive models are no challenging benchmark for more advanced ML models.

During internal parameter selection, parameters like the kernel and bias initializers proved unimportant for model performance. Regarding the optimizer and the activation function, choosing Adam and ReLU, respectively, proved beneficial in all three test cases. No differentiations in internal parameters among the three test horizons were hence necessary.

Likewise observed for all test horizons during feature engineering was also, that including relative humidity in the input was not beneficial for neither of the three test models and that a circular time-of-day representation is most suited.

Architecture tuning in turn, revealed differences among the three horizons, while all models validated the importance of the presence of previous heat load observations and predicted ambient temperature values in the input, the necessary number of hidden neurons in the first hidden layer differed, while the necessary number of hidden neurons decreased with forecast horizon, having too many neurons was also disadvantageous for model performance.

By testing additional horizons, an exponential equation approximating the number of hidden neurons was determined. Since all other parameters were selected independent of the forecast horizon, this approximation allows for the final model creation of all 48 FFNN models.

The final out-of-sample MAPE scores for the three test horizons decreased compared to the initial models’ MAPEs by 18%, 7% and 12% respectively. The parameter sets of the initial and final models mainly differ in the presence of time-of-day and holiday information within the input vector, the number of lagged features, the scaling choice and the number of hidden neurons. Thus, relatively small differences have in parts increased model performance considerably. Therefore, a structured, comprehensible model creation methodology is important for FFNN model benchmarking.

4.2. Model Creation Methodology

During the five-step model creation methodology, many design parameters and their possible value choices were considered. For some parameters, like the activation function, a single value choice, in this case ReLU, consistently outperforms the other possibilites. For other parameters, like the weight initializations, many value choices result in equally well performing models. Both results, if confirmed through multiple load forecasting case studies, simplify model tuning, because choosing the right value becomes obvious in case of the activation function and is arbitrary in case of the initializers.

In summary, a structured FFNN model development procedure for creating benchmark models was presented. To date, no consistent methodology is followed. Instead, model developers choose their own approach, leading to highly different qualities of the developed models. Using our approach will on the one hand increase the quality and interpretability of FFNN benchmark models and on the other hand, through multiple applications to different case studies, generate further insights into FFNN design choices for short-term load forecasting.

4.3. Future Research Directions

Of course, the considered value choices within the presented case study are by far not complete. Different activation functions, learning algorithms or input features could for example be considered. The selection of value choices was mainly motivated by their popularity. Furthermore, the construction of lagged values was kept simple by using continuous sequences of hourly values. Isolated lagged observations from, e.g., 6, 12, 24 or 48 h ago could likewise be considered and could improve model performances. Such additional choices could be added by another application of the methodology to a different heat load data set, while also testing the value choices from the here presented case study, valuable information regarding the effectiveness of new value choices, consistency as well as transferability of previous results is gathered. When, e.g., ReLU consistently proves advantageous as activation function or the initializers irrelevant, tuning these two parameters can be omitted, and model development is simplified. Contrariwise, weekday information was observed irrelevant for predicting the hospital’s heat load but may be relevant to other case studies. Hence, repeated application of the structured methodology can improve individual model performances and reduce the effort needed for model development.

Besides repeated application, a future research direction could involve including the forecast composition into the model creation methodology. Thereby, questions on how the composition affects forecast accuracy and in what cases multivariate forecasting strategies are beneficial, would be addressed. Furthermore, for multivariate forecasts non-consecutive outputs and fixed time-of-day predictions could be tested.

5. Conclusions

Before moving towards more advanced and computationally costly ML approaches, the full potential of simple ML models, like FFNN models, should be exploited. In the presented case study, model performance was improved by up to 18%, based on minor parameter changes. Therefore, FFNN models must be thoroughly and comprehensibly tuned if they are used as a benchmark for performance evaluations of different forecasting approaches.

The effort needed to create benchmark models will be reduced by repeated application of the presented, structured, five-step model creation methodology, meanwhile clarifying consistency and transferability of the individual results.

Author Contributions

Conceptualization, methodology, software, validation, formal analysis, investigation, data curation, writing—original draft preparation and visualization, M.S.; resources, writing—review and editing, M.S. and A.H.; supervision, project administration and funding acquisition, A.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research is part of the research project HESKH funded by the Federal Ministry for Economic Affairs and Climate Action, grant number 03ET1591A. The APC was funded by the Fraunhofer-Gesellschaft.

Data Availability Statement

The weather data presented in this study is openly available on the Open Data Platform at https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/, accessed on 2 December 2021, reference number [24]. The smoothed heat load data presented in this study is available on request from the corresponding author.

Acknowledgments

Many thanks to the Augusta Foundation and Stadtwerke Bochum for their support during measurements and data acquisition. Moreover, we want to thank Christoph Goetschkes and Matthias Schnier for executing the measurements and data collection and Annedore Mittreiter for her support in funding acquisition and project administration.

Conflicts of Interest

The authors declare no conflict of interest.

Nomenclature & Abbreviations

The following abbreviations are used in this manuscript:

$η$	Learning Rate
H	Forecast horizon in hours
$n_{1}$	Number of hidden neurons in the first hidden layer
t	Time of Prediction Making
$+ i$ -model	Prediction Model with a forecast horizon of i hours
ANN	Artificial neural network
ARIMA	Autoregressive integrated moving average
CET	Central European Time
DWD	Deutscher Wetterdienst
FFNN	Feedforward neural network
MAE	Mean Absolute Error
MAPE	Mean Absolute Percentage Error
ML	Machine learning
MSE	Mean Squared Error
ReLU	Rectified Linear Unit
UTC	Coordinated Universial Time

Appendix A

Table A1. Initial parameter set for initial model creation.

Parameter	Value
Heat Load Input	−20 h to −1 h (20 features)
Ambient Temp Input	−6 h to −1 h (6 features)
Temp Prediction Input	0 to 2 h (3 features for +2 horizon),8 to 22 h (15 features for +22 horizon),28 to 42 h (15 features for +42 horizon)
Heat Scaling	min-max to [0, 1]
Temp Scaling	min-max to [−1, 1]
Hidden neurons	50 in first, 10 in second layer
Initializers	Hidden layers: random_normal ¹Output layer: glorot_uniform for kernel and zero for bias
Activation function	Hidden Layers: ReLUOutput Layer: Linear
Loss funtion	Mean Absolute Error
Optimizer	Adam with learning rate $η = 0.001$
Epochs	4
Validation Split	10%
Batch size	256
Test split	20% (from five fold cross-validation)

¹ Italic terms according to tensorflow.keras [29].

Table A2. Considered parameters and their value choices for internal parameter selection.

Parameter	Value Choices
Kernel Initializer	Equivalent for all layers:uniform ¹,glorot_uniform², he_uniform, lecun_uniform, normal, glorot_normal, he_normal, random_normal, zero
Bias Initializer	Equivalent for all layers:uniform, random_normal,zero
Activation Function	For hidden layers: ReLU, tanh, sigmoid, hard_sigmoid
Loss Function	MAE or MSE
Optimizer	SDG with $η = 10^{- 2}$ (11 epochs) RMSprop with $η = 2 \cdot 10^{- 4}$ (4 epochs) Adagrad with $η = 3 \cdot 10^{- 2}$ (4 epochs) Adadelta with $η = 10^{- 1}$ (4 epochs) Adam with $η = 10^{- 3}$ (4 epochs) Adamax with $η = 10^{- 2}$ (4 epochs) Nadam with $η = 10^{- 3}$ (4 epochs)
Batch Size	32, 64, 128, 256, 512, 1024
Scaling	No scaling min-max scaling to [0, 1] for heat, [−1, 1] for temp min-max scaling to [−1, 1] for both heat and temp standard scaling to [0, 1] for both heat and temp

¹ Italic terms according to tensorflow.keras [29]. ² Bold entries refer to the selected value choices.

Table A3. Considered parameters and their value choices for feature engineering.

Parameter	Value Choices
Previous Relative Humidity	Not included¹ Min-max sclaed to [0, 1]
Previous Ambient Temerature	Not included Min-max scaled to [0, 1]
Predicted Relative Humidity	Not included Min-max scaled to [0, 1]
Number of features	6, 12, 18
Time of day	Not included Scalar feature for UTC time between 0 (00:00) and 1 (23:59) Scalar feature for CET time between 0 (00:00) and 1 (23:59) Circular representation of UTC time with two features Circular representation of CET time with two features
Weekday	Not included Scalar feature Six dimensional vector from one-hot-encoding
Holiday	Not included Scalar binary feature: 1 for bank holidays and Sundays, 0 else

¹ Bold entries refer to the selected value choices.

Table A4. Considered parameters and their value choices for architecture tuning.

Parameter	Value Choices
Previous Heat Load	$1, 2, 3, \dots, 9, 10 1, 11, \dots, 19$ or 20 features
Heat Load Prediction	$1, 2, 3, \dots, 19$ or 20 features
Ambient Temp. Prediction	$1, 2, 3, \dots, 19$ or 20 features
Hidden Layers	$1, 2$ or 3 layers
Hidden Neurons in 1st Layer	$1, 2, 3, \dots, 249$ or 250 neurons
Hidden Neurons in 2nd Layer	$1, 2, 3, \dots, 249$ or 250 neurons

¹ Bold entries refer to the selected value choices.

Appendix B

Figure A1. The largest MAPE score of 11.5% over one week is produced by the +39-model and not the model with the largest forecast horizon. The predicted values by the +39-model (predictions) and the real observations for the same period (targets) are displayed for that specific week in March 2021.

Figure A2. The smallest MAPE score of 1.9% is produced by the +1-model, the model with the shortest forecast horizon. The predicted values by the +1-model (predictions) and the real observations for the same period (targets) are displayed for that specific week in March 2021.

References

Haben, S.; Arora, S.; Giasemidis, G.; Voss, M.; Vukadinović Greetham, D. Review of low voltage load forecasting: Methods, applications, and recommendations. Appl. Energy 2021, 304, 117798. [Google Scholar] [CrossRef]
Arvanitidis, A.I.; Bargiotas, D.; Daskalopulu, A.; Laitsos, V.M.; Tsoukalas, L.H. Enhanced Short-Term Load Forecasting Using Artificial Neural Networks. Energies 2021, 14, 7788. [Google Scholar] [CrossRef]
Hippert, H.S.; Pedreira, C.E.; Souza, R.C. Neural networks for short-term load forecasting: A review and evaluation. IEEE Trans. Power Syst. 2001, 16, 44–55. [Google Scholar] [CrossRef]
Wang, Z.; Srinivasan, R.S. A review of artificial intelligence based building energy use prediction: Contrasting the capabilities of single and ensemble prediction models. Renew. Sustain. Energy Rev. 2017, 75, 796–808. [Google Scholar] [CrossRef]
Boicea, V.A.; Ulmeanu, A.P.; Vulpe-Grigoraşi, A. A novel approach for power load forecast based on GAN data augmentation. IOP Conf. Ser. Mater. Sci. Eng. 2022, 1254, 012030. [Google Scholar] [CrossRef]
Cao, L.; Li, Y.; Zhang, J.; Jiang, Y.; Han, Y.; Wei, J. Electrical load prediction of healthcare buildings through single and ensemble learning. Energy Rep. 2020, 6, 2751–2767. [Google Scholar] [CrossRef]
Hong, T.; Pinson, P.; Wang, Y.; Weron, R.; Yang, D.; Zareipour, H. Energy Forecasting: A Review and Outlook. IEEE Open Access J. Power Energy 2020, 7, 376–388. [Google Scholar] [CrossRef]
Ryu, S.; Noh, J.; Kim, H. Deep Neural Network Based Demand Side Short Term Load Forecasting. Energies 2017, 10, 3. [Google Scholar] [CrossRef]
Ghalehkhondabi, I.; Ardjmand, E.; Weckman, G.R.; Young, W.A. An overview of energy demand forecasting methods published in 2005–2015. Energy Syst. 2017, 8, 411–447. [Google Scholar] [CrossRef]
Kováč, S.; Micha’čonok, G.; Halenár, I.; Važan, P. Comparison of Heat Demand Prediction Using Wavelet Analysis and Neural Network for a District Heating Network. Energies 2021, 14, 1545. [Google Scholar] [CrossRef]
Amasyali, K.; El-Gohary, N.M. A review of data-driven building energy consumption prediction studies. Renew. Sustain. Energy Rev. 2018, 81, 1192–1205. [Google Scholar] [CrossRef]
Tealab, A. Time series forecasting using artificial neural networks methodologies: A systematic review. Future Comput. Inform. J. 2018, 3, 334–340. [Google Scholar] [CrossRef]
Fernández-Martínez, D.; Jaramillo-Morán, M.A. Multi-Step Hourly Power Consumption Forecasting in a Healthcare Building with Recurrent Neural Networks and Empirical Mode Decomposition. Sensors 2022, 22, 664. [Google Scholar] [CrossRef] [PubMed]
Hernández-Hernández, C.; Rodríguez, F.; Moreno, J.; Da Costa Mendes, P.; Normey-Rico, J.; Guzmán, J. The Comparison Study of Short-Term Prediction Methods to Enhance the Model Predictive Controller Applied to Microgrid Energy Management. Energies 2017, 10, 884. [Google Scholar] [CrossRef] [Green Version]
González González, A.; García-Sanz-Calcedo, J.; Rodríguez Salgado, D. Evaluation of Energy Consumption in German Hospitals: Benchmarking in the Public Sector. Energies 2018, 11, 2279. [Google Scholar] [CrossRef] [Green Version]
Levsen, A.; Filser, M. Klimaschutz in deutschen Krankenhäusern: Status Quo, Maßnahmen und Investitionskosten: Auswertung klima- und Energierelevanter Daten Deutscher Krankenhäuser. Available online: https://www.dkgev.de/fileadmin/default/Mediapool/1_DKG/1.7_Presse/1.7.1_Pressemitteilungen/2022/2022-07-19_DKI-Gutachten_Klimaschutz_in_deutschen_Krankenha__usern.pdf (accessed on 1 December 2022).
Clarivate. Web of Science. 2022. Available online: https://www.webofscience.com/ (accessed on 1 December 2022).
Ma, D.; Zhang, L.; Sun, B. An interval scheduling method for the CCHP system containing renewable energy sources based on model predictive control. Energy 2021, 236, 121418. [Google Scholar] [CrossRef]
Manno, A.; Martelli, E.; Amaldi, E. A Shallow Neural Network Approach for the Short-Term Forecast of Hourly Energy Consumption. Energies 2022, 15, 958. [Google Scholar] [CrossRef]
Smedsrud, K.b.H.; Xue, K.; Yang, Z.; Stenstad, L.I.; Giske, T.E.; Cao, G. Investigation and prediction of Energy consumption at St. Olavs Hospital. E3S Web Conf. 2021, 246, 04003. [Google Scholar] [CrossRef]
Meter, B. Dynasonics: Hybrid Ultrasonic Flow Meter: DXN Portable Ultrasonic Flow and Energy Meter. Available online: https://www.badgermeter.com/products/meters/ultrasonic-flow-meters/dxn-portable-hybrid-ultrasonic-flow-meter/ (accessed on 1 December 2022).
van Rossum, G.; Drake, F.L. Python 3 Reference Manual; CreateSpace: Scotts Valley, CA, USA, 2009. [Google Scholar]
Reback, J.; McKinney, W.; den Bossche, J.V.; Roeschke, M.; Augspurger, T.; Hawkins, S.; Cloud, P.; Hoefler, P. Pandas 1.4.4. 2022. Available online: https://zenodo.org/record/7037953#.Y-7ZlPlBxPY (accessed on 1 December 2022).
Deutscher Wetterdienst. Open Data Dienst des Deutschen Wetterdiensts. Available online: https://opendata.dwd.de/climate_environment/CDC/observations_germany/climate/10_minutes/air_temperature/historical/ (accessed on 2 December 2021).
dr Prodigy; Ryanss. Python-Holidays v.0.10.1. Available online: https://python-holidays.readthedocs.io/en/latest/index.html (accessed on 5 December 2022).
Bagnasco, A.; Saviozzi, M.; Silvestro, F.; Vinci, A.; Grillo, S.; Zennaro, E. Artificial neural network application to load forecasting in a large hospital facility. In Proceedings of the 2014 International Conference on Probabilistic Methods Applied to Power Systems (PMAPS), Durham, UK, 7–10 July 2014; pp. 1–6. [Google Scholar] [CrossRef]
Karatasou, S.; Santamouris, M.; Geros, V. Modeling and predicting building’s energy use with artificial neural networks: Methods and results. Energy Build. 2006, 38, 949–958. [Google Scholar] [CrossRef]
Moriñigo Sotelo, D.; Duque Pérez, O.; García Escudero, L.A.; Fernández Temprano, M.; Fraile Llorente, P.; Riesco Sanz, M.V.; Zorita Lamadrid, A.L. Short-term hourly load forecasting of a hospital using an artificial neural network. Renew. Energy Power Qual. J. 2011, 1, 441–446. [Google Scholar] [CrossRef]
TensorFlow Developers. TensorFlow. 2022. Available online: https://zenodo.org/record/7641790#.Y-7aTflBxPY (accessed on 1 December 2022).
Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
Caswell, T.A.; Droettboom, M.; Lee, A.; Andrade, E.S.D.; Hunter, J.; Firing, E.; Hoffmann, T.; Klymak, J.; Stansby, D.; Varoquaux, N.; et al. Matplotlib: V3.3.4. 2021. Available online: https://zenodo.org/record/4475376#.Y-7a_vlBxPY (accessed on 1 December 2022).
Bagnasco, A.; Fresi, F.; Saviozzi, M.; Silvestro, F.; Vinci, A. Electrical consumption forecasting in hospital facilities: An application case. Energy Build. 2015, 103, 261–270. [Google Scholar] [CrossRef]
Bakker, V.; Bosman, M.G.; Molderink, A.; Hurink, J.L.; Smit, G.J. Improved Heat Demand Prediction of Individual Households*. IFAC Proc. Vol. 2010, 43, 110–115. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Comparison of raw heat supply measurements versus averaged heat supply in July. For raw measurements, heat supply of the different generation units was summed up. For the average heat supply, raw measurements were averaged by a 180 min gaussian window with 40 min standard deviation.

Figure 2. Possible forecast compositions in case of a 24-value-forecast are represented by the green triangle, edges included. Both the x- and y-axis are shown logarithmically with base two. Marked with A, B, C and D are four examples for the four different composition strategies.

Figure 3. Heat load (colored) and ambient temperature data (grey) available for model development. The five different test sets from the five-fold nested cross-validation are highlighted by different colors in the heat load data. The period between 1 August 2020 and 11 February 2021 was removed for later application of the developed forecast models.

Figure 4. Best performing models with two hour prediction horizon (+2-models) sorted and highlighted by the optimizer used. Best performing models with Adam (beige) and Adamax (blue) optimizer achieve lower MAPE scores compared to Adadelta (green) or RMSprop (brown).

Figure 5. Performance of +2-models sorted and highlighted by their different time-of-day inputs. The two circular representation (green and blue) results in lower MAPE scores than omitting any time-of-day input (beige).

Figure 6. Influence of the number of lagged features on the performance of the models with a forecast horizon of 22 h. The number of lagged features varies between zero (far left in green) and twenty (far right in grey) for (A) previous heat load observations and (B) ambient temperature predictions.

Figure 7. Comparison between the dependency of the model performances on the number of hidden neurons in the first layer for (A) a 2 h ahead forecast horizon and (B) a 42 h ahead forecast horizon.

Figure 8. Experimental results and exponential curve fit for the necessary number of hidden neurons in the first hidden layer depending on the forecast horizon.

Figure 9. Performance evaluation of (A) the 48 FFNN model based on their in-sample MAPE scores on the full data set and (B) the persistence forecast with a horizon between one and 48 h. Generally, the FFNN models show increasing forecast errors for larger forecast horizons. For the persistence forecast, the performance increases, when the horizon coincides with the daily seasonality.

Table 1. Performance of initial FFNN models compared with a persistence (naïve) forecast with the same horizon.

Horizon [h]	Mean FFNN MAPE [%]	Naïve MAPE [%]
2	6.7	9.6
22	7.3	14.8
42	9.2	23.8

Table 2. Average performance of best FFNN test models after architecture tuning compared with the performance of the initial models from Section 3.1.

Horizon [h]	MAPE after Tuning [%]	MAPE of Initial Models [%]	Performance Increase
2	5.5	6.7	18%
22	6.8	7.3	7%
42	8.1	9.2	12%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Stienecker, M.; Hagemeier, A. Developing Feedforward Neural Networks as Benchmark for Load Forecasting: Methodology Presentation and Application to Hospital Heat Load Forecasting. Energies 2023, 16, 2026. https://doi.org/10.3390/en16042026

AMA Style

Stienecker M, Hagemeier A. Developing Feedforward Neural Networks as Benchmark for Load Forecasting: Methodology Presentation and Application to Hospital Heat Load Forecasting. Energies. 2023; 16(4):2026. https://doi.org/10.3390/en16042026

Chicago/Turabian Style

Stienecker, Malte, and Anne Hagemeier. 2023. "Developing Feedforward Neural Networks as Benchmark for Load Forecasting: Methodology Presentation and Application to Hospital Heat Load Forecasting" Energies 16, no. 4: 2026. https://doi.org/10.3390/en16042026

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Developing Feedforward Neural Networks as Benchmark for Load Forecasting: Methodology Presentation and Application to Hospital Heat Load Forecasting

Abstract

1. Introduction

1.1. Problem Statement

1.2. Case Study

2. Materials and Methods

2.1. Data Acquisition

2.2. Forecast Composition

2.3. Model Creation Methodology

3. Results

3.1. Initial Model Creation

3.2. Internal Parameter Selection

3.3. Feature Engineering

3.4. Architecture Tuning

3.5. Final Model Creation

4. Discussion

4.1. Case Study

4.2. Model Creation Methodology

4.3. Future Research Directions

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

Nomenclature & Abbreviations

Appendix A

Appendix B

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI