1. Introduction
ENSO is a combination of El Niño and Southern Oscillation. The former is an abnormal warming phenomenon of sea surface temperature in the tropical Pacific, and the latter describes a seesaw fluctuation in sea-level atmospheric pressures over the southern Pacific and Indian oceans. Both phenomena have obvious periodicity, occurring once every four years on average. Bjerknes [
1] considers that, although El Niño and Southern Oscillation have different manifestations, they have the same physical laws behind them. El Niño will destroy the normal circulation of ocean currents in the South Pacific, and then disturb the original distribution regularity of the global pressure belt and wind belt. When the ENSO scale is large, severe climate disasters will occur in many regions of the world, especially in the countries along the Pacific Ocean. For example, in the ENSO phenomena of 1986 and 1987, the sea surface temperature in the equatorial Pacific was about 2° higher than the average temperature. The northern and central parts of Peru were plagued by rainstorms, while South Asia and northern Africa were dry and rainless. This brought huge economic losses to local agriculture and fisheries [
2,
3]. The international mainstream standard takes the regional sea surface temperature anomaly as the basic monitoring index. For example, the US Climate Prediction Center defines an El Niño/La Nina event as the absolute value of the Niño 3.4 index’s three-month moving average exceeding 0.5 and lasting for at least five months. (The Niño3.4 index is the average sea surface temperature anomaly (SSTA) in the area from 170° W to 120° W in longitude and 5° N to 5° S in latitude [
4]). In view of the great influence of ENSO, scientists from various countries have conducted a lot of research on methods for forecasting it. Traditional ENSO prediction models mainly include statistical and dynamical models.
The statistical models are developed using the correlation between historical datasets and are typically designed to minimize RMSE. The National Centers for Environmental Prediction (NCEP), the Climate Prediction Center (CPC) and the Earth System Research Laboratory (ESRL) of NOAA have developed some statistical tools for sea surface temperature prediction in the tropical Pacific, including the Linear Inverse Model (LIM) [
5], Climatology and Persistence (CLIPER) [
6], the Markov Model (MKV) [
7], Constructed Analogues (CA) [
8] and Canonical Correlation Analysis (CCA) [
9]. The dynamical models are based primarily on the physical equations of the atmosphere and the land and ocean system, ranging from relatively simple physics to comprehensive fully coupled models. As a fully coupled dynamical model, Climate Forecast System Version 2 (CFSV2) [
10] has been widely used. It was developed at the Environmental Modeling Center at NCEP and became operational in March 2011. The CFSv2 model offers hourly data with a horizontal resolution down to one-half of a degree, greatly improving SST prediction skills.
Statistical and dynamical models have their own advantages and weaknesses in ENSO forecasts. Previous studies [
11] suggest that the average skill of dynamical models [
12,
13] is usually better than that of statistical models [
14]. However, due to the uncertainty of the initial conditions, in practice, it is difficult to simulate the inter-annual average change of sea surface temperature [
15]. Statistical models need long histories of predictor data to discover potential relationships, but observations of the tropical Pacific only began in the 1990s. However, statistical models still have development potential, because, compared with complex dynamic models, they reduce costs and are easy to develop. Despite decades of effort, the use of traditional methods to provide effective ENSO forecasts for more than one year is still a problem.
With the advent of the big data era, artificial intelligence technology has continuously made breakthrough achievements in various fields. Some scholars [
16,
17,
18] try to use machine learning methods to directly predict the Nino3.4 index. Nooteboom [
17] combined the classic autoregressive integrated moving average technology with an artificial neural network combined to predict the ENSO index. Ham [
18] utilized SST and heat content (vertically averaged oceanic temperature in the upper 300 m) anomaly maps over 0–360° E, 55° S–60° N for three consecutive months to produce skilful ENSO forecasts for lead times of up to one and a half years through a convolutional neural network and transfer learning techniques [
19]. Although the final result is better than that of the traditional model, only one indicator is obtained, and it is difficult to further analyze its distribution or cause. In addition, some scholars use the method of multi-mode ensemble to reduce the variance of prediction. Zhang [
20] used the Bayesian model average to describe the uncertainty of the model. Different models will be assigned different weights according to the prediction performance, and better models will be assigned greater weights. The combination of three statistical and dynamical models can provide a more skillful ENSO prediction than any single model. In recent years, some scholars have achieved excellent results in predicting meteorological factors using deep learning methods [
21,
22,
23,
24], such as precipitation and temperature. Considering that ENSO is mainly reflected in changes in sea surface temperature, forecasting ENSO phenomena is equivalent to forecasting sea surface temperature anomalies. Therefore, the SST prediction method can be used for ENSO prediction. Aguilar [
25] used the Bayesian Neural Network (BNN) and Support Vector Regression (SVR), two nonlinear regression methods, to predict the tropical Pacific sea surface temperature anomalies over the next 3 to 15 months. Zhang [
26] formulated the SST forecast problem as a time series regression problem and made short-term and long-term forecasts. Although this kind of approach finally obtains the regional sea surface temperature distribution, it obviously breaks the relationship between adjacent points and loses the spatial feature of temperature. In addition, the current training set of most models only contains reanalysis data, so the amount of data is too small to fully train the model.
In this study, we use simulation data and assimilation data to alleviate the problem of insufficient training sets. We introduce an unsupervised learning domain method to predict ENSO and propose DC-LSTM to also predict sea surface temperature, not only the nino3.4 index.
Firstly, sea surface temperature prediction is represented as a spatiotemporal prediction problem rather than a time series regression task. We use the improved model—DC-LSTM—to predict the monthly average sea surface temperature anomaly distribution in the equatorial Pacific and the corresponding Nino3.4 index over the next two years. A dense convolution layer and a transposed convolution layer are used to extract the spatial features and sampling of SSTA, and the multi-layer causal-LSTM is used to mine the deep law of sea temperature changes. We designed six networks with different widths and depths, and found the optimal structure of the model by comparing the root mean squared error (RMSE). In addition, we also try to integrate other meteorological elements, including T300 (vertically averaged oceanic temperature in the upper 300 m, u-wind, v-wind), taking into account that the global ocean temperature distribution observation data are available from 1871. This means that for each calendar month, the number of samples is less than 150. In order to meet the amount of data necessary for model training, we used historical simulation data and reanalysis data. To some extent, they can simulate the development of ENSO and provide initial training for the model. After we mix them, we use a sliding window method to randomly select sequences as the training set. Compared with traditional dynamic models and time series deep learning models, our method can provide longer more skilful forecasts.
The rest of the paper is organized as follows:
Section 2 describes unsupervised spatiotemporal prediction and the DC-LSTM methodology.
Section 3.1 describes the dataset that was used during training.
Section 3.2 describes the training process of DC-LSTM. In
Section 3.3, the results are presented and discussed. In
Section 4, we conclude the strengths and weaknesses of DC-LSTM and the direction of the work we will expand in the future.
2. Materials and Methods
Suppose we express the meteorological data as a dynamical system over time. The observation at any time can be represented by a tensor
, where
represents the spatial region, and
P is the number of meteorological factors. Then the prediction of a sea surface temperature anomaly can be expressed as an unsupervised spatiotemporal prediction problem, that is, according to a
sequence, and can predict the most likely distribution of the sequence after
K times in the future:
At present, neural networks based on RNN [
27] and LSTM [
28] have achieved great success in the field of spatiotemporal prediction [
29,
30,
31,
32,
33,
34,
35,
36]. We chose the leading and most stable model to improve from the perspective of feature pre-extraction, and propose Dense Convolution-LSTM (DC-LSTM). This section will describe the detailed structure of DC-LSTM. As shown in
Figure 1, our model contains three components: the module below is the Dense Block, which is used to extract the spatial features of each time input; between adjacent Dense Blocks is the max pooling layer, which is used to reduce the size of feature maps; the middle module is L-layer Causal LSTM, and the input of each layer is the hidden state of the previous layer. The green arrow shows the transition paths of state M, the blue arrow shows the gradient high unit, and the red arrow shows the transition paths of the hidden state H. The subscript represents the time and the superscript represents the layer. The top module is the Transposed Convolution layer, which is used to restore the original size of the feature map. The model will forecast the next sea surface temperature according to the current meteorological factors for each iteration.
Figure 2 shows the prediction mechanism of the whole system. In the training phase, for the input, probability
p is the true value, and probability 1-
p is the last prediction. In the initial stage, the prediction ability of the model is weak, and the value of
p is 1. With the increase of training iterations, the value of
p decreases gradually. In the verification phase,
p is fixed to 0. The specific structure of each module is as follows.
2.1. Causal LSTM
In sequence-to-sequence tasks, LSTM [
28] has been widely used as a special RNN model. The flow of information in LSTM is controlled by a ‘Gate’ structure, the value of which is between 0 and 1. Among them, the forget gate
controls the discarding of old features, the input gate
controls the addition of new features, and the output gate
controls the output content of the hidden layer. With the help of the gate structure, LSTM has a stable and powerful long-range dependencies modeling capability. It should be noted that, in the original LSTM, the input, output and states must be 1D vectors. The key equations are shown as below, where ‘⊙’ denotes the Hadamard product:
Causal-LSTM [
37] is a spatiotemporal memory module based on LSTM. Each Causal-LSTM cell memorizes the spatiotemporal features of the sequence through the state C transferred along the horizontal direction and the state M transferred along the zigzag direction. Like ConvLSTM [
38], CausalLSTM extracts spatiotemporal features by replacing the full connections operation with the convolution operator in the state-to-state and input-to-state transitions. Causal-LSTM adopts a cascaded mechanism, where
will be concatenated with
,
to calculate the input gate
, and the forget gate
. Finally, the current hidden state
can be achieved by concatenating
and
. The equations of Causal LSTM are shown as follows:
where ⊙ is the Hadamard product, * is convolution,
is Sigmoid activation function, and the superscript K means the layer in the stacked Causal-LSTM network. The square bracket indicates the concatenation operation of the tensors in the channel dimension.
and
are
convolutional filters for making the channel number of the concatenation tensor consistent with that of hidden state
.
For the sea surface temperature anomaly prediction task, the length of input and output sequences is usually above 30. Due to the zigzag transmission route of spatial memory M, the distance is too long for transmitting the early information to the current state. However, these features are important for the prediction of interannual SSTA. Inspired by highway layers [
39], a new spatiotemporal recurrent structure, named the Gradient Highway Unit (GHU), was inserted between the bottom and second layers of Causal LSTM. GHU can help the model to learn the capability of skipping frames and make the propagation of the gradient more efficient in very deep feed-forward networks. The equations of GHU are shown below:
2.2. Dense Convolution Layer
The algorithm using LSTM requires high computation power during its training, which limits the application of a neural network to a certain extent. Therefore, the input needs to be compressed while retaining the key features as much as possible before prediction, and then needs to restore the image size after prediction. Inspired by [
40], we tried to use an end-to-end trainable method called Dense Convolution and Transpose Convolution instead of the inflexible patch sampling in the original model. The basic module of the dense convolutional layer is the Dense Block, the structure of which is shown in
Figure 3.
In this module, each layer of convolution can receive input from all previous layers and pass its output to all subsequent layers. The cascading method allows each layer to accept the “collective knowledge” from the front:
Since the channel concatenation operation requires feature maps of the same size in the block, a translation layer is inserted between adjacent blocks to reduce the feature size. The translation layer is composed of convolution and max pooling. The max pooling layer only keeps the maximum value in one region of the feature maps (size = 2, stride = 2), and convolution is used to adjust the shape. In the down sampling stage, the size of the feature map will be reduced by times, where n is the number of translation layers. The addition of the Dense Convolution layer makes the DC-LSTM have higher computational efficiency and storage efficiency.