3.2.1. Baseline
After a preliminary evaluation of the hidden correlation between the intensity value in ERA5 and IBTrACS, here, we use our approach to correct the intensity in ERA5 to be close to the intensity in IBTrACS. To verify the effectiveness of the methods used, we need to split the testing dataset to evaluate them. We present three methods in the
Section 2 and compare the distributions of the labels in
Figure 6. If we apply the first method, we can observe that the distribution in the training, validation, and testing dataset is very similar to
Figure 6a. For the second method, the distribution of training and validation is similar but different from the testing dataset shown in
Figure 6b. As for the third method in
Figure 6c, the validation and testing are all different from the training dataset.
In fact, there are only two sets of testing data. One is randomly split and the other is split by consecutive years. The bias and RMSE of point to point in
Table 5 represent the errors of the original intensity in the ERA5 reanalysis. If we calculate the Vmax using the 10 m wind speed, we can observe that the RMSE is 69.82 kts in the testing dataset (10%) and 67.98 kts in the testing dataset (2021–2022) before correction. After linear correction, the RMSE is reduced to 20.86 kts and 19.01 kts, respectively. The biases of these two datasets are all close to 1 kt, confirming the accuracy of the linear model. There are a few differences between the results of these two testing datasets. We also calculate the maximum wind speed from 850 hPa and obtain similar results with the surface wind speed. The linear method corrects the bias and RMSE to some extent. But, we consider that the wind speed at 850 hPa is collected from pressure levels, which may contain less noise. From the comparison of the wind structure on these two levels, shown in the
Figure 7, we can observe that the pattern at 850 hPa is more obvious than at the surface. Therefore, 850 hPa was chosen as the base level for constructing the inputs.
The above methods show the potential of linear correction. However, it remains a large RMSE when used for applications. In addition, it ignores the spatial environmental information surrounding the storms when correcting the intensity. We are therefore considering the use of non-linear methods for further correction. Deep neural networks are our first choice, which we introduce in the previous parts. ResNet-18, without pre-trained parameters, is used as our basic network architecture. We split the dataset into the three ways mentioned above and then use the gridded wind speed around the storm centre with a spatial region of to train, validate, and test the network. We use bilinear interpolation to convert the original shape of the ERA5 wind speed (81, 81, 1) to (224, 224, 1) to match the original input shape of ResNet-18. We also change the unit of the output layer to 1 for our regression task. We set the loss function to mean square error (MSE) and select the adaptive moment estimation (Adam) as the optimal algorithm. For the hype parameters, we set the batch size to 32, the epochs to 50, and the learning rate to 0.0001 through a series of experiments.
The results are very similar between the surface and 850 hPa, which also validates the feasibility of our operation of using 850 hPa as a proxy for the base level. We focus on the bias and RMSE of 850 hPa as an input. They show significantly different results when the testing dataset is partitioned in
Table 6. The test RMSE is 9.8 kts when the testing data are taken from the same data distribution as the training dataset using randomly partitioned methods. The scatter plot in
Figure 8a shows a linear correlation between IBTrACS_Vmax and ResNet_Vmax (the predicted intensity of ResNet-18) in the testing dataset, so it is possible to use linear correction to remove the residuals. However, there is no obvious correlation with IBTrACS_Vmax and ResNet_Vmax in the testing dataset (2021–2022). It can be observed from
Figure 8b,c that the RMSE is all above 16 kts even using different validation datasets if the testing dataset is from consecutive years. We can conclude that the non-linear model based on ResNet-18 performs better in intensity correction than the linear model; however, the error is still far from the acceptable average intensity error in practical applications. Therefore, we next focus on the testing dataset split from consecutive years and optimise the inputs and features to improve the performance. Since the validation method does not have much impact on the testing results shown in the previous experiments, we only use the third splitting method (2004–2018 for training, 2019–2020 for validation, and 2021–2022 for testing) in the next experiments.
3.2.2. TC Knowledge for Optimising the Inputs
As machine learning approaches rely highly on data quality, the construction of inputs is extremely important. We find that there is no obvious correspondence between the inputs and the outputs in the data analysis section using statistical methods. To further improve the restrictiveness of the single-level inputs, we update them in three ways, as shown in
Figure 9. The first is to use the original data without bilinear interpolation to preserve the true information hidden in the data. The second is to crop the region to
and then use bilinear interpolation to resize it. The reason for using the crop operation is to make the structure of the wind near the centre of the storm clearer. And, the third operation is to rotate the inputs according to the direction of the storm speed to unify and standardise the wind pattern, and then crop and resize it. From
Table 7, we can check the effectiveness of resizing the inputs compared to the result of the original inputs. Another finding is that the crop operation makes a small improvement, but the rotate operation is not useful here. Therefore, in the next experiments, we will only use the crop operation.
The above experiments may provide new evidence that the correspondence between single-level wind and label is ambiguous. The one-to-many and many-to-one problems remain to be solved. Therefore, we use additional information to update the inputs, trying to ensure that the inputs contain enough information that can be learned by the neural networks. We implement this in two ways, by increasing the spatial levels of the wind and by increasing the atmospheric variables. Here, we use the base level of 850 hPa and add the middle level (500 hPa) and the top level (200 hPa). We add the equivalent potential temperature
that contributes to the TC evolution [
53] calculated by MetPy using pressure (
p), temperature (
t), and relative humidity (
r). We find the effectiveness of this operation in
Table 8, and the RMSE is reduced to 14.90 kts when we use the variables including wind and
on three pressure levels of 850 hPa, 500 hPa, and 200 hPa.
3.2.3. Feature Learning for Improving the Generalisability
Although the results are now better than the baseline (16.41 kts in
Figure 8c), after updating the inputs, the generalisability of the model does not seem to improve much. This is because the testing error of the trained model using (Wind,
) at (850 hPa, 500 hPa, 200 hPa) remains 14.90 kts larger than the uncertainty of the best track data (7 kts) in the North Atlantic. So, in this section, we start to change the way to find solutions. We split the model from inputs to outputs into two parts and update them separately. Because ResNet-18 was used in this paper, the network architecture consists of an input layer, a convolutional layer, a max pooling layer, four types of residual blocks (each type of number is two), an average pooling layer, and an output layer. Specifically, features are extracted from the residual blocks constructed by convolutional layers and then reduced to the dimension of 512 by the average pooling layer. Thus, we can obtain the features from the average pooling layer after the inputs have been fed into the network. Here, we define the input-feature part (feature extractor) of the model as the process from the inputs to the returns of the average pooling layer, and the feature-output mapping of the model as the process from the returns of the average pooling layer to the output layer. We will gradually consider whether the input-feature and feature-output mapping are effective or not and then decide whether or not to update them to improve performance.
We introduce three ways of splitting the dataset before, and we find that the testing error in
Table 6 from the randomly split is satisfactory, but the results from the subsequent newly coming years are vice versa. We can conclude from the former finding that the ResNet-18 network is able to effectively represent the inputs from 2004 to 2022 with general features in the entire data space. This can also be used to validate the effectiveness of the representative ability of deep neural networks. However, in the sub-data space of the validation dataset (2019–2020) and testing dataset (2021–2022) of the unseen new years, the ResNet-18 model trained by the data of the historical years (2004–2018) does not work well enough. As for the reasons, we try to deduce them from the two aspects. One of them may be that the feature-output mapping is not effective enough, and another reason may be that the general features learned from the historical years are not appropriate for the storms in the new coming years, or both.
We perform the following experiments to validate the above assumptions. The first operation is to enlarge the training dataset using data augmentation to reduce overfitting and then improve the generalisability of the features. This also helps to reduce the impact of sample size and validate that the small size can also be used to train a network model. Specifically, we use random rotation to increase the size, as we have a finding in
Table 7 to verify that there is no obvious effect on the results when the inputs are rotated. We use the same experimental setup as in the previous experiments, including the computational environment, network architecture, hype parameters, and so on. We just increase the epochs to 100 and then save the best model based on the validation error in the training process. And, we find that there is no significant difference between the different sample sizes in
Table 9. In order to balance the computational cost and the generalisability, we adopt the model trained on the third group setting shown in
Table 9. Specifically, we use the new training dataset consisting of the original training dataset from 2004 to 2018 and two copied training datasets (fold 2) with random rotation to train the model, and save the trained model without the output layer as the general feature extractor.
We freeze the general feature extractor described in the last paragraph and then reorganise the dataset. For the next set of experiments, the inputs to the learning tasks are changed from the ERA5 variables
X to the corresponding features
after the feature extraction of the ERA5 variables, but the labels remain the same. In the first set of experiments, the features or feature-output mapping are updated using the samples from the previous training years (2004–2018), and also using the samples from 2019 to 2020 for validation, and the testing years (2021–2022) for testing. The dataset used to retrain a model is D1, shown in
Table 10, and the methods we use here are classical machine learning (ML) algorithms and MLP. The top three ML algorithms in our validation are the logistic regression (LR), support vector machine (SVR), and gradient boosting regression (GBR). The choice of MLP is a single-layer network with 1 unit, or 3 layers with units of 1024, 512, and 1, or 5 layers with units of 1024, 4096, 1024, 512, and 1. In fact, LR, SVR, GBR, and MLP with one layer only update the feature-output mapping, but MLP with three layers and MLP with five layers update the features and the feature-output mapping simultaneously, referring to our definition for them in this paper.
The second set of experiments uses the same methods and settings as the first set, only changing the samples to reorganise the dataset as D2. Here, we use 90% of the samples from the previous validation years (2019–2020) to retrain the model to update the features or feature-output mapping, and the remaining 10% for validation. The 2021–2022 samples are still used for testing. The third set of experiments also uses the same methods and settings as the first set but changes the samples to reorganise the dataset as D3 and add DA as a new method. We use 90% of the samples from the previous training and validation years (2004–2020) to retrain the model to update the features or feature-output mapping, and the remaining 10% for validation. The 2021–2022 samples are still used for testing. We set the loss weight of MMD to 1, 100, and 1000 separately when using the DA method.
Here, we design these three sets of experiments to validate the effectiveness of previous training information, validation information not used for training, and all available known information for feature learning. In all experiments, the inputs in are , although it denotes the features , and the labels are . The inputs and labels in (, ) and (, ) are similar.
Here, we try to describe the details of the DA method we use in this part. As we mentioned in
Section 2, there may be some differences in the data distribution between the training and testing datasets in practical applications, leading to a weak generalisability of the trained model. Thus, we can consider
as the source domain and
as the target domain, referring to the concept of DA. The architecture design is inspired by domain adaptive neural networks (DaNN) [
54] and deep domain confusion (DDC) [
55]. Our loss, as defined here, consists of two parts. One is the MSE between the predictions and labels in the testing dataset, and the other is the MMD distance between the features of the training and testing data. The method is shown in
Figure 10, and the total loss can be expressed as
In this formula,
demonstrates MSE and
demonstrates the square of MMD. In detail, the square of the MMD can be expressed as follows:
is the mapping that converts the original data to RKHS (Reproducing Kernel Hilbert Space), and , are the sample sizes of the training and testing dataset separately. The transition allows the features from these two types of datasets to be compared in a high dimension. Here, we use the multi-kernel MMD (MK-MMD) as the distance function for the features and try to find the appropriate weight to balance the two parts of the loss. The aim of this method is to update the general features from the training dataset into the specific features of the testing dataset so that it can be used to improve the generalisability of the model and then reduce the testing error.
The results in
Table 11 show that there is no significant improvement in error reduction using D1 and D2, whether based on traditional ML algorithms (LR, SVR, and GBR) or MLP with different layers. However, there is an obvious improvement in D3 based on MLP, so we focus on the results analysis of the D3 dataset. In fact, when we use MLP with a single layer of a unit, only the feature-output mapping is updated. But if we use MLP with three layers of 1024, 512, and 1 unit, the features have been updated to a new one, even if the shape is still (None, 512). And, the feature-output mapping was updated at the same time. The results show that three layers perform better than one layer and five layers, reducing the RMSE to 11.55 kts. Although it is much improved compared to the test results (16.41 kts) in 2021–2022 shown in
Figure 8c, it is still larger than the uncertainty of the intensity (7 kts) in the North Atlantic. Therefore, we use the DA method shown in
Figure 10.
We only adjust the loss weight of MMD (
) as the different setting in this method. They are 1, 100, and 100, respectively. We find that the results are not very different when choosing 1 or 100, and the RMSE are all 5.99 kts, although the bias shows little difference. However, the RMSE increases when the weight is set to 1000. We choose the best result using the DA method with a loss weight of 100 as the final approach in this paper and focus on analysing the results for this. This method reduces the RMSE to 5.99 kts, which is less than the uncertainty of the intensity in the North Atlantic. The scatter plot in
Figure 11 shows an obvious linear correlation between the labels (IBTrACS_Vmax) and the predictions (DA_Vmax), and the Pearson correlation coefficient (r) of these two intensity values also confirms that it is 0.7 (>0.5). However, the scatter of the predictions is not very centred and has a dispersed shape. For this reason, we also show the error distribution of the predictions. The error distribution in
Figure 12 (left) shows that the errors are centred in the range [−20, 20], although some samples are outside this range. As for the prediction distribution in
Figure 12 (right), it shows a Gaussian distribution that does not fit the label distribution. We suspect that the reason for this is that we use the MSE as the main loss function; the distribution does not change when we add the MMD loss. Therefore, the DA method could be optimised and further explored in the future.