Retrieval of Volcanic Ash Cloud Base Height Using Machine Learning Algorithms

Zhao, Fenghua; Xia, Jiawei; Zhu, Lin; Sun, Hongfu; Zhao, Dexin

doi:10.3390/atmos14020228

Open AccessArticle

Retrieval of Volcanic Ash Cloud Base Height Using Machine Learning Algorithms

by

Fenghua Zhao

¹,

Jiawei Xia

¹,

Lin Zhu

^2,*

,

Hongfu Sun

¹ and

Dexin Zhao

¹

College of Geoscience and Surveying Engineering, China University of Mining and Technology, Beijing 100083, China

²

National Satellite Meteorological Center, China Meteorological Administration, Beijing 100081, China

^*

Author to whom correspondence should be addressed.

Atmosphere 2023, 14(2), 228; https://doi.org/10.3390/atmos14020228

Submission received: 8 December 2022 / Revised: 14 January 2023 / Accepted: 16 January 2023 / Published: 23 January 2023

(This article belongs to the Special Issue Remote Sensing Applied in Atmosphere: Recent Trends, Current Progress and Future Directions)

Download

Browse Figures

Versions Notes

Abstract

:

There are distinct differences between radiation characteristics of volcanic ash and meteorological clouds, and conventional retrieval methods for cloud base height (CBH) of the latter are difficult to apply to volcanic ash without substantial parameterisation and model correction. Furthermore, existing CBH inversion methods have limitations, including the involvement of many empirical formulae and a dependence on the accuracy of upstream cloud products. A machine learning (ML) method was developed for the retrieval of volcanic ash cloud base height (VBH) to reduce uncertainties in physical CBH retrieval methods. This new methodology takes advantage of polar-orbit active remote-sensing data from the Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP), from vertical profile information and from geostationary passive remote-sensing measurements from the Spinning Enhanced Visible and Infrared Imager (SEVIRI) and the Advanced Geostationary Radiation Imager (AGRI) aboard the Meteosat Second Generation (MSG) and FengYun-4B (FY-4B) satellites, respectively. The methodology involves a statistics-based algorithm with hybrid use of principal component analysis (PCA) and one of four ML algorithms including the k-nearest neighbour (KNN), extreme gradient boosting (XGBoost), random forest (RF), and gradient boosting decision tree (GBDT) methods. Eruptions of the Eyjafjallajökull volcano (Iceland) during April-May 2010, the Puyehue-Cordón Caulle volcanic complex (Chilean Andes) in June 2011, and the Hunga Tonga-Hunga Ha’apai volcano (Tonga) in January 2022 were selected as typical cases for the construction of the training and validation sample sets. We demonstrate that a combination of PCA and GBDT performs more accurately than other combinations, with a mean absolute error (MAE) of 1.152 km, a root mean square error (RMSE) of 1.529 km, and a Pearson’s correlation coefficient (r) of 0.724. Use of PCA as an additional process before training reduces feature relevance between input predictors and improves algorithm accuracy. Although the ML algorithm performs well under relatively simple single-layer volcanic ash cloud conditions, it tends to overestimate VBH in multi-layer conditions, which is an unresolved problem in meteorological CBH retrieval.

Keywords:

volcanic ash cloud base height; machine learning; CALIOP lidar data; passive satellite measurement

1. Introduction

Volcanic ash clouds may seriously affect climate and aviation safety, and large-scale eruptions may have profound impacts on the climate through abnormal temperature and precipitation, air pollution, acid rain, and greenhouse effects [1,2,3,4,5]. With the growth of global air traffic, the dangers of volcanic ash clouds to aviation are increasing. Therefore, the prediction of volcanic ash cloud height has become increasingly important.

Volcanic ash top height (VTH) and base height (VBH) are key parameters that describe the vertical distribution of a volcanic ash cloud and are highly related to the vertical motions in the atmosphere. Therefore, they are very important input parameters in volcanic ash transport models. Generally, for a single layer of volcanic ash, VTH refers to the top height of the volcanic ash layer while the bottom height of the same layer is regarded as VBH. Passive meteorological satellites have been widely used in the quantitative estimation of VTH. The launch of new generations of meteorological satellites over the last decade has improved the assessment of cloud heights, providing higher temporospatial and spectral resolution [6,7]. Higher spatial resolution enables resolution of smaller volcanic clouds, contributing to an increase in the number of Volcanic Ash Advisory Centers (VAACs) issued per year [8]. Many techniques have been developed for the retrieval of VTH using passive meteorological satellites [9,10,11,12,13,14]. However, most current radiation transfer models assume single-layer clouds, and multi-layer conditions limit the accuracy of VTH retrieval [15,16].

VBH retrieval from passive meteorological satellites is more challenging than VTH retrieval. Cloud studies currently focus mainly on meteorological cloud base height (CBH) rather than VBH. The two main methods for the retrieval of CBH from passive meteorological satellites involve either extrapolation according to cloud type [17,18,19] or subtraction of the estimated cloud geometric thickness from cloud top height (CTH) obtained from upstream cloud products [20,21,22]. These CBH inversion methods have many limitations and involve numerous empirical formulae, an over-dependence on the accuracy of upstream cloud products, and limited observation ability for bottom radiance under multi-layer cloud conditions [23,24]. Furthermore, there are distinct differences between the radiation characteristics of volcanic ash and meteorological clouds, such that conventional CBH retrieval methods are difficult to apply to volcanic ash clouds without intense parameterization, model correction, and uncertainties.

Advanced machine learning (ML) techniques have been developed over recent years, including the k-nearest neighbour (KNN), support vector machine (SVM), random forest (RF), gradient boosting decision tree (GBDT), extreme gradient boosting (XGBoost), and deep learning (DL) techniques, and these have been successfully applied in the retrieval of cloud and aerosol parameters from satellite data [16,25,26]. ML methods have been widely used for the retrieval of CTH, CBH, and cloud phase. They differ from traditional physical methods in that they flexibly and efficiently learn hidden relationships among a large number of sample features, and they are able to fit nonlinear variables without considering the influence of sensor spectral features. ML can thus be used as an alternative statistical method describing complicated relationships between quantified cloud parameters and multi-channel remote-sensing information, with which traditional physical one-layer radiative transfer functions have difficulty [16]. Previous studies of ML methods in CBH retrieval have focused mainly on the use of satellite upstream products as input features, with possible error accumulation across products [24]. To reduce this error accumulation, Jiménez and McCandless (2021) explored the possibility of using ML for the retrieval of CBH directly from the radiation measurements taken from geostationary satellites, finding that an avoidance of upstream products enhances CBH retrieval [23].

Here, an ML methodology for VBH retrieval was developed, drawing on the advantages of ML in fitting nonlinear variables. The new methodology takes advantage of both vertical polar-orbit active remote-sensing data from the Cloud-Aerosol Lidar with Orthogonal Polarization (CALIOP) [27] system and horizontal geostationary passive remote-sensing radiative data from the Spinning Enhanced Visible and Infrared Imager (SEVIRI) and the Advanced Geostationary Radiation Imager (AGRI) aboard the Meteosat Second Generation (MSG, Europe) [28] and Fengyun-4B (FY-4B, China) satellites [29], respectively, in collecting training datasets. A total of 19,424 volcanic ash cloud sample points of active and passive satellites for the three eruptions (the Eyjafjallajökull (16 April and 5–16 May 2010), Puyehue-Cordón Caulle (5 and 15–18 June 2011) and Hunga Tonga–Hunga Ha’apai eruptions (15 and 17 January 2022)) were temporospatially matched. Among these, 4222 Eyjafjallajökull (9 and 16 May 2010) and Puyehue-Cordón (16 June 2011) datapoints were randomly selected as test data, and the rest were used as training data. The target data of the regression correspond to the CALIOP Level2 (L2) CBH. Principal component analysis (PCA) and the KNN, XGBoost, RF, and GBDT ML algorithms were used together to define a robust relationship between VBH and satellite observations. Section 2 introduces the data used in this study; Section 3 describes the methodology based on PCA and ML algorithms; Section 4 describes independent validation on VBH retrievals from a series of volcanic eruptions; Section 5 discusses VBH sensitivity to various combinations of PCA and ML algorithms, as well as atmospheric temperature profile inputs and uncertainties analysis; finally, Section 6 gives concluding remarks.

2. Data

2.1. Passive Satellite Data

2.1.1. MSG/SEVIRI

SEVIRI is a 12-channel imager aboard the MSG satellite with a repeat cycle of 15 min, in which the high-resolution channel—High Resolution Visible (HRV)—is used for half-disc scanning on the E–W side and full-disc imaging on the N–S side. SEVIRI has eight thermal infrared (IR) and three solar spectrum channels with sampling distances of 3 km at nadir [28]. Wavelength range of different SEVIRI channels and application objects of each channel has been showed in Table 1. SEVIRI Level1 (L1) data were downloaded from the European Organization for the Exploitation of Meteorological Satellites’ (EUMETSAT) Earth observation portal (https://eoportal.EUMETSAT.int/userMgmt (accessed on 24 May 2022)).

2.1.2. FY-4B/AGRI

The FY-4B meteorological satellite is in geostationary orbit over the Equator at 133° E; AGRI is included in its payload. AGRI data are divided into four categories, according to spatial resolution: the spatial resolution of channel 2 is up to 0.5 km, down-scalable to 1 km, 2 km, and 4 km; the highest resolution of channels 1 and 3 is 1 km, down-scalable to 2 km and 4 km; the highest resolution of channels 4–7 is 2 km, down-scalable to 4 km; the highest spatial resolution of the remaining seven channels (8–14) is 4 km. AGRI also features high temporal resolution, performing area scans at 1 min intervals and full Earth disc scans at 15 min intervals [29]. Wavelength range of different AGRI channels and application objects of each channel. has been showed in Table 2. The AGRI data used in this study are the FY4B in-orbit test data provided by the National Satellite Meteorological Center (NSMC). On June 1, 2022, FY4B began to provide observation data for global users, data (after 1 June 2022) can be downloaded from the FENGYUN Satellite Data Center (http://satellite.nsmc.org.cn/portalsite/default.aspx (accessed on 10 January 2023)).

Using SEVIRI and AGRI data, eruptions over the Eyjafjallajökull (Iceland; April–May 2010) volcano, the Puyehue-Cordón Caulle volcanic complex (Chilean Andes; June 2011), and the Hunga Tonga–Hunga Ha’apai volcano (Tonga; January 2022) were selected as typical cases for the construction of training and validation sample sets. Volcanic ash clouds, ice, and water have different absorption properties in split-window channels centred at 10.8 and 12 μm, and these channels have been widely used to distinguish between volcanic ash and meteorological clouds and in volcanic ash parameter retrieval [11,12]. In addition to these traditional split-window channels, more channels are sensitive to volcanic ash, volcanic SO₂, and water vapor, including channels centred at 6.25, 8.6, and 13.3 μm [7,30,31]. To maximize the use of IR data, the brightness temperatures (BTs) of SEVIRI channels 4 (3.9 μm), 5 (6.25 μm), 7 (8.7 μm), 9 (10.8 μm), 10 (12.0 μm), and 11 (13.4 μm) and AGRI channels 7 (3.75 μm), 9 (6.25 μm), 12 (8.55 μm), 13 (10.8 μm), 14 (12.0 μm), and 15 (13.3 μm) were used to build volcanic ash datasets for training and testing the VBH retrieval model (Figure 1).

2.2. Active Satellite Data

Here, the CALIOP L2 CBH product (Version: 4.10) was used as the ‘true’ value for both training and independent validation. The US NASA Earth System Science Pathfinder (ESSP) programme, together with the National Centre for Space Studies (CNES) of the French Space Agency, developed the Cloud-Aerosol Lidar and Infrared Pathfinder Satellite Observation (CALIPSO) satellite for observation of the distribution and properties of global aerosol and cloud. CALIOP is a satellite-based lidar system aboard CALIPSO and is the first such lidar system with three channels (1064 and 532 nm vertical and parallel channels) providing high-resolution vertical profiles of aerosols and clouds in cloud-free conditions and under low-lying and optically thin clouds. Its detection of backscattered radiation at 532 and 1064 nm provides more accurate distributions of clouds and aerosols than are available from other systems [27,32]. There are three basic types of CALIOP L2 product with different spatial resolutions, including layer and profile products and the vertical feature mask. Owing to its high sensitivity to backscattered signals from atmospheric particles (i.e., clouds and aerosols), a more accurate CBH can theoretically be obtained using the lidar echo signal [27]. We also analysed the vertical structure of volcanic clouds using CALIOP Level 1B (L1B) (Version: 4.10) data (comprising the lower-troposphere 532 nm total-attenuation backscatter product of 333 nm horizontal resolution) to distinguish between meteorological clouds and volcanic ash cloud. CALIOP L1B and L2 data are available from the NASA Langley Research Center’s Atmospheric Sciences Data Center (ASDC) website (https://asdc.larc.nasa.org (accessed on 26 January 2022)).

2.3. Atmospheric Profiles

ERA5 is the fifth generation of the European Centre for Medium-Range Weather Forecasts (ECMWF) reanalysis product, which has been utilized to forecast the global climate and weather for the past four to seven decades [33], and provides many atmospheric, land, sea surface, and sea state parameters on an hourly basis. ERA5’s vertical temperature-profile and relative-humidity-profile data were also used as input variables to account for the effect of atmospheric temperature on observed radiation. ERA5 data are available for a latitude/longitude grid of 0.25° × 0.25° resolution with 37 barometric levels for atmospheric parameters [34]. The ERA5 data used here can be downloaded from https://cds.climate.copernicus.eu/cdsapp#!/home (accessed on 21 March 2022).

2.4. Data Processing and Quality Control

The pre-processing of data before construction of the hybrid retrieval model involved three main steps. Firstly, SEVIRI L1 data and CALIOP L2 CBH data and atmospheric profile data for the eruptions were collected and collocated. Secondly, meteorological and volcanic ash clouds were distinguished using the CALIOP aerosol type product (Figure 2), which relies on latitude, temperature, and the measured properties of each layer (total attenuated backscatter (Figure 3), depolarization ratio at 532 nm (Figure 4), and total attenuated colour ratio (1064 nm/532 nm) (Figure 5)) [35], together with the volcanic ash mask of the volcanic ash detection algorithm of the US National Oceanic and Atmospheric Administration (NOAA; Figure 6) [36]. Finally, a total of 19,424 volcanic ash cloud sample points of active and passive satellites for the three eruptions were temporospatially matched. The data were used as training and testing datasets in the development of ML models, and 4222 Eyjafjallajökull and Puyehue-Cordón datapoints were randomly selected as test data (Section 4).

3. Methods

3.1. PCA

PCA is a multivariate statistical method for dimension reduction. It reduces a set of potentially correlated high-dimensional variables to a set of low-dimensional linearly uncorrelated synthetic variables, preserving as much variance of the original data as possible. In order to achieve these goals, PCA computes new variables called principal components which are obtained as linear combinations of the original variables. The first principal component has the maximum variance, which contains the most information. The second principal component is computed under the constraint of being orthogonal to the first component, which removes the information contained in the first principal component and reduces data redundancy. The other components are computed likewise [38]. The computed principal components are not only uncorrelated, but also their variances decrease in turn, which is an optimization of the original variables. The reduced-dimensional feature vectors exclude the effect of covariance between the original vectors on the accuracy of the prediction algorithm and improve the accuracy of the algorithm using small data samples of high dimensionality. The main steps for dimensionality reduction using the PCA algorithm are as follows [39].

The original data are represented by:

$X = (\begin{matrix} x_{11} & x_{12} & \dots & x_{1 m} \\ x_{21} & x_{22} & \dots & x_{2 m} \\ \dots & \dots & \dots & \dots \\ x_{n 1} & x_{n 2} & \dots & x_{n m} \end{matrix}) = (X_{1}, X_{2}, \dots, X_{m}),$

(1)

where n is number of sample points in the dataset, and each sample point has m variables.
Standardization of primitive data is undertaken to eliminate the adverse effects of different dimensions.
The standardized formula is:

$Z_{i j} = \frac{x_{i j} - \bar{x_{j}}}{\sqrt{V a r (x_{j})}} (i = 1, 2, \dots, n; j = 1, 2, \dots, m),$

(2)

$\bar{x_{j}} = \frac{1}{n} \sum_{n}^{i = 1} x_{i j},$

(3)

$V a r (x_{i}) = \frac{1}{n - 1} \sum_{n}^{i = 1} {(x_{i j} - \bar{x_{j}})}^{2}, (i = 1, 2, \dots, n; j = 1, 2, \dots, m) .$

(4)
Calculation of the correlation coefficient matrix R,

$R = \frac{1}{n - 1} Z^{'} Z,$

(5)

where R is an n × n symmetrical matrix, data on the diagonal are all 1, and Z′ is the transposed matrix Z.
Calculation of eigenvalues and eigenvectors of the correlation coefficient matrix. Eigenvalues $λ_{i}$ obtained by

$|λ E - R| = 0 .$

(6)

Then, eigenvalues $λ_{i}$ were sorted by size. Eigenvectors $Z X_{i}$ are obtained by

$(λ E - R) X = 0,$

(7)

$Z X_{i} = (Z X_{i 1}, Z X_{i 2}, \dots, Z X_{i m})' .$

(8)
Determination of the number of principal components. The contribution rate of the principal component is $α_{I}$ ,

$α_{I} = \frac{λ_{i}}{\sum_{i = 1}^{m} λ_{i}},$

(9)

the cumulative contribution rate of p principal components is given by

$C V_{i} = \frac{\sum_{i = 1}^{p} λ_{i}}{\sum_{i = 1}^{m} λ_{i}} .$

(10)

Dimensionality reduction of the original sample is thus achieved. As a commonly used data-processing algorithm, PCA has the following advantages:

The amount of information is calculated according to the covariance, which is influenced only by the data characteristics of the original sample.
The obtained principal components are orthogonal to each other, effectively avoiding the mutual influence of each dimensional parameter.
The principle is simple with rapid calculations.

Based on the BT data of all training datasets, the heat map of BT correlation was calculated as shown in Figure 7, indicating high covariance among BT data parameters. PCA was then applied to the ML algorithm to reduce data dimensionality, thus reducing feature redundancy.

3.2. The Optimal ML Algorithm

We applied four main ML algorithms in training the VBH prediction model: the KNN [40], RF [41], GBDT [42], and XGBoost [43] algorithms. The prediction of VBH is not a classification problem, but rather a parametric regression problem, so the four ML algorithms can be used to predict VBH owing to their parametric regression capability. With the KNN algorithm, output values are simply averaged from nearest neighbours and measured in the input space with some distance parameterization. Ensemble learning is a machine learning method that uses a series of base learners to learn and uses a certain rule to integrate individual learning results to achieve better outcomes than a single learner [44]. Here, the ensemble learning algorithms include RF, GBDT, and XGBoost. RF constructs a large number of decision trees and returns the average prediction for each tree; GBDT is an iterative decision tree algorithm comprising multiple decision trees, with the conclusions of all trees being accumulated into the final result; finally, XGBoost is an improvement on GBDT with highly optimised speeds, supporting regularization to reduce the possibility of over-fitting. Ensemble learning algorithms have become increasingly popular in recent years because they often outperform single ML algorithms [45,46]. However, results vary from application to application and even from dataset to dataset, so it is necessary to test and compare the prediction effectiveness of different ML methods with different subjects and datasets. Here, the optimal prediction model for VBH was selected by applying the PCA algorithm to the four ML algorithms and comparing results. A flow chart of the VBH retrieval algorithm used here is shown in Figure 8.

The specific implementation steps were as follows.

Selection of input features, which include BT data from six IR bands of MSG/SEVIRI and FY-4B/AGRI, the BT difference (BTD) of three thermal IR channels of MSG/SEVIRI and FY-4B/AGRI, temperature and humidity profiles of ERA5, and latitude and longitude of each sample (Table 3).
Application of PCA algorithm. The cumulative explained-variance contribution curve has features that are retained after dimensionality reduction as the horizontal axis, while the explained-variance contribution captured by the new feature matrix after dimensionality reduction is the vertical axis. The latter aids decision making with regard to the optimal number of features to be retained after dimensionality reduction. The cumulative contribution threshold of PCA principal components was set to 99% here (Figure 9), and almost all information was retained with 25 components after dimensionality reduction, so the number of new features to be retained was set at 25.
Based on the new features after PCA, the KNN, XGBoost, RF, and GBDT algorithms were further used to train samples. The performance of ML models is closely related to the selection of parameters, and reasonable parameter settings usually improve accuracy, so training models must take into account the adjustment of parameters [47]. SKlearn is a Python module integrating a wide range of state-of-the-art ML algorithms. The GridSearchCV module in Sklearn, was used to adjust parameters automatically and iteratively. Parameters to be adjusted for the four algorithms and their ranges are shown in Table 4.

3.3. Evaluation of Retrieval Accuracy

The optimal algorithm was selected by comparing the prediction results of the four algorithms. To assess the effectiveness of the model, five evaluation indicators were used: the mean absolute error (MAE), root mean square error (RMSE), Pearson’s correlation coefficient (r), mean bias error (MBE) and mean absolute percentage error (MAPE). During the calculation of MBE, the positive and negative errors can offset each other, impacting the accurate evaluation of the model performance. Therefore, we only use MAE, RMSE and r in the evaluation of four machine learning algorithms. However, MBE can determine whether there is a positive or negative error in the model. Therefore, MBE is applied in the result validation (Section 4) to judge whether the VBH predicted by the model is under- or overestimated. MAPE is calculated to compare this study with other works. These were calculated as follows:

MAE = \frac{1}{n} \sum_{n}^{n = 1} |Y_{i} - {\overset{\land}{Y}}_{i}|,

(11)

RMSE = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(Y_{i} - {\overset{\land}{Y}}_{i})}^{2}},

(12)

r = \frac{\sum_{i = 1}^{n} (\hat{y_{i}} - \bar{\hat{y_{i}}}) (y_{i} - \bar{y})}{\sqrt{\sum_{i = 1}^{n} {(\hat{y_{i}} - \bar{\hat{y_{i}}})}^{2} \sum_{i = 1}^{n} {(y_{i} - \bar{y})}^{2}}},

(13)

MBE = \frac{1}{n} \sum_{n}^{i = 1} ({\overset{\land}{Y}}_{i} - Y_{i}),

(14)

MAPE = \frac{100}{n} \sum_{n}^{i = 1} |\frac{{\overset{\land}{Y}}_{i} - Y_{i}}{Y_{i}}|,

(15)

where n is the number of test datapoints,

Y_{i}

is the actual value of the ith VBH, and

{\overset{\land}{Y}}_{i}

is the predicted value of the algorithm for the ith VBH.

Parameter tuning for the four algorithms yielded optimal parameter combinations for each algorithm. The MAE, RMSE and r values of the four algorithm models were calculated based on the optimal parameters for each algorithm (Table 5). Comparison of the MAE and RMSE values of the four models indicates that the MAE and RMSE values of the GBDT algorithm (1.152 and 1.529 km, respectively) are the lowest of the four algorithms (Figure 10). Min et al. (2020) [25] compared four of the ML algorithms (KNN, SVM, GBDT, and RF) using CTH estimates from the Advanced Himawari Imager (AHI) onboard Himawari-8 and found that the GBDT algorithm outperformed the others with a minimum MAE of 1.54 km and small fluctuations in tuning experiments [48]. GBDT thus has advantages when estimating CBH. We therefore used a combination of PCA and GBDT algorithms (the PCA_GBDT algorithm) for the prediction of VBH.

4. Results and Validation

4.1. Use of CALIOP Data for Validation

The optimal ML algorithm (PCA_GBDT) was applied to an independent sample set and retrieved VBHs were compared with collocated CALIOP VBHs. To test the performance of PCA_GBDT under different cloud conditions, the original independent sample set was divided into three types of dataset for single- and multi-layer volcanic ash clouds and for all samples. These datasets were then stratified at 1 km intervals and prediction results were compared.

PCA_GBDT gave optimal performance under single-layer volcanic ash cloud conditions (Figure 11c) with an MAE of 1.104 km, an RSME of 1.374 km, and a generally linear relationship with validation data (r = 0.754). The PCA_ GBDT algorithm tended to overestimate VBH at <7 km and underestimate it at >7 km (Figure 11d). MAE increased slightly (from 1.104 to 1.152) when all samples were used (Figure 11a), and the MAE error was higher than that of single-layer cloud conditions, especially for lower layers (Figure 11b). However, if only samples with multi-layer volcanic ash clouds were used, MAE increased to 1.351 km and RMSE to 2.051 (Figure 11e), indicating more retrieval uncertainty under multi-layer volcanic ash cloud conditions. When predicting the multi-layer VBH, the CBH of non-top-layer clouds may be predicted to be higher, likely causing the error to increase (Section 5). In general, with VBH > 7 km, the overall prediction error of the PCA_GBDT algorithm was relatively small, and when VBH < 7 km, the algorithm still performed well for single-layer volcanic ash clouds (VBH retrieval is difficult under multi-layer conditions).

4.2. Case Study

The Puyehue-Cordón Caulle eruption (04:01–04:07 UTC on 16 June 2011) was selected to further verify the accuracy of the PCA_GBDT algorithm, as both single- and multi-layer volcanic ash clouds were produced together with mixtures of meteorological and volcanic ash clouds. Meteorological conditions were complex, so the prediction performance of the algorithm could be tested rigorously. The cloud vertical structure at 04:01–04:07 UTC on 16 June 2011, is depicted in Figure 12, indicating that VBH retrievals for single-layer clouds had smaller errors than those of multi-layer clouds.

Although the PCA_GBDT-predicted VBH appeared slightly lower when there were meteorological clouds below the volcanic ash cloud, the overall prediction results were consistent with the CALIOP CBH. For the multi-layer volcanic ash cloud, the retrieved VBH was considerably higher than the real VBH from CALIOP and nearer the CBH of the top layer, consistent with the previous result. The PCA_GBDT algorithm thus yields better VBH retrievals for single-layer volcanic ash clouds and the top clouds of multi-layer clouds, and it can generally capture the complex nonlinear relationship between VBH and satellite IR sensors.

Piontek et al. (2021) [49] introduced a new volcanic ash retrieval algorithm VACOS using the geostationary instrument MSG/SEVIRI and artificial neural networks. VACOS performs pixelwise classifications and retrieves (indirectly) the mass column concentration, the cloud top height and the effective particle radius. The data and algorithm used by VACOS are similar to this work. We further compared the retrieval accuracy of this study with VACOS using data from the Puyehue-Cordón Caulle eruption (2011) case (16 June 2011) (Table 6). It is shown that MAPE for VACOS is 18% and 15% for PCA_GBDT, indicating that our study for volcanic base height has comparable accuracy to previous results, although there are more difficulties for base height retrieval [24].

5. Discussion

5.1. Sensitivity Analysis

5.1.1. Comparison between GBDT and PCA_GBDT

PCA was undertaken with a feature space comprising 51 features. Eigenvalues, variance contribution rates, and the cumulative contribution rate of variance of each principal component were calculated (Table 7). Only the first 25 principal components needed to be selected, as they contained 99% of the original information. Correlation coefficients between the first three principal components and standardized features are listed in Table 8.

The GBDT and PCA_GBDT algorithms were compared to assess the influence of PCA on the accuracy of the ML algorithm, with their parameters adjusted and independent verification data (test set; Section 2) predicted using the optimal parameter combination. Result precision is shown in Table 9 for the two ML algorithms.

The predictions indicate that the algorithm with PCA and GBDT combined, the retrieved VBH MAE decreased from 1.278 km to 1.152 km and RMSE from 1.663 km to 1.529 km, while r increased from 0.715 to 0.724. PCA thus has advantages in feature extraction while also improving prediction accuracy.

5.1.2. Sensitivity to Atmospheric Profile

Uncertainties caused by different eruption locations, times, and atmospheric conditions mean it is difficult to completely capture spatial changes in VBH retrieval using only satellite IR data as inputs. Whether traditional physical retrieval methods or statistical estimation methods are used, both types are inevitably sensitive to meteorological conditions in cloud-parameter retrieval. The sensitivity of the PCA_GBDT model to atmospheric temperature and relative humidity was assessed by comparison of prediction results with and without adding temperature and humidity vertical profile data to the training samples, and with the cumulative contribution rate threshold of the PCA principal component set at 99%. The results (Table 10) indicate that with addition of the atmospheric temperature and humidity profile data, MAE decreased by 26.39% and RMSE by 21.67%, whereas r increased by 2.84%. Therefore, the addition of temperature and humidity profile data to the training improves the accuracy of VBH retrieval.

5.1.3. Sensitivity to BT Data

The sensitivity of the PCA_GBDT algorithm to BT was examined by comparison of prediction results with and without BT added to the training set, and with the cumulative contribution rate threshold of the PCA principal component set at 99%. With the addition of satellite-observed BT data, MAE decreased by 12.99% and RMSE by 12.28%, whereas r increased by 14.56% (Table 11). Therefore, it can be found that the satellite observed BT data contain a lot of information about VBH and adding BT data can effectively retrieve VBH.

5.2. Uncertainty Analysis

Complex meteorological conditions and different types of data were involved in the development of ML models for the retrieval of VBH, and both factors may cause uncertainties in final retrieval.

Under complex meteorological conditions, vertical multi-layer structures in the volcanic ash cloud may adversely affect VBH retrieval. If the cloud optical thickness of the top layer is greater than that of lower layers, the satellite-observed radiation tends to reflect the cloud top more than the bottom [15], causing VBH retrieval to be close to VTH. With complex backgrounds, where volcanic ash and meteorological cloud overlap on vertical scales, low-level meteorological cloud may reduce the prediction accuracy of VBH, generating large uncertainties and causing possible underestimation of VBH [9].

In data pre-processing, use of volcanic ash cloud false-colour images together with the CALIOP product to distinguish meteorological and volcanic ash clouds may lead to confusion between cloud types even though data are quality controlled, incurring a requirement of further differentiation. Additional uncertainties may be introduced in multi-source data matching owing to differences in data resolution. The spatial resolutions of the CALIOP L2 CBH, SEVIRI L1, and AGRI L1 data were 1, 3, and 4 km, respectively. The lower-resolution ERA5 data may not capture horizontal and vertical variations in atmospheric temperature and relative humidity well, especially near the surface. The different data may thus introduce additional uncertainties owing to resolution variations.

6. Conclusions

A new algorithm was developed for retrieval of VBH based on combinations of PCA and four ML methodologies, taking advantage of both vertical and horizontal information for a variety of ash clouds under single- and multi-layer conditions. Vertical information was derived from polar-orbit active remote-sensing data (CALIOP) and ERA5 reanalysis profiles, which respectively represent “true” VBHs and meteorological conditions at different vertical layers. Horizontal information was derived from geostationary passive remote-sensing BT measurements (MSG/SEVIRI and FY-4B/AGRI). Through independent verification and analysis of results, the following conclusions were drawn.

Under single-layer cloud conditions, the new statistically based VBH algorithms developed here can generally capture the complex nonlinear relationship between VBH and satellite IR radiation that is consistent with independent data. The combination of PCA and GBDT yields optimal performance with an MAE of 1.152 km, RMSE of 1.529 km, and an r value of 0.724. With low-level volcanic ash cloud layers, PCA_GBDT slightly overestimated VBH while underestimating VBH for high-level ash layers; under multi-layer conditions MAE increased to 1.35 km and RMSE to 2.05 km, indicating more retrieval uncertainty under multi-layer cloud conditions. Retrieval errors were partly due to the limited detection ability of horizontal satellites, whose data reflect mainly radiation information for the top cloud layer.
ML requires a large number of training samples but increasing sample size generates sample redundancy and increases computational load. VBH retrieval precision is improved by a PCA step before training. Comparison of GBDT and PCA_GBDT predictions indicates reductions in PCA_GBDT MAE and RMSE of 9.86% and 8.06%, respectively, whereas r increased by 1.26% over GBDT alone. PCA thus reduces redundancy in input information and improves the accuracy of VBH retrieval.
Previous CBH studies have focused mainly on the use of remote-sensing data and upstream products to establish training data sets [24], and the use of several upstream products tends to introduce additional inversion errors. Atmospheric reanalysis data include large amounts of vertical-structure information for the atmosphere, closely related to the thermal, dynamic, and microphysical characteristics of volcanic ash clouds. However, such information has not been considered in previous CBH retrieval studies. Here, vertical information was included in the training dataset. This improved the accuracy of VBH retrieval, reduced MAE from 1.565 to 1.152 km, reduced RMSE from 1.952 to 1.529 km, and increased r from 0.704 to 0.724; as a result, the use of atmospheric profiles as auxiliary data improves the accuracy of VBH retrieval, especially under multi-layer cloud conditions.

Here, the Eyjafjallajökull, Puyehue-Cordón Caulle and Hunga Tonga–Hunga Ha’apai eruptions provided training datasets; in future work, training samples with more spectral and microphysical ash-cloud diversity may further improve retrieval accuracy.

Author Contributions

Conceptualization, L.Z. and F.Z.; methodology, J.X. and L.Z.; software, J.X. and D.Z.; validation, L.Z. and J.X.; resources, L.Z., F.Z. and H.S.; data curation, J.X.; writing—original draft preparation, J.X.; writing—review and editing, L.Z. and J.X.; visualization, J.X.; project administration, L.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 41871263.

Data Availability Statement

Only publicly available data sets were analysed in this study. Passive satellite data were obtained from the EUMETSAT Earth Observation Portal, FENGYUN Satellite Data Center, active satellite data were obtained from the NASA Langley Research Center’s Atmospheric Sciences Data Center (ASDC) website and atmospheric profile data were obtained from the European Centre for Medium-Range Weather Forecasts (ECMWF).

Acknowledgments

The authors thank Michael Pavolonis of the National Oceanic and Atmospheric Administration Center, who provided GOES-R volcanic ash products for the two cases (Eyjafjallajökull eruption and Puyehue-Cordón Caulle eruption) so that we could better distinguish between volcanic ash cloud and meteorological cloud. The authors would also like to thank Li Jun for his guidance in theoretical methods and writing. Thanks also go to the National Meteorological Satellite Center for providing FY-4B/AGRI data during in-orbit test.

Conflicts of Interest

The authors declare no conflict of interest.

References

Durant, A.J.; Bonadonna, C.; Horwell, C.J. Atmospheric and Environmental Impact of Volcanic Particulates. Elements 2010, 6, 235–240. [Google Scholar] [CrossRef]
Gu, G.; Adler, R.F. Precipitation and Temperature Variations on the Interannual Time Scale: Assessing the Impact of ENSO and Volcanic Eruptions. J. Clim. 2011, 24, 2258–2270. [Google Scholar] [CrossRef] [Green Version]
Guenther, T.; Schulze, M.; Friederici, A.; Theisel, H. Visualizing Volcanic Clouds in the Atmosphere and Their Impact on Air Traffic. IEEE Comput. Graph. Appl. 2016, 36, 36–47. [Google Scholar] [CrossRef] [PubMed]
Oman, L.; Robock, A.; Stenchikov, G.; Schmidt, G.; Ruedy, R. Climatic response to high-latitude volcanic eruptions. J. Geophys. Res. 2005, 110, D13103. [Google Scholar] [CrossRef]
Prata, A.T. Remote Sensing of Volcanic Eruptions: From aviation hazards to global cooling. In Plate Boundaries and Natural Hazards, 1st ed.; Duarte, J.C., Schellart, W.P., Eds.; American Geophysical Union: Washington, DC, USA, 2016; pp. 289–322. [Google Scholar] [CrossRef]
Zhang, P.; Zhu, L.; Tang, S.; Gao, L.; Chen, L.; Zheng, W.; Han, X.; Chen, J.; Shao, J. General Comparison of FY-4A/AGRI With Other GEO/LEO Instruments and Its Potential and Challenges in Non-meteorological Applications. Front. Earth. Sci. 2019, 6, 224. [Google Scholar] [CrossRef] [Green Version]
Yang, J.; Zhang, Z.Q.; Wei, C.Y.; Lu, F.; Guo, Q. Introducing the new generation of Chinese geostationary weather satellites, Fengyun-4. Bull. Am. Meteorol. Soc. 2017, 98, 1637–1658. [Google Scholar] [CrossRef]
Engwell, S.; Mastin, L.; Tupper, A.; Kibler, J.; Acethorp, P.; Lord, G.; Filgueira, R. Near-real-time volcanic cloud monitoring: Insights into global explosive volcanic eruptive activity through analysis of Volcanic Ash Advisories. Bull. Volcanol. 2021, 83, 9. [Google Scholar] [CrossRef]
Francis, P.N.; Cooke, M.C.; Saunders, R.W. Retrieval of physical properties of volcanic ash using Meteosat: A case study from the 2010 Eyjafjallajökull eruption. J. Geophys. Res. 2012, 117, D00U09. [Google Scholar] [CrossRef]
Pavolonis, M.J. Advances in extracting cloud composition information from spaceborne infrared radiances—A robust alternative to brightness temperatures. Part I: Theory. J. Appl. Meteorol. Clim. 2010, 49, 1992–2012. [Google Scholar] [CrossRef]
Prata, A.J. Observations of volcanic ash clouds in the10–12 μm window using AVHRR/2 data. Int. J. Remote Sens. 1989, 10, 751–761. [Google Scholar] [CrossRef]
Prata, A.J. Infrared radiative transfer calculations for volcanic ash clouds. J. Geophys. Res. Atmos. 1989, 16, 1293–1296. [Google Scholar] [CrossRef]
Prata, A.J.; Prata, A.T. Eyjafjallajokull volcanic ash concentrations determined using Spin Enhanced Visible and Infrared Imager measurements. J. Geophys. Res. Atmos. 2012, 117, D00U23. [Google Scholar] [CrossRef] [Green Version]
Sawada, Y. The Detection Capability of Explosive Eruptions Using GMS Imagery, and the Behaviour of Dispersing Eruption Clouds. In Volcanic Hazards; Latter, J.H., Ed.; Springer: Berlin/Heidelberg, Germany, 1989; Volume 1, pp. 233–245. [Google Scholar] [CrossRef]
Zhu, L.; Li, J.; Zhao, Y.; Gong, H.; Li, W. Retrieval of volcanic ash height from satellite-based infrared measurements. J. Geophys. Res. Atmos. 2017, 122, 5364–5379. [Google Scholar] [CrossRef]
Zhu, W.; Zhu, L.; Li, J.; Sun, H. Retrieving Volcanic Ash Top Height through Combined Polar Orbit Active and Geostationary Passive Remote Sensing Data. Remote Sens. 2020, 12, 953. [Google Scholar] [CrossRef] [Green Version]
Hutchison, K.D. The retrieval of cloud base heights from MODIS and three-dimensional cloud fields from NASA’s EOS Aqua mission. Int. J. Remote Sens. 2002, 23, 5249–5265. [Google Scholar] [CrossRef]
Hutchison, K.; Wong, E.; Ou, S.C. Cloud base heights retrieved during night-time conditions with MODIS data. Int. J. Remote Sens. 2006, 27, 2847–2862. [Google Scholar] [CrossRef]
Miller, S.D.; Forsythe, J.M.; Partain, P.T.; Haynes, J.M.; Bankert, R.L.; Sengupta, M.; Mitrescu, C.; Hawkins, J.D.; Vonder Haar, T.H. Estimating Three-Dimensional Cloud Structure via Statistically Blended Satellite Observations. J. Appl. Meteorol. Climatol. 2014, 53, 437–455. [Google Scholar] [CrossRef] [Green Version]
Brenguier, J.; Pawlowska, H.; Schüller, L.; Preusker, R.; Fischer, J.; Fouquart, Y. Radiative Properties of Boundary Layer Clouds: Droplet Effective Radius versus Number Concentration. J. Atmos. Sci. 2000, 57, 803–821. [Google Scholar] [CrossRef]
Chakrapani, V.; Doelling, D.R.; Rapp, A.D.; Minnis, P. Cloud thickness estimation from GOES-8 satellite data over the ARM SGP site. In Proceedings of the Twelfth ARM Science Team Meeting, St. Petersburg, FL, USA, 8–12 April 2002. [Google Scholar]
Wilheit, T.T.; Hutchison, K.D. Retrieval of cloud base heights from passive microwave and cloud top temperature data. IEEE Trans. Geosci. Electron. 2000, 38, 1253–1259. [Google Scholar] [CrossRef]
Jiménez, P.A.; McCandless, T. Exploring the Potential of Statistical Modeling to Retrieve the Cloud Base Height from Geostationary Satellites: Applications to the ABI Sensor on Board of the GOES-R Satellite Series. Remote Sens. 2021, 13, 375. [Google Scholar] [CrossRef]
Tan, Z.; Huo, J.; Shuo, M.; Han, D.; Yan, W. Estimating cloud base height from himawari-8 based on a random forest algorithm. Int. J. Remote Sens. 2021, 42, 2485–2501. [Google Scholar] [CrossRef]
Min, M.; Li, J.; Wang, F.; Liu, Z.; Menzel, W.P. Retrieval of cloud top properties from advanced geostationary satellite imager measurements based on machine learning algorithms. Remote Sens. Environ. 2020, 239, 111616. [Google Scholar] [CrossRef]
Zhao, D.; Zhu, L.; Sun, H.; Li, J.; Wang, W. Fengyun-3D/MERSI-II Cloud Thermodynamic Phase Determination Using a Machine-Learning Approach. Remote Sens. 2021, 13, 2251. [Google Scholar] [CrossRef]
Winker, D.M.; Vaughan, H.A.; Omar, A.; Hu, Y.; Powell, K.A. Overview of the CALIPSO mission and CALIOP data processing algorithms. J. Atmos. Ocean. Technol. 2009, 26, 2310–2323. [Google Scholar] [CrossRef]
Schmetz, J.P.; Pili, S.; Tjemkes, D.; Just, J.; Kerkmann, S.R.; Ratier, A. An introduction to Meteosat Second Generation (MSG). Bull. Am. Meteorol. Soc. 2002, 83, 977–993. [Google Scholar] [CrossRef]
Xian, D. Fengyun-4B. Satell. Appl. 2021, 7, 68. (In Chinese) [Google Scholar] [CrossRef]
Heidinger, A.K.; Pavolonis, M.J.; Holz, R.E.; Baum, B.A.; Berthier, S. Using CALIPSO to explore the sensitivity to cirrus height in the infrared observations from NPOESS/VIIRS and GOES-R/ABI. J. Geophys. Res. Atmos. 2010, 115, D00H20. [Google Scholar] [CrossRef]
Pavolonis, M.J.; Heidinger, A.K.; Sieglaff, J. Automated retrievals of volcanic ash and dust cloud properties from upwelling infrared measurements. J. Geophys. Res. Atmos. 2013, 118, 1436–1458. [Google Scholar] [CrossRef]
Winker, D.M.; Pelon, J.; Coakley, J.A.; Ackerman, S.A.; Charlson, R.J.; Colarco, P.R.; Flamant, P.; Fu, Q.; Hoff, R.M.; Kittaka, C.; et al. The CALIPSO mission: A global 3D view of aerosols and clouds. Bull. Am. Meteorol. Soc. 2010, 91, 1211. [Google Scholar] [CrossRef]
Hersbach, H.; Bell, B.; Berrisford, P.; Biavati, G.; Horányi, A.; Muñoz Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Rozum, I.; et al. ERA5 Hourly Data on Pressure Levels from 1959 to Present. Copernicus Climate Change Service (C3S) Climate Data Store (CDS). Available online: https://cds.climate.copernicus.eu/cdsapp#!/dataset/10.24381/cds.bd0915c6?tab=overview (accessed on 11 January 2023).
Hersbach, H.; Bell, B.; Berrisford, P.; Hirahara, S.; Horányi, A.; Muñoz-Sabater, J.; Nicolas, J.; Peubey, C.; Radu, R.; Schepers, D.; et al. The era5 global reanalysis. Q. J. R. Meteorol. Soc. 2020, 146, 1999–2049. [Google Scholar] [CrossRef]
Kim, M.H.; Omar, A.H.; Tackett, J.L.; Vaughan, M.A.; Magill, B.E. The calipso version 4 automated aerosol classification and lidar ratio selection algorithm. Atmos. Meas. Tech. 2018, 11, 6107–6135. [Google Scholar] [CrossRef] [Green Version]
Pavolonis, M.J.; Sieglaff, J. GOES-R Advanced Baseline Imager (ABI) Algorithm Theoretical Basis Document for Volcanic Ash: Detection and Height, 2nd ed.; NOAA: Silver Spring, MD, USA, 2010. [Google Scholar]
National Aeronautics and Space Administration. Available online: https://www-calipso.larc.nasa.gov/products/lidar/browse_images/show_v4_detail.php?s=production&v=V4-10&browse_date=2011-06-16&orbit_time=15-54-36&page=1&granule_name=CAL_LID_L1-Standard-V4-10.2011-06-16T15-54-36ZD.hdf (accessed on 9 January 2023).
Abdi, H.; Williams, L.J. Principal component analysis. Wiley. Interdiscip. Rev. Comput. Stat. 2010, 2, 433–459. [Google Scholar] [CrossRef]
Li, L.; Zhao, J.; Wang, C.; Yan, C. Comprehensive evaluation of robotic global performance based on modified principal component analysis. Int. J. Adv. Robot. Syst. 2020, 17, 220–226. [Google Scholar] [CrossRef]
Altman, N.S. An Introduction to Kernel and Nearest-Neighbor Nonparametric Regression. Am. Stat. 1992, 46, 175–185. [Google Scholar]
Breiman, L. Random Forests. Mach. Learn. 2001, 45, 5–32. [Google Scholar] [CrossRef] [Green Version]
Freund, Y.; Schapire, R.E. A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting. J. Comput. Syst. Sci. 1997, 55, 119–139. [Google Scholar] [CrossRef] [Green Version]
Chen, T.; Guestrin, C. XGBoost: A Scalable Tree Boosting System. In Proceedings of the KDD’16: The 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, 13–17 August 2016; pp. 785–794. [Google Scholar] [CrossRef] [Green Version]
Yang, S.; Chen, L.F.; Yan, T.; Zhao, Y.H.; Fan, Y.J. An ensemble classification algorithm for convolutional neural network based on adaboost. In Proceedings of the 2017 IEEE/ACIS 16th International Conference on Computer and Information Science (ICIS), Wuhan, China, 24–26 May 2017; pp. 401–446. [Google Scholar] [CrossRef]
Xi, B.; Yu, Z.; Cao, W.; Shi, Y.; Ma, Q. A survey on ensemble learning. Front. Comput. Sci. 2019, 14, 241–258. [Google Scholar] [CrossRef]
Wang, K.; Liu, X.; Zhao, J.; Gao, H.; Zhang, Z. Application research of ensemble learning frameworks. In Proceedings of the Chinese Automation Congress (CAC), Shanghai, China, 6–8 November 2020. [Google Scholar] [CrossRef]
Franco, M.A.; Krasnogor, N.; Bacardit, J. Automatic tuning of rule-based evolutionary machine learning via problem structure identification. IEEE Comput. Intell. Mag. 2020, 15, 28–46. [Google Scholar] [CrossRef]
Lin, H.; Li, Z.; Li, J.; Zhang, F.; Min, M.; Menzel, W.P. Estimate of daytime single-layer cloud base height from advanced baseline imager measurements. Remote Sens. Environ. 2022, 274, 112970. [Google Scholar] [CrossRef]
Piontek, D.; Bugliaro, L.; Kar, J.; Schumann, U.; Marenco, F.; Plu, M.; Voigt, C. The New Volcanic Ash Satellite Retrieval VACOS Using MSG/SEVIRI and Artificial Neural Networks: 2. Validation. Remote Sens. 2021, 13, 3128. [Google Scholar] [CrossRef]

Figure 1. Spectral-response functions of six IR bands of MSG/SEVIRI and FY-4B/AGRI.

Figure 2. Image of CALIOP aerosol subtype from 15:54–16:08 UTC on 16 June 2011 for the Puyehue–Cordón Caulle eruption [37].

Figure 3. Image of CALIOP total attenuated backscatter at 532 nm from 15:54–16:08 UTC on 16 June 2011 for the Puyehue–Cordón Caulle eruption [37].

Figure 4. Image of CALIOP depolarization ratio at 532 nm from 15:54–16:08 UTC on 16 June 2011 for the Puyehue–Cordón Caulle eruption [37].

Figure 5. Image of CALIOP total attenuated colour ratio (1064 nm/532 nm) from 15:54–16:08 UTC on 16 June 2011 for the Puyehue–Cordón Caulle eruption [37].

Figure 6. Example of a SEVIRI RGB false-colour image (red = 12–10.8 μm; green = 10.8–8.7 μm; blue = 10.7 μm) for the Puyehue–Cordón Caulle eruption at 16:00 UTC on 16 June 2011. The contemporary CALIOP overpass (black line) and volcanic ash mask (yellow) are shown.

Figure 7. Heat map of correlations among brightness temperature (BT) data.

Figure 8. Flow chart of the VBH retrieval algorithm. The BT data, longitude, latitude, and the atmospheric profile data are taken as the input data, and the data dimension is reduced through PCA algorithm to obtain new features. The CBH data of CALIOP is taken as the “true value”. The optimal machine learning algorithm is selected to predict VBH by comparing the performance of KNN, RF, XGBoost and GBDT models.

Figure 9. Cumulative explained-variance contribution curve.

Figure 10. MAE and RMSE comparison of four ML models.

Figure 11. Validation of retrieved VBH for all sample datasets (a,b), single-layer volcanic ash cloud sample sets (c,d), and multi-layer volcanic ash cloud sample sets (e,f).

Figure 12. CALIOP 532 nm total-attenuation backscattering profile for 04:01–04:07 UTC, 16 June 2011, for the Puyehue–Cordón Caulle eruption. The solid black line represents the tropopause, the red dot is the CALIOP CTH, the green dot is the CALIOP CBH, and the yellow dot is the VBH predicted by PCA_GBDT; the red dashed box represents multi-layer volcanic ash cloud; the blue dashed box represents single-layer volcanic ash cloud [15].

Table 1. Wavelength range of different SEVIRI channels and application objects of each channel.

Channel	Wavelength Range (µm)	Corresponding Band	Object of Band Application
1	0.56–0.71	Visible	Aerosol, surface, vegetation, cloud
2	0.74–0.88	Visible	Aerosol, surface, vegetation, cloud
3	1.50–1.78	Short-wave infrared	Ice, snow, cloud
4	3.48–4.36	Mid-wave infrared	Low cloud, surface, ocean, cloud
5	5.35–7.15	Long-wave infrared	Water vapour, wind field, high-level cloud
6	6.85–7.85		Water vapour, wind field
7	8.30–9.10		Atmosphere, surface, cloud
8	9.38–9.94		Ozone
9	9.80–11.80		Atmosphere, surface, wind field, cloud
10	11.00–13.00		Atmosphere, surface, cloud
11	12.40–14.40		Atmosphere and cloud height
12	0.4–1.1	HRV	Surface, cloud

Table 2. Wavelength range of different AGRI channels and application objects of each channel.

Channel	Wavelength Range (µm)	Corresponding Band	Object of Band Application
1	0.45–0.49	Visible	Small particle aerosol, true colour synthesis
2	0.55–0.75	Visible	Vegetation, image navigation registration, star observation
3	0.75–0.9	Near infrared	Vegetation, aerosol on water surface
4	1.36–1.39	Short-wave infrared	Cirrus cloud
5	1.58–1.64		Recognition of low cloud, snow, water cloud and ice cloud
6	2.1–2.35		Cirrus cloud, aerosol, particle size
7	3.5–4.0 (high)	Mid-wave infrared	Cloud level high albedo target, fire point
8	3.5–4.0 (low)	Mid-wave infrared	Low albedo target, surface
9	5.8–6.7	Long-wave infrared	High level water vapor
10	6.9–7.3		Middle level water vapor
11	7.24–7.6		Low level water vapor
12	8.0–9.0		Cloud, total water vapor
13	10.3–11.3		Cloud, surface temperature
14	11.5–12.5		Cloud, total water vapor, surface temperature
15	13.2–13.8		Water vapor, cloud

Table 3. Predictant and predictor variables used for regression model training and prediction for different ML methods.

Predictant	CALIPSO CBH (km)
Predictor (satellite measurement)	SEVIRI: BT (3.9 µm), BT (6.2 µm), BT (8.7 µm), BT (10.8 µm), BT (12.5 µm), BT (13.4 µm); AGRI: BT (3.75 µm), BT (6.25 µm), BT (8.5 µm), BT (10.7 µm), BT (12.0 µm), BT (13.5 µm); BTD (12–10.8 µm), BTD (10.8–8.5 µm).
Predictor (ERA5)	Temperature profile data of 37 layers of 1–1000 hPa Relative humidity profile data of 1–5 hPa
Predictor (other)	Latitude, longitude

Table 4. Adjustment parameters and dynamic ranges of different ML algorithms.

Algorithm Name	Parameter and Dynamic Range
KNN	1. nearest weight function; ‘uniform’, ‘distance’
KNN	2. number of neighbours (n_neighbours); 1–10 with an interval of 1
RF	1. number of trees (n_estimators); 10–1000 with an interval of 20
	2. maximum depth of the tree (max_depth); 15–100 with an interval of 5
	3. minimum number of samples required to split an internal node (min_samples_split); 2, 3, 4, 5, 6, 7, 8, 9, 10
	4. minimum number of samples required to be at a leaf node (min_samples_leaf); 1, 2, 3, 4, 5, 6, 7, 8, 9, 10
GBDT	1. number of trees (n_estimators); 20–200 with intervals of 10, 190, 380, 1900, 3800
	2. maximum depth of the tree (max_depth); 3, 5, 7, 9, 11, 13
	3. minimum number of samples required to be at a leaf node (min_samples_leaf); 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
	4. minimum number of samples required to split an internal node (min_samples_split); 2, 4, 6, 8, 10
	5. learning rate (learning_rate); 0.005, 0.01, 0.05, 0.1
	6. fraction of samples to be used for fitting individual base learners. (subsample); 0.6, 0.7, 0.75, 0.8, 0.85, 0.9
XGBoost	1. number of trees (n_estimators); 10–1000 with an interval of 50
	2. maximum depth of tree (max_depth); 2–10 with an interval of 1
	3. Minimum sum of instance weight (hessian) needed in a child (min_child_weight); 2, 3, 4, 5, 6, 7, 8, 9, 10, 11
	4. Loss reduction threshold caused by decision tree splitting(gamma); 0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6
	5. fraction of samples to be used for fitting individual base learners. (subsample); 0.6, 0.7, 0.8, 0.9
	6. Control feature sampling scale of whole tree(colsample_bytree); 0.6, 0.7, 0.8, 0.9
	7. L1 regularization coefficient (reg_alpha); 0.00, 0.001, 0.01, 0.1, 1, 100
	8. L2 regularization coefficient (reg_lambda); 0.00, 0.001, 0.01, 0.1, 1, 100
	9. learning rate (learning rate); 0.001, 0.01, 0.5, 0.1, 0.2, 1

Table 5. Comparison of VBH prediction errors for the four ML algorithms based on optimal parameters.

Algorithm	Parameter	MAE (km)	RMSE (km)	r
PCA_KNN	n_neighbors = 3; weights = ‘distance’	1.457	2.070	0.615
PCA_RF	n_estimators = 910; max_depth = 40; min_samples_split = 2; min_samples_leaf = 2; max_features = ‘sqrt’	1.160	1.532	0.726
PCA_XGBoost	Learning_rate = 0.1; n_estimators = 960; max_depth = 9; min_child_weight = 4;	1.275	1.641	0.682
PCA_XGBoost	colsample_bytree = 0.8; reg_alpha = 0.1; reg_lambda = 1	1.275	1.641	0.682
PCA_GBDT	Learning_rate = 0.01; n_estimators = 1900; max_depth = 13; min_samples_split = 2; min_samples_leaf = 4; max_features = 14; subsample = 0.9	1.152	1.529	0.724

Table 6. Comparison of the retrievals of VACOS and PCA_GBDT against CALIOP data.

Volcanic Ash	Algorithm	MAPE
Top height	VACOS	18%
Base height	PCA_GBDT	15%

Table 7. Eigenvalues, variance contribution rates, and cumulative contribution rates of each principal component.

Principal Component	Eigenvalue	Contribution Rate (%)	Cumulative Contribution Rate (%)
1	21.351	41.864	41.864
2	12.671	24.845	66.710
3	4.669	9.155	75.864
…	…	…	…
25	0.025	0.048	99.677
…	…	…	…
51	7.632 ×10⁻¹²	1.496 × 10⁻¹¹	100

Table 8. Linear combination coefficients between the first three principal components and standardized features.

1st Principal Component	Standardized Feature	2nd Principal Component	Standardized Feature	3rd Principal Component	Standardized Feature
0.906	T (875 hPa)	0.859	Lat	0.817	BT (12 µm)
0.903	T (900 hPa)	0.836	T (2 hPa)	0.805	BT (10.8 µm)
0.9	T (850 hPa)	0.834	T (3 hPa)	0.781	BT (13.4 µm)
0.892	T (20 hPa)	0.826	T (125 hPa)	0.774	BT (8.7 µm)
0.891	T (825 hPa)	0.806	T (100 hPa)	0.699	BT (6.2 µm)
0.891	T (925 hPa)	0.803	T (5 hPa)	0.534	BT (3.9 µm)
0.881	T (800 hPa)	0.783	T (150 hPa)	0.484	Lon
0.88	T (775 hPa)	0.773	T (1 hPa)	0.456	BTD2
0.877	T (750 hPa)	0.723	T (175 hPa)	−0.333	T (150 hPa)
0.871	T (700 hPa)	0.72	T (70 hPa)	−0.329	T (125 hPa)

Table 9. Comparison of prediction accuracy between GBDT and PCA_GBDT algorithms.

	MAE (km)	RMSE (km)	r
GBDT	1.278	1.663	0.715
PCA_GBDT	1.152	1.529	0.724
PCA_GBDT − GBDT	−9.86%	−8.06%	+1.26%

Table 10. Comparison of algorithm results with or without atmospheric profile data.

	MAE (km)	RMSE (km)	r
Without profile data	1.565	1.952	0.704
With profile data	1.152	1.529	0.724
With − without	−26.39%	−21.67%	+2.84%

Table 11. Comparison of algorithm results with and without BT data.

	MAE (km)	RMSE (km)	r
Without BT data	1.324	1.743	0.632
With BT data	1.152	1.529	0.724
With − without	−12.99%	−12.28%	+14.56%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhao, F.; Xia, J.; Zhu, L.; Sun, H.; Zhao, D. Retrieval of Volcanic Ash Cloud Base Height Using Machine Learning Algorithms. Atmosphere 2023, 14, 228. https://doi.org/10.3390/atmos14020228

AMA Style

Zhao F, Xia J, Zhu L, Sun H, Zhao D. Retrieval of Volcanic Ash Cloud Base Height Using Machine Learning Algorithms. Atmosphere. 2023; 14(2):228. https://doi.org/10.3390/atmos14020228

Chicago/Turabian Style

Zhao, Fenghua, Jiawei Xia, Lin Zhu, Hongfu Sun, and Dexin Zhao. 2023. "Retrieval of Volcanic Ash Cloud Base Height Using Machine Learning Algorithms" Atmosphere 14, no. 2: 228. https://doi.org/10.3390/atmos14020228

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Retrieval of Volcanic Ash Cloud Base Height Using Machine Learning Algorithms

Abstract

1. Introduction

2. Data

2.1. Passive Satellite Data

2.1.1. MSG/SEVIRI

2.1.2. FY-4B/AGRI

2.2. Active Satellite Data

2.3. Atmospheric Profiles

2.4. Data Processing and Quality Control

3. Methods

3.1. PCA

3.2. The Optimal ML Algorithm

3.3. Evaluation of Retrieval Accuracy

4. Results and Validation

4.1. Use of CALIOP Data for Validation

4.2. Case Study

5. Discussion

5.1. Sensitivity Analysis

5.1.1. Comparison between GBDT and PCA_GBDT

5.1.2. Sensitivity to Atmospheric Profile

5.1.3. Sensitivity to BT Data

5.2. Uncertainty Analysis

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI