3.2.1. U-Net Semantic Segmentation Model
The model structure with semantic segmentation plays a key role in ground object classification algorithms, and the semantic segmentation model that incorporates the underlying features has great advantages, so a U-shaped network structure has been constructed on this basis so that the convolution results of each layer of the model are involved in the final feature fusion. This U-shaped semantic segmentation network is known as the U-net [
17]. In this study, an improved U-net semantic segmentation model is employed to classify ground objects such as water bodies, roads, and green spaces in satellite images. For each pixel in the input satellite image, the model determined the category to which the pixel belonged, and finally outputted the prediction result. Specific steps are as follows:
- (1)
Construct a training dataset
In the training for semantic segmentation of satellite images, the performance of the model is inextricably linked to the quality of the training data. Generally speaking, the training data fall into two categories, supervised and unsupervised learning, according to whether the data have been manually annotated or not. The semantic segmentation model used in this paper belonged to the supervised learning category, so a satellite image dataset with manual annotation has to be constructed for the training of the semantic segmentation model.
The annotation tool LabelMe is utilized in this paper to mark and annotate ground objects such as water bodies, roads, and green spaces in satellite images, and the final results are 200 satellite images and 200 annotated images. Data pair examples of the training dataset are shown in
Figure 2, and each data pair shows the satellite image on the left and the manually annotated image on the right.
- (2)
Model training
The dataset used to train the model in this paper is a high-resolution satellite image dataset created by us, which had to be preprocessed before the model could be trained. The dataset is first normalized by subtracting the mean from each image and removing the variance to ensure that images with too much data variation could have the same scale of distribution. Then, in the course of training, the image data are processed in a random manner, including image flipping, panning, and zooming, with the aim of improving the robustness of the model and reducing overfitting of the model. Finally, the larger the batch size is, the more representative of the overall distribution characteristics of the dataset. However, due to the limitations of computer capabilities, the entire dataset could not be loaded at once in a semantic segmentation task. In view of this, the batch size (batch) of the input data is set to 8 with due regard to the hardware performance of the server.
The learning rate is not only the iteration step size of a deep learning model, but also one of the most important hyperparameters when training a model. If the learning rate is set too small, the model usually converges too slowly and tends to converge to a local optimum, while on the contrary, if the learning rate is too large, the model tends not to converge. Therefore, the strategy for setting the learning rate is of paramount importance in training a model. Compared to nonrandom algorithms, stochastic gradient descent (SGD) utilizes information more effectively, especially when it is redundant, and has better performance in early iterations. Moreover, SGD has an advantage over nonrandom algorithms in computational complexity with large samples [
23]. In this paper, the stochastic gradient descent (SGD) algorithm is chosen as the strategy for parameter updating, and the initial learning rate for training is set as 0.0001. In a semantic segmentation model, a loss function is usually used to measure the training effect of the model and to perform gradient optimization. The smaller the value of the loss function, the higher the accuracy of the model and the better the training effect, so the selection of the loss function is of prime importance to the model training. A properly selected loss function will lead to a steady improvement in the predicted results of the model. In this paper, cross-entropy loss and Dice loss are selected as the loss functions for investigation. The Dice coefficient is a statistic used to gauge the similarity of two samples, indicating the degree of overlap between the predicted results and the true results. Its possible values are in the interval (0, 1), and the larger the value, the better. As such, Dice Loss = 1-Dice is taken as the loss function for semantic segmentation. The calculation formula is as follows:
where
denotes the predicted results and
denotes the true results.
Finally, the final output of the model is usually a convolution result with variable value, while the result obtained from the semantic segmentation task is a probability value representing the category to which the pixel belonged, so the output of the model has to be normalized to a number between 0 and 1. In addition, the values of each pixel on all channels should sum to 1, representing the sum of the probabilities of each pixel’s overall output categories equaled 1. To this end, the output result of the softmax function normalization model is selected in this paper:
where
denotes the output value of pixel,
denotes the number of channels, and
denotes the predicted probability.
- (3)
Model prediction
Since the original U-net semantic segmentation model is mainly used for the segmentation of single-channel biomedical images, VGG-16 [
24] is taken by the improved model in this paper as the encoding network, with its network structure detailed in
Table 1. In this study, Conv1 to Conv5 are selected as the encoding structure of U-net, where each Conv comprised two 3 × 3 convolutions and two ReLU activation functions. Each Conv is followed by a max pooling layer, the main purpose of which is to extract important features from the input of the upper layer and reduce the number of feature parameters, and the operation principle is to select the extreme values on a fixed region from the input features of the upper layer. The max pooling selected in this paper is 2 × 2 max pooling with a step size of 2, i.e., the maximum in a 2 × 2 region is selected from the input vector matrix, and the final output width and height are half the input features.
In the decoding structure, to facilitate the construction of the network and for better versatility, the U-net used in this paper superimposes the outputs of Conv1-Conv4 onto 2 times upsampled results of the output features of the decoder, thus obtaining a feature layer with the height and width the same as those of the input image. The detailed structure of U-net is shown in
Figure 3.
3.2.2. Extreme Gradient Boosting (XGBoost) Model
The extreme gradient boosting (XGBoost) model is a decision tree-based integrated machine learning algorithm proposed by CHEN [
25], which is based on classification and regression trees (CART) to classify and predict datasets. XGBoost is employed in this study to train and predict integrated datasets, and its prediction process is as follows:
Step 1: Construct a dataset containing
samples and
features,
{(
), (
),…, (
)}, and the predicted output of the integrated model is expressed as:
where
denotes the
th sample,
denotes the prediction value of the
th sample
,
denotes the
th regression tree, and
denotes the number of regression trees. Equation (2) indicates that given an input
, the output value is the sum of the predicted values of
regression trees (i.e., the weights of the leaf nodes divided according to the decision rules of corresponding regression trees).
Step 2: Define the objective function. The objective function of XGBoost is composed of a loss function and a regularization term, and defining the objective function is to define the loss function and the regularization term. The loss function is used to fit the training data, while the regularization term is used to control the model complexity, and the equation is as follows:
where
denotes the objective function,
denotes the loss function,
is the regularization term, and
denotes the true value of the sample.
Step 3: Optimize the objective function. A forward distribution algorithm is used to optimize the objective function. Supposing
is the predicted value of the
th sample after
th iteration (the
th tree), then:
where
denotes the predicted value of the
th sample after
th iterations,
denotes the predicted value of the
th sample after
iteration, and
denotes the predicted value of the
th tree.
Therefore, the objective function can be expressed as:
where
denotes the objective function,
denotes the loss function,
is the regularization term, and
denotes the true value of the sample.
Step 4: Optimize the loss function in the objective function. The second-order Taylor expansion is used to expand the loss function to approximate to the true value. Its equation is as follows:
where
denotes the loss value of the
th sample from the preceding
trees,
is the first-order partial derivative of
, and
is the second-order partial derivative of
.
Step 5: Obtain the final objective function, which is as follows:
3.2.3. Performance Evaluation Metrics
- (1)
Confusion matrix:
A confusion matrix is an important tool for evaluating the performance of a classification model, with each column representing the instances in a predicted class while each row representing the instances in an actual class. The four metrics used in the analysis are True Positive (TP), False Positive (FP), False Negative (FN) and True Negative (TN). The confusion matrix is manifested by
Table 2.
- (2)
Performance metrics for the semantic segmentation model:
The semantic segmentation model is mainly evaluated using two metrics, Mean Pixel Accuracy (MPA) and Mean Intersection over Union (mIoU), whose expressions are as follows:
where
denotes
categories, and
,
,
and
denote correctly identified positive sample, correctly identified negative sample, incorrectly identified positive sample, and incorrectly identified negative sample, respectively.
- (3)
Performance metrics for the XGBoost model;
The XGBoost model is mainly evaluated using 4 metrics, accuracy (ACC), precision (P), recall (R) and F-score (F1), all of which could be obtained by calculation from the confusion matrix. F-score (F1) indicates the harmonic mean of precision and recall values [
26]. It can be calculated by the following formula:
In addition, the receiver operating characteristic (ROC) curve is plotted to measure the area under the ROC curve (AUC), which is then used to determine the accuracy of the classification results of the binary classification model [
27]. AUC < 0.6 indicates the model has a poor predictive ability; 0.6 < AUC < 0.7 indicates the model has a moderate predictive ability; 0.7 < AUC < 0.8 indicates the model’s predictive ability is good; and AUC > 0.8 indicates the model’s predictive ability is excellent.