Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China

Guo, Zichen; Li, Yuqiang; Wang, Xuyang; Gong, Xiangwen; Chen, Yun; Cao, Wenjie

doi:10.3390/rs15153846

Open AccessArticle

Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China

by

Zichen Guo

^1,2,

Yuqiang Li

^1,2,3,4,*,

Xuyang Wang

^1,2,

Xiangwen Gong

⁵,

Yun Chen

^1,2

and

Wenjie Cao

^1,2,3

¹

Northwest Institute of Eco–Environment and Resources, Chinese Academy of Sciences, Lanzhou 730000, China

²

Naiman Desertification Research Station, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences, Tongliao 028300, China

³

University of Chinese Academy of Sciences, Beijing 100049, China

⁴

Key Laboratory of Strategic Mineral Resources of the Upper Yellow River, Ministry of Natural Resources, Lanzhou 730000, China

⁵

Wansheng Mining of Chongqing Conservation and Repair of Ecological Environment Observation and Research Station, Chongqing Institute of Geology and Mineral Resources, Chongqing 400042, China

^*

Author to whom correspondence should be addressed.

Remote Sens. 2023, 15(15), 3846; https://doi.org/10.3390/rs15153846

Submission received: 20 May 2023 / Revised: 29 July 2023 / Accepted: 31 July 2023 / Published: 2 August 2023

(This article belongs to the Section Earth Observation Data)

Download

Browse Figures

Versions Notes

Abstract

:

The North China agro–pastoral zone is a large, ecologically fragile zone in the arid and semi-arid regions. Quantitative remote sensing inversion of soil organic carbon (SOC) in this region can facilitate understanding of the current status of degraded land restoration and provide data support for carbon cycling research in the region. Deep learning (DNN) for SOC inversion has been W.a hot topic over the past decade, but there have been few studies at the regional scale in the arid and semi-arid zones. In this study, a DNN model with five hidden layers and five skip connections was established using 644 spatially distributed SOC samples and Landsat 8 OLI imagery. The model was compared with the random forest algorithm in terms of generalization ability. The main conclusions were as follows: 1. The DNN algorithm can establish a high-precision SOC inversion model (R² = 0.52, RMSE = 0.7), with 90% of errors concentrated in the range of −2.5 to 2.5 kg·C/m²; 2. the Boruta variable-screening algorithm can effectively improve the model accuracy of the random forest algorithm, but due to the DNN’s better ability to mine hidden information in the data, the improvement effect on the DNN model accuracy is limited; 3. the SOC samples in arid and semi-arid areas are highly positively skewed, with a significant impact on the modeling accuracy of DNN, and conversion is required to obtain a model with better generalization ability; and 4. in arid and semi-arid regions, SOC has a weak correlation with vegetation indices but a stronger correlation with temperature, elevation, and aridity. This study established a reliable deep learning model for SOC density in a large arid and semi-arid region, providing a reference and framework for the establishment of SOC inversion models in other regions.

Keywords:

deep learning; random forest; digital soil mapping; agro–pastoral ecotone

1. Introduction

Soil organic carbon (SOC) represents the largest carbon pool in terrestrial ecosystems, accounting for 50% to 80% of the total carbon present on land. Its stock is 3–4 times greater than that of both the atmosphere and biotic carbon pools [1]. SOC storage is an important means of mitigating climate change and an important indicator of land degradation [2]. SOC reduction in arid and semi-arid regions particularly impacts agriculture and animal husbandry by decreasing soil fertility, water retention capacity, and structural stability [3], ultimately affecting yields. Therefore, research on SOC in arid and semi-arid regions plays an important role in agricultural and animal husbandry production and management.

SOC spatialization can be classified into three categories according to principles: spatial interpolation methods, ecological process models, and remote sensing inversion methods. Among these methods, spatial interpolation methods started with ordinary kriging and developed into co-kriging, regression kriging, and kriging combined with machine learning [4,5,6]. Ecological process models, which simulate the spatial distribution of SOC based on ecological processes, such as plant growth and decomposition, are another category. Ecological process models include the RothC model, CENTURY model [7], DNDC model [8], ICBM model [9], etc. Spatial interpolation methods have shown limited performance in the dynamic temporal variation of SOC, and process simulation requires the establishment of mathematical models for actual ecological systems, posing certain difficulties in promoting them to other areas [10]. Remote sensing inversion methods have certain promotion capabilities in both time and space.

There is a certain correlation between the content of SOC and the reflectance of visible near-infrared spectroscopy, providing a basis for remotely sensing SOC [11]. Therefore, remote sensing can provide a rapid, repeatable, and cost-effective means of quantitatively assessing the spatial distribution of soil properties [12]. Hyperspectral remote sensing has been widely applied in predicting SOC content in spatially small areas, such as farmland and forests [13,14]. However, for large-scale areas, multispectral images with wider coverage and easier accessibility are required to predict SOC content. The spatial heterogeneity of medium-resolution satellite images, such as Landsat and Sentinel, results in lower accuracy in predicting SOC. In particular, scholars have pointed out that the organic matter content of soil needs to be greater than 2% to be measured accurately using soil reflectance [15]. Both scale issues and optical inversion capabilities have contributed to the limitations of traditional remote sensing in accurately measuring SOC.

The development of computer software and hardware has made machine learning a new hotspot for remote sensing SOC prediction [16]. Machine learning algorithms can be divided into traditional neural networks and deep learning networks based on the number and structure of hidden layers. Traditional machine learning methods, such as partial least squares regression, extreme learning machines, back propagation neural networks, and support vector machines, have shown limited performance in prediction accuracy [17], but they are computationally efficient and have lower hardware and software requirements. For example, gray relational analysis–BP neural networks were used to establish an SOC content calculation model in the Abi Lake Wetland National Nature Reserve and found that the 1.2-order derivative model was optimal [18]. Compared to traditional machine learning methods, deep learning methods are better at uncovering hidden relationships among data and thus can achieve higher accuracy in predicting SOC. For example, some scholars tested several methods, including support vector machines (SVMs), artificial neural networks (ANNs), regression trees, RF, extreme gradient boosting (XGBoost), and deep neural networks (DNNs), in a province in Iran and found that DNNs had the greatest accuracy [19]. The prediction accuracy of convolutional neural networks (CNNs) was found to be much higher than that of RF when SOC in Anhui Province, China, was predicted using MODIS phenological variables [20].

The North China pastoral-farming ecotone covers an area of about 650,000 km² and is a transitional zone from semi-humid to arid regions. In this context, the agro–pastoral transitional zone in northern China refers to the geographical region that extends from the western foothills of the Greater Khingan Mountains in the east, through the mountainous front plains of northeastern China, the southwestern part of the Songliao Plain, the mountainous and hilly areas of North China, and the eastern part of the Inner Mongolia Plateau to the Guanzhong Basin in southern Shanxi Province. This region is characterized by overgrazing, reclamation, and other human activities, and it includes four of the sandy lands in China. Previous studies have indicated that the region is sensitive to changes in environmental factors, has poor ecological resilience, and is a typical ecologically vulnerable area [21]. With the process of desertification, SOC in the area will decrease significantly [22]. Since 2000, artificial ecological engineering has led to an increase in vegetation cover in most areas of this region [23]. The conversion of agricultural land into grassland and forest land is the most prevalent type of land use change in the region, accompanied by a reduction in landscape fragmentation [24]. Studies conducted by other researchers [25] have suggested that this region holds great potential for carbon sequestration through improved land use practices. Therefore, investigating the spatial distribution of SOC in this area could provide insights into the effectiveness of ecological restoration projects. Many scholars have made efforts in spatializing SOC in the region. In a study conducted by other researchers [26], 644 samples of SOC were collected from this region and analyzed to assess the correlation between SOC and factors such as latitude, vegetation cover, land use, and soil type. Additionally, the researchers utilized kriging interpolation to model the spatial distribution of SOC in the area [27]. Some researchers [28] explored the relationship between SOC and spectral reflectance in the farming–pastoral zone of northern China and found that the inverse first-order differentiation of spectral reflectance formed a good multivariate linear regression model. Other scholars [29] used the random forest method to invert the distribution of SOC from 1989 to 2018 in a county-scale area of the farming–pastoral zone and explored the impact of human and natural factors on changes in SOC. Although a few studies have used interpolation methods to estimate regional-scale distributions, there has been a lack of research using remote sensing images to estimate the spatial distribution of SOC at the regional scale.

Although DNNs have demonstrated superior abilities compared to traditional learning in research conducted elsewhere, they still face some challenges. First, the collection of SOC requires substantial human resources and has high economic costs, resulting in limited sample sizes in most studies. Second, there is a certain skewness in the distribution of samples in arid and semi-arid areas. Additionally, the heterogeneity of images may lead to model overfitting [30]. These issues indicate that building deep learning models in the large research area of the North China agro–pastoral zone is a challenging task. This study aims to establish an accurate and moderately generalizable SOC model in the farming–pastoral ecotone of northern China, which has relatively fragile and sensitive ecosystems, to provide data support for carbon cycle research and ecological protection measures in this region. Therefore, the research objectives of this paper are: 1. to use Boruta algorithm to screen variables; 2. to establish an SOC inversion model based on an RF algorithm; 3. to establish an SOC inversion model based on DNN methods with skip-layers; and 4. to use the optimal model to calculate the spatial distribution of SOC in the farming–pastoral ecotone of northern China and analyze the distribution characteristics of SOC under different land use types and different degrees of desertification.

2. Materials and Methods

2.1. Overview of Study Area

The agro–pastoral transitional zone in northern China is a transitional zone between the monsoon climate and continental climate, covering semi-humid, semi-arid, and arid regions (as shown in Figure 1a). The overall terrain rises from east to west, with the east mainly being plains, the middle area comprising both plains and hills, and the west including highland hills and basins, making the terrain relatively complex. The northeastern part of the study area has higher precipitation, with an average annual rainfall of 400–450 mm. The area with more cultivated land is distributed in this region, and it is also at the southern edge of the Great Xingan Mountains forest, where mild desertification is more prevalent. The southwestern part of the study area has lower precipitation, with an average annual rainfall of 300–400 mm. The low- to medium-coverage grasslands are more distributed in the west, where the areas of moderate to severe desertification increase toward the northeastern part. Overall, there is an interspersed distribution of cultivated land and grassland in the study area, with a diverse range of land use types. The region also includes China’s four major sandy areas (Hulunbeier Sandy Land, Horqin Sandy Land, Hunshandake Sandy Land, and Mu Us Sandy Land), as shown in the Figure 1b.

The primary soil types in the forested region in the northern research area are gleyic phaeozems and haplic greyzems. Meanwhile, in the northern grassland area, eutric cambisols and haplic luvisols are contiguous. The southern forest region consists of gleyic luvisols, haplic phaeozems, and gleyic phaeozems, all of which are relatively fertile soils. Moving on to the central grassland area, dystric cambisols are sheet-like in distribution. In the agricultural and grassland mixed area in the south, haplic luvisols and gleyic podzoluvisols are banded in distribution. The soil fertility is lower in the desertification-prone area, where the soil type is calcaric cambisols.

2.2. SOC Sampling and Laboratory Analysis

In this study, 644 SOC sampling points were collected within the study area based on the principles of representativeness, uniformity, and accessibility. The sampling was conducted between April and July 2018, and 15 sampling positions were randomly selected within a 10 m × 10 m quadrat for each sample, which were collected from 0–20 cm and 20–30 cm depths and then mixed. Three soil bulk densities were also collected using a cutting ring. The latitude and longitude of each sampling point were recorded using a GPS.

After sampling, the soil samples were air-dried and sieved through a 2-mm mesh sieve in the laboratory, followed by grinding and passing through a 0.25-mm mesh sieve. The content of SOC was determined using the potassium dichromate oxidation method with external heating. After measuring the SOC content, the soil organic carbon density

(S O C D)

was calculated using the mean soil bulk density

(B D)

. The calculation method for BD (g•cm³) is shown in Equation (2), where m₁ represents the weight of the soil and container (in grams), m₂ represents the weight of the container (in grams), and V represents the volume of the container (100 cm³).

S O C D

(kg•m⁻²) refers to the amount of SOC stored in a certain depth of soil per unit area (Long Li et al., 2016), and the calculation formulas are as follows:

S O C D = E C \times B D \times D \times (1 - S C) \times 0.01

(1)

B D = (m_{1} - m_{2}) / V

(2)

E C = \frac{\frac{C \times 5}{V_{1}} \times (V_{1} - V_{2}) \times 10^{- 3} \times 3.0 \times 1.1}{M} \times 100 %

(3)

where

D

represents the thickness of soil layer, and

S C

represents the volume content of gravels larger than 2 mm. The soil organic carbon concentration

(E C)

is calculated by Equation (3). In the equation, c represents the concentration of the standard solution as 0.8000 mol·L⁻¹ (1/6 K₂Cr₂O₇).

V 1

and

V 2

represent the volumes (in mL) of blank titration and titration with soil samples using FeSO₄, respectively.

C

represents the equivalent concentration of FeSO₄. M represents the weight (in grams) of the dried soil sample. The value 3.0 represents the molar mass of 1/4 carbon atoms (g•mol⁻¹) [31]. The value 1.1 is the correction factor for organic matter, which accounts for the 90% oxidation efficiency.

2.3. Remote Sensing Images and Preprocessing

In this study, data preprocessing and band calculation were carried out on the Google Earth Engine (GEE) platform. GEE is a cloud computing platform developed by Google for processing satellite images and other Earth observation data. The platform supports large-scale geospatial processing, saves users’ time for downloading images for local processing, and provides a variety of remote sensing computing programs with flexible and convenient access. In this study, the Landsat8 Collection 2 level-2 dataset was selected (https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C02_T1_L2, accessed on 22 March 2023), which has been processed for geometric and radiometric quality and generates surface reflectance data using the Land Surface Reflectance Code algorithm. The algorithm utilized the aerosol bands along the coastal regions for aerosol inversion testing. It employed MODIS auxiliary climate data and radiative transfer models to perform radiometric calibration of Landsat 8 and 9 satellite images [32]. From 2017 to 2019, a total of 679 images taken between April and October were selected based on a cloud cover threshold of less than 20%. These selected images were then subjected to median synthesis, and the resulting synthesized image was used to load SOC data based on coordinates. In field ecological studies, researchers have found that SOC has lower sensitivity to aboveground biomass, especially in arid and semi-arid regions [33]. Therefore, median synthesized images of three snow-free seasons were used in this study to ensure the integrity of the remote sensing data.

To use more features for machine learning in the future, we calculated as many indices as possible on GEE, which can be divided into several categories: vegetation indices, soil indices, climate indices, and band combination indices. All indices are listed in Table 1.

All indices were calculated and used as input features for machine learning. The coordinates of the 644 sample points were uploaded to the GEE platform, and the corresponding feature values were extracted and exported for subsequent modeling.

The land use data used in this study were produced by the Institute of Geographic Sciences and Natural Resources Research, Chinese Academy of Sciences, in 2020 (Geographical Information Monitoring Cloud Platform, 2020). The download website for the data is http://www.dsac.cn/DataProduct/Detail/200804 (accessed on 22 March 2023). The desertification degree data used in this study are from the SDG15.3 dataset (Big Earth Data in Support Of The Sustainable Development Goals, 2020), and the website for this dataset is http://sdgs.casearth.cn/zw/zbal/sdgldsw/202009/t20200925_582301.html (accessed on 22 March 2023).

2.4. Boruta Feature Selection Algorithm

To accurately identify redundant features and retain important features to improve model accuracy and interpretability, this study tested the performance of the Boruta algorithm in the samples [50]. The Boruta algorithm is a feature selection algorithm based on RF, the main principle of which is to remove redundant features from the feature selection process while retaining the features with the strongest predictive power for the target variable. Specifically, the Boruta algorithm performs feature selection through the following steps. 1. It creates multiple random forest models, each of which is trained on a random subset of the original features. 2. For each feature, it calculates its importance score on the original features and on the random subset features. These scores are calculated based on the node splitting ability and accuracy of the random forest model. 3. For each feature, it calculates the difference between the maximum score on the random feature subset and the maximum score on the original features. If the difference is less than or equal to a predefined tolerance value, the feature is marked as unimportant. 4. It repeats steps 2 and 3 until all features are marked as important or unimportant. 5. For features marked as important, it determines their true importance scores, which are calculated when training the random forest model on all original features. 6. Features with importance scores higher than the random score are retained, while other features are considered redundant and removed.

2.5. Random Forest Regression

Random forest (RF) regression is a traditional machine learning method and an ensemble learning algorithm based on multiple decision tree models, used for predicting continuous numerical outputs. The algorithm uses a bootstrap sampling method to obtain multiple training sets, and each training set constructs a decision tree. Each decision tree predicts the test sample, and the final prediction result is the average value of all decision tree model predictions. RF algorithms perform well in traditional machine learning and have some robustness to missing samples. In this study, RF models were established using all features and the features selected by the Boruta algorithm, and the maximum depth of the decision tree and the minimum number of samples in the node were tested. The accuracy validation method used 10-fold cross-validation.

2.6. Fully Connected Deep Learning Network

Fully connected neural network regression is a type of DL algorithm that maps inputs to outputs through multiple layers of nonlinear transformations. The basic principles of fully connected neural network regression are as follows. 1. A neural network consists of multiple neurons, each of which receives input from the previous layer and transforms it into output through an activation function. 2. In a fully connected neural network, each neuron is connected to all neurons in the previous layer. Therefore, after multiple transformations, a set of new features is formed, which can better represent the abstract features of the input data. 3. The output layer of the neural network usually has only one neuron, the output of which is a continuous real value. The activation function of the output layer is usually an identity function because regression problems require predicting continuous outputs. 4. The training process of the neural network usually uses the backpropagation algorithm to update the weights and biases in the network by minimizing the loss function. Common loss functions include mean squared error (MSE), mean absolute error, etc. 5. During the training process, we can use optimization algorithms, such as batch gradient descent, to update weights and biases, thereby minimizing the loss function. Meanwhile, to prevent overfitting, regularization techniques, such as L1, L2 regularization, dropout, etc., can be used.

In summary, fully connected neural network regression is a powerful deep learning algorithm that can be used to predict continuous outputs. It has good flexibility and prediction performance but requires a large number of computing resources. The fully connected neural network structure used in this study is shown in Figure 2, which includes five hidden layers and five skip layers, with 128 neurons in each hidden layer. The activation function between hidden layers is the ReLU function.

RF and DNN algorithms generally do not require data to follow a normal distribution. However, the distribution of the sample data can significantly affect the generalization ability of machine learning models. Inconsistencies between the sample distribution and the actual distribution, as well as inconsistencies between the sample distribution and the predicted distribution, can lead to a decrease in the machine’s generalization ability [51,52]. One approach to mitigating sample imbalances and dataset shifts is to correct the skewness of the sample distribution and perform the same operations on the predicted values using correction parameters based on the sample values. Common methods for correcting skewness include square root transformation, logarithmic transformation, and the Box–Cox method. The Box–Cox transformation is a statistical technique used to transform non-normal data into a normal distribution. It involves raising the data to a power, lambda, which is determined by maximizing the likelihood function. This transformation can be used to improve the accuracy of statistical models that assume normally distributed data.

2.7. Model Evaluation and Uncertainty Analysis

This study used 10 DNN models obtained through 10-fold cross-validation and used these models to invert the SOC in the agro–pastoral transitional zone in the north. After 10-fold cross-validation, the average

R^{2}

and

R M S E

values for each iteration were calculated. The formulas for calculating

R^{2}

and

R M S E

for each model are as follows:

R^{2} = 1 - (S S R / S S T)

(4)

R M S E = \sqrt{S S E / n}

(5)

S S R

represents the sum of squared residuals,

S S T

represents the total sum of squares,

S S E

represents the sum of squared errors, and

n

represents the number of samples.

We assume that the SOC inverted by these models follows a normal distribution; thus, based on the 10 SOC predicted values on each pixel, we calculated the confidence interval of the pixel. Using this method, we can obtain the uncertainty of DNN models in predicting SOC in the agro–pastoral transitional zone in the north [53].

3. Results

3.1. Exploratory Data Analysis

In this study, we aim to investigate the influence of the distribution of SOC data and image spectral data on the results of DNN inversion. Therefore, in this section, we conducted an analysis of the distribution of the data used to construct the samples to help illustrate the effectiveness of the model establishment in subsequent sections. The analysis results are shown in Figure 3.

From Figure 3a,b, it can be observed that there is significant variation in SOC values among forests and grassland, while the SOC variation in sandy areas is relatively small. After spatially uniform sampling, the distribution of SOC is concentrated around 4.2 kg/m², with the average value slightly higher in grassland compared to other land use types. The extreme values of SOC are more prevalent in grassland and cropland.

For the spectral bands used in model construction, all bands exhibited significant skewness, particularly elevation, which showed a distribution with three peaks. The land surface temperature (LST) had the smallest skewness, while the IBI_temp, which reflects bare soil, had the highest skewness.

The skewness of SOC is likely to result in a decrease in the generalization performance of the DNN model. Therefore, in the following sections, we correct its skewness and investigate its impact on generalization performance.

3.2. The Feature Results Selected by the Boruta Algorithm

After applying the Boruta algorithm to screen vegetation, soil and climate indices, we determined that the following features should be retained: land surface temperature (lst), brightness temperature (ST_B10), near-infrared (SR_B5), elevation, salinity index-2 (SI2), and hillshade (as indicated by the green part in Figure 3a). In contrast, we found that vegetation, most soil salinity indices, and soil particle size indices should be removed (as indicated by the red part in Figure 3a). This decision implies that, in the agro–pastoral transition zone of northern China, there is a correlation of SOCD with temperature, elevation, and soil salinity.

After applying the Boruta algorithm to screen two-band combination features, we identified that, while RIB3B6 should be removed, the following features should be retained: NDIB3B6, DIB6B10, and DIB7B10. These findings suggest that SOCD is correlated with normalized values of the green (0.525–0.600 µm) and shortwave infrared (SWIR) 1 (1.566–1.651 µm), the difference between the SWIR 1 and thermal bands, and the difference between the SWIR 2 and thermal infrared bands.

3.3. Random Forest Regression Model

The study first tested a RF regression model using all features. The model had an in-sample coefficient of determination (R²) of 0.86 and root mean squared error (RMSE) of 6.55 but an out-of-sample 10-fold cross-validation accuracy of R² = 0.28 and RMSE = 11.76, indicating overfitting. The feature importance ranking is shown in Figure 4a. Using MSE as the metric to determine the number of decision trees, it can be seen that the convergence of the model results are basically remains the same after 50 trees (Figure 4b). In this study, 100 decision trees were used as the parameter for building the random forest. Similarly, we determined the minimum number of samples required to be at a leaf node as 5. The model was built using the bands and parameters selected by the Boruta algorithm. The internal accuracy of the model was R² = 0.93 and RMSE = 0.82, while the 10-fold cross-validation results were R² = 0.40 and RMSE = 5.76 (Figure 4c). The degree of overfitting was reduced, the model’s generalization performance significantly improved.

3.4. Fully Connected Neural Network Model

Similar to the training process of the RF model, in this study, we first built a model using all features, with internal accuracy of R² = 0.87 and RMSE = 5.88. However, the results of 10-fold cross-validation outside the model were still overfitted, with R² = 0.38 and RMSE = 11.57. It is noteworthy that the R² of the DNN model was basically close to that of the RF at this point, and the feature set selected by the Boruta algorithm did not significantly improve the accuracy of the DNN model.

The skewness analysis of the original data shows that the skewness is 1.44, as shown in Figure 5a. Therefore, the prediction of SOC in arid and semi-arid areas is a highly positively skewed regression modeling problem, and the data need to be corrected for skewness. This study tested the skewness correction of SOC data using square root, logarithmic transformation, and Box–Cox function, and the results are shown in Figure 5d. It can be seen that, after using the square root, the SOC was still moderately positively skewed, while after using the logarithmic transformation, it became moderately negatively skewed. After using Box–Cox, the data were exactly in line with a normal distribution.

The loss functions tested in this study were L1, MSE, SmoothL1, and Huber Loss. SmoothL1 and Huber Loss showed significant improvements in accuracy. The feature selection was still based on the Boruta results from the previous section, and dropout layers and skip connections were added to reduce overfitting. The accuracy changes during the training process are shown in Figure 6a. The final model achieved accuracy of R² = 0.52 and an RMSE of 0.7 (as shown in Figure 6c). Compared to the RF model, the accuracy was significantly improved, especially in terms of the RMSE. As shown in Figure 5b, the error distribution of the model followed a normal distribution, with 95% of the errors concentrated within ±2, indicating a reliable model.

3.5. Spatial Distribution of 0–30 cm SOCD in the Agro–Pastoral Ecotone in Northern China

The SOCD of the 0- to 30-cm layer in the North China agro–pastoral transition zone was obtained using a fully connected neural network model, as shown in Figure 7a. The total organic carbon storage in the soil was 2439.26 Tg, with an average value of 3.86 kg·C/m². The areas with relatively high SOCD were located in the north and central parts of the study area, where the soil types were mainly gleyic phaeozems and haplic greyzems, with values ranging from 5.53 to 10.44 kg·C/m². The areas with relatively low SOCD were in sandy areas and calcaric cambisols, with values less than 3.99 kg·C/m². Among the four sandy areas, the majority of the SOCD in the Mu Us Sandy Land and the Horqin Sandy Land was less than 3.99 kg·C/m², and the Horqin Sandy Land was slightly better than the Mu Us Sandy Land. Except for the sandy belt area, the SOCD in the Hulunbeier Sandy Land was greater than 5.34 kg·C/m².

Overall, the DNN model established in this study predicted that 90% of the estimated SOCD values will fall within the range of 0 to 18.15 kg·C/m². Figure 7b and Figure 7d, respectively, depict the upper limit (95%) and lower limit (5%) calculated by the DNN model in the agro–pastoral transitional zone of northern China. The overall distribution is consistent with the average value shown in Figure 7c. Values relatively greater than the average value are more likely to occur in the forested areas in the southern part of the study area, while values relatively lower than the average value are more likely to occur in the desertification-prone regions of the study area.

Figure 8a shows the SOCD in the agro–pastoral transition zone by land use type. Forests exhibit the highest SOCD among all types of land use, with both the mean and overall distributions exceeding those of other types. The average SOCD value of forests is 6.4 kg·C/m². In addition to forests, the mean SOCD of other types of woodland (shrubland and sparse forest) is also higher than that of other land use types. The SOCD value of grassland areas decreases with decreasing vegetation cover. The SOCD of paddy fields and high-cover grasslands was around 4 kg·C/m², while that of dryland and medium-cover grasslands was similar.

Figure 8b shows the SOCD in the agro–pastoral transition zone by desertification type. The SOCD of sandy areas was lower than that of other land use types. As desertification developed, the SOCD gradually decreased. The SOCD in lightly desertified areas ranged from 2.5 to 3.5 kg·C/m², whereas in moderately desertified areas, it was between 2.2 and 3.2 kg·C/m². SOCD was found to be concentrated between 2.1 and 3 kg·C/m² in heavily desertified areas, whereas in extremely desertified areas, the concentration was between 2.1 and 2.8 kg·C/m².

4. Discussion

4.1. Factors Affecting the Accuracy of a SOC Inversion Model Established by Machine Learning

Although there have been many studies using machine learning methods to infer SOC content, research at the regional scale is still limited, partly due to the difficulty of collecting soil samples at a large regional scale and partly due to the limitations of the spectral resolution of Landsat and Sentinel satellite imagery. For example, in a county-scale study conducted in our research area, only the RF method could achieve an R² value of 0.59 [29]. In a SOC study at the national scale in an African country, scholars [54] achieved good results using a fully connected neural network without skip layers. Our study, which had a similar study area but only one-third of the sample size and featured highly positively skewed sample distributions, also achieved credible accuracy (R² = 0.52, RMSE = 0.7).

This study found that the performance of the neural network model for inferring SOC could be significantly improved when the output data are made to conform to a normal distribution as much as possible. Additionally, the skewness of the features input into the model should also be noted, as they may be among the factors affecting the accuracy [55]. The skewness of the input features of the model used in this study was calculated, as shown in Figure 3c, and the skewness of all features was found to be relatively low. Correction of the skewness of the input features did not significantly improve the accuracy of the model. Therefore, after comprehensively discussing the skewness of SOC and input features, we believe that the influencing factors of the model accuracy in this study are not only the highly positively skewed SOC values in arid and semi-arid areas but also the possibility of sample size limitations.

The selection of features is another important factor influencing the accuracy of machine learning models for SOC inversion. The same feature may have different levels of importance in different ecological environments and spatial scales. For example, in plain areas dominated by agricultural production in humid regions, NDVI is the most important influencing factor for SOC models [56]. In a nationwide DNN model in Africa, the 655-nm band and NDVI were found to be the two most important factors influencing the results [54]. However, in this study, the contribution of vegetation information, such as NDVI, to the model is small, which may be due to the limited correlation between green vegetation and SOC in arid and semi-arid regions. After comparing various indices and dual-band combinations, this study found that, in arid and semi-arid regions, the relationship between SOC and the normalized difference of near-infrared and green bands, as well as the difference between near-infrared and thermal infrared bands, is more closely related. In laboratory-based spectral studies of soils, the combination of visible and near-infrared bands has been shown in many studies to have good predictive ability for SOC [57,58], while some studies have suggested that the mid-infrared range is also important for expressing SOC [59].

Temperature affects the dissociation of biochemical complexes, microbial enzyme production, and conformation, and it restricts the effectiveness of soil organic matter processes, such as adsorption/desorption and aggregate turnover, thereby influencing the decomposition of soil organic matter [60,61]. At the same time, temperature indirectly alters plant productivity through soil moisture, affecting carbon allocation to leaves, roots, and litter, thus impacting the input of SOC [62]. In a sampling-based study within the same region, researchers [63] found that the annual average temperature and precipitation in the temperate grassland of Inner Mongolia had a greater impact on SOC than grazing. Another study suggested that temperature, precipitation, soil texture, and their interactions could explain 71% of SOC variation in China’s temperate grassland region [64]. In this study, the surface temperature-related features were identified as the primary factors influencing the accuracy of the SOC prediction model, further validating the findings of the aforementioned research. It is worth noting that the model established in this study does not rely on vegetation index data, which are the most obvious variables in this study area. Instead, it relies more on temperature, elevation, and other variables. Therefore, we have reason to believe that it has certain time extrapolation ability.

4.2. SOC in Different Land Use Types

Compared with the ordinary kriging interpolation results from the same data by other scholars [27], the total and mean values of soil organic carbon storage in this study were slightly higher, and the overall distribution results were similar; in the northern black soil region, the soil organic carbon density could reach 21 kg·C/m² according to the interpolation method, but in the DNN model simulation, the maximum value of this part does not exceed 10.44 kg·C/m²; At the same time, the extremely low value of sandy soil disappeared, indicating that machine learning overly relies on samples and can lead to missing extreme values in the inversion results. The SOCDs in different land use types between this study and the aforementioned researchers’ study are shown in Table 2, indicating that the prediction results of the DNN model established in this study are reliable in terms of spatial representation.

In 2018, researchers conducted an investigation on the SOCD of more than a dozen continuously grazed grasslands in Inner Mongolia, and the average SOC was approximately 1.2 kg/m² [65]. This value is slightly lower than the SOC of low-coverage grassland observed in our study. There were also researchers who conducted SOC (soil organic carbon density) investigations at 30 sample sites of meadow steppe, typical steppe, and desert steppe in northern China. The results showed that the SOC values were 7.47 kg/m² for typical steppe, 4.00 kg/m² for desert steppe, and 6.64 kg/m² for meadow steppe [66]. These values are slightly higher than our average value, and the upper and lower limit ranges are consistent with our study. Wang X. Y. et al. [26] reported that, compared with 1980, the carbon storage of grassland ecosystems has increased significantly. Due to the extensive protective measures implemented by China in ecologically vulnerable areas since 1950, farmland and other areas have been converted into grassland [67]. The organic carbon content of soil aggregates in restored grasslands in the farming-pastoral zone of northern China was studied [68], and it was found that the measures of returning farmland to forest and grassland were effective in increasing the concentration of organic carbon in soil surface aggregates. Therefore, considering the proportion and total amount of grassland soil organic carbon pool in this study and other research results [26], grassland is the largest soil organic carbon pool in the study area, playing an important role in carbon sequestration in this region.

Researchers [69] investigated the SOCD of forests and sparse forests in the study area, and the results were approximately 4–6 kg/m² for forests and 1.2–1.92 kg/m² for sparse forests at a depth of 0–30 cm. The results are close to the findings of our study. Timberland accounted for 20.82% of the SOCD in the study area, with carbon storage of 507.853 Tg in the 0–30 cm layer. The main contribution comes from forests, while shrubland and sparse forests only account for 7.36%. According to the DNN simulation results, the SOCD is highest in forest land, with an average value of 6.4 kg·C/m², because the study area is mainly located in the southern edge of the Greater Khingan Range, with gleyic phaeozems and haplic greyzems as the main soil type and a semi-humid climate. The main tree species are coniferous and broad-leaved forests, making it the region with the best ecological environment in the study area.

In a survey of SOCD in the Horqin Sandy Land, the mean values for mild, moderate, and severe desertification were generally consistent with the results of our study [70]. The carbon storage of sandy soil in the study area accounts for 2.91% of the total. In their study [70] conducted in the sandy soil, it was suggested that the content of soil organic carbon decreases with the increasing degree of desertification, consistent with the results of this study in the four sandy areas. Soil wind erosion resulted in an average annual loss of SOC in the surface layer (0–30 cm), ranging from 0 to 37.56 g·m⁻²·a⁻¹, with a mean annual loss of 0.21 g·m⁻²·a⁻¹ in the northern agricultural-pastoral zone [71]. Therefore, even though the SOCD in desertification areas may not be high, desertification remains a sensitive area for SOC changes, especially leading to a decrease in SOC in surrounding areas of desertification.

This study showed similar results to other studies [26] in terms of the proportion and total amount of soil organic carbon in different land use and desert types, further verifying the reliability of the DNN model. The advantage of the DNN model over interpolation methods is that it can extrapolate forward and backward in time based on remote sensing images.

4.3. Limitations of the Model and Future Developments

In this study, we observed consistency between the distribution of soil organic carbon (SOC) and soil type distribution, which may be attributed to the driving role of clay content in SOC variation. Clay content is an important driving factor for soil organic carbon changes, as it affects the adsorption and retention of organic matter, as well as the physical and chemical properties of soil, thereby influencing the accumulation and decomposition processes of organic carbon in soil [9,72]. Therefore, we believe that incorporating clay content or soil type as additional features in future research will improve the generalization performance of the model. Similarly, as mentioned in the previous discussion, the contribution of temperature to the SOC inversion model has been highlighted. Other studies [64] have also mentioned the synergistic effects of precipitation and temperature on SOC content. Therefore, incorporating temperature and precipitation as model features may also be a way to improve model accuracy.

It should be noted that the limited number of SOC samples constrains the generalizability of the RF and DNN models, which is a significant factor restricting SOC remote sensing mapping. Using models with stronger learning capabilities and increasing the sample size are direct approaches to addressing this issue. For example, convolutional neural networks can better capture information from imagery. However, collecting SOC samples requires substantial time and resources. Therefore, future research could also consider methods to expand the sample size, such as various semi-supervised regression algorithms [73].

In addition to improving spatial accuracy, it is important to consider the temporal generalizability of the model, as the original intention of using remote sensing to estimate SOC is to infer past changes in soil organic carbon based on remote sensing imagery. In addition to increasing multiple samples at different time points, long-term image fusion or aligning reflectance values from different time periods may improve the model’s temporal generalization credibility in SOC inversion.

5. Conclusions

This study used a limited number of soil samples and Landsat 8 OLI data to test the performance of fully connected deep learning networks with skip layers and random forest methods for estimating soil organic carbon density at the regional scale in arid and semi-arid regions. The DNN model outperformed the RF model, especially in terms of reducing errors in generalization performance (R²_DNN = 0.52, RMSE_DNN = 0.7, R²_RF = 0.40, RMSE_RF = 5.76). We also explored ways to improve the accuracy of SOC inversion models in arid and semi-arid regions with limited sample sizes. First, using the Boruta algorithm to select features can significantly improve the accuracy of random forest models, but it has limited effects on DNN models. Second, temperature and elevation have a strong correlation with SOC in arid and semi-arid regions, but vegetation indices have a weaker correlation. Furthermore, SOCD in arid and semi-arid regions shows a highly positively skewed distribution, and using Box–Cox transformation to correct the skewness can significantly improve the accuracy of DNN models. This outcome suggests that DNN models are better at extracting hidden information from features but also rely more on the distribution of samples being normal. This study provides an example of SOC mapping in ecologically vulnerable areas at a regional scale. In particular, it highlights the significance of temperature and elevation in remote sensing-based SOC inversion. This study will offer reliable experience and insights for similar studies in other regions. In future studies, it may be possible to increase the sample size or use convolutional neural networks, which can utilize neighboring pixel information.

Author Contributions

Conceptualization, Y.L. and Z.G.; methodology, Z.G.; validation, Z.G.; formal analysis, Z.G.; investigation, X.W. and X.G.; writing—original draft preparation, Z.G.; writing—review and editing, Z.G., Y.L., Y.C. and W.C. All authors have read and agreed to the published version of the manuscript.

Funding

This study is supported by the National Natural Science Foundation of China under grant nos. 31971466 and 32001214.

Acknowledgments

We thank the associate editor and the reviewers for their useful feedback, which helped to improve this paper.

Conflicts of Interest

The authors declare no conflict of interest.

References

Post, W.M.; Peng, T.-H.; Emanuel, W.R.; King, A.W.; Dale, V.H.; DeAngelis, D.L. The global carbon cycle. Am. Sci. 1990, 78, 310–326. [Google Scholar]
Jiao, S.; Li, J.; Li, Y.; Xu, Z.; Kong, B.; Li, Y.; Shen, Y. Variation of soil organic carbon and physical properties in relation to land uses in the Yellow River Delta, China. Sci. Rep. 2020, 10, 20317. [Google Scholar] [CrossRef] [PubMed]
Lal, R. Restoring Soil Quality to Mitigate Soil Degradation. Sustainability 2015, 7, 5875–5895. [Google Scholar] [CrossRef] [Green Version]
Hamzehpour, N.; Shafizadeh-Moghadam, H.; Valavi, R. Exploring the driving forces and digital mapping of soil organic carbon using remote sensing and soil texture. Catena 2019, 182, 104141. [Google Scholar] [CrossRef]
Mallik, S.; Bhowmik, T.; Mishra, U.; Paul, N. Mapping and prediction of soil organic carbon by an advanced geostatistical technique using remote sensing and terrain data. Geocarto Int. 2022, 37, 2198–2214. [Google Scholar] [CrossRef]
Mirzaee, S.; Ghorbani-Dashtaki, S.; Mohammadi, J.; Asadi, H.; Asadzadeh, F. Spatial variability of soil organic matter using remote sensing data. Catena 2016, 145, 118–127. [Google Scholar] [CrossRef]
Falloon, P.; Smith, P. Simulating SOC changes in long-term experiments with RothC and CENTURY: Model evaluation for a regional scale application. Soil Use Manag. 2002, 18, 101–111. [Google Scholar] [CrossRef]
Li, C.; Frolking, S.; Frolking, T.A. A model of nitrous oxide evolution from soil driven by rainfall events: 1. Model structure and sensitivity. J. Geophys. Res. Atmos. 1992, 97, 9759–9776. [Google Scholar] [CrossRef]
Jastrow, J.D.; Amonette, J.E.; Bailey, V.L. Mechanisms controlling soil carbon turnover and their potential application for enhancing carbon sequestration. Clim. Change 2007, 80, 5–23. [Google Scholar] [CrossRef]
Qin, C.Z.; Zhu, A.X.; Qiu, W.L.; Lu, Y.J.; Li, B.L.; Pei, T. Mapping soil organic matter in small low-relief catchments using fuzzy slope position information. Geoderma 2012, 171, 64–74. [Google Scholar] [CrossRef]
Rossel, R.A.V.; Walvoort, D.J.J.; McBratney, A.B.; Janik, L.J.; Skjemstad, J.O. Visible, near infrared, mid infrared or combined diffuse reflectance spectroscopy for simultaneous assessment of various soil properties. Geoderma 2006, 131, 59–75. [Google Scholar] [CrossRef]
Croft, H.; Kuhn, N.; Anderson, K. On the use of remote sensing techniques for monitoring spatio-temporal soil organic carbon dynamics in agricultural systems. Catena 2012, 94, 64–74. [Google Scholar] [CrossRef]
Nocita, M.; Stevens, A.; Toth, G.; Panagos, P.; van Wesemael, B.; Montanarella, L. Prediction of soil organic carbon content by diffuse reflectance spectroscopy using a local partial least square regression approach. Soil Biol. Biochem. 2014, 68, 337–347. [Google Scholar] [CrossRef]
Tatzber, M.; Mutsch, F.; Mentler, A.; Leitgeb, E.; Englisch, M.; Gerzabek, M.H. Determination of organic and inorganic carbon in forest soil samples by mid-infrared spectroscopy and partial least squares regression. Appl. Spectrosc. 2010, 64, 1167–1175. [Google Scholar] [CrossRef] [PubMed]
Ben-Dor, E.; Inbar, Y.; Chen, Y. The reflectance spectra of organic matter in the visible near-infrared and short wave infrared region (400–2500 nm) during a controlled decomposition process. Remote Sens. Environ. 1997, 61, 1–15. [Google Scholar] [CrossRef]
Zhang, L.P.; Zhang, L.F.; Du, B. Deep Learning for Remote Sensing Data A technical tutorial on the state of the art. Ieee Geosci. Remote Sens. Mag. 2016, 4, 22–40. [Google Scholar] [CrossRef]
Guo, L.; Fu, P.; Shi, T.Z.; Chen, Y.Y.; Zhang, H.T.; Meng, R.; Wang, S.Q. Mapping field-scale soil organic carbon with unmanned aircraft system-acquired time series multispectral images. Soil Tillage Res. 2020, 196, 104477. [Google Scholar] [CrossRef]
Wang, X.P.; Zhang, F.; Kung, H.T.; Johnson, V.C. New methods for improving the remote sensing estimation of soil organic matter content (SOMC) in the Ebinur Lake Wetland National Nature Reserve (ELWNNR) in northwest China. Remote Sens. Environ. 2018, 218, 104–118. [Google Scholar] [CrossRef]
Emadi, M.; Taghizadeh-Mehrjardi, R.; Cherati, A.; Danesh, M.; Mosavi, A.; Scholten, T. Predicting and Mapping of Soil Organic Carbon Using Machine Learning Algorithms in Northern Iran. Remote Sens. 2020, 12, 2234. [Google Scholar] [CrossRef]
Yang, L.; Cai, Y.Y.; Zhang, L.; Guo, M.; Li, A.Q.; Zhou, C.H. A deep learning method to predict soil organic carbon content at a regional scale using satellite-based phenology variables. Int. J. Appl. Earth Obs. Geoinf. 2021, 102, 102428. [Google Scholar] [CrossRef]
Chengping, L.; Jiyu, X. Ecologically Vulnerable Characteristics of the farming-Pastoral Zigzag Zone In Northern China. J. Arid. Land Resour. Environ. 1995, 9, 1–7. [Google Scholar]
Yu-Qiang, L.; Ha-Lin, Z.; Xiao-Yong, Y.; Xiao-An, Z.; Yin-ping, C. Dynamics of Carbon and Nitrogen Storages in Plant-Soil System During Desertification Process in Horqin Sandy Land. Environ. Sci. 2006, 4, 635–640. [Google Scholar] [CrossRef]
Liu, Z.; Liu, Y.; Li, Y. Anthropogenic contributions dominate trends of vegetation cover change over the farming-pastoral ecotone of northern China. Ecol. Indic. 2018, 95, 370–378. [Google Scholar] [CrossRef]
Liu, D.; Chen, J.; Ouyang, Z. Responses of landscape structure to the ecological restoration programs in the farming-pastoral ecotone of Northern China. Sci. Total Environ. 2020, 710, 136311. [Google Scholar] [CrossRef]
Zhou, Z.; Sun, O.J.; Huang, J.; Li, L.; Liu, P.; Han, X. Soil carbon and nitrogen stores and storage potential as affected by land-use in an agro-pastoral ecotone of northern China. Biogeochemistry 2007, 82, 127–138. [Google Scholar] [CrossRef]
Wang, X.Y.; Li, Y.G.; Gong, X.W.; Niu, Y.Y.; Chen, Y.P.; Shi, X.P.; Li, W. Storage, pattern and driving factors of soil organic carbon in an ecologically fragile zone of northern China. Geoderma 2019, 343, 155–165. [Google Scholar] [CrossRef]
Wang, X.Y.; Li, Y.Q.; Chen, Y.P.; Lian, J.; Luo, Y.Q.; Niu, Y.Y.; Gong, X.W. Spatial pattern of soil organic carbon and total nitrogen, and analysis of related factors in an agro-pastoral zone in Northern China. PLoS ONE 2018, 13, e0197451. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Li, Y.Q.; Wang, X.Y.; Wan, J.L.; Gong, X.W.; Niu, Y.Y.; Liu, J. Estimating soil organic carbon density in Northern China’s agro-pastoral ecotone using vis-NIR spectroscopy. J. Soils Sediments 2020, 20, 3698–3711. [Google Scholar] [CrossRef]
Wang, L.P.; Wang, X.; Wang, D.Y.; Qi, B.S.; Zheng, S.F.; Liu, H.J.; Luo, C.; Li, H.X.; Meng, L.H.; Meng, X.T.; et al. Spatiotemporal Changes and Driving Factors of Cultivated Soil Organic Carbon in Northern China’s Typical Agro-Pastoral Ecotone in the Last 30 Years. Remote Sens. 2021, 13, 3607. [Google Scholar] [CrossRef]
Yuan, Q.Q.; Shen, H.F.; Li, T.W.; Li, Z.W.; Li, S.W.; Jiang, Y.; Xu, H.Z.; Tan, W.W.; Yang, Q.Q.; Wang, J.W.; et al. Deep learning in environmental remote sensing: Achievements and challenges. Remote Sens. Environ. 2020, 241, 111716. [Google Scholar] [CrossRef]
Shidan, B. Soil Analysis in Agricultural Chemistry; China Agriculture Press: Beijing, China, 2000. [Google Scholar]
Vermote, E.; Justice, C.; Claverie, M.; Franch, B. Preliminary analysis of the performance of the Landsat 8/OLI land surface reflectance product. Remote Sens. Environ. 2016, 185, 46–56. [Google Scholar] [CrossRef] [PubMed]
Zhiyong, L. Spatial Patterns of Soil Carbon and Their Control in the Steppe of Northern Mongolian Plateau. Ph’D Thesis, Inner Mongolia University, Inner Mongolia, China, 2017. [Google Scholar]
Rouse, J.W.; Haas, R.H.; Schell, J.A.; Deering, D.W. Monitoring vegetation systems in the Great Plains with ERTS. NASA Spec. Public 1974, 351, 309. [Google Scholar]
Qi, J.; Chehbouni, A.; Huete, A.R.; Kerr, Y.H.; Sorooshian, S. A modified soil adjusted vegetation index. Remote Sens. Environ. 1994, 48, 119–126. [Google Scholar] [CrossRef]
Huete, A.; Justice, C.; Van Leeuwen, W. MODIS vegetation index (MOD13). Algorithm Theor. Basis Doc. 1999, 3, 295–309. [Google Scholar]
Jamalabad, M.S.; Abkar, A.A. Forest Canopy Density Monitoring, Using Satellite Images. Available online: https://www.researchgate.net/publication/263901692_FOREST_CANOPY_DENSITY_ESTIMATING_USING_SATELLITE_IMAGES (accessed on 22 March 2023).
Huete, A.R. A soil-adjusted vegetation index (SAVI). Remote Sens. Environ. 1988, 25, 295–309. [Google Scholar] [CrossRef]
Baret, F.; Guyot, G. Potentials and limits of vegetation indices for LAI and APAR assessment. Remote Sens. Environ. 1991, 35, 161–173. [Google Scholar] [CrossRef]
Deering, D.W. Measuring Forage Production of Grazing Units from Landsat MSS Data. Available online: https://www.semanticscholar.org/paper/Measuring-forage-production-of-grazing-units-from-Deering/bbdad3db3dc2c59ccf145407390a0771011c44b9 (accessed on 22 March 2023).
Richardson, A.J.; Wiegand, C. Distinguishing vegetation from soil background information. Photogramm. Eng. Remote Sens. 1977, 43, 1541–1552. [Google Scholar]
Gitelson, A.A.; Merzlyak, M.N. Remote sensing of chlorophyll concentration in higher plant leaves. Adv. Space Res. 1998, 22, 689–692. [Google Scholar] [CrossRef]
Scudiero, E.; Skaggs, T.H.; Corwin, D.L. Regional-scale soil salinity assessment using Landsat ETM+ canopy reflectance. Remote Sens. Environ. 2015, 169, 335–343. [Google Scholar] [CrossRef] [Green Version]
Allbed, A.; Kumar, L.; Aldakheel, Y.Y. Assessing soil salinity using soil salinity and vegetation indices derived from IKONOS high-spatial resolution imageries: Applications in a date palm dominated region. Geoderma 2014, 230, 1–8. [Google Scholar] [CrossRef]
Abbas, A.; Khan, S. Using remote sensing techniques for appraisal of irrigated soil salinity. In Proceedings of the International Congress on Modelling and Simulation (MODSIM), Christenchurch, New Zealand, 10–13 December 2007; pp. 2632–2638. [Google Scholar]
Delavar, M.A.; Naderi, A.; Ghorbani, Y.; Mehrpouyan, A.; Bakhshi, A. Soil salinity mapping by remote sensing south of Urmia Lake, Iran. Geoderma Reg. 2020, 22, e00317. [Google Scholar] [CrossRef]
Rikimaru, A.; Roy, P.; Miyatake, S. Tropical forest cover density mapping. Trop. Ecol. 2002, 43, 39–47. [Google Scholar]
Developers, G. USGS Landsat 8 Level 2, Collection 2, Tier 2. 2013. Available online: https://developers.google.com/earth-engine/datasets/catalog/LANDSAT_LC08_C02_T2_L2 (accessed on 22 March 2023).
Todd, S.W.; Hoffer, R.M. Responses of spectral indices to variations in vegetation cover and soil background. Photogramm. Eng. Remote Sens. 1998, 64, 915–922. [Google Scholar]
Kursa, M.B.; Rudnicki, W.R. Feature selection with the Boruta package. J. Stat. Softw. 2010, 36, 1–13. [Google Scholar] [CrossRef] [Green Version]
Chawla, N.V.; Bowyer, K.W.; Hall, L.O.; Kegelmeyer, W.P. SMOTE: Synthetic minority over-sampling technique. J. Artif. Intell. Res. 2002, 16, 321–357. [Google Scholar] [CrossRef]
Pan, S.J.; Yang, Q. A survey on transfer learning. IEEE Trans. Knowl. Data Eng. 2009, 22, 1345–1359. [Google Scholar] [CrossRef]
Minasny, B.; Setiawan, B.I.; Arif, C.; Saptomo, S.K.; Chadirin, Y. Digital mapping for cost-effective and accurate prediction of the depth and carbon stocks in Indonesian peatlands. Geoderma 2016, 272, 20–31. [Google Scholar]
Odebiri, O.; Mutanga, O.; Odindi, J. Deep learning-based national scale soil organic carbon mapping with Sentinel-3 data. Geoderma 2022, 411, 115695. [Google Scholar] [CrossRef]
Bhattacharya, S.; Rajan, V.; Shrivastava, H. ICU mortality prediction: A classification algorithm for imbalanced datasets. In Proceedings of the AAAI Conference on Artificial Intelligence, San Francisco, CA, USA, 4–9 February 2017; Volume 31. [Google Scholar]
Ma, L.; Zhao, L.; Cao, L.; Li, D.; Chen, G.; Han, Y. Inversion of Soil Organic Matter Content Based on Improved Convolutional Neural Network. Sensors 2022, 22, 7777. [Google Scholar] [CrossRef]
Lin, L.; Wang, Y.; Teng, J.; Wang, X. Hyperspectral analysis of soil organic matter in coal mining regions using wavelets, correlations, and partial least squares regression. Environ. Monit. Assess. 2016, 188, 97. [Google Scholar] [CrossRef]
Shi, Z.; Ji, W.; Viscarra Rossel, R.; Chen, S.; Zhou, Y. Prediction of soil organic matter using a spatially constrained local partial least squares regression and the C hinese vis–NIR spectral library. Eur. J. Soil Sci. 2015, 66, 679–687. [Google Scholar] [CrossRef]
Silvero, N.E.Q.; Di Raimo, L.A.D.L.; Pereira, G.S.; De Magalhães, L.P.; da Silva Terra, F.; Dassan, M.A.A.; Salazar, D.F.U.; Demattê, J.A. Effects of water, organic matter, and iron forms in mid-IR spectra of soils: Assessments from laboratory to satellite-simulated data. Geoderma 2020, 375, 114480. [Google Scholar] [CrossRef]
Kätterer, T.; Reichstein, M.; Andrén, O.; Lomander, A. Temperature dependence of organic matter decomposition: A critical review using literature data analyzed with different models. Biol. Fertil. Soils 1998, 27, 258–262. [Google Scholar] [CrossRef]
Conant, R.T.; Ryan, M.G.; Ågren, G.I.; Birge, H.E.; Davidson, E.A.; Eliasson, P.E.; Evans, S.E.; Frey, S.D.; Giardina, C.P.; Hopkins, F.M. Temperature and soil organic matter decomposition rates–synthesis of current knowledge and a way forward. Glob. Change Biol. 2011, 17, 3392–3404. [Google Scholar] [CrossRef]
Luo, Y.; Hui, D.; Zhang, D. Elevated CO2 stimulates net accumulations of carbon and nitrogen in land ecosystems: A meta-analysis. Ecology 2006, 87, 53–63. [Google Scholar] [CrossRef] [Green Version]
Zhao, Y.; Ding, Y.; Hou, X.; Li, F.Y.; Han, W.; Yun, X. Effects of temperature and grazing on soil organic carbon storage in grasslands along the Eurasian steppe eastern transect. PLoS ONE 2017, 12, e0186980. [Google Scholar] [CrossRef] [Green Version]
Evans, S.E.; Burke, I.C.; Lauenroth, W.K. Controls on soil organic carbon and nitrogen in Inner Mongolia, China: A cross-continental comparison of temperate grasslands. Glob. Biogeochem. Cycles 2011, 25. [Google Scholar] [CrossRef] [Green Version]
Wang, L.; Gan, Y.; Wiesmeier, M.; Zhao, G.; Zhang, R.; Han, G.; Siddique, K.H.; Hou, F. Grazing exclusion—An effective approach for naturally restoring degraded grasslands in Northern China. Land Degrad. Dev. 2018, 29, 4439–4456. [Google Scholar] [CrossRef]
Liu, Z.; Zhou, Q.; Ma, Q.; Kuang, W.; Daryanto, S.; Wang, L.; Wu, J.; Liu, B.; Zhu, J.; Cao, C. Scale effect of climate factors on soil organic carbon stock in natural grasslands of northern China. Ecol. Indic. 2023, 146, 109757. [Google Scholar] [CrossRef]
Zhao, X.-X. The Spatio-Temporal Dynamics and Driving Factors of Grassland Net Primary Productivity in the Farming-Pastoral Ecotone of Inner Mongolia. Master’s Thesis, Inner Mongolia University, Inner Mongolia, China, 2023. [Google Scholar]
Yao, Y.F.; Ge, N.N.; Yu, S.; Wei, X.R.; Wang, X.; Jin, J.W.; Liu, X.T.; Shao, M.G.; Wei, Y.C.; Kang, L. Response of aggregate associated organic carbon, nitrogen and phosphorous to re-vegetation in agro-pastoral ecotone of northern China. Geoderma 2019, 341, 172–180. [Google Scholar] [CrossRef]
Yuan, J.; Ouyang, Z.; Zheng, H.; Su, Y. Ecosystem carbon storage following different approaches to grassland restoration in south-eastern Horqin Sandy Land, northern China. Glob. Ecol. Conserv. 2021, 26, e01438. [Google Scholar] [CrossRef]
Meng, Y.; Xuyang, W.; Liye, Z.; Yuqiang, L. Characteristics and influencing factors of soil organic carbon in the process of desertification in Horqin Sandy Land. J. Desert Res. 2022, 42, 221–231. [Google Scholar]
Jun, L. Estimation of Soil Nutrients Loss by Wind Erosion in the Agro-Pastoral Ecotone of Northern China. Master’s Thesis, Hebei Normal University, Shijiazhuang, China, 2021. [Google Scholar]
Wiesmeier, M.; Munro, S.; Barthold, F.; Steffens, M.; Schad, P.; Kögel-Knabner, I. Carbon storage capacity of semi-arid grassland soils and sequestration potentials in northern China. Glob. Change Biol. 2015, 21, 3836–3845. [Google Scholar] [CrossRef] [PubMed]
Tarvainen, A.; Valpola, H. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Adv. Neural Inf. Process. Syst. 2017, 30. [Google Scholar] [CrossRef]

Figure 1. (a) Schematic map of the study area location in northern China and spatial distribution of sampling points. (b) Schematic map of the spatial relationship between arid and semi-arid regions, desertification regions, and the study area. (c) Land use type distribution map of the study area.

Figure 2. The schematic diagram of the DNN structure used in this study. “Rⁱ” represents the number of neurons as “i”.

Figure 3. (a) Boxplot of the sampling data categorized by land use types. (b) Distribution of sampling data for different land uses. (c) Distribution of selected indices and spectral bands involved in the modeling.

Figure 4. (a) Random forest band importance ranking diagram; (b) the relationship between the number of trees and MSE in random forest training; (c) the relationship between the predicted and true values of the random forest after Boruta screening. The blue dots represent the correspondence between predicted values and true values.

Figure 5. (a) The original distribution of SOCD and its probability density function.; (b) sample distribution of SOCD and its probability density function after square root transformation; (c) sample distribution of SOCD and its probability density function after log transformation; (d) sample distribution of SOCD and its probability density function after Box–Cox transformation.

Figure 6. (a) Precision change of deep learning model training process; (b) deep learning model verification set error distribution; (c) the relationship between the predicted and true values of the deep learning model after Boruta screening.

Figure 7. (a) Soil organic carbon density (SOCD) of 0–30 cm in the agro–pastoral ecotone of northern China. (b) The spatial distribution of SOCD for deep neural networks (DNNs) at a lower limit of 5%. (c) The spatial distribution of SOCD for DNNs at the mean. (d) The spatial distribution of SOCD for DNNs at the upper limit of 95%.

Figure 8. (a) Box plots of SOCD for different land use types. (b) Box plots of SOCD at different levels of desertification. The red dots represent extreme values.

Table 1. The indices and formulas used in this research.

Type	Name of Index	Formula	Reference	Name of Index	Formula	Reference
Vegetation index	Normalized Difference Vegetation Index (NDVI)	$(B_{5} - B_{4}) / (B_{5} + B_{4})$	[34]	Modified Soil Adjusted Vegetation Index (MSAVI)	$0.5 [2 B_{5} + 1 - \sqrt{{(2 B_{5} + 1)}^{2} - 8 (B_{5} - B_{4})}]$	[35]
	Enhanced Vegetation Index (EVI)	$2.5 (B_{5} - B_{4}) / (1 + B_{5} + 6 B_{4} - 7.5 B_{2})$	[36]	Optimized Soil Adjusted Vegetation Index (OSAVI)	1.16 $(B_{5} - B_{4}) / (B_{5} + B_{4} + 0.16)$	[37]
	Soil Adjusted Vegetation Index (SAVI)	$(B_{5} - B_{4}) / (B_{5} + B_{4} + 0.5) \times 1.5$	[38]	Ratio Vegetation Index (RVI)	$B_{5} / B_{4}$	[39]
	Renormalized Difference Vegetation Index (RDVI)	$(B_{5} - B_{4}) / \sqrt{(B_{5} + B_{4})}$	[39]	Transformed Vegetation Index (TVI)	$\sqrt{(B_{5} - B_{4}) / (B_{5} + B_{4})} + 0.5$	[40]
	Difference Vegetation Index (DVI)	$B_{5} - B_{4}$	[41]	Green Normalized Difference Vegetation Index (GDVI)	$(B_{5} - B_{3}) / (B_{5} + B_{3})$	[42]
Soil index	Canopy Response Salinity Index (CRSI)	$\sqrt{(B_{5} \times B_{4} - B_{3} \times B_{2}) / B_{5} \times B_{4} + B_{3} \times B_{2}}$	[43]	Salinity Index (SI-T)	$B_{4} / B_{5}$	[44]
	Salinity Index (SI)	$\sqrt{(B_{2} \times B_{4})}$	[45]	Salinity Index-1 (SI-1)	$\sqrt{(B_{3} \times B_{4})}$	[45]
	Salinity Index-2(SI-2)	$\sqrt{{B_{3}}^{2} + {B_{4}}^{2} + {B_{5}}^{2}}$	[45]	Salinity Index-3 (SI-3)	$\sqrt{{B_{3}}^{2} + {B_{4}}^{2}}$	[45]
	Salinity Index-4(SI-4)	$B_{2} \times B_{4} / B_{3}$	[45]	Salinity Index-5 (SI-5)	$B_{3} \times B_{4} / 2$	[45]
	Salinity Ratio (SAIO)	$(B_{4} - B_{5}) / (B_{5} + B_{3})$	[46]	S1 (Soil Index 1)	$B_{2} / B_{4}$	[44]
	S2 (Soil Index 2)	$(B_{2} - B_{4}) / {{(B}_{2} + B}_{4})$	[44]	S3 (Soil Index 3)	$(B_{2} \times B_{4}) / B_{3}$	[44]
	Bare Soil Index (IBI_temp)	${[B}_{7} + B_{4} - (B_{2} + B_{5})] / [B_{7} + B_{4} + (B_{2} + B_{5})]$	[47]	Salinity Index-temp (SI-temp)	${(B}_{4} - B_{5}) \times 100$	[44]
Meteorological index	Land Surface Temperature (LST)	$S T_B_{10} - 273.15$	[48]	Surface Moisture (WETtemp)	$0.1511 B_{2} + 0.1973 B_{3} + 0.3283 B_{4} + 0.3407 B_{5} - 0.7117 B_{6} - 0.4559 B_{7}$	[49]
Topographic factor	Elevation	ASTER GDEM v2	[48]	Slope	$a r c t a n 2 (\sqrt{{{\partial z / \partial x}^{2} + \partial z / \partial y}^{2} / \partial s})$	[48]
Topographic factor	Aspect	$a r c t a n 2 (\partial z / \partial x, - \partial z / \partial x)$	[48]	Slope		[48]
Band combination index	Ratio Index (RI_BiBj)	$B_{i} / B_{j}$		Difference index (DI_BiBj)	$B_{i} - B_{j}$
Band combination index	Normalized index (NDI_BiBj)	$(B_{i} - B_{j}) / (B_{i} + B_{j})$		Difference index (DI_BiBj)	$B_{i} - B_{j}$

B_i represents the i-th band of Landsat 8 L2 data, \partial z is the height difference between two adjacent points, \partial x is the horizontal distance, and \partial s is the distance between two adjacent points.

Table 2. Comparison between kriging interpolation method and DNN inversion method.

Land Use Type	Kriging Interpolation Result (Tg)	DNN Result (Tg)
Farmland	639.02	639.57
Forest land	405.17	328.32
Shrubland	101.20	130.01
Sparse woodland	39.83	49.52
High coverage grassland	493.93	469.31
Medium coverage grassland	287.81	391.75
Low coverage grassland	104.11	136.35
Sandy land	59.06	70.98

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Guo, Z.; Li, Y.; Wang, X.; Gong, X.; Chen, Y.; Cao, W. Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China. Remote Sens. 2023, 15, 3846. https://doi.org/10.3390/rs15153846

AMA Style

Guo Z, Li Y, Wang X, Gong X, Chen Y, Cao W. Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China. Remote Sensing. 2023; 15(15):3846. https://doi.org/10.3390/rs15153846

Chicago/Turabian Style

Guo, Zichen, Yuqiang Li, Xuyang Wang, Xiangwen Gong, Yun Chen, and Wenjie Cao. 2023. "Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China" Remote Sensing 15, no. 15: 3846. https://doi.org/10.3390/rs15153846

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Remote Sensing of Soil Organic Carbon at Regional Scale Based on Deep Learning: A Case Study of Agro-Pastoral Ecotone in Northern China

Abstract

1. Introduction

2. Materials and Methods

2.1. Overview of Study Area

2.2. SOC Sampling and Laboratory Analysis

2.3. Remote Sensing Images and Preprocessing

2.4. Boruta Feature Selection Algorithm

2.5. Random Forest Regression

2.6. Fully Connected Deep Learning Network

2.7. Model Evaluation and Uncertainty Analysis

3. Results

3.1. Exploratory Data Analysis

3.2. The Feature Results Selected by the Boruta Algorithm

3.3. Random Forest Regression Model

3.4. Fully Connected Neural Network Model

3.5. Spatial Distribution of 0–30 cm SOCD in the Agro–Pastoral Ecotone in Northern China

4. Discussion

4.1. Factors Affecting the Accuracy of a SOC Inversion Model Established by Machine Learning

4.2. SOC in Different Land Use Types

4.3. Limitations of the Model and Future Developments

5. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI