1. Introduction
Forests play an irreplaceable role in human survival and sustainable development [
1]. They affect the productivity of terrestrial ecosystems, soil formation, nutrient cycling [
2], and the ecology of the surrounding watersheds [
3]. Therefore, elucidating the species and distribution of trees in forests is vital for forest ecosystem changes and forest management strategies. Tree species can be identified via field surveys and remote sensing. Traditional forest inventorying is time-consuming and cannot efficiently provide detailed spatial distribution information of forest trees over large areas. In contrast, remote sensing technology (e.g., satellite image data) has advantages, such as the large monitoring scale, fast information acquisition, short revisit period, and low operating costs, and it has become an important tool for forest species classification and forest resource survey and monitoring [
4,
5,
6,
7].
Subtropical forests cover a quarter of the total area, mainly distributed in central and southern China [
8]. In the north subtropical evergreen broad-leaved and deciduous mixed forest belt, natural forest is not much; pine forest and artificial Chinese fir forest are common. The middle subtropical and south subtropical evergreen broad-leaved forests are the central distribution areas of Masson pine forest, Chinese fir forest, and evergreen broadleaved forest in China. The natural Masson pine forest accounts for about 50% of the forest area, and the Chinese fir forest accounts for 20%–30%. Mixed forests, dominated by evergreen broadleaved forests, account for 10%–20%. The subtropical areas of China are important production areas of timber forests, which is also the focus of ecological projects, such as soil and water conservation and biodiversity conservation in China [
9,
10]. Tree species classification based on remote sensing data is closely related to the development of digital image processing technology. Early research mainly adopted a pixel-based supervised and unsupervised classification, and the maximum likelihood method and k-nearest neighbour (KNN) method were widely used [
11,
12]. Förster et al. used QuickBird remote sensing images and object-oriented classification and recognition methods to effectively extract spruce, larch, and other tree types in forest areas in southern Germany [
13]. With the progress of deep learning technology, various machine learning algorithms have provided a new paradigm for tree species (group) classification and recognition. Algorithms such as random forest (RF) and 4 vector machine (SVM) have certain adaptability to high-dimensional features [
14,
15,
16], and are gradually replacing traditional classification algorithms in the research on forest tree species recognition. Based on Sentinel-2 and Landsat series images, Chen et al. used a random forest classifier to identify five forest types such as coniferous forest, broad-leaved forest, and mixed forest. The overall accuracy rate reached >85% [
17].
Recent studies have increasingly adopted vegetation information extraction methods involving multiple features [
18,
19,
20,
21]. Topographic feature is a multidimensional variable; with the development of remote sensing technology, we can quickly obtain the terrain features of regional altitude, slope, aspect, and so on. Altitude, slope, and aspect are the main factors affecting the distribution of tree species. The effects of light, heat, water, and soil nutrients are different. Their changes will cause changes in these environmental factors, which will affect the distribution of plant communities [
22]. In the study of remote sensing tree species identification, scholars try to add topographic factors to improve the identification accuracy. The integration of spectral features, textural features, and topographic factors obtained from remote sensing data can effectively improve tree species identification accuracy. By combining spectral features and textural features, topography, and other auxiliary aspects, Luo et al. [
19] extracted mangroves from other land cover types and achieved a remarkably improved extraction accuracy. Kampouri et al. [
23] combined information on topographic factors (e.g., altitude and slope) and expert knowledge to improve the accuracy of tree species classification in Sentinel-II images.
Machine learning models differ in data structure requirements and algorithms. Laurel Ballanti et al. [
24] and Aneta Modzelewska et al. [
25] used SVM algorithms based on hyperspectral data to identify tree species in forests near Marin County, California, and the Bialowieza Forest, respectively. They achieved overall classification accuracies of 95.02% and 70.00%, respectively. Previous studies have mainly focused on identifying temperate forests [
26,
27], which often have relatively simple tree species structures. Studies on the remote sensing-based identification of evergreen broadleaved forest species in subtropical regions are relatively few [
28]. Subtropical areas are rich in forest resources and feature high densities, complex forest stands and mixed forests of different species, which can complicate remote sensing-based identification. Although some scholars have obtained high-precision classification results, they are often based on airborne data for which coverage area is limited [
29,
30,
31]. There is a lack of research on using high-resolution satellite data to identify subtropical evergreen forest tree species. Furthermore, the effect of different kinds of machine learning on the identification of subtropical evergreen forests, and the question of which method and feature combinations are more suitable for the identification of subtropical evergreen forests still need to be made clear.
In this study, we focused on the typical subtropical evergreen forest area in Nankang District, Jiangxi Province, southern China. Masson pine, Chinese fir, and evergreen broad-leaved trees are widely distributed in the study area which is similar to the surrounding cities and counties. The overall goal of this study is to explore different algorithms and feature combination Schemes and their capabilities in subtropical tree species identification, and to find the most suitable remote sensing tree species identification method for subtropical evergreen forest areas. The main steps included (1) constructing different feature factor combination Schemes and comparing the effects of using different types of factors to identify subtropical evergreen tree species; (2) constructing machine learning classification algorithms, such as the nearest neighbor classification (KNN), support vector machine (SVM), BP neural network (BP), and random forest (RF) to explore the ability of different classification algorithms to identify subtropical evergreen tree species; and (3) evaluating the relative importance of variables and analyse the contribution rate of variables to the recognition model.
3. Results
3.1. OA Evaluation of Each Classification Scheme
As shown in
Figure 2, SVM exhibited the highest classification accuracy, with an OA of 90.27% when using Scheme 8 (band reflectance + vegetation index + topographic factor). The SVM classifier exhibited the highest OA when all of the Schemes except Scheme 11 (band reflectance + vegetation index + textural features + topographic factor) were used; when Scheme 11 was used, the OA was lower than RF by 0.31%. Therefore, SVM can be considered the optimal classification algorithm for tree species identification.
A comparison of Schemes 1–3 revealed that for tree species identification using a single feature factor, Scheme 3 (textural feature) achieved the highest accuracy on PLR, SVM, RF, and KNN, while for the artificial neural network (BP), the OA was 6.07% lower than that for Scheme 2 (vegetation index). Scheme 2 (vegetation index) achieved the next-highest accuracy, while Scheme 1 (band reflectance) exhibited the worst performance. When a combination of multiple feature factors was used for tree species identification, Scheme 8 (band reflectance + vegetation index + topographic factor) yielded the highest classification accuracy on BP, PLR, SVM, and KNN which higher than other 11 Schemes. For RF, the OA of Scheme 8 was only 1.54% lower than that of Scheme 11 (band reflectance + vegetation index + textural features + topographic factor). In addition, the SVM classifier constructed using Scheme 8 yielded the highest classification accuracy and Kappa coefficient. Thus, Scheme 8 can be regarded as the optimal feature combination for tree species identification.
A comparison of the combinations of Schemes 4 and 7, 5 and 8, 6 and 9, and 10 and 11 showed that the topographic factors contributed the following OA: 12.66%–24.36%, 17.15%–25.34%, 9.52%–17.16%, and −2.88%–15.28%, respectively. The overall classification accuracies of BP and PLR under Scheme 11 were 0.41% and 2.88% lower than that under Scheme 10. The addition of topographic factors improved the overall classification accuracy.
3.2. Assessment of the Optimal Accuracy of Each Classification Algorithm
Five classification algorithms, namely BP, PLR, SVM, RF, and KNN, were considered in this study. Their parameters in R-project were set as shown in
Table 7.
To more intuitively compare the classification abilities of the algorithms, their mean highest accuracy results were compared (
Figure 3). The OA and kappa coefficient decreased as follows: SVM > RF > BP > PLR > KNN.
Generally, the five adopted classification methods exhibited good performance in the identification of subtropical evergreen forests, and the OA and the kappa coefficient exceeded 87% and 81%, respectively. SVM yielded the highest classification accuracy, with the highest OA reaching 90.27%, corresponding to the highest kappa coefficient of 85.37%; RF yielded the second-highest classification accuracy value of 88.90%, corresponding to a kappa coefficient of 83.30%; the OA and kappa coefficient were lower than those of SVM by 1.52% and 2.42%, respectively. KNN yielded the lowest OA and kappa coefficient of 87.40% and 81.08%, respectively, which were lower than those of SVM by 3.18% and 5.03%, respectively.
3.3. Assessment of Tree Species Classification Accuracy
From the results of the identification of the three tree species, the user’s accuracy and producer’s accuracy for Masson pine exceeded 90%, while those for the broadleaved evergreen and fir trees exceeded 80%. The highest producer’s accuracy and user’s accuracy for evergreen broadleaved tree identification were 88.17% (SVM) and 84.66% (SVM), respectively. The highest producer’s accuracy and user’s accuracy of Masson pine identification were 100% (SVM) and 0.9322 (RF), respectively. The highest producer’s accuracies for Chinese fir identification were 88.09% (BP) and 94.67% (BP), respectively (
Table 8).
In order to reflect the identification effect of three tree species from different angles, we use commission errors and omission errors to evaluate. The omission errors refer to the number of samples that are not assigned to the real label in a certain category. The commission errors are the number of misclassified samples in a certain category for samples that do not belong to this category but are assigned to this category [
59]. The commission errors of a certain category are high, and the omission errors are low, indicating that other feature categories are misclassified into the target feature category, resulting in an exaggerated number of target feature categories identified. On the contrary, the commission errors of a certain category are low, and the omission errors are high, indicating that the ground object category is omitted, resulting in a smaller number of identified target ground object categories. The commission errors and omission errors for the three species analysed at the image metric scale generally decreased as follows: broadleaved evergreen > Chinese fir > Masson pine (
Figure 4). Broadleaved evergreen trees showed comparable omission errors (0.1183–01957) and commission errors (0.1534–0.1673). Masson pine showed small omission errors (0–0.0921) and commission errors (0.0678–0.0920), all below 0.1. For Chinese fir, omission errors (0.1189–0.1745) were dominant compared with the commission errors (0.0914–0.1218).
3.4. Relative Importance of Variables
When KNN, SVM, BP, and the PLR model were used, the highest overall classification accuracy was obtained under Scheme 8 (band reflectance + vegetation index + topographic factor). When the RF model was used, the highest overall classification accuracy was obtained under Scheme 11 (band reflectance + vegetation index + textural feature + topographic factor). Therefore, we used the RF model to generate the relative importance of the feature variables under Schemes 11 and 8 and obtained the importance parameter MDG (
Figure 5a,b), which was used to explore the relative importance of each feature factor.
The results without textural features (Scheme 8) showed that the four variables with high relative importance assessed by the RF model were the normalized difference green index (NDGI), altitude, the modified soil-adjusted vegetation index (MSAVI), and difference vegetation index (DVI), with NDGI having the highest relative importance (67.96), followed by altitude (65.28), and then MSAVI (23.94). After the addition of textural features (Scheme 11), the four key variables were altitude, NDGI, Contrast4, and MSAVI, with importance values of 55.84, 34.88, 31.63, and 17.47, respectively. According to the combined results of the variable importance evaluation for both classification Schemes, NDGI, altitude, and MSAVI are the most important variables.
4. Discussion
The study showed that when a single feature factor was used for tree species identification, the textural index (Scheme 3) performed best, followed by the vegetation index (Scheme 2), and the band reflectance exhibited the worst performance (Scheme 1). It suggests that accurately identifying tree species only by band reflectance is difficult, possibly because evergreen species exhibit relatively similar spectral reflectance, and relying solely on band reflectance often results in homospectral or heterospectral phenomena [
60]. Wang et al. [
61] reached similar conclusions after classifying tree species with insignificant differences in spectral curves, such as mixed broadleaved, mixed conifer and mixed conifer forests. This suggests that other information, such as textural or topographic information, is needed to distinguish the species with similar spectral curves.
Furthermore, among the five machine learning algorithms employed, the accuracy of tree species identification based on a single feature factor was lower than that based on multiple features. This suggests that the use of single type features for tree species remote sensing identification is significantly limited, particularly for subtropical broadleaf evergreen forest areas with high biodiversity and high vegetation cover. However, the combination of multi-feature variables can fully leverage the features information of ground objects to increase the discrimination between different tree species [
62]. Wang et al. [
63] integrated the tree species’ vegetation indices, phenological information, textural features, and topographic features in the construction of a multi-feature random forest tree species classification model and found that the fusion of multiple features could effectively improve the recognition accuracy of tree species.
A comparison of the classification accuracy with and without topographic factors revealed that the inclusion of topographic features significantly contributed to the OA, mainly because the tree distribution is closely related to topographic factors [
64]. The main landform types in the study area are hills and mountains, which are undulating. The three tree species feature different altitudes, slopes, and aspects. Chinese fir is located on a higher slope than Masson pine and on a lower slope than evergreen broadleaved trees. Masson pine and Chinese fir are sun-loving species, and broadleaved evergreens generally prefer shade, which leads to a difference in their slope orientation. The importance of topographic factors in tree species identification is also reflected in the findings of several studies. Wang et al. [
63] found that topographic features play a crucial role in the classification of random forest tree species based on the fusion of different features, with altitude being the best feature factor. Li et al. [
65] found that the inclusion of topographic factors improved the identification accuracy of coniferous forests when tree species were classified using GF-2 PMS data. Several studies have shown that the contribution of topographic factors to recognition accuracy depends on the topographic difference. Li et al. [
66] studied the Huangfu shan National Forest Park and found that topographic features had little effect on tree species identification accuracy, and even reduced the OA, owing to the few peaks in the study area and the average low altitude. The highly accessible flat areas featured high tending and renewal intensities, and the slope and aspect distribution of tree species were affected to some extent. AHoscilo et al. [
67] used Sentinel-2 data to classify eight tree species and found that the classification accuracy increased from 75.60% to 81.70% after adding topographic features, and the elevation had the greatest impact on tree species classification, followed by slope. Zhang et al. [
68] found that the spatial resolution and accuracy of DEM also affected classification accuracy. The spatial resolution of DEM used in the present study was 8 m, which was generally higher than the DEM accuracy used in other studies, which may be the reason for the important role of topographic factors.
In the present study, the five algorithms exhibited good recognition accuracies, and the OA exceeded 87%. SVM exhibited the best, followed by RF. Other studies have also found that SVM and RF exhibited high accuracy in forest and land cover classification [
69]. SVM exhibited the best classification effect in terms of OA and kappa coefficient, which indicates that the SVM classifier has high generalisability and only requires a small amount of training sample data. The SVM classifier shows many unique advantages in solving the problem of small-sample and high-dimensional pattern remote sensing recognition [
70,
71], particularly suitable for the classification of remote sensing images. It is generally accepted that the SVM classifier can effectively process limited training samples [
72], because the system randomly generates a hyperplane and moves continuously, to establish an optimal decision hyperplane and classify the samples. RF outperformed the other algorithms when the highest number of variables was used (Scheme 11), attributable to the improved ability of the RF algorithm to process high-dimensional, massive variables in parallel compared with other machine learning algorithms [
73]. Li et al. [
65] conducted a comparison experiment using GF-2 PMS data combined with spectral features, textural features, the vegetation index, and topographic factors. The experimental results showed that the OA and kappa coefficients of the RF classifier were higher than those of the SVM classifier.
The omission and commission errors of the evergreen broadleaved trees were higher than those of Chinese fir and Masson pine, which is consistent with the conclusion of Zhang. Zhang et al. [
74] studied the remote sensing-based identification of subtropical evergreen trees and found that the omission and commission errors of both Chinese fir and Masson pine were lower than those of evergreen broadleaved forests. In the present study, the classification errors of broadleaved trees were relatively high, possibly because the study area comprised typical subtropical trees, including a wide variety of broadleaved species, and few pure broadleaved forest areas, which tend to produce mixed pixels.
The results of the variable importance assessment showed that the topographic factor was an important model variable. The consideration of topographic factor can improve tree species identification accuracy, especially altitude. Rautiainen et al. [
75] indicated that topographic factors and the light preference of tree species were closely related to the spectral differences of conifers and showed that the addition of elevation and aspect factors improved the separability of fir and Masson pine. In addition, in the present study, the vegetation index played an important role in tree species identification, with NDGI having the highest relative importance, and it could be used as a classification variable to distinguish evergreen trees. This was attributed to NDGI being able to reflect the chlorophyll content, the biomass of trees, and the water content of leaves, while the three evergreen trees are different in these aspects.
5. Conclusions
Mastering the distribution of tree species is the basis for forest resource management. Remote sensing technology is an important method for tree species identification. The combination of high-resolution satellite and machine learning technology provides the possibility for tree species identification at the regional scale. China’s forests are widely distributed, and subtropical evergreen forests are large in area, and they are the main forest types in southern China. However, due to their high stand density, it is difficult to identify them. Most studies focused on tree species identification in temperate regions, and relatively few in subtropical regions. The use of UAV data in subtropical tree species identification is often more than the use of satellite data. The effect of different feature combinations and different machine learning methods on the identification of subtropical evergreen tree species needs to be explored.
The purpose of this study is to compare the effects of different feature combination Schemes and machine learning algorithms in the identification of subtropical evergreen tree species, and to construct the classification of subtropical evergreen tree species with the best recognition effect. In this study, evergreen tree species in subtropical areas were identified via remote sensing. The results provide support information for the recognition of subtropical evergreen tree species using GF-2 imagery. The main conclusions are as follows:
- (1)
The combination of Scheme 8 and SVM can produce the best recognition effect of subtropical evergreen forest tree species, with an overall accuracy of 90.27% and a Kappa coefficient of 85.37%.
- (2)
BP, SVM, RF, KNN, and PLR classifiers were constructed through the multi-feature combination method. The best OA and kappa coefficient obtained by each classification algorithm exceeded 87% and 0.81, respectively (SVM > RF > BP > PLR > KNN); thus, the classification results met the application requirements of tree species identification and extraction of subtropical natural evergreen forests under a complex canopy structure and high stand density.
- (3)
Achieving accurate tree species recognition using a single class of feature factors was difficult. The combination of multiple features yielded a higher classification accuracy, and the addition of topographic factors effectively improved the tree species recognition accuracy.
- (4)
Band reflectance, vegetation index feature, textural features, and topographic factor extracted from GF-2 data were combined into different Schemes. Scheme 8 (band reflectance + vegetation index + topographic factor) yielded the best effect, followed by Scheme 11 (band reflectance + vegetation index + textural feature + topographic factor).
- (5)
The recognition effect of three evergreen tree species was evaluated based on commission errors and omission errors. We found that among the five models, Masson pine had the best recognition effect, followed by Chinese fir.
- (6)
Different variables had different importance values in tree species identification. NDGI, altitude, and MSAVI were the most relevant variables.