1. Introduction
The root system is an important organ for water and nutrient uptake in soybean plants, and its morphological characteristics are closely related to the growth and development of above-ground organs, yield, quality and resistance [
1]. There is considerable potential for breeding cultivars that can efficiently absorb water and nutrients by understanding the root morphological traits associated with soybean growth and development [
2]. This makes the root important for identifying crucial traits in promising breeding targets [
3]. However, the roots normally function within the soil, limiting direct observation [
4]. Collecting root images from soil is time-consuming and laborious, and may damage root system architecture [
5]. Fortunately, nutrient solution hydroponics is a cultivation method of growing plants in nutrient solution [
6]. It solves non-visualization in soil cultivation, facilitates real-time observation and statistical information of root morphology, and is widely used in soybean seedling germplasm resource identification and root trait screening [
7,
8,
9]. It was found that the seedling stage of the soybean has an important position in its entire reproductive period and is the best period for cultivating good root systems [
10]. The quantitative study of root morphological characteristics during this period is of great significance for germplasm resource innovation, the selection and breeding of high-quality varieties, and root genetic improvement.
Root segmentation is a key technology in root morphological feature extraction and measurement methods. Segmentation results directly affect the subsequent measurement results of root morphological parameters. In the traditional root segmentation field, Liu et al. [
11] proposed a rape root segmentation method based on color and Gaussian model, but it needs to set morphological parameters artificially and contains a small amount of noise in the background. The threshold-based segmentation method is sensitive to noise, but has a problem grouping targets with similar background and root color into one category, which is often unsatisfactory especially in the face of interference factors caused by noise such as water stains, noise spots and regions with insignificant contrast. She et al. [
12] adopted an automatic global threshold segmentation method to segment cotton root images, which easily loses part of the roots in complex environments. Wang et al. [
13] used the pixel classification background segmentation method based on a support vector machine to segment maize root images, but some disconnected regions and orphaned pixels would be generated. This segmentation method requires the manual extraction of features, which makes the algorithms more difficult. Falk et al. [
14] proposed a soybean seedling root segmentation method based on a computer vision-imaging platform and machine learning. This platform delivers biologically relevant time-series data on root growth and development for phenomics and plant breeding applications.
In recent years, with the rapid development of convolutional neural networks [
15], deep learning has become a research hotspot in the field of machine learning, showing the capabilities to address various challenging computer vision tasks. Instead of traditionally tedious manual target feature extraction, convolutional neural networks can automatically learn features from input data and achieve the end-to-end pixel-level classification of image targets. Long et al. [
16] proposed fully convolutional networks (FCN), which introduced deep learning into the field of image semantic segmentation. The accuracy of semantic segmentation has been remarkably improved [
17]. At present, more and more scholars have applied semantic segmentation algorithms based on deep learning to various fields such as medicine [
18], semiconductor material [
19], remote sensing [
20] and agriculture [
21,
22,
23], and have made great progress in plant root image segmentation. Wang et al. [
24] proposed a fully automated soybean root segmentation method based on convolutional neural networks called SegRoot, validated its best segmentation performance by using a transfer learning technique and evaluated how different network capacities affected the performance. Teramoto et al. [
25] adopted a U-shaped full convolutional neural network to semantically segment rice roots in trench profile images. Smith et al. [
26] used a convolutional neural network U-Net to effectively segment chicory roots and soil in RGB root images. However, the feature layer of this traditional convolutional neural network assigns the same weight to the target features and interference features. Detailed information of the image will be gradually lost after multiple convolutional pooling operations, making root segmentation edge rough and root disconnection. Gong et al. [
27] improved the U-Net model by introducing residual module and SENet (Squeeze-and-Excitation Networks) [
28] to segment the root images of rice seedlings planted in transparent bags. Compared with the Otsu method, the proposed model can automatically segment the root morphology in rice root images under strong noise with higher accuracy. However, it also increases the complexity of the model. Kang et al. [
29] proposed an attention mechanism-based semantic segmentation model for the in situ imaging of cotton root systems to distinguish the root system from the soil background. A simple and effective attention module, CBAM (Convolutional Block Attention Module) [
30] is introduced into the model, so that the model has a good segmentation effect. However, the above image segmentation methods often obtain some false positive pixels when extracting roots from the original root image with water stains, noise spots and regions with insignificant contrast, causing partial root disconnection. It is necessary to further improve the network architecture.
This paper proposed a semantic segmentation model of soybean seedling root images based on an improved U-Net network for soybean seedling root images in a hydroponic environment. To achieve accurate soybean seedling root segmentation and meet the demand of fine root phenotype measurement, a U-Net-based network architecture was used. Inspired by the visual attention mechanism [
31], the attention model was embedded in the downsampling and skip connection parts of the model, so that the model pays more attention to the root region and accurately identifies the feature of pseudo roots. To validate the rationality of the improved network, the model prediction process was visually interpreted using feature maps and class activation mapping maps. The connected component analysis method was used to remove the remaining noise in the prediction map. Furthermore, to further verify the generalization ability of the model, a segmentation experiment on soybean root images was performed in the soil-culturing environment. The segmentation network designed in this study can segment the root features of soybean seedlings more accurately in both hydroponic and soil-culturing environments, providing accurate and reliable data support for the measurement of soybean root parameters at the seedling stage.
4. Experiments and Results
4.1. Network Training
4.1.1. Experimental Environment
Python 3.6 was used as the programming development language; the operating system environment was Windows 10 64-bit. The experimental hardware development environment was an Intel(R) Core (TM) i5-10400 CPU @ 2.9 GHz processor with 16 GB of running memory and an NVIDIA GeForce RTX 2060 graphics card. It was also equipped with CUDA 10.0 as the parallel computing framework and CUDNN 7.4.2 as the deep neural network acceleration library. The proposed model was implemented in the PyTorch 1.2.0 deep learning development framework.
4.1.2. Loss Function
The selection of the loss function in the experiment directly affects the training results of the network. The loss function helps to optimize the parameters of the network, and the purpose of training a deep convolutional neural network is to find the optimal solution of the loss function. Since the semantic segmentation problem of soybean seedling root images is a binary classification problem with only two categories, root and background, the binary cross entropy (BCE) loss function is used. However, the number of root pixels in this dataset is a low proportion of the root images being detected, and the BCE loss causes the learning and recognition of the network toward the root category to be inhibited, and the network is more inclined to predict as background. Compared with BCE loss, Dice loss can solve the problem of positive and negative sample imbalance well [
36]. Among them, the mathematical expressions of BCE loss and Dice loss are shown in Equations (4) and (5).
where
is the total number of pixel points in the root image;
is the
i-th pixel point;
is the label value of the
i-th pixel point, which is defined as 1 if this pixel point is the root, and 0 otherwise;
is the probability value of the
i-th pixel point being predicted as the root, which takes values from 0 to 1.
is a smaller positive value that exists to avoid the case of a denominator of 0, and its value is 1.
The combination of BCE loss function and Dice loss function is used as the loss function of the root segmentation model, which makes the model training stable and can effectively solve the problem of positive and negative sample imbalance [
37], and the calculation formula is shown in the following equation.
where
is the total loss;
is the dichotomous cross-entropy loss,
is the Dice loss, and λ is the weight coefficient to balance the importance between the BCE loss function and the Dice loss function.
The BCE loss judges the classification effect of each pixel point from details. The Dice loss judges the classification effect of the predicted segmentation map from the global perspective. Considering that the segmentation of root images needs to focus on the importance of global judgment, and the main task of the model is segmentation, the BCE loss weight is too large to make the model tend to predict as background, so λ is set to 0.3 in the design of the combined loss function.
4.1.3. Experimental Parameter Setting
To ensure the repeatability of the model training experiment, a fixed random seed strategy was adopted. Each experiment produced the same input data to ensure that the model achieved the same results in each run. To avoid generating overlearning, the maximum number of training rounds was set to 10 epochs [
38], and the batch size of each iteration of training was set to 1. The 240 training samples were iterated 240 times per round, for a total of 2400 iterations. The network model weights were randomly initialized using kaiming normal distribution strategy, and the training process used the Stochastic Gradient Descent (SGD) [
39] algorithm based on momentum for network optimization and update, the initial learning rate was set to 0.01, the momentum factor was set to 0.99, and the weight decay factor was set to 0.00001 to prevent the network from overfitting. The dynamic learning rate adjustment strategy in the learning rate decay training strategy was used. After each epoch was trained, it was tested on the validation set and its F1 value was recorded. When it was detected that the F1 value of the validation set did not rise under 3 epochs, the learning rate was adjusted to 10% of the original one, and the training stopped after the number of iterative rounds reached the maximum.
4.2. Model Evaluation Metrics
To quantitatively evaluate the performance of the constructed network model in the semantic segmentation task of soybean root images, five commonly used measures of algorithm performance were used: Accuracy, Precision, Recall, F1-Score (F1) and Intersection over Union (IoU). The accuracy of the soybean root segmentation results was evaluated by five commonly used metrics of algorithm performance. These five metrics can be calculated by the confusion matrix [
40]. The positive sample represents the root pixel and the negative sample represents the non-root pixel. The confusion matrix is shown in
Table 3.
Accuracy refers to the proportion of the total number of correctly predicted pixels in the total number of all pixels. It is calculated as follows.
Precision refers to the proportion of the total number of pixels correctly predicted as root system to the total number of all pixels predicted as root system, highlighting the false detection rate in the prediction results. It is calculated as follows.
Recall refers to the proportion of the total number of pixels correctly predicted as root systems to the total number of all pixels that are actually root systems, highlighting the rate of missed detections in the prediction results. It is calculated as follows.
F1 is an evaluation metric that balances precision and recall and comprehensively reflects the segmentation effect, which is defined as the harmonic mean of precision and recall values in the range of (0,1). The larger the value, the better the segmentation effect of the model. Its calculation formula is as follows.
IoU refers to the ratio between the intersection and union of the prediction result of the root pixel category and the true label by the model. It is a metric to comprehensively evaluate the segmentation performance. The value of 1 indicates that the prediction result is completely consistent with the true label. The calculation formula is as follows.
where
TP is true positive, which indicates the number of pixels that are manually labeled as root regions and automatically predicted as root regions by the model;
TN is true negative, which indicates the number of pixels that are manually labeled as background regions and automatically predicted as background regions by the model;
FP is false positive, which indicates the number of pixels that are
FP is false positive, which indicates the number of pixels that are manually labeled as background region but automatically predicted as root region;
FN is false negative, which indicates the number of pixels that are manually labeled as root region but automatically predicted as background region.
Since the model segmentation time is a more important metric in evaluating the performance of segmentation methods in practical segmentation applications, the single image segmentation time t is used as the evaluation criterion.
4.3. Model Performance Evaluation
The accuracy evaluation metric and loss value of the validation set are calculated after each epoch. Model segmentation performance metrics of the validation stage after 10 rounds of training are shown in
Table 4. Precision and Recall of the improved U-Net model are 0.9895 and 0.9844, respectively. The comprehensive evaluation metrics F1 and IoU of the model segmentation performance are 0.9869 and 0.9742, respectively. The validation loss value is 0.0208, indicating that the model has high segmentation accuracy.
4.4. Comparison of Segmentation Methods
The Otsu method based on threshold segmentation (OTSU algorithm), the traditional SegNet [
41] network model, the traditional PSPNet [
42] network model, the traditional DeepLabv3+ [
43] network model and the traditional U-Net network model were selected for comparison tests to further verify the effectiveness of the proposed method for soybean seedling image segmentation. Eighty images were randomly selected in the test set as input. The evaluation metrics were used to quantitatively analyze the segmentation methods. The average value of the evaluation metrics for each algorithm is shown in
Table 5.
The segmentation performance of the improved U-Net model is better than the other five algorithms overall. There are fewer cases of mis-segmentation and over-segmentation. The single-image segmentation time of this model is 0.153s. The improved U-Net model has little difference with the traditional U-Net model, the traditional SegNet model, the traditional PSPNet model and the traditional DeepLabv3+ model in the single image segmentation time. The total time consumption of the improved U-Net model is larger than that of the other four traditional deep learning models. This is due to the introduction of the attention module and the existence of a certain time-consuming calculation of attention weights.
In order to compare the segmentation effects of the OTSU algorithm, traditional SegNet network model, traditional PSPNet network model, traditional DeepLabv3+ network model, traditional U-Net model and improved U-Net model more intuitively, the segmentation effects of the methods are visualized. Segmentation effect graphs of some images are shown in
Figure 10.
There is a small amount of over-segmentation when using the OTSU algorithm to segment the root system. In other words, the background pixel points are classified as root pixel points. The OTSU algorithm can roughly segment the root region when the difference between foreground and background colors in the image is large and the texture is simple. However, for images with root colors similar to or the same as the background color, the segmented root region shows more cases of root disconnection, while the edges are not smooth enough. The deep learning model in general has a better segmentation effect than the traditional image segmentation algorithm. The root morphology has been clearly segmented with the traditional SegNet model, the traditional PSPNet model, the traditional DeepLabv3+ model and the traditional U-Net model. The taproot is correctly segmented, and part of the lateral roots is distinguished. However, these four traditional models still have a few roots over-segmented. Very few noise spots and water stains similar to roots are wrongly classified as roots. The edges of roots extracted by the traditional SegNet model and traditional PSPNet model are a little rough. There is adhesion between roots in the segmentation results of the traditional DeepLabv3+ model. The lateral roots with low color contrast cannot be effectively segmented by the traditional U-Net model. The improved U-Net model effectively solves the problem of a small number of disconnected root systems. The segmentation is more accurate and the details are better. It can accurately segment the disconnected lateral roots, but there are still a few stray points in the segmentation result.
4.5. Visualization of Feature Maps and Heat Maps
The feature map is helpful to understand the features learned by the network. The heat map can directly reflect the importance of a certain area in the image. The brighter the color, the more attention the network pays to this area. To better understand the improved U-Net model, the feature maps and heat maps were visualized in the last convolutional layer of the traditional U-Net model and the improved U-Net model, respectively, during the prediction process. The results shown in
Figure 11 were obtained from the segmentation task of the soybean seedling root image.
From
Figure 11a,b, it is observed that the traditional U-Net network has obvious defects in the learning effect on the local detail region of the target. It is shown that the target region of the root system is not identified and the segmented image shows a few stray points of noise. It is observed from
Figure 11c,d that the improved U-Net network learns better and covers the main region of the target more comprehensively. In the fourth image, the roots in the lower left corner of the image framed by the red circle and red box were selected. As shown by the model locating the location of the target in the image, it can learn the root system features with lower contrast. The attention mechanism effectively integrates the information of low-dimensional semantic feature maps and high-dimensional semantic feature maps, and makes the improved network pay more attention to the area where the lateral roots are disconnected, which fully explains the rationality of the improved network.
4.6. Post-Processing Results and Analysis Based on Connected Component Area Threshold
In the segmentation results obtained by using the improved U-Net model, there still exist a few stray noise points, as shown in
Figure 10e. The largest feature difference between the foreground target and the stray points is the area feature. Therefore, the area of the connected component was extracted for analysis to remove the stray points from the image. In order to extract the area feature parameter, the regions in the binary segmentation map with a gray value of 255 and isolated from each other were extracted and labeled separately using the connected component labeling method, as shown in
Figure 12a. The number of pixels in the stray point area of an image is generally less than 50. Therefore, when analyzing each small area marked, the noise threshold is set to 50. That is to say, if the area of the connected area is less than 50 pixels, the noise is assigned to 0. The image result after removing the noise is shown in
Figure 12b.
In
Figure 12, each column consists of the connected component labeled images using different colors and the root system images after the removal of the clutter noise, respectively. By comparing
Figure 12, from (a), it is observed that each isolated root region is labeled with different colors, which can more intuitively highlight the stray point noise; from (b), it is observed that the stray point noise with smaller areas is removed, leaving an accurate root segmentation result.
4.7. Results and Analysis of Root System Image Testing of Multiple Soybean Varieties at Seedling Stage
To further verify the generalization ability and practicality of the improved U-Net model, 30 images of each multi-variety soybean seedling root system collected from hydroponic and soil culture environments without model training were segmented. The model prediction and post-processing results of some images in hydroponic and soil culture environments, respectively, are shown in
Figure 13 and
Figure 14.
As can be seen in
Figure 13b, the improved U-Net model does not show any significant performance degradation, except for the presence of very few stray points of noise in the prediction results. This result shows that the improved network has a good generalization ability to the root image of soybean seedlings in the hydroponic environment. It cannot only accurately segment single-species soybean seedling root systems, but also has a good segmentation effect in the multi-species soybean seedling root image segmentation task without model training. It can be seen from
Figure 13c that the residual stray points in the model prediction results have been eliminated and accurate root segmentation results have been obtained, which meets the segmentation requirements needed for the study to a certain extent.
From
Figure 14b, it can be seen that the improved U-Net model has no obvious performance degradation. However, there are a very few stray points of noise and lateral root end disconnection in the prediction results. It is verified that the improved network has strong generalization ability and better segmentation results in the root system image segmentation task of multi-species soybean seedlings in the soil culture environment without model training. From
Figure 14c, it can be seen that the residual stray points in the model prediction results have been filtered out. More accurate root segmentation results have been obtained.
In summary, the segmentation method can handle the root system images of multiple soybean species at the seedling stage in both hydroponic and soil-culturing environments. The segmentation result has less background noise, strong generalization ability and practicality.
4.8. Discussion
The accurate segmentation of a root system from an image background is a prerequisite for the fine-grained measurement of root morphological characteristics. To effectively analyze the root image of soybean seedlings, advanced automatic image segmentation methods are required. More recently, deep learning methods can help automate this task. It is worth noting that the performance of deep learning-based segmentation methods partially depends on image quality. Considering that image quality is very important for root system research, we used a high-precision scanner to obtain high quality root images in this study. In future research, an effective image enhancement method, such as a generative adversarial network, will be explored for enhancing the quality of the root images, which can help improve the segmentation accuracy.
To illustrate the performance of the proposed model, we explored five deep learning-based image semantic segmentation models in the self-built soybean seedling root dataset. Comparing the segmentation results of all models, the improved U-Net model has a better segmentation effect than the other four traditional models. The PSPNet model uses the Pyramid Pooling Module (PPM) to provide global context information by different-region-based context aggregation. The U-Net model transfers complete feature map information (i.e., both location and pixel values) to the decoder in the channel dimension. However, the PSPNet model and U-Net model have difficulty segmenting the soybean seedling root system completely, and many root pixels are lost. More specifically, due to the low contrast of lateral root and background, most pixels of lateral roots are classified as background. The SegNet model transfers only max-pooling indices (i.e., location of feature maps) from encoder to decoder for feature concatenation and reconstruction. Based on DeepLabv3 with Atrous Spatial Pyramid Pooling (ASPP), the DeepLabv3+ model uses a simple and effective decoder module to capture sharper object boundaries. However, some small bright spots and water drops similar to the color of soybean roots are also segmented by the SegNet model and DeepLabv3+ model. There is still some residual noise in the mask. In addition, we also performed segmentation experiments on soybean seedling root systems in the soil culture environment without model training. Our experimental results on soybean seedling root images in hydroponic and soil culture environments demonstrated the excellent performance and high accuracy of the improved U-Net model for end-to-end fully automatic root segmentation. This is important for the study of soybean root phenotypic parameters.
However, the proposed method also has some limitations. For the soybean seedling root images under a soil culture environment, the improved U-Net model could not completely segment the end of the lateral roots when the contrast at the end of the lateral roots was relatively low. The main reason for this problem is that, when constructing the root semantic segmentation dataset, the sample size with an insignificant lateral root end contrast is relatively small. The root system in most images has high contrast with the background, and the proposed model is trained on the whole root system image in the hydroponic environment. In future research, the network model can be further improved to enhance the feature extraction ability of the model for the root system. In addition, the root segmentation dataset can be further improved, for example, by increasing the number of samples with insignificant lateral root contrast to enhance the ability of the model to extract lateral root features with low contrast.
5. Conclusions
In this paper, we proposed a root image semantic segmentation model based on an improved U-Net network with a dual attention and attention gate mechanism to segment soybean seedling roots automatically and precisely. The dual attention mechanism in the downsampling process automatically focuses on the root region and important feature channels of the image. The attention gate mechanism in the skip connection part can make full use of the spatial information in the encoder to bridge the semantic gap between the encoder and the decoder, and suppress the feature activation of irrelevant regions in the image.
The proposed model was trained, validated and tested with self-built soybean seedling root datasets. The experimental results on the root test dataset showed that the Accuracy, Precision, Recall, F1 and IoU of the improved U-Net model were 0.9962, 0.9883, 0.9794, 0.9837 and 0.9683, respectively. The segmentation time of a single image was 0.153 s. Compared with five other algorithms including the OTSU algorithm, traditional SegNet model, traditional PSPNet model, traditional DeepLabv3+ model and traditional U-Net model, the proposed model has the highest overall accuracy and the best segmentation effect. It can extract more accurate root edges and effectively solve the problem of a few broken roots. Through the visual experiment of feature maps and heat maps, it is verified that the proposed model is reasonable and better than the traditional U-Net model. The post-processing algorithm based on the area threshold of the connected component can remove the background noise present in segmentation and obtain accurate root segmentation. It provides a theoretical basis and technical support for the quantitative assessment of soybean seedling root morphological characteristics. In the future, we need to add a quantitative module of root phenotype parameters, so as to achieve complete soybean root phenotype analysis on the premise of ensuring accuracy.