Research on Red Jujubes Recognition Based on a Convolutional Neural Network

Wu, Jingming; Wu, Cuiyun; Guo, Huaying; Bai, Tiecheng; He, Yufeng; Li, Xu

doi:10.3390/app13116381

Open AccessArticle

Research on Red Jujubes Recognition Based on a Convolutional Neural Network

by

Jingming Wu

¹,

Cuiyun Wu

²,

Huaying Guo

¹,

Tiecheng Bai

¹

,

Yufeng He

¹

and

Xu Li

^1,2,*

¹

Key Laboratory of Tarim Oasis Agriculture, Tarim University, Ministry of Education, Alar 843300, China

²

National and Local Joint Engineering Laboratory for Efficient and High Quality Cultivation and Deep Processing Technology of Characteristic Fruit Trees in Southern Xinjiang, Tarim University, Alar 843300, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(11), 6381; https://doi.org/10.3390/app13116381

Submission received: 18 April 2023 / Revised: 20 May 2023 / Accepted: 22 May 2023 / Published: 23 May 2023

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Red jujube is one of the most important crops in China. In order to meet the needs of the scientific and technological development of the jujube industry, solve the problem of poverty, realize the backward advantage, and promote economic development, smart agriculture is essential. The main objective of this study was to conduct an online detection study of unpicked red jujubes in order to detect as many red jujubes in the picture as possible while minimizing the occurrence of overfitting and underfitting. Experiments were conducted using the Histogram of Oriented Gradients + Support Vector Machine (HOG+SVM) traditional detection method and the You Only Look Once version 5 (YOLOV5) and Faster R-CNN modern deep learning detection methods. The precision, recall, and F1 score were compared to obtain a better algorithm. The study also introduced the AlexNet model with the main objective of attempting to combine it with other traditional algorithms to maximize accuracy. Labeling was used to label the training images in YOLOV5 and Faster Regions with CNN Features (Faster R-CNN) to train the machine model so that the computer recognized these features when it saw new unlabeled data in subsequent experiments. The experimental results show that in the online recognition detection of red jujubes, the YOLOV5 and Faster R-CNN algorithms performed better than the HOG + SVM algorithm, which presents precision, recall, and F1 score values of 93.55%, 82.79%, and 87.84% respectively; although the HOG + SVM algorithm was relatively quicker to perform. The precision of detection was obviously more important than the efficiency of detection in this study, so the YOLOV5 and Faster R-CNN algorithms were better than the HOG + SVM algorithm. In the experiments, the Faster R-CNN algorithm had 100% precision, 99.65% recall, an F1 score of 99.82%, and 83% non-underfitting images for the recognized images, all of which were higher than YOLOV5′s values, with 97.17% recall, an F1 score of 98.56%, and 64.42% non-underfitting. In this study, therefore, the Faster R-CNN algorithm works best.

Keywords:

red jujubes; deep learning; You Only Look Once version 5 (YOLOV5); faster regions with CNN features (Faster R-CNN); histogram of oriented gradients + support vector machine (HOG+SVM)

1. Introduction

In recent years, with the increasing scale of production, it has become increasingly difficult to predict crop yields and harvest crops using manual labor, and a lot is invested in human and material resources to carry out the related work [1]. Due to the importance of crop yields to farmers and related departments, this demand remains high. With the continuous development and improvement of technologies such as machine learning, computer vision, and artificial intelligence, it has become possible to replace humans with machines for these tasks [2].

The forestry and fruit industries are important driving forces for the development of the agricultural economy in Xinjiang and represent a bright spot in economic growth [3]. Red jujube is an important crop in Xinjiang and is known for its high quality. With a long history and a huge area of planting, red jujube (Latin name: Ziziphus jujuba Mill., also known as Rhamnaceae dates) is a drupe that is round or long ovoid in shape and is 2–3.5 cm long and 1.5–2 cm in diameter. It ripens from red to reddish-purple with a fleshy, thick, sweet mesocarp. It flowers between May and July, grows in mountainous areas, hills, or plains at altitudes below 1700 m a.s.l., and is widely cultivated. This species is native to China and is often cultivated in Asia, Europe, and America [4]. It is usually planted for three years before bearing fruit and has an economic life of about 60–80 years. It is not only one of the main cash crops in southern Xinjiang but also an important means of income increase for local farmers [5]. Due to abundant sunlight in Xinjiang, jujube trees, being high-light crops, greatly benefit from the favorable light conditions for their growth. Additionally, Xinjiang experiences significant temperature differences between day and night. The greater the temperature difference, the more conducive it is for the accumulation of dry matter, making it easier for jujube trees to achieve high yields. Moreover, the water in Xinjiang is highly controllable, and production relies entirely on irrigation. This allows for easy control of both quantity and quality, as irrigation can be carried out based on the water requirements of jujube trees, aligning with their needs for light, temperature, and water. Due to these advantages in Xinjiang, coupled with its extensive cultivation areas, Xinjiang has become the largest production base of high-quality dried jujubes in China and the world [6].

In recent years, many scholars have conducted extensive research on crop identification and focused on automatic identification based on machine learning and computer vision technologies. For example, spectral imaging technology has been used to detect the quality of red jujubes [7].

A preliminary study by Wang et al. [8] explored the method of setting a single threshold and combining the screened images in the same time period with the threshold to extract the NDVI index from the jujube planting area in the crop zone, which is fast, convenient, simple and easy to understand. Alharbi et al. [9] used different models, including convolutional neural networks, to identify healthy and diseased apples. All models exceeded 90% precision on the test images. Maximum precision of 99.17% was achieved. Parth Bhatt et al. [10] evaluated random forest (RF), support vector machine (SVM), and average neural network (avNNet) using National Agricultural Imagery Program (NAIP) and unmanned aerial vehicle (UAV) imagery, ultimately concluding that RF worked best. Researchers have studied different crops. For instance, Xu et al. [11] used the super green feature algorithm and the maximum between-class variance (OTSU) method for segmentation, observing a remarkable segmentation effect, and was able to achieve a recognition accuracy of 94.1% when the weeding robot was traveling at a speed of 1.6 km/h. Chandel et al. [12] found that the combination of computer vision and thermal RGB images helped in the high-throughput mitigation and management of crop water stress. The method proposed by Khan et al. [13] can be applied on a large scale to effectively map the crop types of smallholder farms at an early stage, enabling them to plan a seamless supply of food. There are also studies that use different methods to study the same crop. By investigating the use of a neural network model to intelligently sample fruits in images and to infer missing regions, Mirbod et al. [14] concluded that their method could be used as an alternative method to deal with fruit occlusion in agricultural imaging and to improve the accuracy of measurements. Wang et al. [15] combined Faster R-CNN with different deep convolutional neural networks, including vgg16, resnet50, and resnet101. Their experimental results showed that the proposed system could effectively identify different types of tomato diseases. Velumani et al. [16] used Faster R-CNN to study plant density in the early stages and concluded that the super-resolution method showed significant improvement. Alruwaili et al. [17] used fast R-CNN to study tomatoes, and the final results showed that the accuracy of the RTF-RCNN proposed in the study was as high as 97.42%, which is better than traditional methods. The most common research direction is using different methods to identify different crops and ultimately obtain relatively suitable methods for identifying various crops [18,19,20].

Because of the large-scale cultivation of various crops in Xinjiang, it is too expensive to use manpower for picking and other tasks, and there has been a trend of replacing manpower with machines. Because fruit trees, compared to other crops, cannot be picked and destroyed in their entirety, it makes sense to achieve various tasks through the application of precision agriculture and computer vision.

In the past, researchers have used various deep-learning models to identify and classify red jujubes of different varieties and qualities, but they mainly studied red jujubes that had already been classified. At the same time, relatively more research is focused on diseases and yield [21,22]. This study selected images of red jujubes still growing on trees for detection, and therefore compared to detecting red jujubes that have already been harvested, there were more environmental interference factors. The ultimate aim of this study is to use a suitable method to improve the detection accuracy as much as possible and finally achieve the online detection of red jujubes.

2. Materials and Methods

2.1. Study Dataset

The data for this study were partially obtained from Kaggle. A total of 4104 detection images were collected. Of these images, 1100 were from Kaggle, and 3004 were collected from the field via cell phone. The obtained images were divided into several groups, among which 2400 images were used as a training set, 574 images were used as a validation set, and 1130 images were used as a test set for subsequent studies. No images from Kaggle were used in the test set.

2.2. Study Process

This study comprised three main parts. The first part involved verifying the obtained images to ensure that the classification of the research images was correct. The second part consisted of model training, which involved using different methods to randomly select images according to a certain ratio for training and obtaining the corresponding model. The third part was the validation stage, in which the obtained image models were used to predict the validation set images to obtain the corresponding results. The basic flow of Faster R-CNN and YOLOV5 used in this study is shown in Figure 1. The specific process depicted in the figure pertains to the specific workflow of the deep learning algorithm used in this research. Its main objective is to better control the workflow during the corresponding algorithm study, accomplish tasks more efficiently, and play a significant role in collaboration.

The process of the AlexNet algorithm used in the study, along with the HOG + SVM algorithm, differs from Faster R-CNN and YOLOv5. In this study, the main purpose of using the AlexNet algorithm was to improve image recognition accuracy and subsequently explore its applicability to other algorithms. Therefore, its process does not include the detection part shown in Figure 1. On the other hand, the HOG + SVM algorithm, which differs from the training method of convolutional neural networks, has a different process for feature extraction compared to the feature extraction part shown in Figure 1. As for the specific processes of other parts, these two algorithms share a basic similarity with Faster R-CNN and YOLOv5.

2.3. Training and Testing Data

The main purpose of this study was to achieve the online detection of jujubes. Therefore, 565 jujube images were classified as the validation set, and the training set images were randomly selected according to a certain ratio for model training. Here, some training set image examples are shown in Figure 2a, some verification set image examples are shown in Figure 2b, and some test set image examples are shown in Figure 2c. The numbers appearing under the image files serve as identifiers for the image quantities. In this study, the images in the training set, testing set, and validation set are stored in separate folders. The classification of the respective images is based on the names of the folders where the images are stored. Therefore, there is no labeling of the image names for classification purposes.

Although some of the data used in this study were from Kaggle, the datasets were relatively close to real-life situations compared to images captured through field investigation. Therefore, the partially public datasets from Kaggle used in this study were relatively close to actual situations.

2.4. Detection Algorithms Used

In this study, the main methods used for image detection were Faster R-CNN, YOLOV5, and HOG + SVM, which were compared to determine the most suitable method.

CNN [23] refers to the convolutional neural network, which mainly consists of multiple convolutional layers, pooling layers, an activation function, and fully connected layers, etc. It is a large class of networks with convolution as the core. It is mainly divided into two categories—one is mainly used for image classification, such as LeNet, AlexNet, etc., and the other is mainly used for target detection, such as Faster R-CNN, YOLO, etc.

(1): Input layer and convolution layer: The input layer is used to receive the original image, while the convolution layer is used to extract the feature information of the image, which traverses the image information according to the size of the selected convolution core, and finally summarizes it. The convolution layer formula is shown in Equation (1), where x^l_j represents the jth characteristic graph of the l-layer convolution layer; k^l_ij represents the convolution kernel matrix of the l-layer; M_l₋₁ represents the collection of Lmuri 1-layer characteristic graphs; b^l_j represents the network bias parameter; and f represents the activation function.

$x_{j}^{l} = f (\sum_{x_{i} \in M_{l - 1}} x_{i}^{^{l - 1}} * k_{i j}^{l} + b_{j}^{l})$

(1)

The convolution kernel for image feature extraction is one of the main parameters of a CNN model, which directly affects the performance of CNN model feature extraction. The activation function defines the transformation mode of the nonlinear mapping of data so that the CNN can better solve the problem of insufficient feature expression ability. The commonly used activation functions are sigmoid, tanh, ReLU, etc.
(2): Pooling layer: The pooling layer is mainly used for downsampling—that is, to reduce the amount of data reasonably according to the detection characteristics to achieve the reduction in calculation and to control overfitting to a certain extent. Its specific calculation formula is the same as the convolution layer.
(3): Full connection layer and output layer: The full connection layer is responsible for transforming the two-dimensional feature graph of convolution output into a one-dimensional vector, thus realizing the end-to-end learning process. Each node of the fully connected layer is connected to all of the nodes of the upper layer, so it is called the fully connected layer, and its single-layer computation is shown in Equation (2), where M represents the upper layer calculation, and F represents the size of the convolution core of the current layer.

$N = M \times F$

(2)

Its calculation formula is presented in Equation (3), where w^l represents the weight of the fully connected layer; b^l is the bias parameter of the fully connected layer l; and x^l⁻¹ represents the output characteristic graph of the previous layer.

$x^{l} = f (w^{l} x^{l - 1} + b^{l})$

(3)

After the convolution is completed for multi-layer feature extraction, the output layer acts as a classifier to predict the categories of input samples.

2.4.1. Faster R-CNN

Faster R-CNN is a deep learning model used for object detection, proposed by the Microsoft Research team in 2015 [24]. R-CNN stands for Region-based Convolutional Neural Network, which is a method that uses a convolutional neural network to detect and classify objects in images [25]. Unlike traditional region-based CNNs (such as R-CNN and Fast R-CNN [26]), Faster R-CNN integrates the region proposal process into the CNN, thus greatly improving detection speed and accuracy.

The Faster R-CNN model consists of two main components: a “region proposal network” (RPN [24]) that generates possible regions containing objects and an “object detection network” that classifies and locates these regions. RPN uses sliding windows and convolutional feature mapping to generate region proposals, while the object detection network uses techniques similar to Fast R-CNN to classify and locate these proposals. The theory is roughly shown in Figure 3.

RoI pooling is used in Faster R-CNN so that the generated region proposal map of candidate jujube boxes can generate a feature map of fixed size for subsequent classification and regression.

In this study, the VGG16 was used as the feature extraction backbone network [27], and the downsampling multiple of VGG16 was selected as 16. After normalizing the used images, mapping was performed, and the first quantization was performed. The subsequent number of poolings was 7, and the image was divided and quantized for a second time. The maximum pixel value was taken from each small area to represent this area, and 49 small areas outputted 49 pixel values, forming a 7 × 7 feature map.

At the same time, this model’s main advantage over Fast R-CNN is the RPN network, which is specifically shown in Figure 4.

In the first five layers of this part, the first value is the size of the input image, which is assumed to be 224 × 224 × 3, as an example. After importing, it is the convolution kernel of layer1, whose dimensions are 7 × 7 × 3 × 96. Due to the above data, the result obtained by conv1 is 110 × 110 × 96, and the origin of 110 is as obtained by Equation (4), where W₁ is the size of the input image, W₂ is the size of the convolution kernel, P is the filling of zero, and 2 is the stride.

W = (W_{1} - W_{2} + P) / 2 + 1

(4)

Then, pooling is performed again to obtain pool 1. The size of the pooling kernel is 3 × 3, so the dimensions of the image after pooling are 55 × 55 × 96, as obtained by the same process as above, as shown in formula 1. The subsequent layer2 convolution is the same process, and the kernel size is 5 × 5 × 96 × 256, which yields conv2 as 26 × 26 × 256. The subsequent process is the same, and the final result is obtained.

After training the model, the image was detected using the model to obtain the final results.

In this study, the model used by Faster R-CNN was VGG16, which has been around for a while but has demonstrated excellent performance. The loss function used was L1 loss, which has an impact on the convergence problem but avoids the gradient and other problems and achieves good results. The chosen IOU value was 0.5:0.95, with an increment of 0.05. The convolution kernel was 3 with a step size of 1, and the pooling kernel was 2 with a step size of 2. The learning rate used in the algorithm was 0.001. Figure 5 shows an example of the assay results in this study.

In summary, Faster R-CNN introduced RPN to quickly generate candidate jujube object regions, thereby reducing detection time while maintaining high detection accuracy and improving detection speed. It also has better scalability and easier end-to-end training.

In conclusion, Faster R-CNN is an efficient and accurate target detection algorithm that has the advantages of high accuracy, high efficiency, and flexibility compared with traditional target detection algorithms and has a wide range of application prospects.

2.4.2. YOLOV5

The YOLO [28,29,30] series is a one-stage [28] regression method based on deep learning. Due to its speed and accuracy, YOLO is one of the most famous object detection algorithms. Compared with the two-stage [24,31] Faster R-CNN, YOLO does not include the process of obtaining region proposals and its operating process is shown in Figure 6, where F is Future mAP; C is confidence; P is class probability; and GT is ground truth. First, the input data is processed and subjected to data augmentation. YOLOv5 utilizes Mosaic data augmentation at the input end, which significantly improves detection accuracy for small objects. Additionally, adaptive anchor box calculation is employed for anchor box selection. These steps complete the input preparation. Next, in the backbone section, YOLOv5 introduces an FCOS layer after the input, which is similar to the PassThrough-Layer in YOLOv2. This layer increases the channel count to four times the original feature map. The neck section of YOLOv5 adopts a CSP structure, primarily utilizing FPN (Feature Pyramid Network) and PAN (Path Aggregation Network) for downsampling and upsampling, resulting in three feature maps of different scales for prediction. Regarding the loss section, the most significant change lies in the calculation of the anchor regions for the samples. In this part, the matching process involves calculating the aspect ratio between the bounding box (bbox) and the anchor of the current layer. If the aspect ratio exceeds a predetermined threshold, the anchor is considered unmatched with the bbox, and the bbox is discarded as a negative sample. The remaining bboxes are assigned to the corresponding grid cells, and the adjacent two grid cells are also considered potential predictors for the bbox. This leads to at least a three-fold increase in the number of positive anchor samples compared to previous versions. Consequently, for a bbox, there are at least three anchors that match it. Regarding the calculation of the loss function, it is generally divided into three parts: classification loss, confidence loss, and localization loss. Binary Cross-Entropy (BCE) loss is still used for the classification and confidence losses, while the localization loss employs GIoU (Generalized Intersection over Union) loss.

Compared with the previous YOLO series, the most significant difference in YOLOV5 is the handling mechanism for anchors, which enables YOLOV5 to converge faster.

In the input, YOLOV5 uses the Mosaic data augmentation method at the input end, which greatly improves its ability to recognize small objects. In the backbone part, YOLOV5 adds an Fcos after the input, which makes the channel number four times that of the original feature map compared to YOLOV4. In the loss part, YOLOV5 has made many changes compared to previous versions. In the previous YOLO series, there was a unique anchor corresponding to each ground truth, and the selection method for this anchor involved selecting the anchor with the largest IOU with the ground truth, regardless of the case where one ground truth corresponds to multiple anchors. However, the matching rule adopted by YOLOV5 involves calculating the aspect ratio of the bbox and the current layer anchor. If the aspect ratio is greater than the set threshold, the anchor and bbox do not match, and the bbox is discarded and considered a negative sample. The remaining bbox is calculated to determine which grid it falls into and to find the two adjacent grids, considering all three grids to be able to predict the bbox. This leads to at least three times more positive samples than before. Regarding the calculation of the loss function, it is still divided into category loss, confidence loss, and localization loss overall. The BCE loss is still used for the category loss and confidence loss, but for the localization loss, and the loss of w, h, x, and y, GioU-loss is used. Specifically, GioU is an improvement of ioU, meaning Intersection Over Union, which is specified as shown in Equation (5), where A and B denote the bounding boxes of the prediction frame and the real frame, respectively, and C denotes their minimum convex package.

GIoU = IoU - (C - (A U B)) / C

(5)

In the YOLOV5 algorithm of this study, the BECLogits loss function is used to calculate the classification loss, the cross-entropy loss function (BCEclsloss) is used for confidence loss, and the GIOU-Loss is used for the bounding box. Non-extreme suppression is used for suppression in prediction. The initial learning rate chosen in the learning rate is 0.001, and the optimizer used is chosen after testing optim.Adagrad, optim.RMSprop, optim.SGD, and optim.Adam. The problem is that when experimenting with different learning rates, the results have advantages and disadvantages. The depth_multiple and width_multiple parameters of YOLOV5 were tested, and 0.3 and 0.5 were chosen as the study data.

2.4.3. AlexNet

Due to poor results in the early stage of research in image detection, an image recognition algorithm was considered to assist image detection.

The network model used in this study was slightly different from the conventional model. The AlexNet network model is a deep convolutional neural network model designed by Alex Krizhevsky [32] in 2012. Due to the fact that the number of layers in the model is not very deep and it has good classification performance, this study, based on the AlexNet model, constructed a CNN model whose model structure is shown in Figure 7. The original Alex model is shown on the left in Figure 7, and the modified model is shown on the right. Different from AlexNet, this model removes part of the convolution layer and full connection layer and adds part of the pooling layer to normalize the local response, which eliminates the dimensional difference between data and facilitates data utilization and fast calculation. At the same time, competition is introduced between the feature map generated in the adjacent convolution kernels so that some of the features that are prominent in the feature map are more prominent; but they are suppressed in the adjacent other feature maps so that the correlation between the feature map generated by different convolution kernels is reduced.

The model used removed some convolutional and fully connected layers and added some pooling layers and local response normalization to eliminate the dimensional difference between data for ease of data utilization and fast computation. At the same time, competition was introduced between feature maps generated in adjacent convolutional kernels, making some features that are significant in the feature map more prominent and suppressing features in other adjacent feature maps, which reduces the correlation between feature maps produced by different convolutional kernels. The activation function of the ordinary fully connected layer in the convolutional neural network usually uses ReLU and other activation functions, but the last fully connected layer is a Softmax classification layer used to predict the probability of each class, whose expression is shown in Equation (6), where x is the inputs for the fully connected layer; W^T is the weights; b is the bias term, and y is the probability of Softmax output.

{y = softmax (z) = softmax (W}^{T} x + b)

(6)

2.4.4. HOG + SVM

Histogram of Oriented Gradients (HOG) [33] is a feature extraction algorithm proposed by the Frenchman Dalal at the 2005 CVPR conference, and it has been combined with SVM for pedestrian detection. As a traditional object detection algorithm, it has a milestone significance in image hand-crafted feature extraction and achieved great success in pedestrian detection at that time.

The main purpose of the HOG algorithm is to perform gradient calculations on the image and calculate the gradient direction and gradient magnitude of the image. The edge and gradient features extracted can capture the characteristics of local shapes well, and because gamma correction is performed on the image and normalized using the cell method, it has good invariance to geometric and optical changes, and transformation or rotation has little effect on sufficiently small regions [34,35].

The basic idea is to divide the image into many small, connected regions, i.e., cell units, and then calculate the gradient amplitude and direction of the cell by voting to form a histogram based on the gradient characteristics. The histogram is normalized in a larger range of the image (also known as an interval or block). The normalized block descriptor is called the HOG descriptor (feature descriptor). The HOG descriptors of all blocks in the detection window are then combined into the final feature vector. Then, the SVM classifier is used to perform binary classification (detection) of targets and non-targets. The details are shown in Figure 8.

During the detection process, the local object shape detected by HOG can be described by the distribution of gradients or edge directions, and HOG can capture local shape information well, with good invariance to geometric and optical changes. At the same time, HOG is obtained by the dense sampling of image blocks, and the calculated HOG feature vector implicitly contains the spatial relationship between the block and the detection window. The image detection process is shown in Figure 9, and the calculation of the HOG feature value is shown in Figure 10.

The HOG + SVM algorithm is faster in training compared to the deep learning algorithm, but it struggles to obtain feature descriptions and has poor real-time performance; it also struggles to deal with the occlusion problem and when the target changes a lot. Due to the nature of the gradient, HOG is quite sensitive to noise and must be further processed.

2.5. Evaluation Indicators

Regarding the detection model, the average precision is currently used as the standard metric for object detection, and it is also used as the evaluation metric for the COCO dataset [36]. The accuracy evaluation metrics used in this study were F1 score, precision, and recall, as shown in Equations (7)–(9), where TP represents correct detections, FP represents non-target detections, and FN represents missed detections.

F 1 = (2 * P * R) / (P + R)

(7)

P = T P / (T P + F P)

(8)

R = T P / (T P + F N)

(9)

2.6. Efficiency of Detection Methods

The efficiency of the detection method is one of the most important indicators in the experiment, and in this study, the results related to this aspect are shown directly in terms of time duration, which is relatively intuitive.

In this study, the average training time, total training time, fastest test time, and total test time are shown using different methods because the number of model training iterations ultimately selected for each method was different.

3. Results

The experiments in this study were conducted on a local PC, and the experimental environment configuration of the model is shown in Table 1.

3.1. Model Training

In this study, repeated experiments were conducted to obtain research results. During the experimental process, since the two deep learning models used could be saved after training, we saved the models at regular intervals. After conducting comparative tests, we selected the model with the best performance. Therefore, after conducting comparative tests, it is possible to select the model with the best performance. In this study, labeling was used to mark labels, which were divided into red jujubes and other labels, and the marked images are shown in Figure 11.

In order to ensure that the image resisted the attack of geometric transformation, establish the invariants in the image, and speed up the convergence of the training network, we chose to normalize the images and convert all of the images to a size of 640 × 1280. Although this normalization operation caused a certain degree of change in the images, and some images had lower resolutions, it was found that the results were basically unaffected after testing with different resolutions multiple times.

3.2. Model Results

During the study, when using the Faster R-CNN method, there was no obvious overfitting phenomenon in the training process, and the loss rate gradually decreased while the accuracy improved as the number of training iterations increased. The loss rate was basically stable when trained for 5000 iterations, and there was no overfitting problem with the training data. The final model trained for 5000 iterations was selected for this study.

When using the YOLOV5 algorithm, the accuracy and loss rate improved continuously as the number of training iterations increased. They gradually stabilized when the number of iterations reached 40, and basically reached their peak and were stable when the number of iterations reached 50. Therefore, the model trained for 50 iterations was selected for this study.

When using the AlexNet algorithm for auxiliary research, the loss rate gradually decreased, and the accuracy gradually improved in the early stage of training. These values were basically stable when trained for 500 iterations. There was a long stable period as the number of training iterations increased. However, as the number of iterations reached 1000–1500, the training results gradually showed an overfitting phenomenon, and this became more and more obvious. Therefore, it was concluded that 500 training iterations could be used to accurately represent the results.

When using the HOG + SVM algorithm, it was mainly based on positive and negative samples for detection. A total of 1487 images were selected as positive and negative samples, and 565 images were selected as the test set for this study.

3.3. The Prediction Results of Model

In addition to the methods listed above, various models were used in this study for experimentation. However, due to issues such as accuracy problems and overfitting and underfitting, these methods are not included in the comparison.

In the HOG + SVM method, the final detection accuracy was 93.54%. Although there were many cases in which not all red jujubes could be detected in the detected red jujubes images, the accuracy was relatively high compared to traditional detection methods.

When the AlexNet method was used for auxiliary research, the model trained 500 times was selected for prediction research. The accuracy image obtained from the research is shown in Figure 12a, and the loss image is shown in Figure 12b.

Through the image, it can be found that the accuracy of the first 200 training iterations increased significantly, and the loss decreased significantly. In the subsequent training process, although the results fluctuated to some extent, the overall trend was toward improving accuracy and decreasing loss, gradually becoming stable. The training accuracy was basically stable at 0.87, and the validation accuracy was basically stable at 0.83. The training loss was basically stable at 0.30, and the validation loss was basically stable at 0.45.

In the YOLOV5 algorithm, the accuracy was significantly higher than that of traditional detection algorithms, and the detection accuracy of other interfering images fluctuated mainly because the image environment in the detection was more complicated. However, the main purpose of this study was to detect red jujubes, so this situation did not affect it. The confusion matrix obtained through training is shown in Figure 13a, and the overall accuracy is relatively high. The F1 score image is shown in Figure 13b, and the F1 score is relatively good in the range of 0.2–0.4. The relationship between the accuracy and confidence of this test is shown in Figure 13c, and through the image, it can be found that when the confidence reached 0.859, the accuracy was 100%. Although the accuracy reached its peak when the confidence was relatively high, its accuracy was relatively high at around 0.4 confidence, so the results proved that its detection effect was relatively good. The accuracy and recall rate curves are shown in Figure 13d, and although there was some fluctuation, the overall result was good. The reason why the effect of other interfering images was relatively low was considered to be that the other images were not labeled for training. The recall rate and confidence curves are shown in Figure 13e, and through the recall rate and confidence curve, it can be found that the result was basically the same as the above. The curve of red jujubes was relatively good, and other interference curves fluctuated, although this could also be due to the labeling of other interfering objects in the training set.

The final training results are shown in Figure 14a. Through this image, it can be found that although there were some fluctuations in the trend, it was basically stable and developed well. The training result was basically stable after 50 times. Some of the verification set images are shown in Figure 14b, and it can be found from the verification results that although there were some problems with underfitting and overfitting, the overall effect was good.

In the experimental test set, because the testing environment was more complex than the validation set, the detection results of some images were slightly lower than those of the validation set. However, the overall detection effect is good, and the final completely detected images account for 35.58%, the partially detected image due to underfitting account for 59.29%, and the images presenting an overfitting phenomenon account for 5.13%. The detected images are shown in Figure 15a–c.

In the Faster R-CNN method, it was found that the loss rate and accuracy of the training set and the validation set were basically stable when the number of training iterations was 5000, so this model was chosen for the subsequent study. Some of the image examples of the validation set are shown in Figure 16.

In this model, all jujube images are recognized, and the recognition rate is 100%. In the detection process, the detection effect of most images is good, but there are also some images with incomplete detection of all jujubes, as shown in Figure 17. This situation accounts for about 36% of all detection images. In addition, there are also cases of overfitting, as shown in Figure 18. This situation accounts for about 17% of all red jujube detection images.

The final results of the efficiency of the detection method are shown in Table 2.

The table data show that Faster R-CNN has the fastest single training speed, but the final training time is longer. In the case of high accuracy requirements, YOLOV5 has the fastest model training speed and the fastest testing speed. Without controlling accuracy, HOG + SVM has the highest rate.

After the complete experiments were conducted and the final results were obtained, the precision, recall, and F1 scores of different algorithms were summarized, and the specific values are shown in Table 3.

The data in Table 3 show that HOG + SVM has a relatively lower precision, recall, and F1 score compared to the other two algorithms, and Faster R-CNN and YOLOV5 both have a precision of 100%, while Faster R-CNN has a slightly higher recall and F1 score. Therefore, from the data observation, the Faster R-CNN algorithm used in the study is relatively superior.

4. Discussion

In this study, the Faster R-CNN and YOLOV5 algorithms achieve high accuracy and have good generalizability for detecting complex and diverse types of object environments. Therefore, this method can achieve the goal of the online detection of red jujubes in principle while accepting a greater variety of information and identity data in order to expand the scale. In computer vision research [37,38,39], external environmental impacts on overall performance are often encountered, such as occlusion and lighting issues. In this study, some jujube images had problems with leaf occlusion, and this part of the jujube was also selected for training in this study’s experiment, resulting in overfitting of the leaves detected as jujube in some images. Some pictures were taken toward the sun, and the lighting seriously affected detection within the images, resulting in incomplete detection in some images. At the same time, some jujube detections were connected with multiple fruits, mainly due to the large number of training images taken showing multiple connected fruits, which could not be fully separated, resulting in a normal detection phenomenon [40].

In the Faster R-CNN and YOLOV5 algorithms, although the recognition effect of the two algorithms is 100% in the pre-positioning stage, the number of fully detected images in the Faster R-CNN algorithm is much higher than that for the YOLOV5 algorithm. Additionally, YOLOV5 has a slightly lower recall and F1 score than the Faster R-CNN method. Considering the special nature of crops, in overfitting images, only some leaves are detected as jujube, which does not overly affect the picking, but underfitting means that the model cannot recognize jujube, which has a great impact on picking and other results [41].

In this study, it is worth mentioning the detection efficiency of various algorithms on images. In the HOG + SVM and some methods that are not listed due to their low accuracy, the overall training, validation, and testing speeds are fast, but the accuracy is relatively low. When attempting to use the AlexNet algorithm for auxiliary experiments, although its accuracy is significantly higher than most traditional methods, the results are still not optimistic. In the YOLOV5 and Faster R-CNN algorithms, although the training model stage is slow in this study, the detection accuracy is significantly improved. After detection is completed, the verification time for each image is basically 2–5 s in the Faster R-CNN, and the speed is generally average, but compared to the training, validation, and detection speeds of CNN algorithms and other algorithms, it is 5–10 times faster. In the YOLOV5 algorithm, the verification and detection speed of images is basically 0.2–0.5 s per image in this study environment, which is much faster than the Faster R-CNN.

Due to the special nature of agriculture, in the case of achieving online detection, if the detection error is too large to detect the red jujubes, it will still consume a lot of human and material resources, so in the online identification of this study, the accuracy is more important than the speed. Comparing the YOLOV5 and Faster R-CNN algorithms used in this study, Faster R-CNN has an advantage in the detection of images, as its percentage of compliance with red jujubes detection is 87%, which is significantly higher than the 64.42% observed for YOLOV5.

It cannot be ignored that there are limitations in this study. First, the images used for testing are basically from the same place, so it is not possible to verify whether the model used is equally applicable to red jujube recognition in other regions. Secondly, the images used for the current detection are static images, while the specific case must be more complex if it is to be used in reality, and the dynamic recognition should be performed during motion while the detection effect in this region is temporarily vacant. Moreover, although the use of overfitted images has little effect on the picking of jujube fruit, there is bound to be an effect on the fruit tree itself, so the corresponding fruit should be fully detected to reduce damage as much as possible. At the same time, when using dynamic video images for detection, it is inevitable that a higher efficiency of detection will be required. Of course, if dynamic video images are used for online inspection with a large number of images, the accuracy will certainly be much higher than that of a single image, while being able to choose a better angle for image acquisition will also greatly reduce environmental interference. Additionally, the imaging equipment used will also be professional equipment, and the images will be more accurate.

5. Conclusions

Although the sample size in this study is small compared to the scale of cultivation data, the Faster R-CNN and YOLOV5 algorithms achieved high accuracy in this experiment, and the detection of object environment types was complex and diverse, indicating good universality. In this study, the accuracy of Faster R-CNN and YOLOV5 was significantly higher compared to the traditional algorithm HOG + SVM. The training speed of these two deep learning models is slower, but their models only need to train a better model, and their accuracy is obviously more important than speed in the detection of red jujubes, so Faster R-CNN and YOLOV5 used in this study are superior to HOG + SVM. Meanwhile, the Faster R-CNN method in this paper works well in recognition detection regarding static images and can reach 100% in recognition accuracy and 83% in the number of images that do not show an overfitting phenomenon in detection accuracy. Therefore, the method used in this study is feasible.

In the numerical analysis of the precision, recall, and F1 score, the Faster R-CNN data are optimal, so in this study, the Faster R-CNN results are optimal.

In future research, the problem of overfitting in image recognition will be considered, and the main causes of this problem will be examined. Additionally, image recognition detection of dynamic video will be considered to make the experimental environment closer to the real environment. Additionally, we will try to include other kinds of crops in detection studies. Models such as ResNet [42,43,44] will also be considered for research. In the case of accuracy problems, other algorithms will be combined with the current approach to improve recognition accuracy. Additionally, researchers should consider using different algorithms, such as SSD, for experimental testing.

Author Contributions

Conceptualization, X.L. and J.W.; methodology, X.L.; software, J.W.; validation, C.W., H.G., T.B., Y.H. and X.L.; formal analysis, J.W.; investigation, X.L.; data curation, X.L. and J.W.; writing—original draft preparation, X.L. and J.W.; writing—review and editing, C.W.; visualization, X.L.; supervision, C.W., H.G. and Y.H.; project administration, T.B.; funding acquisition, T.B. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Open Project of the National and Local Joint Engineering Laboratory for Efficient and High Quality Cultivation and Deep Processing Technology of Characteristic Fruit Trees in Southern Xinjiang under Grants FE201805. Funded by the earmarked fund of Xinjiang Jujube Industrial Technology System under Grants XJCYTX-01-01. Funded by Bingtuan Science and Technology Program under Grants 2019CB001, 2021CB041, 2021BB023, 2021DB001. Funded by Alar City Science and Technology Plan Project under Grants 2021GX02. Funded by Tarim University Innovation Team Project under Grants TDZKCX202306, TDZKCX202102.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The experimental data used to support the findings of this study are available from the corresponding author upon request.

Acknowledgments

The authors would like to show sincere thanks to those technicians who have contributed to this research.

Conflicts of Interest

The authors declare no conflict of interest.

References

Lai, J.; Li, Y.; Chen, J.; Niu, G.Y.; Lin, P.; Li, Q.; Wang, L.; Han, J.; Luo, Z.; Sun, Y. Massive crop expansion threatens agriculture and water sustainability in northwestern China. Environ. Res. Lett. 2022, 17, 3. [Google Scholar] [CrossRef]
Meng, X.; Yuan, Y.; Teng, G.; Liu, T. Deep learning for fine-grained classification of jujube fruit in the natural environment. Food Meas. 2021, 15, 4150–4165. [Google Scholar] [CrossRef]
Liu, M.; Li, C.; Cao, C.; Wang, L.; Li, X.; Che, J.; Yang, H.; Zhang, X.; Zhao, H.; He, G.; et al. Walnut Fruit Processing Equipment: Academic Insights and Perspectives. Food Eng. Rev. 2021, 13, 822–857. [Google Scholar] [CrossRef]
Yao, S. Past, Present, and Future of Jujubes—Chinese Dates in the United States. HortScience Horts. 2013, 48, 672–680. [Google Scholar] [CrossRef]
Wang, X.; Shen, L.; Liu, T.; Wei, W.; Zhang, S.; Li, L.; Zhang, W. Microclimate, yield, and income of a jujube–cotton agroforestry system in Xinjiang, China. Ind. Crops Prod. 2022, 182, 114941. [Google Scholar] [CrossRef]
Shahrajabian, M.H.; Sun, W.; Cheng, Q. Chinese jujube (Ziziphus jujuba Mill.) –A promising fruit from Traditional Chinese Medicine. Annales Universitatis Paedagogicae Cracoviensis Studia. Ann. Univ. Paedagog. Crac. Stud. Nat. 2020, 5, 194–219. [Google Scholar]
Wang, S.; Sun, J.; Fu, L.; Xu, M.; Tang, N.; Cao, Y.; Yao, K.; Jing, J. Identification of red jujube varieties based on hyperspectral imaging technology combined with CARS-IRIV and SSA-SVM. J. Food Process Eng. 2022, 45, e14137. [Google Scholar] [CrossRef]
Wang, Y.; Wang, L.; Tuerxun, N.; Luo, L.; Han, C.; Zheng, J. Extraction of Jujube Planting Areas in Sentinel-2 Image Based on NDVI Threshold—A case study of Ruoqiang County. In Proceedings of the 29th International Conference on Geoinformatics, Beijing, China, 15–18 August 2022; pp. 1–6. [Google Scholar]
Alharbi, A.G.; Arif, M. Detection and Classification of Apple Diseases using Convolutional Neural Networks. In Proceedings of the 2020 2nd International Conference on Computer and Information Sciences (ICCIS), Sakaka, Saudi Arabia, 13–15 October 2020; pp. 1–6. [Google Scholar]
Bhatt, P.; Maclean, A.L. Comparison of high-resolution NAIP and unmanned aerial vehicle (UAV) imagery for natural vegetation communities classification using machine learning approaches. GIScience Remote Sens. 2023, 60, 2177448. [Google Scholar] [CrossRef]
Xu, B.; Chai, L.; Zhang, C. Research and application on corn crop identification and positioning method based on Machine vision. Inf. Process. Agric. 2023, 10, 106–113. [Google Scholar] [CrossRef]
Chandel, N.S.; Rajwade, Y.A.; Dubey, K.; Chandel, A.K.; Subeesh, A.; Tiwari, M.K. Water Stress Identification of Winter Wheat Crop with State-of-the-Art AI Techniques and High-Resolution Thermal-RGB Imagery. Plants 2022, 11, 3344. [Google Scholar] [CrossRef]
Khan, H.R.; Gillani, Z.; Jamal, M.H.; Athar, A.; Chaudhry, M.T.; Chao, H.; He, Y.; Chen, M. Early Identification of Crop Type for Smallholder Farming Systems Using Deep Learning on Time-Series Sentinel-2 Imagery. Sensors 2023, 23, 1779. [Google Scholar] [CrossRef] [PubMed]
Mirbod, O.; Choi, D.; Heinemann, P.H.; Marini, R.P.; He, L. On-tree apple fruit size estimation using stereo vision with deep learning-based occlusion handling. Biosyst. Eng. 2023, 226, 27–42. [Google Scholar] [CrossRef]
Wang, Q.; Qi, F. Tomato Diseases Recognition Based on Faster RCNN. In Proceedings of the 2019 10th International Conference on Information Technology in Medicine and Education (ITME), Qingdao, China, 23–25 August 2019; pp. 772–776. [Google Scholar]
Velumani, K.; Lopez-Lozano, R.; Madec, S.; Guo, W.; Gillet, J.; Comar, A.; Baret, F. Estimates of Maize Plant Density from UAV RGB Images Using Faster-RCNN Detection Model: Impact of the Spatial Resolution. Plant Phenomics 2021, 2021, 9824843. [Google Scholar] [CrossRef] [PubMed]
Alruwaili, M.; Siddiqi, M.H.; Khan, A.; Azad, M.; Khan, A.; Alanazi, S. RTF-RCNN: An Architecture for Real-Time Tomato Plant Leaf Diseases Detection in Video Streaming Using Faster-RCNN. Bioengineering 2022, 9, 565. [Google Scholar] [CrossRef] [PubMed]
Lutfi, M.; Rizal, H.S.; Hasyim, M.; Amrulloh, M.F.; Saadah, Z.N. Feature Extraction and Naïve Bayes Algorithm for Defect Classification of Manalagi Apples. J. Phys. Conf. Ser. 2022, 2394, 012014. [Google Scholar] [CrossRef]
Yang, Q.; Duan, S.; Wang, L. Efficient Identification of Apple Leaf Diseases in the Wild Using Convolutional Neural Networks. Agronomy 2022, 12, 2784. [Google Scholar] [CrossRef]
Hao, Q.; Guo, X.; Yang, F. Fast Recognition Method for Multiple Apple Targets in Complex Occlusion Environment Based on Improved YOLOv5. J. Sens. 2023, 2023, 3609541 . [Google Scholar] [CrossRef]
Liu, M.; Wang, J.; Wang, L.; Liu, P.; Zhao, J.; Zhao, Z.; Yao, S.; Stănică, F.; Liu, Z.; Wang, L.; et al. The historical and current research progress on jujube–a superfruit for the future. Hortic. Res. 2020, 7, 119. [Google Scholar] [CrossRef]
Liu, Y.; Lei, X.; Deng, B.; Chen, O.; Deng, L.; Zeng, K. Methionine enhances disease resistance of jujube fruit against postharvest black spot rot by activating lignin biosynthesis. Postharvest Biol. Technol. 2022, 190, 111935. [Google Scholar] [CrossRef]
LeCun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015, 2016. [Google Scholar] [CrossRef] [PubMed]
Liao, X.; Zeng, X. Review of Target Detection Algorithm Based on Deep Learning. In Proceedings of the 2020 International Conference on Artificial Intelligence and Communication Technology(AICT 2020), Chongqing, China, 28–29 March 2020; p. 5. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015. [Google Scholar]
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of the ICLR 2015, San Diego, CA, USA, 7–9 May 2015. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2016, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2017, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. YOLOv4: Optimal Speed and Accuracy of Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 2020, Seattle, WA, USA, 13–19 June 2020. [Google Scholar]
Li, H.; Ji, Y.; Gong, Z.; Qu, S. Two-stage stochastic minimum cost consensus models with asymmetric adjustment costs. Inf. Fusion 2021, 71, 77–96. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 6. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–26 June 2005. [Google Scholar]
Li, Q.; Qu, G.; Li, Z. Matching between SAR images and optical images based on HOG descriptor. IET Int. Radar Conf. 2013, 2013, 1–4. [Google Scholar]
Bedo, J.; Macintyre, G.; Haviv, I.; Kowalczyk, A. Simple SVM based whole-genome segmentation. Nat. Prec. 2009. [Google Scholar] [CrossRef]
Lin, T.-Y.; Maire, M.; Belongie, S.; Bourdev, L.; Girshick, R.; Hays, J.; Perona, P.; Ramanan, D.; Zitnick, C.L.; Dollár, P. Microsoft COCO: Common Objects in Context. Eur. Conf. Comput. Vis. 2014, 8693, 740–755. [Google Scholar]
Cemil, Z. A Review of COVID-19 Diagnostic Approaches in Computer Vision. Curr. Med. Imaging 2023, 19, 695–712. [Google Scholar]
Xu, M.; Yoon, S.; Fuentes, A.; Park, D.S. A Comprehensive Survey of Image Augmentation Techniques for Deep Learning. Pattern Recognit. 2023, 137, 109347. [Google Scholar] [CrossRef]
Lu, Y.; Chen, D.; Olaniyi, E.; Huang, Y. Generative adversarial networks (GANs) for image augmentation in agriculture: A systematic revie. Comput. Electron. Agric. 2022, 200, 107208. [Google Scholar] [CrossRef]
Sengupta, S.; Lee, W.S. Identification and determination of the number of immature green citrus fruit in a canopy under different ambient light conditions. Biosyst. Eng. 2014, 117, 51–61. [Google Scholar] [CrossRef]
Wang, R.Q.; Zhu, F.; Zhang, X.Y.; Liu, C.L. Training with scaled logits to alleviate class-level over-fitting in few-shot learning. Neurocomputing 2023, 522, 142–151. [Google Scholar] [CrossRef]
Aversano, L.; Bernardi, M.L.; Cimitile, M.; Pecori, R. Deep neural networks ensemble to detect COVID-19 from CT scans. Pattern Recognit. 2021, 120, 108135. [Google Scholar] [CrossRef] [PubMed]
He, R.; Xiao, Y.; Lu, X.; Zhang, S.; Liu, Y. ST-3DGMR: Spatio-temporal 3D grouped multiscale ResNet network for region-based urban traffic flow prediction. Inf. Sci. 2023, 624, 68–93. [Google Scholar] [CrossRef]
Song, H.M.; Woo, J.; Kim, H.K. In-vehicle network intrusion detection using deep convolutional neural network. Veh. Commun. 2020, 21, 100198. [Google Scholar] [CrossRef]

Figure 1. Flow chart for the basic process of the Faster R-CNN and YOLOV5 algorithms.

Figure 2. (a) Some training set image examples. (b) Some verification set image examples. (c) Some test set image examples.

Figure 3. Region Proposal Network.

Figure 4. RPN Network Flow.

Figure 5. Example of test results.

Figure 6. YOLO operating process.

Figure 7. Schematic diagram of our model.

Figure 8. HOG process.

Figure 9. HOG + SVM image detection process.

Figure 10. Calculation of HOG feature values.

Figure 11. Marker example image.

Figure 12. Accuracy and loss rate of the training and validation sets. (a) represents the accuracy of recognition; (b) represents the loss of recognition.

Figure 13. Related images to YOLOV5. (a) represents confusion matrix; (b) represents F1 score (c) represents accuracy versus confidence (d) represents accuracy versus recall rate (e) represents recall rate versus confidence.

Figure 14. Related images to YOLOV5. (a) represents missing rate; (b) represents results of the validation set.

Figure 15. YOLOV5-related test results. (a) represents examples of a completely detected image; (b) represents an example of an underfitting image; (c) represents an example of an overfitted image.

Figure 16. Examples of validation set images.

Figure 17. Examples of underfitted images.

Figure 18. Examples of overfitted images.

Table 1. Experimental environment configuration.

Software, Hardware/Systems	Configuration
system	Windows
CPU	Intel(R) Core(TM) i7-10750H CPU @ 2.60 GHz
GPU	GTX 1650Ti
Development Languages	python 3.8
Deep Learning Framework	torch 1.12.0 + tensorflow 2.3.1
Accelerated Environment	CUDA 11.6

Table 2. Times required for various methods.

Method	Average Training Time (s)	Total Training Time (s)	Fastest Testing Time (s)	Total Test Time (s)	Precision (%)
Faster R-CNN	8.37	41,846	1.7	3051	100
YOLOV5	189	9450	0.2	339	100
HOG + SVM	822	822	0.09	102	93.55
AlexNet	162	16,200	2.8	4294	86

Table 3. Summary table of Precision, Recall, and F1 score of different methods.

Method	Precision	Recall	F1 Score
Faster R-CNN	100%	99.65%	99.82%
YOLOV5	100%	97.17%	98.56%
HOG + SVM	93.55%	82.79%	87.84%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wu, J.; Wu, C.; Guo, H.; Bai, T.; He, Y.; Li, X. Research on Red Jujubes Recognition Based on a Convolutional Neural Network. Appl. Sci. 2023, 13, 6381. https://doi.org/10.3390/app13116381

AMA Style

Wu J, Wu C, Guo H, Bai T, He Y, Li X. Research on Red Jujubes Recognition Based on a Convolutional Neural Network. Applied Sciences. 2023; 13(11):6381. https://doi.org/10.3390/app13116381

Chicago/Turabian Style

Wu, Jingming, Cuiyun Wu, Huaying Guo, Tiecheng Bai, Yufeng He, and Xu Li. 2023. "Research on Red Jujubes Recognition Based on a Convolutional Neural Network" Applied Sciences 13, no. 11: 6381. https://doi.org/10.3390/app13116381

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Research on Red Jujubes Recognition Based on a Convolutional Neural Network

Abstract

1. Introduction

2. Materials and Methods

2.1. Study Dataset

2.2. Study Process

2.3. Training and Testing Data

2.4. Detection Algorithms Used

2.4.1. Faster R-CNN

2.4.2. YOLOV5

2.4.3. AlexNet

2.4.4. HOG + SVM

2.5. Evaluation Indicators

2.6. Efficiency of Detection Methods

3. Results

3.1. Model Training

3.2. Model Results

3.3. The Prediction Results of Model

4. Discussion

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI