Birds Detection in Natural Scenes Based on Improved Faster RCNN

Xiang, Wenbin; Song, Ziying; Zhang, Guoxin; Wu, Xuncheng

doi:10.3390/app12126094

Open AccessArticle

Birds Detection in Natural Scenes Based on Improved Faster RCNN

¹

School of Mechanical and Automotive Engineering, Shanghai University of Engineering Science, Shanghai 201620, China

²

State Key Laboratory of Automotive Safety and Energy, School of Vehicle and Mobility, Tsinghua University, Beijing 100084, China

³

Beijing Key Lab of Traffic Data Analysis and Mining, School of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China

⁴

School of Information Science and Engineering, Hebei University of Science and Technology, Shijiazhuang 050018, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(12), 6094; https://doi.org/10.3390/app12126094

Submission received: 20 May 2022 / Revised: 3 June 2022 / Accepted: 6 June 2022 / Published: 15 June 2022

(This article belongs to the Section Computing and Artificial Intelligence)

Download

Browse Figures

Versions Notes

Abstract

:

To realize the accurate detection of small-scale birds in natural scenes, this paper proposes an improved Faster RCNN model to detect bird species. Firstly, the model uses a depth residual network to extract convolution features and performs multi-scale fusion for feature maps of different convolutional layers. Secondly, the K-means clustering algorithm is used to cluster the bounding boxes. We improve the anchoring according to the clustering results. The improved anchor frame tends toward the real bounding box of the dataset. Finally, the Soft Non-Maximum Suppression method is used to reduce the missed detection of overlapping birds. Compared with the original model, the improved model has faster effect and higher accuracy.

Keywords:

deep residual network; faster RCNN model; multi-scale fusion; soft non-maximum suppression

1. Introduction

Object detection is an important research direction in the field of computer vision. Tracking and protection of rare birds has always been one of the focuses in the field of animal protection. In recent years, with the mature application of Image Processing [1], Pattern Recognition [2] and Computer Vision Technology [3], it is possible to detect birds unattended in the wild. The Hengshui Lake Wetland Bird Sanctuary project in China uses high-definition cameras to photograph birds, including grey jays, egrets, gulls, and black-billed gulls. Because the camera is far away from the birds, some of the bird images are mainly small-scale birds. Bird subjects often only account for a small part of the whole image, which makes model detection very difficult [4,5]. In addition, birds live in groups, which leads to the phenomenon of birds blocking each other in some pictures. Although the research on bird detection algorithms in ideal situations has made great progress, bird detection in real natural conditions is affected by many factors. Therefore, the research of bird detection under real natural conditions has an important value.

Since the great success of AlexNet [6] in the ImageNet [7] image classification competition in 2012, CNN has been widely used in the field of computer vision. In recent years, various new CNN structures have been proposed, which makes the object detection algorithm develop rapidly [8]. At present, object detection algorithms can be divided into two categories, the two-stage object detection algorithm represented by Faster RCNN [9] and the single-stage object detection algorithm represented by YOLO [10] and SSD [11]. In order to solve the problem that it is difficult to identify small birds in natural scenes, this paper proposes an improved Faster RCNN model to detect bird species.

This paper makes the following contributions:

Convolution features are extracted by ResNet-50 [12], and the feature maps of different convolution layers are Multi-scale fusion [13];
K-means [14] clustering algorithm is used to cluster the bounding boxes. We improve the anchoring according to the clustering results. The improved anchor frame tends toward the real bounding box of the dataset;
Soft NMS method [15] is used to solve the problem of occlusion of birds, and multi-scale training is used to improve the generalization ability of the model in the training stage.

2. Related Work

Ke et al. [16] proposed a Multiple Instance Learning approach that selects anchors and jointly optimizes the two modules of a CNN-based object detector. This approach, referred to as Multiple Anchor Learning constructs anchor bags and selects the most representative anchors from each bag. Such an iterative selection process was potentially NP-hard to optimize. To address this issue, this approach [17] solved MAL by repetitively depressing the confidence of the selected anchors by perturbing their corresponding features. In an adversarial selection-depression manner, MAL not only pursues optimal solutions but also fully leverages multiple anchors’ features to learn a detection model. Dong et al. [18] proposed CentripetalNet, which used centripetal offsetting to pair corners in the same instance. CentripetalNet predicted the position and centripetal shift of corner points. On MS-COCO test-dev, CentripetalNet not only beat all existing anchor-free detectors with 48.00% AP but also achieves performance equivalent to the latest instance segmentation method with 40.21% MaskAP. Li et al. [19] set a cleanliness score for each anchor. The impact of noisy sampling on classification and positioning is reduced to adaptively adjust the importance of each anchor during training. Cleanliness serves as a soft anchor to supervise the training of classification branches. Since it contains more information than the previous positive/negative labels, it can prevent the network from obtaining incorrect prediction results from noisy samples. The contribution of different anchors to the network loss was specified, which makes the model tend to choose those samples with higher cleanliness score. Ren et al. [20] developed a unified framework that supports examples and contexts. Weakly supervised learning has become a compelling object detection tool by reducing the need for strong supervision during training. An instance-aware self-training algorithm and a learnable specific DropBlock were used, and a memory-efficient sequence batch backpropagation algorithm was designed.

Li et al. [21] proposed a new balanced grouping software Max module to balance classifiers within the detection framework through group training. It implicitly adjusts the training process for both the head and tail classes and ensures that they were fully trained without any additional sampling of instances of the tail class. Cao et al. [22] proposed the concept of Prime Sample, mainly focusing on the training process of these samples. Experiments have proven that focusing on the Prime sample rather than the hard sample was more effective for training. Guo et al. [23] first analyzed the design defects of feature pyramid in FPN and then introduced a new feature pyramid structure. Wang et al. [24] proposed combinational convolutional neural network. The voting mechanism in the composite network was extended to vote on the corner of the boundary box. This model can estimate the boundary box of partially occluded object reliably.

3. Materials and Methods

3.1. Feature Extraction Network

Feature extraction refers to the extraction of more advanced features from the original pixels, which can capture the differences between various categories. Feature extraction is an unsupervised method, which does not use the image category label when extracting information from pixels. At present, the commonly used image feature extraction networks include VGG19, ResNet-50, and InceptionNet-V3 [25].

Although VGG19 can perform well on ImageNet, it is not efficient due to the large number of channels in the built-up layer [26]. For example, if the number of input and output channels of a 3 × 3 convolution kernel is 512, the amount of computation is 9 × 512 × 512.

InceptionNet-V3 decomposes 7 × 7 convolutions into two one-dimensional convolutions (1 × 7, 7 × 1). This advantage cannot only speed up the calculation, but it can also split one convolution into two convolutions, which further increases the depth of the network and increases the nonlinearity of the network. It is also worth noting that the network input is changed from 224 × 224 to 299 × 299.

ResNet-50 has more convolution layers and can obtain more object features through convolution. ResNet-50 has a jump structure, which can directly skip one or more layers, solving the problem of gradient vanishing caused by stacking. ResNet-50 block is shown in Figure 1. The so-called residual network is originally to learn

H (x)

, but as it becomes more and more difficult to learn

H (x)

as the level becomes deeper, then it is directly changed to learn as

F (x) : = H (x) - x

. F(x) is the difference of

H (x) - x

, so

F (x)

is the residual, and after four simple operations to find

H (x) : = F (x) + x

,

H (x)

is exactly the result of the fit. This is the concept of residual learning.

There are two ways to find the parameter passing path: if it does not work along one path, you can go along the other path. As the hierarchy becomes deeper and deeper, it is difficult to find

H (x)

directly from layer

w^{n - 1}

to layer

w^{n}

. Another path can be found to bypass the previous ReLU and deviation directly and let

F (x)

reach

F (x) + x

directly, which is equivalent to finding an identity mapping parameter x with identity from layer

w^{n - 1}

to layer

w^{n}

, so that

F (x) + x

satisfies the

H (x)

expectation value. This means adding a path to

H (x)

, viewing this path as a quick connection rather than just a path to black, giving the network more options and allowing the network to learn for itself exactly which path is more convenient to take. Because what you want to simulate is not sure whether it is simple or complex, if you simulate something simple, you can choose to take a shortcut. If you are simulating something very complex, you have to take a very complex path. Note that the last ReLU output is

R e l u (F (x) + x)

.

Among them, the depth of VGG19 has 26 layers, and the pretraining parameters reach 549 MB; the depth of ResNet-50 has 168 layers, but the pretraining parameters are only 99 MB; the depth of InceptionNet-V3 is 159 layers, and the pretraining parameters are 92 MB, as shown in Table 1. Compared with VGG19 and InceptionNet-V3, ResNet-50 has more convolution layers and can obtain more object features through convolution; thus, we select ResNet-50 as the feature extraction network of the improved Faster RCNN model.

3.2. Multi-Scale Fusion Network

Firstly, the receptive field of high-level network is relatively large. The ability of semantic information representation is strong. The resolution of feature map is low. The representation ability of geometric information is weak. Spatial geometric features lack detail. Secondly, the receptive field of the low-level network is relatively small. Although the resolution is high, the representation ability of the semantic information is weak. Finally, high-level semantic information can help us accurately detect the object. Therefore, we add all these features together in deep learning, which is very effective for detection.

In the original Faster RCNN network structure, ROI-pooling [27] is performed in the last layer of the convolutional neural network to generate candidate regions. ROI-pooling only works for birds that are closer to the cameras. In our actual test, the camera is far away from the birds and the photos taken are mainly small-sized birds. The original Faster RCNN network performed better for some large-scale object birds, while the performance for small-scale object birds was not ideal [28]. In order to capture fine-grained feature information of bird objects and introduce context information, this paper proposes to improve ROI-pooling by fusion of multiple convolutional feature maps. As shown in Figure 2, the convolutional feature maps conv4f_x, con3d_x and con2c_x conduct ROI-pooling with the obtained ROIS to ensure that there will be no significant differences in each dimension. Finally, the obtained results are fused and scaled. In order to match the fused result with the original network structure, a 1 × 1 convolution kernel is used for channel dimensionality reduction. The RPN network takes a picture of any size as input and outputs a series of rectangular Object Proposals [29], each of which carries an objectness score. To generate regions by Proposals, a small network is slid over the convolutional feature map, which is the output of the last shared convolutional layer.

3.3. K-Means Clustering Algorithm

In Anchor-based object detection algorithm, the Anchor is generally designed manually. For example, in SSD and Faster RCNN, nine anchors with different sizes and aspect to height ratios were designed. However, one drawback of manually designed anchors is that there is no guarantee that they will be a good fit for the dataset. If the size of the anchor and the size of the object are significantly different, the detection effect of the model will be affected. YOLOv2 [30] suggested using K-means clustering to replace manual design. By clustering the bounding boxes in the training set, a group of anchors which are more suitable for the data set can be generated automatically.

K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters, in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. K-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids. K-means clustering algorithm is a typical distance-based clustering algorithm [31]. It uses distance as the evaluation index of similarity. It says that the closer two objects are, the more similar they are.

The bounding box is represented by the vertices in the upper left corner and the vertices in the lower right corner, namely,

(x_{1}, y_{1}, x_{2}, y_{2})

. When clustering the Box, we only need the width and height of the Box as features. In addition, since the size of the pictures in the data set may be different, the width and height of the picture should be used to normalize the width and height of the box, as shown in Formula (1):

\{\begin{matrix} w = \frac{x_{2} - x_{1}}{w_{i m g}} \\ h = \frac{y_{2} - y_{1}}{h_{i m g}} \end{matrix}

(1)

where

w_{i m g}

is the true width of the image in the dataset,

h_{i m g}

is the true height of the image in the dataset, w is the normalized width of the image in the dataset and h is the normalized height of the image in the dataset.

If the Euclidean distance in standard K-means is directly used as a measure, larger box clusters will generate larger squared errors than small box clusters. Since we only care about the IoU of the anchor and box and do not care about the size of the box, it is therefore more appropriate to use IoU as the measure.

3.4. Soft Non-Maximum Suppression

In the process of object detection, the Non-Maximum Suppression algorithm [32] generates a series of detection box sets B and the corresponding fraction S in the detected image. When the detection box M with the maximum score is selected, M is removed from the set B and is incorporated into the final detection result set D. Meanwhile, any detection box in the set B whose overlap rate is greater than the threshold value

N_{t}

is also removed. The NMS algorithm is calculated as shown in Formula (2):

s_{i} = \{\begin{matrix} s_{i}, I_{o u} (M, b_{i}) < N_{t} \\ 0, I_{o u} (M, b_{i}) \geq N_{t} \end{matrix}

(2)

where

I_{o u}

represents the overlap rate. However, this method has an obvious problem: if there is a high degree of overlap among birds in the same area of the image, and some bird detection box fractions are set to 0, the detection of this bird will fail and the average detection rate (mAP) of the algorithm will be reduced, as shown in Figure 3.

In Figure 3, the score value of the left detection box is 0.77 and that of the middle detection box is 0.95. The threshold value for setting the overlap rate of birds is 0.5, while the overlap rate of the two detection boxes in the figure is 0.54. According to the NMS algorithm, the detection box with a low score that the overlap rate exceeds the threshold will be removed, resulting in the failure to detect the birds in the middle. Aiming at the existing problems of NMS, a soft-NMS algorithm was used in this paper. Set an attenuation function for adjacent check boxes based on the amount of overlap rather than setting their fraction to zero completely. Simply put, if a check box overlaps most of M, it will have a low score. However, if there is only a small overlap between the detection box and M, its original detection score will not be greatly affected. In addition, Soft-NMS does not require additional training and is easy to implement, so it is easy to be integrated into the model. The Soft-NMS score attenuation function is shown in Formula (3):

s_{i} = \{\begin{matrix} s_{i}, I_{o u} (M, b_{i}) < N_{t} \\ s_{i} (1 - I_{o u} (M, b_{i})), I_{o u} (M, b_{i}) \geq N_{t} \end{matrix}

(3)

4. Experiment and Result Analysis

4.1. Experimental Environment and Experimental Data

The experiment was carried out under the Windows 10 operating system. The deep learning framework is TensorFlow and the programming language used is Python. The graphics card used is configured as: 2 NVIDIA GeForce RTX 2080 TI, and the video memory is 22 GB. The main trunk network is ResNet-50. The transfer learning method is used to train the network, and the pre-trained model of ImageNet is used to initialize the network parameters.

In order to evaluate the bird detection performance of proposed method in the real natural environment, the bird detection data set used in this paper is from the camera image of Hengshui Lake in China. The image sampling scene is shown in Figure 4. In this experiment, 10 kinds of birds were used for detection. The image sample of the data set is shown in Figure 5. The categories replaced by the data amount in the training set were changed for data expansion, brightness adjustment and various data-enhancement strategies, and finally, the amount of data in each category reached more than 3000 samples.

4.2. Evaluation Indicators

FPS and mAP were used to evaluate the algorithm model. The mAP index is obtained by preliminarily calculating the Average Precision of each category and then calculating the mean of the Average Precision of all categories. The calculation is shown in Formula (4), where

T P

(True Positive) is the True case,

F P

(False Positive) is the False Positive case,

F N

(False Negative) is the False Negative case,

N_{c}

is the number of classes c, division Precision P (Precision) and Recall rate R (Recall) and

P (r_{c})

is the P value when class c’s recall rate is

r_{c}

.

F 1_

score is the metric for classification problems and is often used as the final metric for multiple classification problems.

\{\begin{matrix} P = \frac{T P}{T P + F P} \\ R = \frac{T P}{T P + F N} \\ A P_{c} = \frac{1}{N_{c}} \sum_{r_{c} \in R_{c}} P (r_{c}) \\ m A P = \frac{1}{N} \sum A P_{C} \\ F 1_score = \frac{2 \times P \times R}{P + R} \end{matrix}

(4)

In the real-time detection task, the FPS value is an extremely important indicator, which is a direct reflection of detection speed and has a direct impact on the application scenarios of the task.

4.3. Model Parameters

In the training stage, the model was trained and iterated 20,000 times on the bird dataset, and the initial learning rate was set at 0.001. At the same time, the learning rate decrement strategy is adopted, the decrement rate is 0.005 and the learning rate decays once after 4000 iterations. Before the image is input to the network, it is cut with the machine to ensure that its short edge is 480, 600 or 700, and the long side scale does not exceed 1000. Horizontal flipping is adopted as a data enhancement strategy. In RPN, the number of anchors increased from the original 9 to 15; the three aspect ratios were 1:1, 1:1.5 and 2:1; and the five different basic scales were 16 × 16, 32 × 32, 64 × 64, 128 × 128 and 256 × 256. For the part of Faster RCNN classification regression network, the bar with the ROI as the foreground was set as the IoU threshold, the truth box was greater than or equal to 0.5 and the rest was background. When a certain ROI score is higher than 0.8 and the IoU of the corresponding real box is less than 0.5, it will be regarded as a hard case sample, and these hard cases will be sent to the subsequent network for further training. Similar to the training stage, in the test stage, the tested images are randomly cropped and input into the test network. For each test image, RPN will generate 128 candidate boxes, and a candidate box will be considered as a bird if the classification score of a candidate box exceeds 0.8. In this paper, the threshold value in the Soft-NMS algorithm is set to 0.3.

4.4. Result and Analysis

4.4.1. Ablation Studies

To test the effectiveness and contribution of the model using different strategies, experiments were carried out on the bird dataset. The same learning rate 0.001 and iteration times 20,000 were set during the experiment. Experimental results are shown in Table 2, where × means unused and √ means used.

According to Table 2, the average detection accuracy of the original Faster RCNN model is 73.68%. Different improvement strategies have different effects on model enhancement, among which multi-scale feature map fusion has a more obvious effect on model enhancement. The average detection accuracy of the final model using all the improved strategies reached 79.85%, which was 6.17% higher than that of the original model.

4.4.2. Comparison of Our Model and State-of-the-Art Detection Models

In order to compare the effect difference between the proposed model and other classical object detection models (including SSD300 and YOLOv4), the proposed method and other classical methods were tested and evaluated on the bird dataset. When IoU = 0.5, the model in this paper was compared with SSD300 and YOLOv4, as shown in Table 3. The performance of the model in this paper is better on mAP and F1_score, and the performance of the model is more stable. However, the FPS of proposed model is lower than other models. The experimental results show that compared with the original Faster RCNN model, the average accuracy and detection speed of this model are improved by 3.5% and 2.6%, respectively.

5. Conclusions

Aiming at the problem of bird detection in natural scenes, an improved Faster RCNN model is proposed in this paper. The residual network ResNet-50 was used as the backbone network to extract the image features, and the multi-scale feature image fusion strategy was used to detect small scale birds, so as to improve the detection accuracy of difficult samples. On this basis, the soft non-maximum suppression method was used to solve the problem of bird object overlap, and multi-scale training strategy was introduced to further improve the detection accuracy and detection rate of the model. Experimental results show that the model is effective for bird detection in natural scenes, and the mAP of the model on the bird dataset is 89.0%. The performance of the model in this paper is better on mAP and F1_score, and the performance of the model is more stable. However, the model has a poor performance in detection speed and low FPS. Future experiments will focus on the detection speed of FPS, and we will conduct experiments in the future to solve the problem of FPS.

Author Contributions

Conceptualization, Z.S. and W.X.; methodology, Z.S.; software, W.X.; validation, W.X., Z.S. and X.W.; formal analysis, W.X.; investigation, Z.S.; resources, W.X.; data curation, X.W.; writing—original draft preparation, G.Z.; writing—review and editing, Z.S.; visualization, G.Z.; supervision, W.X.; project administration, X.W.; funding acquisition, W.X. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the National Natural Science Foundation of China (No. 51805312).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

The authors sincerely thank Shanghai University of Engineering Science, Beijing Jiaotong University, the State Key Laboratory of Automotive Safety and Energy and the School of Vehicles and Mobility, Tsinghua University, Beijing.

Conflicts of Interest

The authors declare no conflict of interest.

References

Iqbal, Z.; Khan, M.A.; Sharif, M.; Shah, J.H.; ur Rehman, M.H.; Javed, K. An automated detection and classification of citrus plant diseases using image processing techniques: A review. Comput. Electron. Agric. 2018, 153, 12–32. [Google Scholar] [CrossRef]
Saijo, Y.; Loo, E.P.I.; Yasuda, S. Pattern recognition receptors and signaling in plant-microbe interactions. Plant J. 2018, 93, 592–613. [Google Scholar] [CrossRef] [PubMed]
Scharr, H.; Dee, H.; French, A.P.; Tsaftaris, S.A. Special issue on computer vision and image analysis in plant phenotyping. Mach. Vis. Appl. 2016, 27, 607–609. [Google Scholar] [CrossRef] [Green Version]
Yue, X.; Li, H.; Shimizu, M.; Kawamura, S.; Meng, L. YOLO-GD: A Deep Learning-Based Object Detection Algorithm for Empty-Dish Recycling Robots. Machines 2022, 10, 294. [Google Scholar] [CrossRef]
Wang, J.; Su, S.; Wang, W.; Chu, C.; Jiang, L.; Ji, Y. An Object Detection Model for Paint Surface Detection Based on Improved YOLOv3. Machines 2022, 10, 261. [Google Scholar] [CrossRef]
Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
Fei-Fei, L.; Deng, J.; Li, K. ImageNet: Constructing a large-scale image database. J. Vis. 2010, 9, 1037. [Google Scholar] [CrossRef]
Chauhan, R.; Ghanshala, K.K.; Joshi, R. Convolutional neural network (CNN) for image detection and recognition. In Proceedings of the 2018 First International Conference on Secure Cyber Computing and Communication (ICSCCC), Jalandhar, India, 15–17 December 2018; pp. 278–282. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster RCNN: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 January 2017; pp. 2117–2125. [Google Scholar]
Zhong, Y.; Wang, J.; Peng, J.; Zhang, L. Anchor box optimization for object detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2020; pp. 1286–1294. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
Ke, W.; Zhang, T.; Huang, Z.; Ye, Q.; Liu, J.; Huang, D. Multiple anchor learning for visual object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10206–10215. [Google Scholar]
Dong, Z.; Li, G.; Liao, Y.; Wang, F.; Ren, P.; Qian, C. Centripetalnet: Pursuing high-quality keypoint pairs for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10519–10528. [Google Scholar]
Li, H.; Wu, Z.; Zhu, C.; Xiong, C.; Socher, R.; Davis, L.S. Learning from noisy anchors for one-stage object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10588–10597. [Google Scholar]
Ren, Z.; Yu, Z.; Yang, X.; Liu, M.Y.; Lee, Y.J.; Schwing, A.G.; Kautz, J. Instance-aware, context-focused, and memory-efficient weakly supervised object detection. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10598–10607. [Google Scholar]
Li, Y.; Wang, T.; Kang, B.; Tang, S.; Wang, C.; Li, J.; Feng, J. Overcoming classifier imbalance for long-tail object detection with balanced group softmax. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10991–11000. [Google Scholar]
Cao, Y.; Chen, K.; Loy, C.C.; Lin, D. Prime sample attention in object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11583–11591. [Google Scholar]
Guo, C.; Fan, B.; Zhang, Q.; Xiang, S.; Pan, C. Augfpn: Improving multi-scale feature learning for object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12595–12604. [Google Scholar]
Wang, A.; Sun, Y.; Kortylewski, A.; Yuille, A.L. Robust object detection under occlusion with context-aware compositionalnets. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 12645–12654. [Google Scholar]
Szegedy, C.; Vanhoucke, V.; Ioffe, S.; Shlens, J.; Wojna, Z. Rethinking the inception architecture for computer vision. In Proceedings of the IEEE conference on computer vision and pattern recognition, San Juan, PR, USA, 17–19 June 2016; pp. 2818–2826. [Google Scholar]
da Silva, J.R.; de Almeida, G.M.; Cuadros, M.A.d.S.; Campos, H.L.; Nunes, R.B.; Simão, J.; Muniz, P.R. Recognition of Human Face Regions under Adverse Conditions—Face Masks and Glasses—In Thermographic Sanitary Barriers through Learning Transfer from an Object Detector. Machines 2022, 10, 43. [Google Scholar] [CrossRef]
Qin, Y.; He, S.; Zhao, Y.; Gong, Y. RoI pooling based fast multi-domain convolutional neural networks for visual tracking. In Proceedings of the International Conference on Artificial Intelligence and Industrial Engineering, Phuket, Thailand, 26–27 July 2016. [Google Scholar]
Xavier, A.I.; Villavicencio, C.; Macrohon, J.J.; Jeng, J.H.; Hsieh, J.G. Object Detection via Gradient-Based Mask R-CNN Using Machine Learning Algorithms. Machines 2022, 10, 340. [Google Scholar] [CrossRef]
Zitnick, C.L.; Dollár, P. Edge boxes: Locating object proposals from edges. In Proceedings of the European Conference on Computer Vision; Springer: Berlin/Heidelberg, Germany, 2014; pp. 391–405. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 2017; pp. 7263–7271. [Google Scholar]
Nalli, G.; Amendola, D.; Perali, A.; Mostarda, L. Comparative Analysis of Clustering Algorithms and Moodle Plugin for Creation of Student Heterogeneous Groups in Online University Courses. Appl. Sci. 2021, 11, 5800. [Google Scholar] [CrossRef]
Hosang, J.; Benenson, R.; Schiele, B. Learning non-maximum suppression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, San Juan, PR, USA, 17–19 June 2017; pp. 4507–4515. [Google Scholar]

Figure 1. Schematic diagram of the ResNet-50 block.

Figure 2. Structure of the whole work model.

Figure 3.

I_{o u}

overlap diagram.

Figure 3.

I_{o u}

overlap diagram.

Figure 4. Image sampling scene.

Figure 5. Sample of bird dataset.It includes 10 categories. Wherein, (a) is a picture of motacilla Alba; (b) is a picture of egrets; (c) is a picture of saunders’s gull; (d) is a picture of a seagull; (e) is a picture of the Greyhound; (f) is a photograph of cormorant; (g) is a picture of a sparrow; (h) is a picture of a little dog; (i) is a picture of a woodpecker; (j) is a picture of a white goose.

Table 1. Comparison of model and pretraining parameters.

Pretraining Model	Network Layers	Pretraining Parameter
VGG19	26	549 MB
InceptionNet-V3	159	92 MB
ResNet-50	168	99 MB

Table 2. Effect comparison of model promotion by different strategies.

Num	Anchor	KM	MSFIF	SNMS	MT	NET	mAP	FPS	P	R	F
1	9	×	×	×	×	VGG19	84.6	12.52	73.15%	74.22%	73.68%
2	15	√	×	×	×	VGG19	86.07	11.86	74.99%	75.37%	75.18%
3	15	×	√	×	×	VGG19	85.23	11.99	75.83%	76.51%	76.17%
4	15	×	×	√	×	VGG19	85.08	12.43	77.61%	76.13%	76.86%
5	15	×	×	×	√	VGG19	85.81	12.18	76.39%	77.43%	77.41%
6	15	√	√	×	×	VGG19	86.52	12.63	77.22%	75.48%	76.34%
7	15	√	√	√	×	VGG19	87.59	13.91	76.84%	78.91%	77.86%
8	15	√	√	√	√	VGG19	88.67	13.78	77.67%	78.35%	78.00%
9	15	√	√	√	√	INet-V3	88.72	13.17	78.94%	79.76%	79.35%
10	15	√	√	√	√	Res-50	89.04	13.63	80.03%	79.68%	79.85%

KM: K-means; MSFIF: Multi-scale feature image fusion; SNMS: Soft-NMS; MT: Multiscale training; NET: Network extraction network; mAP: mAP (IoU = 0.5); P: Precision; R: Recall; F: F1_score.

Table 3. Comparison results of the proposed method and state-of-the-art methods.

Algorithm	mAP (IoU = 0.5)	FPS	Precision	Recall	F1_Score
SSD300	71.23	45.64	70.86%	72.09%	71.47%
YOLOv4	86.91	31.22	79.55%	81.13%	80.33%
Faster RCNN	84.60	12.52	73.15%	74.22%	73.68%
Proposed model	89.04	13.63	80.03%	79.68%	79.85%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Xiang, W.; Song, Z.; Zhang, G.; Wu, X. Birds Detection in Natural Scenes Based on Improved Faster RCNN. Appl. Sci. 2022, 12, 6094. https://doi.org/10.3390/app12126094

AMA Style

Xiang W, Song Z, Zhang G, Wu X. Birds Detection in Natural Scenes Based on Improved Faster RCNN. Applied Sciences. 2022; 12(12):6094. https://doi.org/10.3390/app12126094

Chicago/Turabian Style

Xiang, Wenbin, Ziying Song, Guoxin Zhang, and Xuncheng Wu. 2022. "Birds Detection in Natural Scenes Based on Improved Faster RCNN" Applied Sciences 12, no. 12: 6094. https://doi.org/10.3390/app12126094

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Birds Detection in Natural Scenes Based on Improved Faster RCNN

Abstract

1. Introduction

2. Related Work

3. Materials and Methods

3.1. Feature Extraction Network

3.2. Multi-Scale Fusion Network

3.3. K-Means Clustering Algorithm

3.4. Soft Non-Maximum Suppression

4. Experiment and Result Analysis

4.1. Experimental Environment and Experimental Data

4.2. Evaluation Indicators

4.3. Model Parameters

4.4. Result and Analysis

4.4.1. Ablation Studies

4.4.2. Comparison of Our Model and State-of-the-Art Detection Models

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI