Next Article in Journal
Radio Tracking Reveals the Home Range and Activity Patterns of Nutria (Myocastor coypus) in the Macdo Wetland in South Korea
Previous Article in Journal
Effect of THI on Milk Production, Percentage of Milking Cows, and Time Lying in Holstein Cows in Northern-Arid Mexico
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Efficient Method for Monitoring Birds Based on Object Detection and Multi-Object Tracking Networks

1
College of Information Engineering, Sichuan Agricultural University, Ya’an 625000, China
2
Ya’an Digital Agricultural Engineering Technology Research Center, Ya’an 625000, China
*
Author to whom correspondence should be addressed.
Animals 2023, 13(10), 1713; https://doi.org/10.3390/ani13101713
Submission received: 21 March 2023 / Revised: 14 May 2023 / Accepted: 16 May 2023 / Published: 22 May 2023
(This article belongs to the Topic Ecology, Management and Conservation of Vertebrates)

Abstract

:

Simple Summary

Knowing the species and numbers of birds in nature reserves is essential to achieving the goals of bird conservation. However, it still relies on inefficient and inaccurate manual monitoring methods, such as point counts conducted by researchers and ornithologists in the field. To address this difficulty, this paper explores the feasibility of using computer vision technology for wetland bird monitoring. To this end, we build a dataset of manually labeled wetland birds for species detection and implement taxonomic counts of ten wetland bird species using a deep neural network model with multiple improvements. This tool can improve the accuracy and efficiency of monitoring, providing more precise data for scientists, policymakers, and nature reserve managers to take targeted conservation measures in protecting endangered birds and maintaining ecological balance. The algorithm performance evaluation demonstrates that the artificial intelligence method proposed in this paper is a feasible and efficient method for bird monitoring, opening up a new perspective for bird conservation and serving as a reference for the conservation of other animals.

Abstract

To protect birds, it is crucial to identify their species and determine their population across different regions. However, currently, bird monitoring methods mainly rely on manual techniques, such as point counts conducted by researchers and ornithologists in the field. This method can sometimes be inefficient, prone to errors, and have limitations, which may not always be conducive to bird conservation efforts. In this paper, we propose an efficient method for wetland bird monitoring based on object detection and multi-object tracking networks. First, we construct a manually annotated dataset for bird species detection, annotating the entire body and head of each bird separately, comprising 3737 bird images. We also built a new dataset containing 11,139 complete, individual bird images for the multi-object tracking task. Second, we perform comparative experiments using a state-of-the-art batch of object detection networks, and the results demonstrated that the YOLOv7 network, trained with a dataset labeling the entire body of the bird, was the most effective method. To enhance YOLOv7 performance, we added three GAM modules on the head side of the YOLOv7 to minimize information diffusion and amplify global interaction representations and utilized Alpha-IoU loss to achieve more accurate bounding box regression. The experimental results revealed that the improved method offers greater accuracy, with mAP@0.5 improving to 0.951 and mAP@0.5:0.95 improving to 0.815. Then, we send the detection information to DeepSORT for bird tracking and classification counting. Finally, we use the area counting method to count according to the species of birds to obtain information about flock distribution. The method described in this paper effectively addresses the monitoring challenges in bird conservation.

1. Introduction

The global ecological environment is under severe threat due to rapid social and economic development [1,2], leading to the endangerment of many bird species [3]. Consequently, the protection and conservation of endangered organisms have emerged as one of humanity’s most pressing concerns. Numerous countries worldwide have implemented various measures to help protect birds, ensuring their reproduction and survival. Typically, monitoring stations are established in protected areas for wildlife monitoring and management. The data on bird species and populations collected through monitoring allow reserve managers to better understand bird survival and distribution patterns, enabling the development of the most effective measures for bird protection. Therefore, efficient monitoring of birds is one of the keys to addressing bird conservation issues.
Nevertheless, currently, bird monitoring methods mainly rely on manual techniques, such as point counts conducted by researchers and ornithologists in the field [4,5,6,7,8], utilizing equipment such as binoculars, high-powered cameras, and telephoto lenses to conduct fixed-point observations in areas where birds congregate. This approach is not only time-consuming and labor-intensive, but it also suffers from limited coverage, and the data obtained are often untimely, inaccurate, and incomplete. This is particularly true for near-threatened and endangered species, which may not be observed due to their low occurrence and population numbers [9]. This outdated method significantly impedes bird conservation efforts. As a result, there is an urgent need to develop efficient methods to enhance bird monitoring efficiency.
In recent years, bird monitoring technology has experienced significant advancements. For instance, Zheng Fa et al., conducted a sample line survey at field sample sites using a telephoto lens SLR digital camera and monocular and binoculars [10]. Sun Ruolei et al., employed bird songs, photographs, and professional resources, such as the “Field Manual of Chinese Birds” for identification [11]. Liu Jian et al., utilized biological foot ring sensors [12]. In addition, some scholars have also employed methods such as aerial photography, unmanned aerial vehicle (UAV) surveys [13,14,15,16,17,18,19], and the use of bioacoustics for bird monitoring [20,21,22,23]. Compared with traditional manual monitoring methods, these methods can improve the efficiency of bird monitoring and avoid unnecessary time waste. However, they rely heavily on manual labor and experience accumulation to manage bird monitoring, which is vulnerable to factors such as small bird objects, high shading, high density, harsh field environments, and heavy manual workload. Despite these improvements, the monitoring efficiency is still not high enough. Therefore, there is an urgent need for intelligent and modern methods to promote bird conservation in the direction of precision and automation.
The development of artificial intelligence has expanded from a single application area to a wide range of applications [24,25], and computer vision technology is one of them. The application of computer vision to animal protection is one of the hot spots of research among scholars all over the world [26,27]. Juha Niemi et al., investigated bird identification through a bird radar system combined with an object-tracking algorithm [28]. They applied convolutional neural networks trained by deep learning algorithms to image classification, demonstrating the need for an automatic bird identification system suitable for real-world applications. In a breakthrough in this research area, scientists from research teams at CNRS, the University of Montpellier, and the University of Porto, Portugal, have developed the first artificial intelligence model capable of identifying individual birds [29], and their system is capable of automatically identifying individual animals with no external markers at all, without human intervention, and without harming the animals. However, the system has some limitations; it can only identify birds in the database and cannot cope with changes in appearance, such as feather changes. Lyu Xiuli and colleagues from Northeast Petroleum University utilized a convolutional neural network to identify and locate red-crowned cranes [30] and established a recognition model specifically for this species that showed good identification performance for red-crowned crane populations. However, they did not develop a multi-classification and comprehensively systematic method, which can only identify red-crowned cranes and is largely impractical.
Reviewing the related works in recent years, it has been found that although many effective works have been carried out in the field of bird monitoring and protection, they still have some shortcomings, such as the fact that the practicality needs to be improved, the models are not stable enough, and the accuracy is low. In addition, there are many species of birds, which are numerous and difficult to identify. No algorithm can accurately classify birds and record their numbers in precise species. Therefore, it is necessary to study a highly efficient method with good classification performance and species counting capability to monitor birds, deepen the research on the intelligence and automation of bird conservation, and promote the conservation of biodiversity.
This study selects ten priority-protected bird species, including the Ruddy Shelduck (Tadorna ferruginea), Whooper Swan (Cygnus cygnus), Red-crowned Crane (Grus japonensis), Black Stork (Ciconia nigra), Little Grebe (Tachybaptus ruficollis), Mallard (Anas platyrhynchos), Pheasant-tailed Jacana (Hydrophasianus chirurgus), Demoiselle Crane (Anthropoides virgo), Mandarin Duck (Aix galericulata), and the Scaly-sided Merganser (Mergus squamatus), as research objects. These species are under protection due to their declining population numbers and are of great conservation concern. Monitoring bird populations is a crucial tool for protecting bird species. To better monitor these protected bird species, we propose an efficient and automated bird monitoring method based on the latest object detection and multi-object tracking technologies, which is capable of achieving precise monitoring for these ten bird species and offering a new perspective on bird monitoring. Firstly, we detect and locate the birds by object detection and obtain the species information of the birds; then, we use the multi-object tracking algorithm to assign a unique ID to each object to ensure the accuracy of counting and avoid duplicate counting or missed counting due to occlusion; and finally, we combine the detection results with the ID information to realize the classification counting of the birds. Since the performance of the tracking by the detection method depends on the quality of the object detection algorithm, we also target improving the object detection algorithm, aiming to improve the efficiency of bird monitoring, promote the research of intelligent and automated bird conservation, and protect biodiversity.
Specifically, the contribution of this paper includes the following points. Firstly, we propose a new method of bird monitoring based on object detection and multi-object tracking networks. The method improves the efficiency of bird conservation, and at the same time, it is highly portable and provides a reference for the conservation of other animals. Secondly, we improve the object detection part in this paper. In this paper, the YOLOv7 algorithm is used as the baseline for object detection, and three GAM modules are added to the head side of YOLOV7 to reduce information dispersion, amplify the global interaction representation, and replace the loss function with Alpha-IoU loss to obtain more accurate bounding box regression and object detection. In this regard, the performance of the YOLOv7 algorithm and the performance of the proposed method for bird monitoring in this paper are optimized. Thirdly, an ingenious method of sorting and counting is designed. We make a counting board of the same size as the original image and combine the detection result and the tracking-assigned ID information to realize the counting of birds by species. (The specific technical idea of counting will be described in detail in Section 2.4.4 of this paper.) Finally, a manually annotated bird species detection dataset is constructed in this paper. It contains ten species of key protected birds, including 3737 images of bird flocks, and adopts pure-head annotation and whole-body annotation methods for annotation, respectively. The dataset images have both single bird activities and dense flock activities, which are inevitably disturbed by natural factors such as vegetation shadows, non-avian animals, water bodies, and litter. These datasets are derived from various real environments in wetlands, which makes the trained model more robust and well-suited for practical use, and the dataset can also be used as a reference for bird species detection studies. We also build a new dataset for the multi-object tracking task, containing 11,139 complete individual bird images and various motion patterns and shooting angles of birds, allowing the trained model to extract more effective features and be more robust, in addition to expanding the fine-grained classification dataset of birds (e.g., “CUB-200-2011”).

2. Materials and Methods

2.1. Data Acquisition

Ten types of protected birds in China were selected as the objects of the experiment. (The names of the ten types of birds are as follows: ”Ruddy Shelduck, Whooper Swan, Red-crowned Crane, Black Stork, Little Grebe, Mallard, Pheasant-tailed Jacana, Demoiselle Crane, Mandarin Duck, Scaly-sided Merganser”.)
The dataset for this paper is divided into two sections: (1) the dataset for object detection and (2) the dataset for a multi-object tracking feature extraction network.
The data used in this paper for object detection come from the internet. In order to ensure the authenticity and validity of the experiment, we screened the quality of the data to meet the minimum pixel requirements of more than 1080P, and all of them are authentic bird images in a wetland environment. The collected data contain a variety of interference factors, such as overlap, distance change, light and shade change, vegetation shadow, non-avian animals, and garbage. These interferences can replicate many conditions in real scenes, enhance the robustness of the algorithm, and improve the generalization ability of the model to ensure the effectiveness of the method. In addition, considering the occlusion between birds, this paper not only marks the whole body of the bird but also marks the head of the bird separately.
The dataset used in the multi-target tracking feature extraction network in this paper comes from the dataset of object detection. We extract the complete image of each bird from the dataset of object detection, including multiple angles and actions of birds. The dataset can also be used to expand the fine-grained classification dataset of birds. The preview of the above two parts of the dataset is shown in Figure 1.
All datasets in this paper are divided into a training set, a validation set, and a test set according to a ratio of 85 10 5 , which can be accessed and downloaded online at the following link: “https://www.kaggle.com/datasets/dieselcx/birds-chenxian (accessed on 12 May 2023)”. The division of the datasets is shown in Table 1 and Table 2.

2.2. Data Preprocessing

2.2.1. Mosaic Data Enhancement

Mosaic is a data enhancement method proposed in YOLOv4 [31]. The method focuses on randomly selecting four images and splicing them into a new image as training data after transforming them by random scaling, random cropping, and random lining up.
Mosaic data enhancement has two main advantages. (1) Expanding the dataset: In addition to enriching the background of the detection dataset, random scaling also adds many small targets, making the model more robust. (2) Reducing GPU memory: In batch normalization, the data of four images are computed simultaneously, which can reduce the dependence on batch size, and a single GPU can complete the training. The workflow for Mosaic’s data augmentation operation is shown in Figure 2.

2.2.2. Mixup Data Enhancement

Mixup [32] is an algorithm for mixing classes of augmentation of images in computer vision. It is based on the principle of neighborhood risk minimization and uses linear interpolation to mix images between different classes to construct new training samples and labels, which expand the training dataset. The image processing formula for the mixup data enhancement is as follows:
x ˜ = λ x i + 1 λ x j
y ˜ = λ y i + 1 λ y j
x i , y i and x j , y j are two randomly selected samples and corresponding labels from the same batch, which are randomly sampled numbers from the beta distribution. λ is a parameter that follows the distribution of β , λ 0 ,   1 . Figure 3 shows several images after the mixup data enhancement process.

2.2.3. HSV Data Enhancement

HSV is a color space created based on the intuitive properties of color. H stands for hue, S for saturation, and V for value. Hue, which means color, is measured in degrees in a range of 0 ° , 360 ° . We can change its color by changing the size of the angle. Saturation indicates how close a color is to a spectral color. A color can be seen as the result of mixing a certain spectral color with white. If we adjust the spectral color so that its proportion increases, then the closer the color is to the spectral color, the higher the saturation is. The range of saturation is 0 ,   1 . The value indicates the color’s degree of brightness. The brightness value is related to the luminosity of the luminous body. If we increase luminosity, the color will be brighter, and the range of the value is 0 ,   1 . Figure 4 shows several images after the HSV data enhancement process.

2.3. Related Networks

2.3.1. Object Detection: YOLOv7

YOLOv7 [33] is the work of the YOLO (You Only Look Once) series and is one of the most advanced object detection models available. YOLOv1 [34] was proposed in 2015 and was the debut of a single-stage detection algorithm, which emerged to effectively address the drawback of slow inference in two-stage detection networks while maintaining good detection accuracy. Subsequently, the authors proposed an improved YOLOv2 [35] based on YOLOv1, which used a joint training method of detection and classification, enabling the model to detect more than 9000 classes of objects. Next, YOLOv3 [36] was proposed as an improved version of the previous work. Its most significant feature is the introduction of the residual module Darknet-53 and the FPN architecture, which allows object prediction and multiscale fusion at three different scales. In addition, this version also adds new tricks, such as batch normalization and the mish activation function, to further improve the detection accuracy of the YOLO series. Based on this, YOLOv4 and YOLOv5 were introduced, which added many tricks to YOLOv3. YOLOv4 introduced modules such as CSPDarknet53, SPP [37], and PAN [38], which enhanced the perceptual field and feature representation of the network. In addition, it adopts new tricks, such as the Mosaic data enhancement trick and DropBlock [39] regularization trick, to further improve detection accuracy. YOLOv5, on the other hand, adopts a large number of design tricks such as the focus structure, improved CSP module, adaptive anchor frame calculation, and adaptive image scaling, which means the model has a qualitative leap in speed and accuracy.
Finally, YOLOv7 came out in 2022 with the network architecture shown in Figure 5. Based on its predecessor, it innovatively proposes the extended ELAN architecture, which can improve the self-learning capability of the network without destroying the original gradient path. In addition, it employs a cascade-based model scaling method so that a model of the appropriate scale can be generated for the actual task to meet the detection requirements. The introduction of these new tricks and architectures further improves the performance and effectiveness of the YOLO series networks. In this paper, we take the YOLOv7 network as a baseline and further enhance it.

2.3.2. Introducing Attention Mechanism into YOLOv7: GAM

The attention mechanism [40] is a signal-processing mechanism that was discovered by some scientists in the 1990s while studying human vision. Practitioners in artificial intelligence have introduced this mechanism into some models with success. Currently, the attention mechanism has become one of the most widely used “components” in the field of deep learning, especially in the field of natural language processing. Models or structures such as BERT [41], GPT [42], Transformer [43], etc., which have received much exposure in the past two years, all use the attention mechanism. It simulates the phenomenon that humans selectively pay attention to some visible information and ignore others to rationalize the limited visual processing resources. Specifically, the information redundancy problem is mainly solved by selecting only a part of the input information or assigning different weights to different parts of the input information.
In the process of exploring the application of attention mechanisms in computer vision, many excellent works have emerged, although they also have some drawbacks. For example, SENet [44] also brings the problem of low efficiency when suppressing unimportant pixels, CBAM [45] performs channel and spatial attention operations sequentially, while BAM [46] does them in parallel, but they both ignore the channel–space interaction, thus losing the cross-dimensional information. Considering the importance of cross-dimensional interactions, TAM [47] improves efficiency by exploiting the attention weights between each pair of 3D channels, spatial width and spatial height. However, the attention operation is still applied to two dimensions at a time, instead of all three dimensions. Therefore, to amplify cross-dimensional interactions, GAM [48] proposes an attention mechanism capable of capturing important features in all three dimensions, which is able to amplify global dimensional interaction features even with reduced information dispersion. The authors use a sequential channel–space attention mechanism and redesign the CBAM submodule. The whole process is shown in Figure 6, and the definitions are stated in Equations (3) and (4). Given an input feature map F 1 R C × H × W , the intermediate state F 2 and the output F 3 are defined as follows: M C and   M S are the channel attention map and the spatial attention map, respectively, and denotes the multiplication operation by the elements.
F 2 = M C F 1 F 1
F 3 = M S F 2 F 2
  • Channel Attention Sub-module
The channel attention submodule uses a three-dimensional arrangement to retain information in three dimensions. Then, it uses a two-layer MLP (Multilayer Perceptron) to amplify the cross-dimensional channel–space dependence. (MLP is an encoder–decoder structure, the same as BAM, and its compression ratio is r ); the channel attention submodule is shown in Figure 7.
2.
Spatial Attention Sub-module
In the spatial attention submodule, in order to focus on spatial information, two convolutional layers are used for spatial information fusion, and the same reduction ratio r as BAM is used from the channel attention submodule. At the same time, since the maximum pooling operation reduces the use of information and has a negative impact, the pooling operation is deleted to further preserve feature mapping.
In order to increase the precision of object detection, GAM is added to the YOLOv7 network in this study. The modified network structure is shown in Figure 8. There are two primary uses for the GAM module. First, it can lessen information dispersion, allowing the network to focus more on the properties of the target object and enhance detection performance. Second, it can expand the global interactive representation, increase the sufficiency of information exchange between various components, and improve the accuracy of detection. When attention mechanisms are added to the backbone network, some of the original weights of the backbone network are destroyed, which causes errors in the prediction results of the network. In this case, we decided to keep the original network features intact while incorporating the attention mechanism into the enhanced feature network extraction.

2.3.3. Introducing Alpha-IoU into YOLOv7

Object detection is used to locate the object in the image by bounding box regression. In early object detection work, IoU was used as the localization loss [49]. However, when the prediction box does not overlap with the ground truth, the IoU loss will cause the problem of gradient disappearance, resulting in a slower convergence speed and an inaccurate detector. To solve this problem, researchers have proposed several improved IoU-based loss designs, including GIoU [50], DIoU [51], and CIoU [52]. Among them, GIoU introduces a penalty term in the IoU loss to alleviate the gradient disappearance problem, while DIoU and CIoU consider the center point distance and aspect ratio between the prediction box and the ground truth in the penalty term.
Alpha-IoU [53] generalizes the existing IoU-based losses into a new series of power IoU losses, which has a power IoU term and an additional power regularization term. First, the Box-Cox transform is applied to the IoU loss, and it is generalized to power IoU loss, denoted by α, and finally extended to a more general form by adding a power regularization term. In simple terms, it is a power operation in the IoU and the penalty term expression. The calculation is shown in Equation (5).
L IoU = 1 I o U L α · IoU = 1 I o U α
The Alpha-IoU loss function can generalize the existing IoU-based losses, including GIoU and DIoU, to a new power IoU loss function to achieve more accurate bounding box regression and object detection. For example, based on GIOU and DIOU, the formula changed to the corresponding Alpha-IoU is shown in Equations (6) and (7).
L GIoU = 1 I o U + C B B g t C L α · GIoU = 1 I o U α + C B B g t C 1 + x a
L DIoU = 1 I o U + ρ 2 b , b g t c 2 L α · DIoU = 1 I o U α + ρ 2 α b , b g t c 2 α
The YOLOv7 network is given an upgrade in this paper with the introduction of Alpha-IoU, which can outperform the IoU-based loss with clear performance benefits and give the detector more flexibility by adjusting to achieve various levels of bbox regression accuracy. Additionally, it is more resistant to our datasets and noise bboxes, strengthening the model’s resistance to complex situations.

2.3.4. Multi-Object Tracking: DeepSORT

The multi-object tracking task is to detect and assign IDs to multi-objects in the video for trajectory tracking without prior knowledge of the number of objects, where each object has a different ID to enable subsequent trajectory prediction, precision finding, etc. DeepSORT [54] is the most popular algorithm for multi-object tracking tasks and an improved algorithm based on the ideas of SORT [55].
The SORT algorithm uses a simple Kalman filter [56] to process the correlation of frame-by-frame data and a Hungarian algorithm for the correlation metric, a simple algorithm that achieves good performance at high frame rates. However, as SORT ignores the surface features of the object being inspected, it is only accurate when the uncertainty in the estimation of the object’s state is low. In DeepSORT, appearance information is added, the ReID domain model is borrowed to extract appearance features, the number of ID switches is reduced, a more reliable metric is used instead of the association metric, and a CNN network is used to extract features to increase the robustness of the network to misses and obstacles.
The tracking scene of the DeepSORT algorithm is defined on an eight-dimensional state space u , v , γ , h , x ˙ , y ˙ , γ ˙ , h ˙ , where u , v are the coordinates of the detection frame centroid, γ is the aspect ratio, and γ is the height of the detection frame and their respective velocities in image coordinates. Then, a homogeneous model and a linear observational model Kalman filter are used with the observed variables u , v , γ , h to forecast updates. For each trajectory k , the number of matched frames is counted from the moment of the last first match a k ; the count is incremented if a match is made during Kalman prediction and reset to 0 if the trajectory is associated with a new prediction. Additionally, set a lifetime threshold of Amax, after which no match of time is considered to have left the tracking area and is removed from the track (in layman’s terms, an object that has not matched for a long time beyond Amax is considered to have left the tracking area). Since each newly detected object may become a new trajectory or if they are directly classified as a trajectory, then false detection will occur frequently. DeepSORT marks the new test result as“tentative,” which is followed by a few frames (usually three) and then “confirmed” if the next three consecutive frames match; it is confirmed as a new track. Otherwise, it is marked as ‘deleted’ and is no longer considered to be a track. Figure 9 shows the bird tracking process.
The original DeepSORT’s output is not intuitive enough to display the detected species. To improve the presentation, the source code is modified in this paper to add the display of species, making the output more intuitive. Figure 10 shows the results of the bird tracks.

2.4. Monitoring Methods

This paper proposes a computer vision-based bird monitoring method for detecting birds in nature reserves and counting them by analogy. By combining the information from the monitoring points, bird surveyors will have information on the distribution of bird populations and migration routes so that they can develop more effective ways to protect birds. The algorithm is divided into three steps: first, using an object detection algorithm to detect the species of birds and locate the object; then, using multi-object tracking to track the birds, each bird object is assigned a unique ID for subsequent data processing and analysis; and finally, the birds are counted by species using a counting method that combines the species, location, and ID information. To better illustrate the methodology of this paper, further details of the above will be added below.

2.4.1. Different Labelling Methods

In object detection, if only the bird’s body is considered for annotation, it may lead to more overlapping parts between the annotation frame and the background, thus causing too much background noise and affecting the accuracy and reliability of object detection. A bird’s head has different structural forms, which can accurately identify birds. In complex scenes, such as those with object overlap and occlusion, the annotation of a bird’s head can help to better distinguish different birds and reduce the probability of misjudgment. Therefore, in this paper, not only the bird’s body but also the bird’s head is labeled, and the two labeling methods form a set of comparative experiments to find the optimal model.

2.4.2. Obtaining the Best Algorithmic Model

The dominant multi-object tracking method is based on detection to track. Therefore, the method’s effectiveness depends on object detection. To achieve better results, we need an excellent object detection model.
First, we selected a group of current mainstream object detection networks to form a set of comparative experiments, selected the network with the best results, and continued to improve the object detection part to obtain the best model, with the following ideas for improvement. (1) Data augmentation: expanding the dataset and increasing the diversity of the data by rotating, flipping, cropping, and scaling the data to improve the robustness and generalization of the model. (2) Algorithm optimization: optimization of the object detection algorithm, e.g., improving the loss function, network structure, optimizer, etc., to improve the training efficiency and detection accuracy of the model. (3) Feature fusion: using multiple feature maps for object detection and fusing features at different levels and scales to improve the model’s ability to perceive and recognize objects.
The above measures will improve the accuracy and precision of bird monitoring, providing more accurate data to support bird research and conservation work.

2.4.3. Multi-Object Tracking

Multiple detected objects are given unique identifiers, and trajectory tracking is carried out. Each object has a different ID to enable subsequent counting, accurate searching, and so on.

2.4.4. Implementation of the Counting Area

To count the different species of birds separately and more accurately, a new method is proposed in this paper. The method uses an area that covers 95% of the image for counting, which ensures a more accurate count of the birds. Specifically, when a bird enters the counting area, we read information about the bird’s species and ID, which is fed back using the computer vision method described above. If the ID is a first occurrence, we record the ID and increase the number of birds in that species; if the ID has already been recorded, we leave the number of birds recorded for that species unchanged. In this way, we can more accurately count birds by species and, therefore, obtain a more accurate count.
For the logical implementation of the counting area, this paper first makes a matrix of the same size as the image and fills it with 1 to form the counting area. Next, we replace the value of the area we’re counting with a 0 and reassign the detected bird’s position information passed back to the corresponding position on our matrix to 1 to mark the presence of the bird in that area. Finally, we judge whether to count according to whether there is a 1 in the counting area. The counting method realized by this logic can effectively record the number of birds. The overall logical implementation diagram of the counting method is shown in Figure 11, which clearly shows the counting process of the method.

3. Results

3.1. Experimental Environment

Table 3 below shows the basic equipment information of the software and hardware used in this paper.

3.2. Training Parameters

Table 4 shows the training parameters of the training process used in the experiment.

3.3. Evaluation Metrics

In this paper, we use mainstream evaluation metrics such as precision, recall, F1 score, mAP, and FPS to evaluate the effect of the model. Before introducing each evaluation metric, we briefly present the confusion matrix, whose parameters are defined in Table 5.
Precision indicates the proportion of samples that the model correctly identifies as belonging to positive classes. It reflects the ability of the model to distinguish positive class samples. Equation (8) listed below calculates precision.
precision = T P T P + F P
Recall represents the ratio of the number of samples that the model correctly identified as positive classes to the total number of positive samples. Equation (9) listed below calculates recall.
recall = T P T P + F N  
The F1 score is a measure of the classification problem. Precision and recall are contradictory metrics. When the value of precision is high, the value of recall is often low. Therefore, we need to consider both metrics together to evaluate the effect of the model, and the F1 score is the harmonized average of precision and recall. Equation (10) calculates the F1 score, where P represents precision and R represents recall.
F 1 = 2 P R P + R = 2 T P 2 T P + F P + F N
Accuracy is the most commonly used classification performance metric. It can be used to express the accuracy of the model, that is, the ratio of the number of samples that the model properly identified as positive classes to the total number of samples. Accuracy can be calculated using Equation (11).
accuracy = T P + T N T P + F N + F P + T N
IoU (Intersection over Union) is often used to measure the degree of overlap between the predicted box and the ground truth box to evaluate the accuracy of the target detection algorithm. In this paper, IoU is used to measure the ratio of intersection to union set between the bounding boxes of predicted birds and the bounding boxes of real birds. Equation (12) below calculates the IoU.
IoU = A B A B
AP (Average Precision) refers to the average precision, that is, the precision of each species is averaged for multi-species prediction, which can measure the effect of the model on each species. We use the integral to calculate the area enclosed by the P-R curve and the coordinate axis to find the AP. The P-R curve is plotted according to the precision and recall of each species. The AP can be calculated using the following Equation (13).
AP = 0 1 p t d t
MAP (Mean Average Precision) refers to the average of the AP across all species. The mAP is usually used as the final indicator to assess the performance of the metric, measuring the effectiveness of the model on all species. The mAP is calculated by Equation (14), where S is the total number of species and the denominator is the sum of the AP under all species.
mAP = j = 1 S A P j S
FPS (Frames Per Second) is the frame rate per second. Another important metric for the target detection algorithm is the speed and it measures the number of images that the network can process per second. The higher the FPS, the better the timeliness.

3.4. Experimental Results

3.4.1. Comparison Experiments of the Most Advanced Methods for Object Detection under Different Labeling Methods

The method for bird monitoring proposed in this paper needs to have a good-performance object detection network for bird species detection. At the same time, it is necessary to use the multi-target tracking network to assign a unique ID to each object to assist the classification count. The multi-target tracking adopts the tracking-by-detection method, and its effect depends on the effectiveness of object detection. Therefore, we need to investigate a high-precision and high-performance object detection network.
We chose the mainstream object detection networks in recent years for comparison (Faster-RCNN [57], EfficientDet [58], CenterNet [59], SSD [60], YOLOv4, YOLOv5, YOLOv7, YOLOv8). Considering the occlusion between birds and the problem of image noise, we not only use the dataset labeled on the bird’s body, but also use the dataset labeled only on the bird’s head to train and test the object detection network, and then combine the experimental results to compare their F1 score, mAP@0.5, FPS and other evaluation indicators. Figure 12 shows the variation of mAP@0.5 for each target detection network during the training period. Table 6 and Table 7 show the comparative experimental results of the object detection networks trained using the datasets of the two labeling methods, respectively.
The experimental results show that the YOLOv7 network trained with a dataset labeling the entire body of the bird is the most effective method. Therefore, we choose YOLOv7 as the research object of this paper and use the dataset annotating the whole body of the bird for subsequent experiments.

3.4.2. Ablation Experiment on Data Enhancement

We tried some training tricks based on the original YOLOv7 model, divided the experimental groups using combining or splitting tricks, and obtained the experimental results shown in Table 8 below through training and testing. The results of this ablation experiment show that when we use HSV, Mosaic, and mixup data augmentation simultaneously, the method of group 12 has the best experimental effect, the evaluation indexes, such as mAP and F1 score, improve the most, and the FPS reduces but still meets the real-time requirement.

3.4.3. Ablation Experiment of Introducing a Series of Improved Strategies for YOLOv7

We reduce information dispersion and amplify the global interaction representation in this paper by adding three GAM modules to the head side of YOLOv7 and replacing the loss function with Alpha-IoU loss to achieve more accurate bounding box regression and object detection. We conducted ablation experiments to validate the method’s effectiveness, and the results show that our method improved the performance of the original YOLOv7 network. Figure 13 shows the variation of mAP@0.5 for each object detection network during the training period. Table 9 shows the experimental results of the ablation experiments.
For commonly used deep learning networks (such as CNNs) are generally considered to be black boxes and not very interpretable. To help better understand and explain the principle and decision-making process of our improvement work, we introduced Grad-CAM [61] in the ablation experiment. The class activation mapping image is generated, which can help us analyze the network’s attention area for a certain species. We can, in turn, analyze whether the network learns the correct features or information through the area of network attention. The heat map drawn by Grad-CAM is shown in Figure 14. From the graph, it can be seen that the improved method proposed in this paper can better mine the structural characteristics of our birds, and it is less affected by image noise. This method is better and more reasonable.

3.4.4. Manual Verification of Algorithm Effectiveness

The monitoring method proposed in this paper uses YOLOv7 and DeepSORT networks to detect and track birds and combines the self-designed counting method to classify and count birds. The overall process is shown in Figure 15, and Figure 16 shows the interface design for the bird monitoring system that we proposed based on actual monitoring results.
We manually counted the species and numbers of birds in the counting area of the video frame to simulate a real-world situation and compared the model’s results of counting birds to those results. The interval of time is neither excessively long nor excessively short for a better performance compared to the performance of each period. The interval of time chosen by this experiment’s counting results is 15 s. Table 10 presents the experimental outcomes.
The results of the verification experiment indicate that the used model’s monitoring of birds is consistent with the actual values. Our method is efficient and feasible, which can assist personnel in understanding the distribution of bird populations and formulate targeted bird protection strategies, thus significantly improving the efficiency of bird protection.

4. Discussion

In this paper, we have developed an integrated framework based on computer vision technology for real-time automatic classification and the counting of birds. By using automated monitoring techniques, such as the computer vision monitoring method proposed in this study or sound sensors, it is possible to collect information on birds in remote areas over extended periods of time, thereby increasing the likelihood of discovering rare species [62,63]. Researchers can combine the location information of monitoring sites to collect information on the distribution of bird populations and migration routes in order to develop more effective bird conservation plans. This method plays a certain role in the field of bird conservation, improving the efficiency and accuracy of bird monitoring and making bird conservation more effective [64].
The traditional method of using sample points requires human observers to observe at various sample points within a certain space, which has obvious limitations. For example, in studies of site occupancy or habitat preference [65,66], these limitations are particularly evident in species with low detectability. In such cases, increasing the duration of observations (such as by using automated monitoring techniques) may improve detectability [67,68,69]. Additionally, birds typically inhabit unique and often remote environments, such as dense forests and high-altitude mountains [70], and their small size and large numbers make close observation challenging for researchers. Due to birds’ sensitivity to human disturbance, even slight disturbances can cause significant behavioral reactions and potentially have negative impacts on their survival [71]. In this study, we used deep learning-based techniques for bird species classification and tracking and developed an automated method for classification and counting that can help address the above problems.
Budka et al., used a scientometric analysis to examine publication and research trends [20] and found that in recent years, most publications related to bird monitoring or classification are related to “deep learning.” This indicates that applying deep learning to bird monitoring is a rapidly developing research topic, although the overall research quantity is still limited, which confirms the necessity of using deep learning for bird monitoring. In this paper, we experimented with eight models: Faster-RCNN, EfficientDet, CenterNet, SSD, YOLOv4, YOLOv5, YOLOv7, and YOLOv8 to investigate suitable computer vision methods. Among the eight models, YOLOv7 achieved the best performance, and we further optimized this algorithm. Currently, we have not found scholars who use similar methods for bird monitoring, so we can only compare our model with articles that use different methods but similar tasks. Our proposed model achieved an average precision of over 71% [72], 88.1% [73], and 88.7% [74], which validates its effectiveness in identifying bird categories compared to these methods. Combined with our tracking and counting method, this opens up new perspectives for bird monitoring.
However, our method still has certain limitations, such as the need to improve recognition accuracy in complex backgrounds and lighting conditions. Future research could further explore how to optimize deep learning models to address these challenges, as well as integrate other auxiliary technologies (such as drones, satellite remote sensing, and bioacoustics) [13,14,15,16,17,18,19,20,21,22,23] into the bird monitoring system. Additionally, we could add more negative samples during the training phase and use image datasets generated by GAN [75] for data augmentation to improve monitoring effectiveness.

5. Conclusions

It is worth mentioning that this study has constructed a bird species detection and tracking dataset. The dataset includes 3737 images of birds and 11,139 images of the whole body of individual birds, with manual data annotation of the whole body and head of each bird. Such a dataset could provide data to support future bird conservation research. This study proposes a bird monitoring method using computer vision tricks, which uses object detection and a multi-object tracking network to detect and track birds by species, and then combines the information from detection and tracking to count birds by species using an area counting method. We also improved the object detection part by taking YOLOv7, the current mainstream object detection network, as the baseline and fused GAM to the head side of YOLOv7 and changed it to Alpha-IoU loss to obtain a more accurate edge regression and object detection. The improvement resulted in a mAP of 95.1%, which is probably the best model in the field to date. Our experiments have shown that our method can be effective in monitoring birds and obtaining their population distribution, which meets the requirements for practical applications.
In the future, we will continue to optimize the method to achieve better results in more scenarios. For example, we could try to train the model using more datasets or use more advanced tricks to analyze the sounds and behavior of birds. We believe that with these improvements our method will better serve the cause of bird conservation and provide more help for ornithological research.

Author Contributions

Conceptualization, X.C. and H.P. (Hongli Pu); methodology, X.C. and Y.H.; software, X.C. and H.P. (Hongli Pu); validation, M.L., H.P. (Hongli Pu), D.Z., J.C. and Y.H.; formal analysis, X.C.; investigation, M.L., D.Z. and J.C.; resources, H.P. (Haibo Pu); data curation, H.P. (Hongli Pu); writing—original draft preparation, X.C.; writing—review and editing, X.C., M.L. and Y.H.; visualization, X.C. and Y.H.; supervision, H.P. (Haibo Pu); project administration, H.P. (Haibo Pu). All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Innovation Training Program Project of Sichuan Agricultural University. The specific details are the Student Innovation Training Program Project Grant (No. 202210626050). The item number is 202210626050.

Institutional Review Board Statement

The animal study protocol was approved by the Institutional Animal Care and Use Committee of Sichuan Agricultural University (protocol code 20230075, 6 March 2023).

Informed Consent Statement

Not applicable.

Data Availability Statement

The data are available online at https://www.kaggle.com/datasets/dieselcx/birds-chenxian, accessed on 12 May 2023.

Acknowledgments

Thanks to Haibo Pu for his help in the discussion of ideas.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Almond, R.E.A.; Grooten, M.; Peterson, T. Living Planet Report 2020—Bending the Curve of Biodiversity Loss; World Wildlife Fund: Gland, Switzerland, 2020. [Google Scholar]
  2. Murali, G.; de Oliveira Caetano, G.H.; Barki, G.; Meiri, S.; Roll, U. Emphasizing Declining Populations in the Living Planet Report. Nature 2022, 601, E20–E24. [Google Scholar] [CrossRef] [PubMed]
  3. IUCN. The IUCN Red List of Threatened Species. Version 2022-2. 2022. Available online: https://www.iucnredlist.org (accessed on 17 April 2023).
  4. Sun, G.; Zeng, L.; Qian, F.; Jiang, Z. Current Status and Development Trend of Bird Diversity Monitoring Technology. Geomat. World 2022, 29, 26–29. Available online: https://chrk.cbpt.cnki.net/WKE2/WebPublication/paperDigest.aspx?paperID=8cfe8d55-8931-42f0-8cd9-0fe1011305e5# (accessed on 12 May 2023).
  5. Cui, P.; Xu, H.; Ding, H.; Wu, J.; Cao, M.; Chen, L. Status Quo, Problems and Countermeasures of Bird Monitoring in China. J. Ecol. Rural. Environ. 2013, 29, 403–408. [Google Scholar]
  6. Pugesek, B.H.; Stehn, T.V. The Utility of Census or Survey for Monitoring Whooping Cranes in Winter; University of Nebraska-Lincoln: Lincoln, NE, USA, 2016. [Google Scholar]
  7. Bibby, C.J.; Burgess, N.D.; Hillis, D.M.; Hill, D.A.; Mustoe, S. Bird Census Techniques; Elsevier: Amsterdam, The Netherlands, 2000. [Google Scholar]
  8. Gregory, R.D.; Gibbons, D.W.; Donald, P.F. Bird census and survey techniques. Bird Ecol. Conserv. 2004, 17–56. [Google Scholar] [CrossRef]
  9. Pacifici, K.; Simons, T.R.; Pollock, K.H. Effects of vegetation and background noise on the detection process in auditory avian point-count surveys. Auk 2008, 125, 600–607. [Google Scholar] [CrossRef]
  10. Zheng, F.; Guo, X.; Zhai, R.; Xian, S.; Huang, F. Analysis of the status and protection measures of birds in Xinying Mangrove National Wetland Park. Guizhou Sci. 2022, 40, 62–66. [Google Scholar]
  11. Sun, R.; Ma, H.; Yu, L.; Xu, Z.; Chen, G.; Pan, T.; Zhou, W.; Yan, L.; Sun, Z.; Peng, Z.; et al. A preliminary report on bird diversity and distribution in Dabie Mountains. J. Anhui Univ. 2021, 45, 85–102. [Google Scholar]
  12. Liu, J. Design of Bird Image Recognition System Based on DNN. Agric. Equip. Veh. Eng. 2019, 57, 113–116. [Google Scholar]
  13. Chabot, D.; Francis, C.M. Computer-Automated Bird Detection and Counts in High-Resolution Aerial Images: A Review. J. Field Ornithol. 2016, 87, 343–359. [Google Scholar] [CrossRef]
  14. Weissensteiner, M.H.; Poelstra, J.W.; Wolf, J.B.W. Low-Budget Ready-to-Fly Unmanned Aerial Vehicles: An Effective Tool for Evaluating the Nesting Status of Canopy-Breeding Bird Species. J. Avian Biol. 2015, 46, 425–430. [Google Scholar] [CrossRef]
  15. Chabot, D.; Craik, S.R.; Bird, D.M. Population Census of a Large Common Tern Colony with a Small Unmanned Aircraft. PLoS ONE 2015, 10, e0122588. [Google Scholar] [CrossRef]
  16. McClelland, G.; Bond, A.; Sardana, A.; Glass, T. Rapid Population Estimate of a Surface-Nesting Seabird on a Remote Island Using a Low-Cost Unmanned Aerial Vehicle. Mar. Ornithol. 2016, 44, 215–220. [Google Scholar]
  17. Hodgson, J.C.; Baylis, S.M.; Mott, R.; Herrod, A.; Clarke, R.H. Precision Wildlife Monitoring Using Unmanned Aerial Vehicles. Sci. Rep. 2016, 6, 22574. [Google Scholar] [CrossRef]
  18. Sardà-Palomera, F.; Bota, G.; Padilla, N.; Brotons, L.; Sardà, F. Unmanned Aircraft Systems to Unravel Spatial and Temporal Factors Affecting Dynamics of Colony Formation and Nesting Success in Birds. J. Avian Biol. 2017, 48, 1273–1280. [Google Scholar] [CrossRef]
  19. Wilson, A.M.; Barr, J.; Zagorski, M. The feasibility of counting songbirds using unmanned aerial vehicles. AUK A Q. J. Ornithol. 2017, 134, 350–362. [Google Scholar] [CrossRef]
  20. Xie, J.; Zhu, M. Acoustic Classification of Bird Species Using an Early Fusion of Deep Features. Birds 2023, 4, 11. [Google Scholar] [CrossRef]
  21. Bateman, H.L.; Riddle, S.B.; Cubley, E.S. Using Bioacoustics to Examine Vocal Phenology of Neotropical Migratory Birds on a Wild and Scenic River in Arizona. Birds 2021, 2, 19. [Google Scholar] [CrossRef]
  22. Yip, D.; Leston, L.; Bayne, E.; Sólymos, P.; Grover, A. Experimentally derived detection distances from audio recordings and human observers enable integrated analysis of point count data. Avian Conserv. Ecol. 2017, 12, 11. [Google Scholar] [CrossRef]
  23. Budka, M.; Kułaga, K.; Osiejuk, T.S. Evaluation of Accuracy and Precision of the Sound-Recorder-Based Point-Counts Applied in Forests and Open Areas in Two Locations Situated in a Temperate and Tropical Regions. Birds 2021, 2, 26. [Google Scholar] [CrossRef]
  24. Zhang, C.; Lu, Y. Study on Artificial Intelligence: The State of the Art and Future Prospects. J. Ind. Inf. Integr. 2021, 23, 100224. [Google Scholar] [CrossRef]
  25. Pan, Y. Heading toward Artificial Intelligence 2.0. Engineering 2016, 2, 409–413. [Google Scholar] [CrossRef]
  26. Berger-Wolf, T.Y.; Rubenstein, D.I.; Stewart, C.V.; Holmberg, J.A.; Parham, J.; Menon, S.; Crall, J.; Van Oast, J.; Kiciman, E.; Joppa, L. Wildbook: Crowdsourcing, Computer Vision, and Data Science for Conservation. arXiv 2017, arXiv:1710.08880. [Google Scholar]
  27. Tuia, D.; Kellenberger, B.; Beery, S.; Costelloe, B.R.; Zuffi, S.; Risse, B.; Mathis, A.; Mathis, M.W.; van Langevelde, F.; Burghardt, T.; et al. Perspectives in Machine Learning for Wildlife Conservation. Nat. Commun. 2022, 13, 792. [Google Scholar] [CrossRef] [PubMed]
  28. Niemi, J.; Tanttu, J.T. Deep Learning Case Study for Automatic Bird Identification. Appl. Sci. 2018, 8, 2089. [Google Scholar] [CrossRef]
  29. Ferreira, A.C.; Silva, L.R.; Renna, F.; Brandl, H.B.; Renoult, J.P.; Farine, D.R.; Covas, R.; Doutrelant, C. Deep Learning-Based Methods for Individual Recognition in Small Birds. Methods Ecol. Evol. 2020, 11, 1072–1085. [Google Scholar] [CrossRef]
  30. Lyu, X.; Chen, S. Location and identification of red-crowned cranes based on convolutional neural network. Electron. Meas. Technol. 2020, 43, 104–108. [Google Scholar]
  31. Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
  32. Zhang, H.; Cisse, M.; Dauphin, Y.N.; Lopez-Paz, D. Mixup: Beyond empirical risk minimization. arXiv 2017, arXiv:1710.09412. [Google Scholar]
  33. Wang, C.Y.; Bochkovskiy, A.; Liao, H.Y.M. YOLOv7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. arXiv 2022, arXiv:2207.02696. [Google Scholar]
  34. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 26 June–1 July 2016; pp. 779–788. [Google Scholar]
  35. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
  36. Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:1804.02767. [Google Scholar]
  37. He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
  38. Liu, S.; Qi, L.; Qin, H.; Shi, J.; Jia, J. Path Aggregation Network for Instance Segmentation. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 8759–8768. [Google Scholar]
  39. Ghiasi, G.; Lin, T.Y.; Le, Q.V. Dropblock: A regularization method for convolutional networks. Adv. Neural Inf. Process. Syst. 2018. [Google Scholar] [CrossRef]
  40. Wickens, C. Attention: Theory, Principles, Models and Applications. Int. J. Hum. Comput. Interact. 2021, 37, 403–417. [Google Scholar] [CrossRef]
  41. Bidirectional Encoder Representations from Transformers—An Overview ScienceDirect Topics. Available online: https://www.sciencedirect.com/topics/computer-science/bidirectional-encoder-representations-from-transformers (accessed on 4 March 2023).
  42. Radford, A.; Narasimhan, K.; Salimans, T.; Sutskever, I. Improving Language Understanding by Generative Pre-Training; OpenAI: San Francisco, CA, USA, 2018. [Google Scholar]
  43. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is all you need. Adv. Neural Inf. Process. Syst. 2017, 30, 5998–6008. [Google Scholar]
  44. Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
  45. Woo, S.; Park, J.; Lee, J.Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 9–14 September 2018; pp. 3–19. [Google Scholar]
  46. Park, J.; Woo, S.; Lee, J.Y.; Kweon, I.S. Bam: Bottleneck attention module. arXiv 2018, arXiv:1807.06514. [Google Scholar]
  47. Liu, Z.; Wang, L.; Wu, W.; Qian, C.; Lu, T. Tam: Temporal adaptive module for video recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 11–17 October 2021; pp. 13708–13718. [Google Scholar]
  48. Liu, Y.; Shao, Z.; Hoffmann, N. Global attention mechanism: Retain information to enhance channel-spatial interactions. arXiv 2021, arXiv:2112.05561. [Google Scholar]
  49. Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T. Unitbox: An advanced object detection network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016; pp. 516–520. [Google Scholar]
  50. Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Reid, I.; Savarese, S. Generalized intersection over union: A metric and a loss for bounding box regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019; pp. 658–666. [Google Scholar]
  51. Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU loss: Faster and better learning for bounding box regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
  52. Zheng, Z.; Wang, P.; Ren, D.; Liu, W.; Ye, R.; Hu, Q.; Zuo, W. Enhancing geometric factors in model learning and inference for object detection and instance segmentation. IEEE Trans. Cybern. 2021, 52, 8574–8586. [Google Scholar] [CrossRef]
  53. He, J.; Erfani, S.; Ma, X.; Bailey, J.; Chi, Y.; Hua, X.S. α-IoU: A Family of Power Intersection over Union Losses for Bounding Box Regression. Adv. Neural Inf. Process. Syst. 2021, 34, 20230–20242. [Google Scholar]
  54. Wojke, N.; Bewley, A.; Paulus, D. Simple online and realtime tracking with a deep association metric. In Proceedings of the 2017 IEEE International Conference on Image Processing (ICIP), Beijing, China, 17–20 September 2017; IEEE: Piscataway, NJ, USA, 2017; pp. 3645–3649. [Google Scholar]
  55. Bewley, A.; Ge, Z.; Ott, L.; Ramos, F.; Upcroft, B. Simple online and realtime tracking. In Proceedings of the 2016 IEEE International Conference on Image Processing (ICIP), Phoenix, AZ, USA, 25–28 September 2016; IEEE: Piscataway, NJ, USA, 2016; pp. 3464–3468. [Google Scholar]
  56. Basar, T. A New Approach to Linear Filtering and Prediction Problems. In Control Theory: Twenty-Five Seminal Papers (1932–1981); IEEE: Piscataway, NJ, USA, 2001; pp. 167–179. ISBN 978-0-470-54433-4. [Google Scholar]
  57. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015. [Google Scholar] [CrossRef]
  58. Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 10781–10790. [Google Scholar]
  59. Duan, K.; Bai, S.; Xie, L.; Qi, H.; Huang, Q.; Tian, Q. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 6569–6578. [Google Scholar]
  60. Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
  61. Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2020, 128, 336–359. [Google Scholar] [CrossRef]
  62. Dixon, A.P.; Baker, M.E.; Ellis, E.C. Agricultural landscape composition linked with acoustic measures of avian diversity. Land 2020, 9, 145. [Google Scholar] [CrossRef]
  63. Farina, A.; James, P.; Bobryk, C.; Pieretti, N.; Lattanzi, E.; McWilliam, J. Low cost (audio) recording (LCR) for advancing soundscape ecology towards the conservation of sonic complexity and biodiversity in natural and urban landscapes. Urban Ecosyst. 2014, 17, 923–944. [Google Scholar] [CrossRef]
  64. Nichols, J.D.; Williams, B.K. Monitoring for conservation. Trends Ecol. Evol. 2006, 21, 668–673. [Google Scholar] [CrossRef] [PubMed]
  65. MacKenzie, D.; Nichols, J.; Lachman, G.; Droege, G.S.; Royle, J.; Langtimm, C. Estimating site occupancy rates when detection probabilities are less than one. Ecology 2002, 83, 2248–2255. [Google Scholar] [CrossRef]
  66. Gu, W.; Swihart, R.K. Absent or undetected? Effects of non-detection of species occurrence on wildlife–habitat models. Biol. Conserv. 2003, 116, 195–203. [Google Scholar] [CrossRef]
  67. Sliwinski, M.; Powell, L.; Koper, N.; Giovanni, M.; Schacht, W. Research design considerations to ensure detection of all species in an avian community. Methods Ecol. Evol. 2016, 7, 456–462. [Google Scholar] [CrossRef]
  68. Dettmers, R.; Buehler, D.; Bartlett, J.; Klaus, N. Influence of point count length and repeated visits on habitat model performance. J. Wild Manag. 1999, 63, 815–823. [Google Scholar] [CrossRef]
  69. Budka, M.; Czyż, M.; Skierczyńska, A.; Skierczyński, M.; Osiejuk, T.S. Duration of survey changes interpretation of habitat preferences: An example of an endemic tropical songbird, the Bangwa Forest Warbler. Ostrich 2020, 91, 195–203. [Google Scholar] [CrossRef]
  70. Johnson, M.D. Measuring habitat quality: A review. Condor 2007, 109, 489–504. [Google Scholar] [CrossRef]
  71. Battin, J. When Good Animals Love Bad Habitats: Ecological Traps and the Conservation of Animal Populations. Conserv. Biol. 2004, 18, 1482–1491. [Google Scholar] [CrossRef]
  72. Zottesso, R.H.; Costa, Y.M.; Bertolini, D.; Oliveira, L.E. Bird species identification using spectrogram and dissimilarity approach. Ecol. Inform. 2018, 48, 187–197. [Google Scholar] [CrossRef]
  73. Zheng, H.; Fu, J.; Zha, Z.J.; Luo, J. Learning deep bilinear transformation for fine-grained image representation. Adv. Neural Inf. Process. Syst. 2019. [Google Scholar] [CrossRef]
  74. Ji, X.; Jiang, K.; Xie, J. LBP-based bird sound classification using improved feature selection algorithm. Int. J. Speech Technol. 2021, 24, 1033–1045. [Google Scholar] [CrossRef]
  75. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial networks. Commun. ACM 2020, 63, 139–144. [Google Scholar] [CrossRef]
Figure 1. A preview image of the dataset. (a) The preview shows the dataset for object detection; (b) the preview shows the dataset for the multi-object tracking feature extraction network.
Figure 1. A preview image of the dataset. (a) The preview shows the dataset for object detection; (b) the preview shows the dataset for the multi-object tracking feature extraction network.
Animals 13 01713 g001
Figure 2. Mosaic data augmentation. First, from the dataset of birds, a batch of image data was randomly extracted. Then, four images were randomly selected, randomly scaled, randomly distributed, and spliced into new images, and the above operations were repeated for batch size times. Finally, the neural network was trained using the Mosaic data augmentation data.
Figure 2. Mosaic data augmentation. First, from the dataset of birds, a batch of image data was randomly extracted. Then, four images were randomly selected, randomly scaled, randomly distributed, and spliced into new images, and the above operations were repeated for batch size times. Finally, the neural network was trained using the Mosaic data augmentation data.
Animals 13 01713 g002
Figure 3. Images after the mixup augmentation processing.
Figure 3. Images after the mixup augmentation processing.
Animals 13 01713 g003
Figure 4. Images after HSV augmentation processing.
Figure 4. Images after HSV augmentation processing.
Animals 13 01713 g004
Figure 5. The network architecture diagram of YOLOv7.
Figure 5. The network architecture diagram of YOLOv7.
Animals 13 01713 g005
Figure 6. The overview of GAM.
Figure 6. The overview of GAM.
Animals 13 01713 g006
Figure 7. Channel attention submodule.
Figure 7. Channel attention submodule.
Animals 13 01713 g007
Figure 8. The network structure after adding GAM in YOLOv7.
Figure 8. The network structure after adding GAM in YOLOv7.
Animals 13 01713 g008
Figure 9. The flowchart of bird tracking.
Figure 9. The flowchart of bird tracking.
Animals 13 01713 g009
Figure 10. (a,b) The results of bird tracking.
Figure 10. (a,b) The results of bird tracking.
Animals 13 01713 g010
Figure 11. The logic diagram for bird counting.
Figure 11. The logic diagram for bird counting.
Animals 13 01713 g011
Figure 12. The variation in mAP@0.5 during training. (a) Training using the dataset labeled only with the head of the bird; (b) training using the dataset labeled with the body of the bird.
Figure 12. The variation in mAP@0.5 during training. (a) Training using the dataset labeled only with the head of the bird; (b) training using the dataset labeled with the body of the bird.
Animals 13 01713 g012
Figure 13. The variation in mAP@0.5 during training. (a) Ablation experiments with improved strategies; (b) a comparison of the original model with our final improved model.
Figure 13. The variation in mAP@0.5 during training. (a) Ablation experiments with improved strategies; (b) a comparison of the original model with our final improved model.
Animals 13 01713 g013
Figure 14. The heat map of various models in the ablation experiment.
Figure 14. The heat map of various models in the ablation experiment.
Animals 13 01713 g014
Figure 15. Flowchart of the processing of the method.
Figure 15. Flowchart of the processing of the method.
Animals 13 01713 g015
Figure 16. The interface design for the monitoring system.
Figure 16. The interface design for the monitoring system.
Animals 13 01713 g016
Table 1. Partitioning of the datasets for object detection.
Table 1. Partitioning of the datasets for object detection.
Annotation MethodNameProportionNumber of PicturesNumber of Birds
Whole Body Annotationtraining set85%317611,322
validation set10%3731543
test set5%188863
Head Annotationtraining set85%278210,085
validation set10%3271217
test set5%164681
TotalWhole Body Annotation100%373713,728
Head
Annotation
100%327311,983
Head annotation means that only the bird’s head is annotated, whereas whole body annotation means that the bird’s entire body, including the head, is annotated.
Table 2. Partitioning of the datasets for multi-target tracking feature extraction networks.
Table 2. Partitioning of the datasets for multi-target tracking feature extraction networks.
Partition NameProportionNumber of Pictures
training set85%9468
validation set10%1114
test set5%557
Total100%11,139
Table 3. Software and hardware experimental equipment.
Table 3. Software and hardware experimental equipment.
NameType/Version
Operating systemUbuntu 20.04
Python versionPython 3.8
Versions of the libraryTorch1.9.0 + cu111
Integrated Development EnvironmentPycharm 2021.3.3
Central Processing UnitAMD EPYC 7543 32-Core Processor
Graphics Processing UnitA40(48 GB) × 2
Table 4. Parameter configuration for training neural networks.
Table 4. Parameter configuration for training neural networks.
ParameterValueParameterValue
Initial Learning Rate0.01Weight Decay0.0005
Momentum0.937Batch Size32
Image Size640 × 640Epochs200
Table 5. Parameter definitions.
Table 5. Parameter definitions.
Confusion MatrixPredicted Results
PositiveNegative
Real ResultsTrue T P 1 F N 2
False F P 3 T N 4
T P 1 (True Positive): It predicts positive classes as positive classes. F N 2 (False Negative): It predicts positive classes as negative classes. F P 3 (False Positive): It predicts negative classes as positive classes. T N 4 (True Negative): It predicts negative classes as negative classes.
Table 6. A comparison of different object detection algorithms (using the dataset annotated with only the head).
Table 6. A comparison of different object detection algorithms (using the dataset annotated with only the head).
ModelClassPrecisionRecallF1 ScoremAP@0.5FPS
Faster-RCNNAll0.7930.8540.820.84125
Ruddy Shelduck0.6970.8740.780.855
Whooper Swan0.6860.8020.740.775
Red-crowned Crane0.6390.8600.730.825
Black Stork0.8060.8470.830.824
Little Grebe0.8560.8970.880.873
Mallard0.8790.8550.870.861
Pheasant-tailed Jacana0.8680.8690.870.859
Demoiselle Crane0.9210.8180.870.854
Mandarin Duck0.8570.8650.860.859
Scaly-sided Merganser0.7250.8520.780.824
EfficientDetAll0.8730.8320.850.85112
Ruddy Shelduck0.8870.9000.890.895
Whooper Swan0.8200.7450.780.807
Red-crowned Crane0.8720.7370.800.821
Black Stork0.8380.8210.830.834
Little Grebe0.9130.8920.900.899
Mallard0.8120.8690.840.829
Pheasant-tailed Jacana0.8970.8570.880.871
Demoiselle Crane0.9300.7410.820.807
Mandarin Duck0.9110.8830.900.891
Scaly-sided Merganser0.8540.8740.860.861
CenterNetAll0.8280.6110.700.71259
Ruddy Shelduck1.0000.7490.860.892
Whooper Swan0.9140.7500.820.883
Red-crowned Crane0.9560.7330.830.840
Black Stork0.8020.6300.710.799
Little Grebe0.6440.7930.710.636
Mallard0.8920.6000.720.793
Pheasant-tailed Jacana0.8020.7610.780.792
Demoiselle Crane0.7030.6120.650.694
Mandarin Duck0.8020.3930.530.611
Scaly-sided Merganser0.7620.0930.170.181
SSDAll0.8610.7680.810.82163
Ruddy Shelduck0.7900.8100.800.809
Whooper Swan0.7590.6230.680.673
Red-crowned Crane0.8540.7070.770.811
Black Stork0.8900.7900.840.840
Little Grebe0.9340.8850.910.892
Mallard0.8670.8440.860.866
Pheasant-tailed Jacana0.9260.8920.910.917
Demoiselle Crane0.8430.5860.690.725
Mandarin Duck0.8590.7220.780.796
Scaly-sided Merganser0.8930.8210.860.878
YOLOv4All0.9070.6790.760.79040
Ruddy Shelduck0.8790.8880.880.889
Whooper Swan0.8390.7020.760.808
Red-crowned Crane0.7770.6640.720.767
Black Stork0.8910.7650.820.849
Little Grebe0.9620.8390.900.933
Mallard0.8940.6960.780.790
Pheasant-tailed Jacana0.9850.7670.860.900
Demoiselle Crane0.9510.6560.780.838
Mandarin Duck0.9060.5560.690.801
Scaly-sided Merganser0.9850.2540.400.329
YOLOv5All0.9230.8470.880.84188
Ruddy Shelduck0.8180.8770.850.811
Whooper Swan0.9010.5780.700.734
Red-crowned Crane0.9420.7700.850.675
Black Stork0.9610.9360.950.914
Little Grebe0.8550.9460.900.876
Mallard0.9400.9230.930.918
Pheasant-tailed Jacana0.9600.9500.950.862
Demoiselle Crane0.9710.7960.870.830
Mandarin Duck0.9170.9240.920.914
Scaly-sided Merganser0.9670.7710.860.878
YOLOv7All0.8500.8360.840.86281
Ruddy Shelduck0.9110.6680.770.800
Whooper Swan0.6480.7010.670.705
Red-crowned Crane0.8510.8460.850.876
Black Stork0.6520.7590.700.726
Little Grebe0.9680.9020.930.968
Mallard0.8410.9000.870.915
Pheasant-tailed Jacana0.7490.9090.820.773
Demoiselle Crane0.9590.7840.860.903
Mandarin Duck0.9600.9580.960.989
Scaly-sided Merganser0.9620.9340.950.966
YOLOv8All0.8460.8000.820.83591
Ruddy Shelduck0.8520.5690.680.713
Whooper Swan0.6460.6030.620.573
Red-crowned Crane0.7900.8150.800.834
Black Stork0.7630.7740.770.787
Little Grebe0.9540.9610.960.970
Mallard0.9020.8640.880.896
Pheasant-tailed Jacana0.7490.8640.800.791
Demoiselle Crane0.9100.7680.830.877
Mandarin Duck0.9610.8810.920.938
Scaly-sided Merganser0.9320.8980.910.970
FPS stands for Frames Per Second and mAP@0.5 is an abbreviation for Mean Average Precision when the Intersection over Union (IoU) is set to 0.5.
Table 7. A comparison of different object detection algorithms (using the dataset annotated with the whole body).
Table 7. A comparison of different object detection algorithms (using the dataset annotated with the whole body).
ModelClassPrecisionRecallF1 ScoremAP@0.5FPS
Faster-RCNNAll0.8310.8920.860.87926
Ruddy Shelduck0.7350.9120.810.893
Whooper Swan0.7240.8400.780.813
Red-crowned Crane0.6770.8980.770.863
Black Stork0.8440.8850.860.862
Little Grebe0.8940.9350.910.911
Mallard0.9170.8930.910.899
Pheasant-tailed Jacana0.9060.9070.910.897
Demoiselle Crane0.9590.8560.900.892
Mandarin Duck0.8950.9030.900.897
Scaly-sided Merganser0.7630.8900.820.862
EfficientDetAll0.9150.8740.890.89814
Ruddy Shelduck0.9320.9550.940.945
Whooper Swan0.8600.7850.820.857
Red-crowned Crane0.9120.7770.840.867
Black Stork0.8930.8660.880.884
Little Grebe0.9430.9370.940.945
Mallard0.8670.8990.880.875
Pheasant-tailed Jacana0.9270.9120.920.917
Demoiselle Crane0.9850.7710.860.853
Mandarin Duck0.9410.9380.940.937
Scaly-sided Merganser0.8940.9040.900.897
CenterNetAll0.9680.6590.740.79658
Ruddy Shelduck1.0000.9770.990.998
Whooper Swan0.9840.7500.850.783
Red-crowned Crane0.9730.7500.850.840
Black Stork1.0000.8510.920.881
Little Grebe0.8420.8000.820.936
Mallard0.9090.5170.660.740
Pheasant-tailed Jacana1.0000.7780.880.906
Demoiselle Crane0.9710.8100.880.862
Mandarin Duck1.0000.3330.500.785
Scaly-sided Merganser1.0000.0230.040.226
SSDAll0.9010.7960.840.85860
Ruddy Shelduck0.8380.8380.840.857
Whooper Swan0.7840.6610.720.696
Red-crowned Crane0.8920.7450.810.849
Black Stork0.9090.8050.850.878
Little Grebe0.9760.9000.940.930
Mallard0.8820.8820.880.904
Pheasant-tailed Jacana0.9680.9300.950.955
Demoiselle Crane0.9100.6010.720.763
Mandarin Duck0.9410.7370.830.834
Scaly-sided Merganser0.9120.8590.880.916
YOLOv4All0.9240.6830.770.81137
Ruddy Shelduck0.8480.9070.880.944
Whooper Swan0.8770.7130.790.846
Red-crowned Crane0.7690.6250.690.782
Black Stork0.9290.7760.850.887
Little Grebe1.0000.8500.920.948
Mallard0.9320.7070.800.805
Pheasant-tailed Jacana1.0000.7780.880.892
Demoiselle Crane0.9660.6670.790.853
Mandarin Duck0.9210.5560.690.816
Scaly-sided Merganser1.0000.2500.400.333
YOLOv5All0.8650.8050.830.92093
Ruddy Shelduck0.8080.8240.820.911
Whooper Swan0.8970.6780.770.734
Red-crowned Crane0.8750.4770.620.875
Black Stork0.9110.8740.890.984
Little Grebe0.8230.8940.860.960
Mallard0.9140.9090.910.978
Pheasant-tailed Jacana0.7690.9120.830.992
Demoiselle Crane0.8730.7740.820.910
Mandarin Duck0.8280.9120.870.964
Scaly-sided Merganser0.9540.7910.860.889
YOLOv7All0.9420.8700.900.932111
Ruddy Shelduck0.7660.8960.830.921
Whooper Swan0.9290.6010.730.760
Red-crowned Crane0.9750.8090.880.915
Black Stork0.9840.9470.970.980
Little Grebe0.9000.9730.940.974
Mallard0.9510.9390.950.978
Pheasant-tailed Jacana1.0000.9870.990.996
Demoiselle Crane0.9830.8300.900.915
Mandarin Duck0.9670.9300.950.960
Scaly-sided Merganser0.9660.7910.870.921
YOLOv8All0.9250.8640.890.92797
Ruddy Shelduck0.8460.9000.870.929
Whooper Swan0.8920.5970.720.759
Red-crowned Crane0.9710.7990.880.899
Black Stork0.9530.9470.950.976
Little Grebe0.8520.9460.900.970
Mallard0.9460.9370.940.971
Pheasant-tailed Jacana1.0000.9750.990.995
Demoiselle Crane0.9750.8590.910.947
Mandarin Duck0.8970.9300.910.953
Scaly-sided Merganser0.9540.7550.840.875
FPS stands for Frames Per Second and mAP@0.5 is an abbreviation for Mean Average Precision when the Intersection over Union (IoU) is set to 0.5.
Table 8. Each experimental group in YOLOv7′s data-enhanced ablation experiments corresponds to a group of tricks and evaluation metrics. The “✓“ indicates that the trick is not used in this group of experiments and the “🗴“ indicates that it is used in this group of experiments.
Table 8. Each experimental group in YOLOv7′s data-enhanced ablation experiments corresponds to a group of tricks and evaluation metrics. The “✓“ indicates that the trick is not used in this group of experiments and the “🗴“ indicates that it is used in this group of experiments.
GroupHSVMosaicMixUpFocalLossPrecisionRecallF1 ScoremAP@0.5mAP@0.5:0.95FPS
1🗴🗴🗴🗴0.9390.8700.900.9320.807111
2🗴🗴🗴0.9140.8850.900.9300.80192
3🗴🗴🗴0.9330.8760.900.9310.79889
4🗴🗴🗴0.9290.8780.900.9290.79891
5🗴🗴🗴0.9250.8720.900.9240.78282
6🗴🗴0.9240.8850.900.9300.80184
7🗴🗴0.9350.8800.910.9270.79079
8🗴🗴0.9130.8770.890.9290.80183
9🗴🗴0.9400.8810.910.9320.80786
10🗴🗴0.9160.8840.900.9290.77777
11🗴🗴0.9300.8740.900.9290.79782
12🗴0.9420.8880.910.9330.80985
13🗴0.9320.8760.900.9270.78381
14🗴0.9330.8790.910.9270.78981
15🗴0.9450.8790.910.9310.80180
160.9320.8750.900.9270.78884
FPS stands for Frames Per Second and mAP@0.5 is an abbreviation for Mean Average Precision when the Intersection over Union (IoU) is set to 0.5.
Table 9. Ablation experiments with improved algorithms. (For the following experiments, mixup, Mosaic, and HSV data augmentation methods are used by default.)
Table 9. Ablation experiments with improved algorithms. (For the following experiments, mixup, Mosaic, and HSV data augmentation methods are used by default.)
ModelClassPrecisionRecallF1 ScoremAP@0.5mAP@0.5:0.95FPS
YOLOv7All0.9420.8880.910.9330.80985
Ruddy Shelduck0.7660.9120.830.9210.804
Whooper Swan0.9310.6690.780.7700.536
Red-crowned Crane0.9750.8190.890.9150.757
Black Stork0.9840.9560.970.9800.864
Little Grebe0.9000.9790.940.9740.931
Mallard0.9510.9390.940.9780.878
Pheasant-tailed Jacana1.0000.9570.980.9960.912
Demoiselle Crane0.9830.8830.930.9150.760
Mandarin Duck0.9670.9340.950.9600.852
Scaly-sided Merganser0.9660.8330.890.9210.793
YOLOv7
+
GAM
All0.9290.8830.910.9380.803101
Ruddy Shelduck0.7500.8900.810.9170.792
Whooper Swan0.9200.6490.760.7800.538
Red-crowned Crane0.9720.8340.900.9220.754
Black Stork0.9560.9620.960.9790.849
Little Grebe0.8180.9730.890.9820.923
Mallard0.9470.9610.950.9770.884
Pheasant-tailed Jacana1.0000.9870.990.9960.889
Demoiselle Crane0.9830.8340.900.9210.752
Mandarin Duck0.9670.9390.950.9710.851
Scaly-sided Merganser0.9780.7990.880.9310.797
YOLOv7
+
Alpha-IoU
All0.9450.8870.920.9470.80992
Ruddy Shelduck0.8730.9080.890.9440.816
Whooper Swan0.9140.6840.780.8110.549
Red-crowned Crane0.9720.8250.890.9350.747
Black Stork0.9520.9620.960.9800.867
Little Grebe0.8850.9730.930.9890.906
Mallard0.9460.9470.950.9790.879
Pheasant-tailed Jacana1.0000.9870.990.9950.901
Demoiselle Crane0.9720.8280.890.9200.778
Mandarin Duck0.9680.9430.960.9780.854
Scaly-sided Merganser0.9710.8170.890.9400.793
YOLOv7
+
GAM
+
Alpha-IoU
(YOLOv7Birds)
All0.9450.8980.920.9510.81582
Ruddy Shelduck0.8920.9040.900.9440.803
Whooper Swan0.9220.7300.810.8250.573
Red-crowned Crane0.9840.8580.920.9350.752
Black Stork0.9470.9620.950.9810.857
Little Grebe0.8170.9730.890.9900.919
Mallard0.9560.9460.950.9830.880
Pheasant-tailed Jacana1.0000.9870.990.9960.911
Demoiselle Crane0.9770.8710.920.9320.807
Mandarin Duck0.9750.9430.960.9850.847
Scaly-sided Merganser0.9770.8080.880.9370.797
FPS stands for Frames Per Second and mAP@0.5 is an abbreviation for Mean Average Precision when the Intersection over Union (IoU) is set to 0.5.
Table 10. Comparison between quantity in reality and quantity calculated by the model.
Table 10. Comparison between quantity in reality and quantity calculated by the model.
Interval of Time0–15 s16–30 s31–45 s46–60 s
Results of manual count (“true count”)Quantity36446572
Counting Accuracy (%)100%100%100%100%
Results of our algorithmQuantity36446572
Counting Accuracy (%)100%100%100%100%
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Chen, X.; Pu, H.; He, Y.; Lai, M.; Zhang, D.; Chen, J.; Pu, H. An Efficient Method for Monitoring Birds Based on Object Detection and Multi-Object Tracking Networks. Animals 2023, 13, 1713. https://doi.org/10.3390/ani13101713

AMA Style

Chen X, Pu H, He Y, Lai M, Zhang D, Chen J, Pu H. An Efficient Method for Monitoring Birds Based on Object Detection and Multi-Object Tracking Networks. Animals. 2023; 13(10):1713. https://doi.org/10.3390/ani13101713

Chicago/Turabian Style

Chen, Xian, Hongli Pu, Yihui He, Mengzhen Lai, Daike Zhang, Junyang Chen, and Haibo Pu. 2023. "An Efficient Method for Monitoring Birds Based on Object Detection and Multi-Object Tracking Networks" Animals 13, no. 10: 1713. https://doi.org/10.3390/ani13101713

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop