Visual Detection of Lost Ear Tags in Breeding Pigs in a Production Environment Using the Enhanced Cascade Mask R-CNN

Wang, Fang; Fu, Xueliang; Duan, Weijun; Wang, Buyu; Li, Honghui

doi:10.3390/agriculture13102011

Open AccessArticle

Visual Detection of Lost Ear Tags in Breeding Pigs in a Production Environment Using the Enhanced Cascade Mask R-CNN

by

Fang Wang

,

Xueliang Fu

^*,

Weijun Duan

,

Buyu Wang

and

Honghui Li

College of Computer Science, Inner Mongolia Agricultural University, Hohot 010018, China

^*

Author to whom correspondence should be addressed.

Agriculture 2023, 13(10), 2011; https://doi.org/10.3390/agriculture13102011

Submission received: 13 September 2023 / Revised: 9 October 2023 / Accepted: 13 October 2023 / Published: 17 October 2023

(This article belongs to the Special Issue Artificial Intelligence in Livestock Farming)

Download

Browse Figures

Versions Notes

Abstract

:

As the unique identifier of individual breeding pigs, the loss of ear tags can result in the loss of breeding pigs’ identity information, leading to data gaps and confusion in production and genetic breeding records, which can have catastrophic consequences for breeding efforts. Detecting the loss of ear tags in breeding pigs can be challenging in production environments due to factors such as overlapping breeding pig clusters, imbalanced pig-to-tag ratios, and relatively small-sized ear tags. This study proposes an improved method for the detection of lost ear tags in breeding pigs based on Cascade Mask R-CNN. Firstly, the model utilizes ResNeXt combined with a feature pyramid network (FPN) as the feature extractor; secondly, the classification branch incorporates the online hard example mining (OHEM) technique to improve the utilization of ear tags and low-confidence samples; finally, the regression branch employs a decay factor of Soft-NMS to reduce the overlap of redundant bounding boxes. The experiment employs a sliding window detection method to evaluate the algorithm’s performance in detecting lost ear tags in breeding pigs in a production environment. The results show that the accuracy of the detection can reach 92.86%. This improvement effectively enhances the accuracy and real-time performance of lost ear tag detection, which is highly significant for the production and breeding of breeding pigs.

Keywords:

lost ear tag detection; Cascade Mask R-CNN; OHEM; Soft-NMS; feature extraction

Graphical Abstract

1. Introduction

Ear tags are widely used in large-scale farms as a crucial device for breeding pigs’ identification [1]. Individual identification through ear tags is crucial for the development of the breeding industry, as it enables the management and tracking of breeding pigs throughout their entire life cycle; this practice supports the refinement, intensification, and intelligence of the industry and holds great significance for production management, genetic breeding, and lineage tracing [2]. Ear tag loss is a frequent problem in the production environment, often caused by the breeding pigs’ natural activity, rubbing against farm facilities, and mutual aggression. Although measures can be taken to reduce the loss rate, such as improving the quality of ear tags and changing the marking position, it is impossible to eliminate the occurrence of ear tag loss. The loss of ear tags has significant consequences for breeding pigs’ identity information, resulting in the loss and confusion of production and genetic breeding data. Livestock caretakers rely primarily on direct observation or video surveillance methods to detect tag loss. However, direct observation poses the risk of zoonotic diseases and offers limited economic benefits. Video monitoring, although reducing contact with breeding pigs, is hindered by factors like overlapping breeding pig clusters and video contamination, making it time-consuming, inefficient, and prone to errors. Therefore, researching an automated, real-time, and accurate method for detecting breeding pigs’ lost ear tags is imperative.

In recent years, with the rapid development of artificial neural networks in computer vision, deep convolutional neural networks (DCNNs) based on computer vision have been widely applied in the field of animal husbandry [3]. A DCNN offers several advantages in breeding pig production, including high recognition rates, non-invasiveness, minimal animal stress response, and easy deployment. It enables real-time, efficient, and continuous detection, making it suitable for tasks such as individual recognition [4,5,6], pose detection [7,8,9], target tracking [10,11], and count statistics [12,13]. Previous studies have primarily focused on learning the image feature representation of breeding pigs, extracting features, and using image-based classification and object recognition for practical applications. These advancements offer automated and intelligent solutions for farming management of breeding pigs. However, the existing research mainly focuses on individual breeding pigs, with limited investigations into detecting lost ear tags. Cascade Mask R-CNN [14], an advanced target detection algorithm based on Mask R-CNN [15], improves both detection accuracy and speed by introducing a cascade structure, utilizing shared convolutional feature maps and using smaller receptive fields. Additionally, it demonstrates robust generalization capabilities, making it suitable for various object detection scenarios. This paper presents an enhanced Cascade Mask R-CNN approach for visually detecting lost ear tags in a production environment.

To address the issue of insufficient feature extraction in complex pigsty environments within the Cascade Mask R-CNN, this study utilizes the ResNeXt [16] network as the feature extractor. Additionally, this paper introduces online hard example mining (OHEM) to enhance the model’s ability to learn from challenging samples, effectively tackling the problems of imbalanced dataset categories and uneven distribution of complex and easy samples. Soft-NMS is utilized to enhance instance segmentation performance to mitigate issues caused by dense overlaps of breeding pigs, such as false merges and missed instances due to occlusion. The effectiveness of the algorithm improvements and their usability in a production environment are validated through sliding window detection experiments. The main contributions of this paper are as follows:

(1): The study applies instance segmentation to detect breeding pigs’ lost ear tags and demonstrates the feasibility of using deep learning instance segmentation algorithms.
(2): An improved algorithm based on Cascade Mask R-CNN is proposed. The algorithm integrates techniques such as the ResNeXt backbone network, online hard example mining (OHEM) [17]), Soft-NMS [18], and others to enhance the performance of breeding pig and ear tag instance segmentation.
(3): The proposed method effectively identified piglets with lost tags under production conditions.

2. Materials and Methods

2.1. Data Acquisition

The data used in this study were collected from a large-scale breeding pig farm located in Hohhot, Inner Mongolia Autonomous Region, China, between 24 December 2022 and 23 January 2023. Four breeding pig pens were selected as the data collection areas, each with dimensions of 5.3 × 2.7 m (see Figure 1) and housing 28 breeding pigs aged 2 to 3 months; two breeding pigs had lost ear tags, while the remaining 26 had intact ear tags. A dome-shaped camera (Hikvision, Hangzhou, China) was installed 3.4 m above the feeding area. The camera model used was DS-2PT7D20IW-DE, which had a resolution of 1920 × 1080 pixels and a frame rate of 25 fps. The breeding pig pens were continuously monitored for 24 h. Color images were captured during well-lit conditions, and in darker conditions, the camera automatically switched to infrared illumination mode, capturing grayscale images. The recorded videos were stored in the local area network NVR of the farm and accessed remotely using NVR management software. The exported video files had the extension .asf and were encoded in H.264 format.

2.2. Establishment of the Sample Database

2.2.1. Data Filtering

Eighty-four video segments, each three minutes long, were manually selected from the collected video files. These video segments exclusively contained instances of breeding pigs with lost ear tags, making them suitable as validation data for assessing the effectiveness of ear tag detection in a production environment. Additionally, 37.9 h of usable video segments were selected from the remaining video files. The FFmpeg software extracted frames from these videos at one frame per second. This process resulted in the acquisition of 136,350 image files.

Due to the high similarity of images between neighboring frames, direct use for model training may result in longer training time, lower training efficiency, or even overfitting. In this study, the Structural Similarity (SSIM) [19] algorithm was utilized to evaluate image similarity and eliminate highly similar images, thereby improving model training efficiency. The SSIM algorithm parameter ranges from −1 to 1, where 1 indicates that the two images are almost identical. After conducting a series of experimental attempts on the study data, the empirical threshold for SSIM was set to 0.78. This threshold eliminated approximately 92% of the data, resulting in the retention of 10,908 image files. The similarity between images

f_{1}

and

f_{2}

is calculated using Equation (1):

S S I M (f_{1}, f_{2}) = \frac{(2 μ_{1} μ_{2} + c_{1}) (2 σ_{1, 2} + c_{2})}{(μ_{1}^{2} + μ_{2}^{2} + c_{1}) (σ_{1}^{2} σ_{2}^{2} + c_{2})},

(1)

where

μ_{1}

,

μ_{2}

represent the pixel averages of images

f_{1}

and

f_{2}

, respectively;

σ_{1}^{2}

,

σ_{2}^{2}

denote the pixel variance of images

f_{1}

,

f_{2}

;

σ_{1, 2}

are the pixel covariance between images

f_{1}

and

f_{2}

; and

c_{1}

and

c_{2}

are two constants used to avoid division by zero, with their values both set to 0.01.

In order to improve the efficiency of model training, the experiments manually eliminated images with defacement and without breeding pigs based on the filtering of the SSIM algorithm. As a result, 6271 images were retained. The images were cropped to focus on the feeding area of the breeding pigs, resulting in a resolution of 1438 × 973 pixels for the cropped images.

2.2.2. Image Annotation

The PaddleSeg [20] open-source image annotation software (Version 2.7, licensed under the Apache License 2.0) was used to annotate the breeding pig and ear tag regions. The annotations were stored in the COCO dataset [21] format. The original image is shown in Figure 2a, while the annotated result is shown in Figure 2b. In the annotated image, the semi-transparent red regions represent the breeding pigs, and the semi-transparent purple regions represent the ear tags. Specifically, there are two breeding pigs with ear tags in the image, while one breeding pig is missing the ear tag.

2.2.3. Data Enhancement

In this study, data augmentation was performed using the Python data augmentation library Albumentations [22] and a single-sample data augmentation approach. The methods include random rotation (randomly generating a rotation angle within the range of −45 to 45 degrees), Gaussian blur (convolution kernel with a size of 5 × 5 and standard deviation of 3.0), contrast adjustment (randomly sampling adjustment factors within the range of 0 to 1), modification of pixel values (randomly modifying RGB channel values between −50 and 50), dropout (with a probability of 0.4 and a size of 7 × 7), and random fog (randomly generating fogging levels within the range of 0.3 to 1). After randomly extracting 15% of the data for data augmentation, the total number of samples expands to 7212 images. The augmented images are shown in Figure 3.

2.2.4. Dataset Partition

The dataset was randomly split into a training set of 5770 images and a test set of 1442 images, using an 8:2 ratio. To further evaluate the model’s robustness under various lighting conditions and breeding pig activity levels, the test set was divided into six groups: daytime stationary, daytime active, daytime mixed, nighttime stationary, nighttime active, and nighttime mixed. The term ‘mixed’ indicates that the images in these groups contain stationary and active breeding pigs. The data presented in Table 1 show the distribution of images in each group within the test set.

2.3. Design of Detection Model for Breeding Pigs’ Lost Ear Tags

Cascade Mask R-CNN is a widely used deep learning method. It improves the performance of object detection and instance segmentation by incorporating a cascade structure into the Mask R-CNN framework. This study proposes an improved Cascade Mask R-CNN model for detecting breeding pigs’ lost ear tags. The model comprises three parts: the backbone network, the region proposal network, and the Cascade Detection Network, as illustrated in Figure 4.

2.3.1. Backbone Network

In this study, the ResNeXt101 network is combined with the feature pyramid network as a feature extraction backbone network to accommodate the multi-scale feature extraction needs in complex breeding pig barn scenes. ResNeXt101 maintains the same number of network layers as ResNet101 but enhances the internal structure of the network blocks. It consists of four residual block groups connected in series, each containing 3, 4, 23, and 3 residual blocks, respectively. Inside the residual block, three convolution operations (1 × 1, 3 × 3, and 1 × 1) are utilized, and the second 3 × 3 convolution employs channel group convolution with 32 branches and 64 groups. Additionally, each convolution layer is followed by a group normalization (GN) [23] operation, and each residual block is connected to the ReLU [24] activation function after the last group normalization. The formulas for GN are provided in Equations (2)–(5), and the formula for the ReLU activation function is given in Equation (6). The advantages of using the ResNeXt101 network are as follows: (1) the channel group convolution mechanism reduces the number of parameters and improves the speed of model training; and (2) the residual connection structure facilitates gradient back-propagation and enhances feature extraction ability.

In this paper, the outputs of the four residual block groups in ResNeXt101 are all input into the FPN [25] network. The four residual block groups’ output channels are 256, 512, 1024, and 2048 channels, respectively. After upsampling, feature fusion, and generation of multi-scale feature pyramids, the FPN network outputs five 256-channel feature layers, which enables the network to perform target detection and image segmentation at different scales, thereby improving the model’s performance.

μ_{β} = \frac{1}{n} \sum_{i = 1}^{n} x_{i}

(2)

σ^{2} = \frac{1}{n} \sum_{i = 1}^{n} {(x_{i} - μ_{β})}^{2}

(3)

{\hat{x}}_{i} = \frac{x_{i} - μ_{β}}{\sqrt{μ_{β}^{2}} + ε}

(4)

y_{i} = γ {\hat{x}}_{i} + β

(5)

Y_{l} = max (0, X_{l}^{c o n v})

(6)

In Equations (2)–(5), n is the number of images participating in the loss calculation,

x_{i}

represents the i-th sample,

μ_{β}

and

σ^{2}

are the average value and variance of the samples in the same batch, and

{\hat{x}}_{i}

represents the result of normalization based on the average value and variance.

ε

is a small constant to avoid zero division, while

γ

and

β

are the scaling and shift parameters. In Equation (6),

X_{l}^{c o n v}

represents the output value after the l-th layer convolution operation, and

Y_{l}

represents the output value of the activation function.

2.3.2. Region Proposal Network

The Region Proposal Network (RPN) [26] is used to generate candidate regions and perform preliminary classification and position regression. To accomplish this, the RPN takes the outputs of five 256-channel feature layers from the FPN, and a set of anchors is placed on different feature map layers, each corresponding to a fixed receptive field size. These anchors are selected based on predefined scales and aspect ratios to cover targets of varying scales and aspect ratios. Multiple candidate boxes are generated for each anchor, with their center positions fixed at the center of the anchor and adjusted based on the predefined aspect ratios and scales. Specifically, in this paper, anchors with a scale of 8 are utilized, and three different aspect ratios of 1:2, 1:1, and 2:1 are generated based on these anchors. The strides of these anchors on the feature maps are predetermined to cover various regions while sliding over the input image. The paper employs five different strides, which are 4, 8, 16, 32, and 64. The RPN effectively covers targets of diverse scales and positions by placing anchors on different feature map layers and generating a range of candidate boxes. Once the anchor boxes are generated, the RPN maps the candidate boxes back to the original image space and performs classification and coordinate regression, ultimately outputting a set of 256-channel candidate boxes along with their respective categories.

2.3.3. Cascade Detection Network

This research employs a cascaded detection architecture to accomplish breeding pig and ear tag object detection and instance segmentation. The cascaded detection network is designed with multiple stages connected sequentially to enhance detection and segmentation accuracy. The number of cascade stages in this study is set to 3, and stage loss weights of 1, 0.5, and 0.25 are allocated to balance the contributions of each stage. Before the cascaded detection, the RoIAlign operation extracts Region of Interest (RoI) features. This operation maps RoI feature regions of varying sizes to a fixed-size 7 × 7 feature map with an output channel of 256. The feature map strides are set to 4, 8, 16, and 32 to maintain consistency with the backbone network. This design effectively extracts discriminative features and handles targets of different scales, improving overall performance.

The cascade detection network comprises two fully connected layers: a classification network and a bounding box regression network. Its purpose is to predict the bounding box coordinates and categories of the RoI. The cascade detection network’s input and output channels are set to 256. In order to address the issues of class imbalance and uneven distribution of complex and easy samples in the training set, this paper incorporates the online hard example mining (OHEM) technique into the cascade classification softmax layer. The technique calculates the loss value for each sample, sorts them in descending order, and selects the top 25% as hard examples, while the remaining samples are considered easy examples. The weight coefficients for back-propagation of hard and easy examples are 1 and 0.25, respectively. The loss function is then recalculated, as shown in Equation (7), to enhance the detection capability for more challenging objects.

L = \frac{1}{N} \sum_{i = 1}^{N} (ω_{i} {L_{_c l s}}^{i})

(7)

{L_{_c l s}}^{i}

represents the cascaded classification loss of the i-th sample,

ω_{i}

is the weight coefficient for the backward propagation of the sample, and

L

denotes the mean loss.

In the cascade regression, a resampling mechanism is used to predict the precise position of object bounding boxes, and each sub-regressor consists of a fully connected layer. The original algorithm utilizes the non-maximum suppression (NMS) algorithm, which sorts all detection boxes with confidence scores higher than a threshold and then iterates through each box. The box is removed if the overlap between the box and subsequent boxes exceeds a predefined threshold (0.5). However, NMS may need to correct some incorrect boxes. This paper employs an improved non-maximum suppression algorithm called Soft-NMS to address the issue of highly overlapping proposed regions. Soft-NMS introduces dynamic weight coefficients as attenuation factors for softening the suppression strategy. It helps to overcome the misclassification caused by overlapping instances of breeding pigs. This approach allows more overlapped boxes to be retained and reduces the risk of removing correct instances.

After completing the cascade instance detection of breeding pigs and ear tags, it is necessary to generate binary masks of the same size as the input to locate the position and shape of the target objects. Initially, features are extracted from the feature map using four different strides—4, 8, 16, and 32—capturing object information at multiple scales. Subsequently, the RoIAlign operation is applied to extract features for each candidate region and map them to a 14 × 14 feature map. Finally, these features are transformed into a 256-dimensional feature space and input into a fully convolutional network to predict the segmentation masks for each candidate region. The network comprises four convolutional layers, with the input and output channels set to 256. This design ensures sufficient expressive power to learn and predict complex object shapes. This research’s predicted target categories consist of breeding pigs and ear tags. As a result, segmentation masks are generated for each category in the last layer.

2.3.4. Lost Ear Tag Detection

Based on instance segmentation, this study determines whether a breeding pig is in a tagged state by calculating the intersection area of masks. Initially, the experiment obtained the sets of masks for breeding pigs and ear tags in the image, denoted as B and E, respectively, assuming there are n breeding pigs and m ear tags recognized in the image; thus,

B = b_{1}, b_{2}, \dots, b_{n}

and

E = e_{1}, e_{2}, \dots, e_{m}

. For each element

b_{i}

in B, we calculate its intersection area

A_{i j}

with each element

e_{j}

in E, as shown in Equation (8). If

A_{i j}

is greater than 0, it indicates that

b_{i}

intersects with an element from the set E, implying that breeding pig

b_{i}

is wearing an ear tag. Conversely, it signifies that pig

b_{i}

is untagged, as shown in Equation (9).

A_{i j} = B_{i} \cap E_{j}, i = 1, 2, \dots, n; j = 1, 2, \dots, m

(8)

R e s u l t = \{\begin{matrix} ear tag loss, & A_{i j} = 0 \\ ear tag was not lost, & A_{i j} > 0 \end{matrix}

(9)

2.4. Design of Loss Function

The loss function is crucial in evaluating the model, as it measures the difference between predicted and actual values. In this study, the loss function consists of various components. Firstly, it includes the classification loss (

L_r p n_c l s

) and regression loss (

L_r p n_r e g

) of the Region Proposal Network. Additionally, it estimates the classification loss (

L_c l s

), mask loss (

L_m a s k

), and regression loss (

L_r e g

) of the cascade detection. The calculation of

L_r p n_c l s

is made according to binary cross-entropy loss, while

L_r p n_r e g

is determined using smoothed L1 loss. Similarly,

L_c l s

and

L_m a s k

are computed using pixel-level binary cross-entropy loss, and

L_r e g

is calculated using smoothed L1 loss. The overall loss is obtained by taking the weighted sum of these component losses, as shown in Equation (10).

L_{c a s c a d e} = L_r p n_c l s + L_r p n_r e g + \sum_{i = 1}^{n} λ_{i} ({L_{_c l s}}^{i} + {L_{_r e g}}^{i}) + L_{_m a s k}

(10)

The weights

λ_{i}

for each cascade are set to 1, 0.5, and 0.25, respectively.

2.5. Evaluation Indicators of the Model

This study utilizes accuracy as the primary evaluation metric to assess the model’s performance. Additionally, bounding box detection mean accuracy (bbox mAP), instance segmentation mean accuracy (mask mAP), detection speed (fps), recall (R), and F1 score (

F 1

) are employed as auxiliary evaluation metrics. The mAP was calculated using the evaluation metric of the COCO dataset. Specifically, it calculates the average precision (AP) mean at different IoU thresholds ranging from 0.50 to 0.95 with an interval of 0.05. The AP at each IoU threshold is computed as the area under the precision-recall curve (IoU curve) bounded by the x-axis and the curve itself. Precision and recall are calculated using Equations (11) and (12), respectively; the mAP is then calculated using Equation (13), and the F1 score is calculated using Equation (14).

p r e c i s i o n = \frac{T P}{T P + F P}

(11)

r e c a l l = \frac{T P}{T P + F N}

(12)

m A P = \frac{1}{N} \sum_{i = 1}^{N} (\int_{0}^{1} p_{i} (r) d r)

(13)

F 1 = 2 \cdot \frac{p r e c i s i o n \cdot r e c a l l}{p r e c i s i o n + r e c a l l}

(14)

2.6. Experimental Platform and Parameter Settings

The experimental hardware platform used in this study was a server running the Ubuntu 20.04 operating system; it was equipped with two Intel(R) Xeon(R) Gold 6137 processors, 256 GB of RAM, and eight NVIDIA GeForce RTX 3090 graphics cards. The software environment included Miniconda3, Python 3.8.5, CUDA 11.7, Pytorch 2.0.0, and the MMDetection 2.28.2 deep learning framework. During the training process, stochastic gradient descent was employed as the optimizer with an initial learning rate of 0.02, a momentum coefficient of 0.9, and a weight decay coefficient of 0.0001. The input images were scaled to 1438 × 973 pixels, and the batch size was 24.

3. Results

3.1. Selection of Backbone Network

In order to evaluate the performance of the model using ResNeXt101 as the backbone network, a comparative experiment is conducted on the test set. VGG16 [27], ResNet50 [28], ResNet101, Res2Net101 [29], and ResNeXt101 are used as the backbone networks for Cascade Mask R-CNN, respectively. The results in Table 2 show that ResNeXt101 achieves the highest bbox mAP of 86.43%, mask mAP of 83.97%, and recall of 98.14%. These results demonstrate its superior performance. Considering the need for high-precision detection in lost ear tag detection, ResNeXt101 is the optimal choice among the five backbone networks in this study.

3.2. Impact of Improved Components on Model Performance

3.2.1. Performance Analysis of the Improvement Strategy

In order to evaluate the impact of the improvement component on the model performance, this study introduces the improvement strategy gradually while training the model. It calculates its evaluation metric values on the test set.

The data presented in Table 3 show that, from a holistic perspective, the recall is 99.19%, the F1 score is 98.79%, and the bbox mAP and mask mAP are 90.07% and 87.84% respectively; these values indicate varying degrees of improvement compared to Cascade Mask R-CNN. From the process, the cascade classification branch incorporates the OHEM technique, and the regression branch introduces dynamic weight coefficients of Soft-NMS; this allows the model to focus more on utilizing features from harder-to-identify samples and to accurately calculate the bounding box confidence. As a result, the bbox mAP and mask mAP improve by 5.33 and 6.21 percentage points, respectively. It demonstrates that the enhanced strategy can effectively address the issues related to the unbalanced distribution of breeding pig and ear tag samples and the challenge of identifying small ear tag targets, which makes the model more suitable for visually detecting breeding pigs’ lost ear tags in production environments.

Figure 5 shows the changes in the average accuracy of detection and segmentation throughout the model training. It is evident that the bbox mAP and mask mAP of the model in this study start improving earlier and tend to stabilize after 10 epochs. The maximum values reached are approximately 90% and 88%, close to the test set values of 90.07% and 87.84%, respectively. These results indicate that the model does not exhibit overfitting or underfitting and possesses good generalization ability.

3.2.2. Detection and Segmentation Performance for Breeding Pigs and Ear Tags

In order to assess the detection and segmentation performance for breeding pigs and ear tags, the study model’s detection effects are compared with those of Cascade Mask R-CNN on the test set.

Analyzing the data presented in Table 4, the model proposed in this study demonstrates superior performance in detecting and segmenting breeding pigs and ear tags compared to Cascade Mask R-CNN. Specifically, the average detection accuracy for breeding pigs and ear tags improved by 5.13 and 6.41 percentage points, respectively. The improvement is particularly notable for ear tags, attributed to effectively utilizing features extracted from difficult-to-recognize samples using Soft-NMS and OHEM techniques. Therefore, the proposed model exhibits enhanced performance in detecting small targets.

3.2.3. Analysis of the Visualization Effect of the Model

The experiment employed the class-activated heat map [30] to visualize the feature extraction in ear tag recognition. Class activation maps generate heatmaps where warmer colors indicate the regions the model focuses on more attentively.

Based on Figure 6, it is evident that both models demonstrate accurate detection and recognition of breeding pigs as well as ear tags. Notably, the heat map of the study model exhibits intensified warm colors in the ear region, indicating that the utilization of the OHEM technique has led to heightened activation levels of features in this area. As a result, ear tag detection and recognition performance have been enhanced.

3.2.4. Training Loss Analysis of the Models

Figure 7 illustrates the changes in the model loss throughout the iteration process. The total loss starts to converge after approximately 2400 iterations, suggesting the effectiveness of the training strategy employed in the proposed model.

3.3. Detection Effects on Different Test Sets

In order to examine the impact of the model on detecting breeding pigs and ear tags under various lighting conditions and activity states, this study calculated the detection accuracy for the following six test sets: daytime stationary group, daytime active group, daytime mixed group, nighttime stationary group, nighttime active group, and nighttime mixed group. The results are presented in Table 5.

Analyzing the data presented in Table 5, the average precision performance of detection across the six test sets reveals the following trends. The daytime detection performance surpasses nighttime within the same activity state data group. Among the daytime and nighttime groups, the stationary group outperforms the mixed group, which outperforms the moving group. Overall, the detection accuracy remains above 84%. These results indicate that the model can meet the ear tag detection requirements under different lighting conditions and motion states.

3.4. Analysis of Lost Ear Tag Detection in a Production Environment for Breeding Pig Farming

In order to assess the performance of the proposed model in detecting lost ear tags in a production environment, a statistical approach was employed. The average duration that each breeding pig spent in the field of view of the experimental camera was 7.2 s. Since the images between adjacent frames were highly similar, the experiment captured one frame for detection every ten frames. It utilized an 18-frame sliding window with a step size of 1 frame to scan through each video and identify instances of breeding pigs losing their ear tags. The criteria for classifying a breeding pig with lost ear tag were:

(1): A distance greater than 1 pixel exists between the breeding pig mask and each of the four edges of the detection image (the breeder enters the detection region intact).
(2): If no ear tag mask is detected in the current detection image that intersects with the breeding pigs’ mask, this indicates “there may be a breeding pig with lost ear tag”.
(3): Among the 18 consecutive detection images, 80% of the images had the detection result of “there may be a breeding pig with lost ear tag”, leading to the judgment of “breeding pigs with lost ear tag”.

This experiment used 3 min video clips extracted from production environments distributed over 24 h of a day to validate the effectiveness of lost ear tag detection in production environments. Among them, 84 videos have breeding pigs with loss of ear tags, and 84 do not. The number of correct model detections was 156, and all correctly detected dropouts were reported an average of 6.42 s later than the first appearance of a dropout in the video.

The results are presented in Table 6. The algorithm’s accuracy in this study is 92.86%, which is 4.76% higher than Cascade Mask R-CNN.

4. Discussion

Identifying breeding pigs through ear tag recognition is crucial for precise management of livestock farms; however, target identification based on machine vision still faces certain challenges. Previous researchers [31,32] have employed detectors for object recognition, but this approach is only suitable for low-density and limited-view scenarios. There are also studies [33] that achieved high accuracy and efficiency in recognizing pigs from different perspectives using deep learning, but the detection accuracy decreases when dealing with heavily overlapping samples. Additionally, dust particles in the air of pigsties can deposit on camera lenses, affecting image quality and leading to false detections even when using models with higher precision and recall rates [34]. In this study, the proposed method addresses issues such as misclassification of instances due to imbalanced distributions of pig and ear tag categories, as well as dense overlapping of pigs, by focusing on instance-level features, which enhances feature extraction capabilities and detection segmentation accuracy. Furthermore, image enhancement strategies are employed to mitigate the influence of environmental factors, resulting in improved detection performance of the model in real production environments.

According to the data presented in Table 2, it is clear that when VGG16 is used as the backbone network, the detection performance is noticeably lower than the other four backbone networks. This can be attributed to the fact that VGG16 has a shallower depth than the other networks, which limits its ability to extract features and capture deeper semantic information in complex environments. On the contrary, the ResNet network demonstrates the fastest detection speed thanks to its fewer convolutional layers while achieving the highest F1 score. When ResNeXt101 is used as the backbone network, it achieves a detection rate of 7.86 frames per second; it also shows improved bbox mAP, mask mAP, and R metrics compared to VGG16, with an increase of 8.55, 7.69, and 1.12 percentage points, respectively. The effectiveness of ResNeXt101 can be attributed to its deeper backbone network, which efficiently utilizes multi-scale information in the cascaded detection process, resulting in superior detection performance. Additionally, the 32-branch strategy employed in the residual blocks of ResNeXt101 effectively preserves fine-grained details. Therefore, using ResNeXt101 as the backbone network in Cascade Mask R-CNN is more suitable for detecting breeding pigs’ lost ear tags.

Table 3 shows that the improvement strategy significantly improves the bbox mAP and mask mAP, achieving a bbox mAP of 90.07% and a mask mAP of 87.84%. From Table 4, it is evident that the model effectively detects and identifies breeding pigs and ear tags, significantly improving the detection accuracy of ear tags and indicating that the OHEM technique can optimize the detection accuracy of minority class samples caused by imbalanced dataset categories. However, it is worth noting that the introduced improvements result in increased computational requirements, leading to a larger model size of 143.85 MB and a decrease in fps to 6.79. With an average image processing time of 0.147 s, real-time detection of breeding pigs’ lost ear tag can still be achieved.

Table 5 shows that RGB three-channel images collected during the daytime contain more information than grayscale images collected at night; therefore, the overall detection accuracy is higher for the daytime data group. In breeding pig activity images, occlusion or blurring of ear tags often leads to incorrect detection or omission, resulting in the phenomenon that, within both the daytime and nighttime groups, the stationary group outperforms the mixed group, which is superior to the activity group.

Due to limitations in experimental conditions, this study focuses on 2- to 3-month-old pigs, taking into account the differences in ear morphology and ear-to-body proportion among different breeds and age groups of pigs; therefore, all conclusions drawn are solely based on the pigs involved in this experiment. Moving forward, the research will concentrate on three main areas: firstly, streamlining the network model to enhance detection efficiency; secondly, implementing a target tracking algorithm to enable real-time tracking and tagging of pigs with lost ear tags; lastly, developing a prediction model for lost ear tags that utilizes relevant features as input, which will facilitate lost ear tag monitoring and early warning throughout the entire business process of pig farming.

5. Conclusions

In general, using the enhanced Cascade Mask R-CNN can realize the automatic detection of lost ear tags in breeding pigs. The present study utilizes ResNeXt and FPN as the backbone network for feature extraction and incorporates OHEM and Soft-NMS algorithms to enhance feature extraction capabilities and utilization. This effectively addresses the issue of low precision caused by imbalanced sample distribution and overlapping breeding pig aggregation in lost ear tag detection. Finally, the experiment validates the algorithm’s performance in a production environment using sliding window detection methods. Compared to the Cascade Mask R-CNN (ResNet50), the proposed model achieves an accuracy of 92.86%, effectively enhancing the accuracy capability. These findings have significant implications for the production of breeding pigs and for building a foundation for developing real-time monitoring and early warning systems for lost ear tags in large-scale breeding pig farms.

Author Contributions

F.W.: Conceptualization, funding acquisition, investigation, methodology, software, writing—original draft, writing—review and editing. X.F.: Funding acquisition, resources, supervision. W.D.: Data curation, formal analysis, investigation, software, validation. B.W.: Data curation, formal analysis, investigation. H.L.: Conceptualization, funding acquisition, methodology. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported by the Key Science and Technology Special Project of Inner Mongolia Autonomous Region (2021ZD0005), the Special Project for Building a Science and Technology Innovation Team at Universities of Inner Mongolia Autonomous Region (BR231302), the National Natural Science Foundation of China (61962047), and the Research Innovation Foundation of Graduate Students of Inner Mongolia Autonomous Region (BZ2020054).

Informed Consent Statement

Informed consent was obtained from all subjects involved in the study.

Data Availability Statement

The data presented in this study are available from the corresponding author on reasonable request.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

References

Bergqvist, A.S.; Forsberg, F.; Eliasson, C.; Wallenbeck, A. Individual identification of pigs during rearing and at slaughter using microchips. Livest. Sci. 2015, 180, 233–236. [Google Scholar] [CrossRef]
Wang, R.; Gao, R.; Li, Q.; Dong, J. Pig Face Recognition Based on Metric Learning by Combining a Residual Network and Attention Mechanism. Agriculture 2023, 13, 144. [Google Scholar] [CrossRef]
Oliveira, D.A.B.; Pereira, L.G.R.; Bresolin, T.; Ferreira, R.E.P.; Dorea, J.R.R. A review of deep learning algorithms for computer vision systems in livestock. Livest. Sci. 2021, 253, 104700. [Google Scholar] [CrossRef]
Lei, K.; Zong, C.; Yang, T.; Peng, S.; Zhu, P.; Wang, H.; Teng, G.; Du, X. Detection and analysis of sow targets based on image vision. Agriculture 2022, 12, 73. [Google Scholar] [CrossRef]
Marsot, M.; Mei, J.; Shan, X.; Ye, L.; Feng, P.; Yan, X.; Li, C.; Zhao, Y. An adaptive pig face recognition approach using Convolutional Neural Networks. Comput. Electron. Agric. 2020, 173, 105386. [Google Scholar] [CrossRef]
Yan, H.; Cui, Q.; Liu, Z. Pig face identification based on improved AlexNet model. Inmateh-Agric. Eng. 2020, 61. [Google Scholar] [CrossRef]
Liu, L.; Zhou, J.; Zhang, B.; Dai, S.; Shen, M. Visual detection on posture transformation characteristics of sows in late gestation based on Libra R-CNN. Biosyst. Eng. 2022, 223, 219–231. [Google Scholar] [CrossRef]
Ji, H.; Yu, J.; Lao, F.; Zhuang, Y.; Wen, Y.; Teng, G. Automatic position detection and posture recognition of grouped pigs based on deep learning. Agriculture 2022, 12, 1314. [Google Scholar] [CrossRef]
Xu, J.; Zhou, S.; Xu, A.; Ye, J.; Zhao, A. Automatic scoring of postures in grouped pigs using depth image and CNN-SVM. Comput. Electron. Agric. 2022, 194, 106746. [Google Scholar] [CrossRef]
Tu, S.; Zeng, Q.; Liang, Y.; Liu, X.; Huang, L.; Weng, S.; Huang, Q. Automated Behavior Recognition and Tracking of Group-Housed Pigs with an Improved DeepSORT Method. Agriculture 2022, 12, 1907. [Google Scholar] [CrossRef]
Ryu, H.W.; Tai, J.H. Object detection and tracking using a high-performance artificial intelligence-based 3D depth camera: Towards early detection of African swine fever. J. Vet. Sci. 2022, 23, e17. [Google Scholar] [CrossRef] [PubMed]
Zhou, Z. Detection and Counting Method of Pigs Based on YOLOV5_Plus: A Combination of YOLOV5 and Attention Mechanism. Math. Probl. Eng. 2022, 2022, 7078670. [Google Scholar] [CrossRef]
Liu, C.; Su, J.; Wang, L.; Lu, S.; Li, L. LA-DeepLab V3+: A Novel Counting network for pigs. Agriculture 2022, 12, 284. [Google Scholar] [CrossRef]
Cai, Z.; Vasconcelos, N. Cascade r-cnn: Delving into high quality object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2961–2969. [Google Scholar]
Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1492–1500. [Google Scholar]
Shrivastava, A.; Gupta, A.; Girshick, R. Training region-based object detectors with online hard example mining. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 761–769. [Google Scholar]
Bodla, N.; Singh, B.; Chellappa, R.; Davis, L.S. Soft-NMS–Improving object detection with one line of code. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 5561–5569. [Google Scholar]
Wang, Z.; Bovik, A.C.; Sheikh, H.R.; Simoncelli, E.P. Image quality assessment: From error visibility to structural similarity. IEEE Trans. Image Process. 2004, 13, 600–612. [Google Scholar] [CrossRef] [PubMed]
Liu, Y.; Chu, L.; Chen, G.; Wu, Z.; Chen, Z.; Lai, B.; Hao, Y. PaddleSeg: A High-Efficient Development Toolkit for Image Segmentation. 2021. Available online: http://xxx.lanl.gov/abs/2101.06175 (accessed on 11 July 2023).
Lin, T.Y.; Maire, M.; Belongie, S.; Hays, J.; Perona, P.; Ramanan, D.; Dollár, P.; Zitnick, C.L. Microsoft coco: Common objects in context. In Proceedings of the Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, 6–12 September 2014; pp. 740–755. [Google Scholar]
Buslaev, A.; Iglovikov, V.I.; Khvedchenya, E.; Parinov, A.; Druzhinin, M.; Kalinin, A.A. Albumentations: Fast and Flexible Image Augmentations. Information 2020, 11, 125. [Google Scholar] [CrossRef]
Wu, Y.; He, K. Group normalization. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 3–19. [Google Scholar]
Glorot, X.; Bordes, A.; Bengio, Y. Deep sparse rectifier neural networks. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics. JMLR Workshop and Conference Proceedings, Fort Lauderdale, FL, USA, 11–13 April 2011; pp. 315–323. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv 2014, arXiv:1409.1556. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 770–778. [Google Scholar]
Gao, S.H.; Cheng, M.M.; Zhao, K.; Zhang, X.Y.; Yang, M.H.; Torr, P. Res2net: A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell. 2019, 43, 652–662. [Google Scholar] [CrossRef] [PubMed]
Zhou, B.; Khosla, A.; Lapedriza, A.; Oliva, A.; Torralba, A. Learning deep features for discriminative localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 2921–2929. [Google Scholar]
Liu, J.; Gao, C.; Meng, D.; Hauptmann, A.G. Decidenet: Counting varying density crowds through attention guided detection and density estimation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 5197–5206. [Google Scholar]
Zhang, Y.; Zhou, D.; Chen, S.; Gao, S.; Ma, Y. Single-image crowd counting via multi-column convolutional neural network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 589–597. [Google Scholar]
Tian, M.; Guo, H.; Chen, H.; Wang, Q.; Long, C.; Ma, Y. Automated pig counting using deep learning. Comput. Electron. Agric. 2019, 163, 104840. [Google Scholar] [CrossRef]
Nasirahmadi, A.; Sturm, B.; Edwards, S.; Jeppsson, K.H.; Olsson, A.C.; Müller, S.; Hensel, O. Deep learning and machine vision approaches for posture detection of individual pigs. Sensors 2019, 19, 3738. [Google Scholar] [CrossRef] [PubMed]

Figure 1. The layout of the pig pens during the experiment is illustrated in the diagram. The green star represents the location of the camera. The shaded area represents the image acquisition area, which is also the feeding area for the breeding pigs, and all of the feeding breeding pigs pass through this area.

Figure 2. Instance annotation images, where (a) represents the original image, while (b) represents the annotated result.

Figure 3. Image enhancement examples.

Figure 4. Structure diagram of the breeding pigs’ lost ear tag detection model based on the improved Cascade Mask R-CNN.

Figure 5. The changes in bbox mAP and mask mAP of the four models.

Figure 6. Heat map of the model.

Figure 7. Training loss of the model.

Table 1. Distribution of images in test set groups.

Test Set	Number of Test Set
Daytime stationary group	221
Daytime active group	242
Daytime mixed group	258
Nighttime stationary group	223
Nighttime active group	241
Nighttime mixed group	257
Total	1442

Table 2. Model performance of different backbone networks.

Backbone	Bbox mAP/%	Mask mAP/%	fps	Recall/%	F1/%
VGG16	77.88	76.28	8.9	97.02	96.13
ResNet50	84.74	81.63	11.92	98.09	97.78
ResNet101	85.45	82.47	10.37	98.01	97.67
Res2Net101	85.95	82.72	9.61	98.02	97.61
ResNeXt101	86.43	83.97	7.86	98.14	97.69

Table 3. Detection and instance segmentation results of the four models with progressively introduced improvement strategies.

Model	Bbox mAP/%	Mask mAP/%	fps	Recall/%	F1/%	Model Size/MB
Cascade Mask R-CNN (ResNet50)	84.74	81.63	11.92	98.09	97.78	515.93
Cascade Mask R-CNN (ResNeXt101)	86.43	83.97	7.86	98.14	97.69	607.23
Cascade Mask R-CNN (ResNeXt101_ OHEM)	87.48	87.09	7.01	97.95	97.58	644.47
Cascade Mask R-CNN (ResNeXt101_ OHEM_ Soft-NMS)	90.07	87.84	6.79	99.19	98.79	659.78

Table 4. Detection performance for breeding pigs and ear tags.

Model	Bbox mAP%		Mask mAP%		Recall%		F1%
Model	Pig	Ear Tag	Pig	Ear Tag	Pig	Ear Tag	Pig	Ear Tag
Cascade Mask R-CNN (ResNet50)	86.41	82.52	84.51	80.22	98.66	97.53	84.51	81.87
Cascade Mask R-CNN (ResNeXt101_OHEM_SoftNMS)	92.16	88.83	89.64	86.28	99.86	98.47	89.64	88.28

Table 5. Detection performance on the six test sets.

Evaluation Indicator	Daytime			Nighttime
Evaluation Indicator	Stationary Group	Moving Group	Mixed Group	Stationary Group	Moving Group	Mixed Group
Bbox mAP	92.17%	88.95%	90.62%	91.87%	86.35%	90.47%
Mask mAP	89.93%	86.67%	89.53%	88.27%	84.39%	88.23%

Table 6. Accuracy of lost ear tag detection.

Model	Accuracy
Cascade Mask R-CNN (ResNet50)	88.10%
Cascade Mask R-CNN (ResNeXt101_OHEM_Soft-NMS)	92.86%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Wang, F.; Fu, X.; Duan, W.; Wang, B.; Li, H. Visual Detection of Lost Ear Tags in Breeding Pigs in a Production Environment Using the Enhanced Cascade Mask R-CNN. Agriculture 2023, 13, 2011. https://doi.org/10.3390/agriculture13102011

AMA Style

Wang F, Fu X, Duan W, Wang B, Li H. Visual Detection of Lost Ear Tags in Breeding Pigs in a Production Environment Using the Enhanced Cascade Mask R-CNN. Agriculture. 2023; 13(10):2011. https://doi.org/10.3390/agriculture13102011

Chicago/Turabian Style

Wang, Fang, Xueliang Fu, Weijun Duan, Buyu Wang, and Honghui Li. 2023. "Visual Detection of Lost Ear Tags in Breeding Pigs in a Production Environment Using the Enhanced Cascade Mask R-CNN" Agriculture 13, no. 10: 2011. https://doi.org/10.3390/agriculture13102011

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Visual Detection of Lost Ear Tags in Breeding Pigs in a Production Environment Using the Enhanced Cascade Mask R-CNN

Abstract

1. Introduction

2. Materials and Methods

2.1. Data Acquisition

2.2. Establishment of the Sample Database

2.2.1. Data Filtering

2.2.2. Image Annotation

2.2.3. Data Enhancement

2.2.4. Dataset Partition

2.3. Design of Detection Model for Breeding Pigs’ Lost Ear Tags

2.3.1. Backbone Network

2.3.2. Region Proposal Network

2.3.3. Cascade Detection Network

2.3.4. Lost Ear Tag Detection

2.4. Design of Loss Function

2.5. Evaluation Indicators of the Model

2.6. Experimental Platform and Parameter Settings

3. Results

3.1. Selection of Backbone Network

3.2. Impact of Improved Components on Model Performance

3.2.1. Performance Analysis of the Improvement Strategy

3.2.2. Detection and Segmentation Performance for Breeding Pigs and Ear Tags

3.2.3. Analysis of the Visualization Effect of the Model

3.2.4. Training Loss Analysis of the Models

3.3. Detection Effects on Different Test Sets

3.4. Analysis of Lost Ear Tag Detection in a Production Environment for Breeding Pig Farming

4. Discussion

5. Conclusions

Author Contributions

Funding

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI