A Real-Time Fish Target Detection Algorithm Based on Improved YOLOv5

Li, Wanghua; Zhang, Zhenkai; Jin, Biao; Yu, Wangyang

doi:10.3390/jmse11030572

Open AccessArticle

A Real-Time Fish Target Detection Algorithm Based on Improved YOLOv5

by

Wanghua Li

¹

,

Zhenkai Zhang

^1,*,

Biao Jin

¹ and

Wangyang Yu

²

¹

Ocean College, Jiangsu University of Science and Technology, Zhenjiang 212100, China

²

Department of Communication, Wuhan Maritime Communication Research Institute, Wuhan 430223, China

^*

Author to whom correspondence should be addressed.

J. Mar. Sci. Eng. 2023, 11(3), 572; https://doi.org/10.3390/jmse11030572

Submission received: 16 February 2023 / Revised: 2 March 2023 / Accepted: 4 March 2023 / Published: 7 March 2023

(This article belongs to the Section Physical Oceanography)

Download

Browse Figures

Versions Notes

Abstract

:

Marine fish target detection technology is of great significance for underwater vehicles to realize fish automatic recognition. However, the complex underwater environment and lighting conditions lead to the complex background of the collected image and more irrelevant interference, which makes the fish target detection more difficult. In order to detect fish targets accurately and quickly, a real-time fish target detection network based on improved YOLOv5s is proposed. Firstly, the Gamma transform is introduced in the preprocessing part to improve the gray and contrast of the marine fish image, which is convenient for model detection. Secondly, the ShuffleNetv2 lightweight network introducing the SE channel attention mechanism is used to replace the original backbone network CSPDarkNet53 of YOLOv5 to reduce the model size and the amount of calculation, and speed up the detection. Finally, the improved BiFPN-Short network is used to replace the PANet network for feature fusion, so as to enhance the information propagation between different levels and improve the accuracy of the detection algorithm. Experimental results show that the volume of the improved model is reduced by 76.64%, the number of parameters is reduced by 81.60%, the floating-point operations (FLOPs) is decreased by 81.22% and the mean average precision (mAP) is increased to 98.10%. The balance between lightweight and detection accuracy is achieved, and this paper also provides a reference for the development of underwater target detection equipment.

Keywords:

YOLOv5; fish target detection; ShuffleNetv2; lightweight model; BiFPN

1. Introduction

The ocean is rich in biological resources and is the largest supply base for protein, and one of the most important is fishery resources. However, the global ocean area is huge, and the fish distribution range is wide. Therefore, fishery analysis and statistics are particularly important. The traditional exploration method of marine fish distribution is mainly fishing at sea, using longline fishing, trawl fishing and other common marine fishing techniques [1], but the cost is very high and the efficiency is too low. In addition, fishing operations at sea are risky, so the use of unmanned underwater vehicles has become a trend. The operation of unmanned underwater vehicles [2,3], such as autonomous underwater vehicles (AUVs) and remotely operated vehicles (ROVs), can not only reduce the cost of missions, but also provide data support for scientific research [4,5,6,7]. In order to meet the above expectations, it is very important to design an efficient and real-time fish target detection algorithm.

With the development of deep learning, deep learning technology has also broken through the technical bottleneck problem of traditional target detection algorithms in feature extraction [8], and has become the mainstream algorithm of current target detection. Based on the differences in detection principles, deep learning target detection algorithms can be divided into two categories: two-stage and one-stage, and the representative algorithms are R-CNN [9,10,11,12] series and YOLO [13,14,15,16] and SSD [17] series, respectively, where SSD is based on the improvement of the former two. Although there are many target detection methods, most of them focus on accuracy improvement, which makes the network structure more complex and requires greater hardware cost. Moreover, the complex underwater environment and lighting conditions lead to the complex background and more irrelevant interference of the collected image, which makes the existing target detection methods difficult to directly transfer. Therefore, it is particularly important to study a lightweight and high detection accuracy target detection method for underwater images. As a classic one-stage target detection algorithm, YOLO series algorithms are widely used because of their excellent detection performance. Sung et al. [18] migrated the YOLO network structure model for real-time detection of general ground targets to the detection of underwater fish targets, and achieved good classification accuracy and real-time performance. However, the actual hardware limitations of underwater robots were not considered; that is, the lightweight problem of the model was not considered. Cai et al. [19] uses MobileNetv1 to replace DarkNet-53 as the backbone network and optimizes the determination strategy of feature maps in the backbone to improve the accuracy of fish target detection. However, the model complexity is not taken into account. Hua et al. [20] integrates multiple 1 × 1 convolutions into the Tiny-YOLOv3 algorithm to enhance semantic features, and introduces the dilated convolution method, which improves the detection accuracy effect but requires a lot of computing power to search for the optimal structure. Fang et al. [21] simplifies the backbone network and changes the structure of the target detection head; although the inference speed is improved, the detection accuracy of the model is sacrificed. Fang et al. [22] uses network pruning to remove feature layers, but the optimization effect is not stable. Li et al. [23] proposes a pruning model compression method based on network channels to realize a lightweight network, and prunes again on the basis of MobileNetv2 to reduce the number of model parameters, which has a significant effect, but its accuracy loss is serious, it is difficult to guarantee the detection accuracy and the generalization is not strong. The above studies only focus on single accuracy or lightweight improvement, and it is difficult to achieve a balance between being lightweight and detection accuracy.

In order to be suitable for underwater target detection equipment with a small memory, this paper proposes a real-time fish target detection algorithm based on improved YOLOv5s, and verifies it on a marine fish image dataset in a real environment. The main contributions of this paper are as follows:

The Gamma transform is added to the preprocessing part to improve the contrast and gray of the underwater image so that the model can better identify the target object from the image and improve the accuracy of the model.
After integrating the SE channel attention mechanism into the ShuffleNetv2 lightweight network, the YOLOv5s backbone network is replaced for feature extraction, and the parameters are greatly reduced to achieve the purpose of being a lightweight model.
The improved simplified version of the weighted bidirectional feature pyramid network is used as a module, and the feature fusion is repeated three times to obtain richer feature information and further improve the detection performance.

The rest of this paper is organized as follows: Section 2 mainly introduces the acquisition and annotation of the dataset. Section 3 details the proposed real-time fish target detection algorithm based on improved YOLOv5. Experimental and analytical results are presented in Section 4. Finally, the conclusion is described in Section 5.

2. Materials and Methods

2.1. Dataset Acquisition

The dataset used in this paper was selected from the Fish4Knowledge(F4K) dataset [24], and a total of 2985 fish images were selected. Fish4Knowledge(F4K) dataset is a fish image dataset collected by Taiwan Power Corporation, Taiwan Institute of Oceanology and Kding National Park from 1 October 2010 to 30 September 2013 at underwater viewing platforms in Taiwan Nanwan Strait, Lanyu Island and Hubi Lake. It includes 23 species of fish. There are a total of 27,370 images of fish. In order to ensure the diversity of fish in the dataset and make sure the final model had good performance in complex scenes, we used random sampling to select fish pictures. The specific selection method was as follows:

Firstly, the number of images of each type of fish in F4K dataset was counted;
Then, 200 fish samples were randomly sampled from more than 200 fish samples;
Finally, the fish samples extracted in the second step were merged with the samples with fewer than 200 samples to form the dataset of this paper.

Examples of the datasets are shown in Figure 1.

2.2. Dataset Annotation

Fish images were annotated by the open-source software LabelImg to make the dataset needed to train the deep learning model. Firstly, LabelImg annotation software was used to read and display the pictures, and then the fish targets in the pictures were calibrated with rectangular boxes and saved as YOLO data format. At the same time, a corresponding annotation file was generated after each image was annotated, generally in .txt file format, and each line in the file represents the category and location of an object. The first column represents the class label of the object, and the next four columns represent the location information of the object, which are x, y, w, h. The dataset contained 2985 fish pictures, which were divided into training set and test set in a ratio of 8:2. A training set of 2388 images contained 3467 fish targets, and 597 images in the test set contained 912 fish targets.

3. Proposed Method

YOLOv5 is characterized by fast speed and high flexibility. It is composed of Input, Backbone, Neck and Head. As the smallest model in YOLOv5, YOLOv5s is widely used in lightweight research. Its network structure is shown in Figure 2. The function of Input is to preprocess the input dataset, including Mosaic data augmentation, adaptive anchor box calculation, adaptive image scaling and other operations. Backbone uses CSPDarknet53 network to extract rich information features from input images. In addition, after version 5.0, SPP module is replaced by SPPF module. When the calculation results are the same, SPPF computing speed is twice as fast as SPP. The core structure of Neck is the feature pyramid (FPN) and the path aggregation network (PAN) structure, but the CSP structure is introduced into the PAN structure to realize the fusion of different scale feature information. Head is the detection structure of YOLOv5, which outputs feature maps of different sizes, respectively, for target prediction.

Although YOLOv5 has made great achievements, there are still some shortcomings. It requires high hardware cost and is difficult to deploy in small embedded devices or mobile devices (such as AUV etc.). Therefore, in order to solve this problem, YOLOv5 was improved from the perspective of algorithm model complexity and detection accuracy.

The improved YOLOv5s is mainly improved from three aspects: data image enhancement, lightweight feature extraction and feature fusion. Firstly, the Gamma transform was used to enhance the underwater image data, so as to facilitate the model detection. Secondly, the ShuffleNetv2 lightweight network integrated with the SE channel attention mechanism was used to replace the YOLOv5s backbone network for feature extraction, greatly reducing the parameters and realizing the overall lightweight nature of the YOLOv5s algorithm. Finally, the improved simplified version of weighted bidirectional feature pyramid network was used as a module, and the feature enhancement extraction was repeated three times to obtain more abundant feature information and further improve the detection performance. In the following, the basic theories and improved methods related to data image enhancement, lightweight feature extraction and feature fusion are introduced in detail.

3.1. Data Image Enhancement

Due to the particularity of the underwater environment, the acquired underwater images have problems such as low contrast and unclear targets, and it is difficult to improve the accuracy of the underwater target detection model. Therefore, before underwater target detection, the gray and contrast of the image are processed to ensure that the image can provide sufficient and correct feature information.

In this paper, the Gamma transform is introduced in the preprocessing stage to adjust the gray level and contrast of underwater fish images, so as to facilitate the detection of subsequent models. Gamma transform [25] refers to the non-linear transformation of the image gray value, so that the gray level of the output image and the gray level of the input image show an exponential relationship as shown in Equation (1):

V_{o u t} = A V_{i n}^{γ}

(1)

It can be seen from Equation (1) that when the Gamma value is greater than 1, Gamma transform will stretch the gray level of the brighter region and compress the gray level of the darker region, thus darkening the image as a whole. Conversely, when the Gamma value is less than 1, the Gamma transform will compress the gray level of the brighter region and stretch the gray level of the darker region, thus brightening the image as a whole. Increasing the contrast at low gray level is more conducive to the resolution of image details at low gray level.

3.2. Lightweight Feature Extraction Network Design with a Fused Attention Mechanism

The initial YOLOV5 model easily loses the target feature information in the feature extraction stage, which reduces the detection effect of the target. Moreover, its network model is large in size, has a large number of parameters and requires high hardware requirements, so there are certain difficulties in its deployment. In view of the above situation, based on the original YOLOv5s network, this paper combines the SE channel attention mechanism with the lightweight ShuffleNetv2 network to replace the original backbone network. While reducing network parameters, the network feature extraction ability is improved, so as to achieve a lightweight network.

3.2.1. ShuffleNetv2 Model

Based on the ShuffleNetv1 model, MA et al. proposed four guidelines for designing efficient and lightweight networks, and designed a new ShuffleNetv2 network module [26], as shown in Figure 3. It can be seen that the ShuffleNetv2 network module is mainly composed of basic unit and down-sampling unit. As shown in Figure 3a, in the basic unit of ShuffleNetv2 unit, the number of input feature channels is divided into two groups. The left branch does not do any operation identity mapping, and the right branch does convolution operation and batch normalization so as to strengthen the fusion of the two-channel graph information. The left and right branches of the down-sampling unit (Figure 3b) are down-sampling operations, the size of the feature map is halved, the dimension is doubled and the channel shuffling operation is performed on the merged feature map. Different from the basic unit, the down-sampling unit does not use channel separation operation without increasing the amount of calculation of the network model, and directly increases the number of network channels and the width of the network to further enhance the feature extraction ability of the network.

3.2.2. SE Channel Attention Mechanism

In order to make the network better fit the correlation between channels, increase the weight of more important channel features and improve the model’s ability to extract target features and detection accuracy, the SE (squeeze-and-excitation) channel attention mechanism module was introduced [27]. The SE channel attention mechanism module gives each channel a weight, so that different channels have different forces on the results, and the SE channel attention mechanism module is easy to embed in the neural network. Its network structure diagram is shown in Figure 4.

After the input X is convolutional, the feature map (U) with dimension [C,H,W] is obtained. Then squeeze is performed on the feature map U; that is, average pooling or max pooling is performed on it, and the dimension is reduced to [C,1,1], the number of channels is unchanged and [C,1,1] is the weight extracted from each channel that has influence on feature extraction. The squeeze Equation is:

Z = F_{sq} (X_{c}) = \frac{1}{H \times W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} X_{c} (i, j)

(2)

In Equation (2),

H \times W

is the size of channel space;

X_{c}

is the input feature map;

(i, j)

is the point on the feature map with abscissa i and ordinate j;

F_{sq} (X_{c})

represents the squeeze operation on the feature map. Z is the weight obtained by squeezing the channel.

After the pooling operation, the excitation operation is performed, and the vector passes through the MLP including the fully connected layer FC, the activation function ReLU and Sigmoid to obtain the weight of each channel [C,1,1]. The excitation Equation is:

S_{c} = F_{e x} (Z, W) = S i g m o i d [W_{2} \times R e L U (W_{2}, Z)]

(3)

In Equation (3), W refers to the fully connected layer;

W_{1}

and

W_{2}

are the fully connected layers connected in turn after global average pooling; and the subsequent activation functions are ReLU and Sigmoid, respectively. The dimension of vector Z becomes [C/R,1,1] after passing through

W_{1}

fully connected layer, and then becomes [C,1,1] after passing through

W_{2}

fully connected layer, where R is a hyperparameter;

S_{c}

is the attention weight generated after the excitation operation. Finally, the Reweight operation is carried out, and the weight [C,1,1] is applied to the feature map with dimension [C,H,W]; that is, each channel is multiplied by its own weight to complete the redistribution of weight. The weighting Equation is:

\hat{X} = F_{s c a l e} (X_{c}, S_{c}) ⨂ S_{c}

(4)

In Equation (4),

⨂

is element-wise multiplication;

F_{s c a l e}

is the reweighting operation;

\hat{X}

is the result obtained through the SE channel attention.

3.2.3. Improved Feature Extraction Network

In this paper, according to the characteristics of underwater fish images, ShuffleNetv2 1.0x was selected as the backbone network, and on this basis, the backbone network was improved. Its improved backbone network structure is shown in Figure 5, where ShuffleNetv2 unit (such as down-sampling unit repeat is 1, basic unit repeat is 3) represents the down-sampling layer unit in ShuffleNetv2 unit repeated stacking 1 layer. The basic unit is stacked as 3 layers repeatedly, and the feature extraction ability is strengthened by using 3 consecutive ShuffleNetv2 units without increasing the amount of calculation. It can be seen from Figure 5 that the structure block in the backbone network of this paper is composed of the basic unit and the down-sampling unit, and the input image is further obtained through the backbone network to obtain three effective feature layers. ShuffleNet uses three methods: channel merging, channel shuffling and channel splitting to improve the ability of feature extraction, and uses Depthwise convolution (DWConv) to reduce network parameters. As can be seen from Figure 5, the activation function in the backbone network in this paper adopts SiLU instead of ReLU, and its non-monotonic and smooth characteristics help to improve the model generalization ability and the accuracy of detection. The activation function SiLU is shown in Equation (5). In addition, based on the ShuffleNet unit network, this paper adds the SE attention mechanism module after the 1 × 1 convolution layer (Conv) of the basic unit and after the 1 × 1 convolution layer (Conv) of the left and right branches of the down-sampling unit to strengthen the weight of more important channel features, in order to improve the model’s ability to extract target features and detection accuracy.

S i L U (x) = x * s i g m o i d (x)

(5)

In Equation (5),

s i g m o i d (x) = \frac{1}{1 + e^{- x}}

3.3. Improved Feature Fusion Network Design

3.3.1. BiFPN and Its Simplification

The traditional FPN structure has only top-down unidirectional information flow [28], while the PANet network adopted by YOLOv5 adds an additional bottom-up path on the basis of FPN for information enhancement, effectively retaining more shallow features, as shown in Figure 6a. However, the Google team believes that the input feature maps with different resolutions have different effects on the output feature maps. Therefore, a bidirectional weighted feature pyramid network structure (BiFPN) [29] is proposed, which introduces learnable weights when the networks at different levels are stacked and fused, so that the network continuously adjusts the weights to learn the importance of different input features. Figure 6b shows the structure diagram of BiFPN. In Figure 6, P3 to P7 are the five input nodes of the original network model.

In this paper, the idea of weighted bidirectional fusion is applied to YOLOv5s model, and combined with the PANet network structure, further optimization is made:

Reduce the number of BiFPN input nodes to adapt to the three input effective feature layers of the lightweight backbone network;
Delete the input nodes with only one direction because their contribution to the network is small;
A cross-scale connection method is proposed, and an extra edge is added to fuse the features in the feature extraction network directly with the features of the same size in the bottom-up path, so that the network retains more shallow semantic information while not losing too much relatively deep semantic information.

Thus, a simplified version of bidirectional cross-scale feature fusion pyramid network structure is proposed, denoted as BiFPN-Short, and Figure 6c shows the structure diagram of BiFPN-Short.

In the BiFPN-Short structure, taking the P4 node as an example, the feature fusion process is shown in Equations (6) and (7).

P_{4}^{t d} = C o n v (\frac{w_{1} * P_{4}^{i n} + w_{2} * R e s i z e (P_{5}^{i n})}{w_{1} + w_{2} + ε})

(6)

P_{4}^{o u t} = C o n v (\frac{w_{1}^{'} * P_{4}^{i n} + w_{2}^{'} P_{4}^{t d} + w_{3}^{'} * R e s i z e (P_{3}^{o u t})}{w_{1}^{'} + w_{2}^{'} + w_{3}^{'} + ε})

(7)

In Equations (6) and (7),

P_{i}^{i n}

is the input feature of layer i (note: i = 4, 5);

P_{j}^{o u t}

is the output feature of layer j (note: j = 3, 4);

P_{4}^{t d}

is intermediate feature; Conv is the convolution operation; Resize is an up-sampling or down-sampling operation.

w_{1}, w_{2}, w_{1}^{'}, w_{2}^{'}, w_{3}^{'}

is the learnable weight corresponding to each feature;

ε

is set to 0.0001 to avoid non-convergent learning rate.

3.3.2. Lightweight Ghost Convolution Module

Ghost convolution is a lightweight convolution module proposed by Han [30] in 2020, and the schematic diagram is shown in Figure 7.

The Ghost convolution module firstly generates some basic original feature maps through 1 × 1 ordinary convolution operations, and then performs

φ_{1}

−

φ_{k}

linear transformation on these feature maps one by one to obtain another part of redundant feature maps, and then fuses this part of feature maps with the original feature maps to increase the number of channels. This method of obtaining redundant feature maps by linear operation can generate those redundant feature maps at less cost than ordinary convolution. Therefore, in this way, we can reduce the total number of parameters to simplify the model.

3.3.3. Improved Feature Fusion Networks

Although the BiFPN-Short feature fusion network aggregates the semantic information of different layers in the form of bidirectional weighting and strengthens the connection between deep and shallow networks, this module still has a large number of parameters. In order to ensure the balance between the amount of parameters and the detection accuracy, it is considered to further optimize the BiFPN-Short network structure. Firstly, the ordinary convolution module and C3 module in BiFPN-Short are replaced by the lightweight Ghost convolution module and C3Ghost module. Then, the replaced BiFPN-Short network structure is used as a module, denoted as New-BiFPN-Short, and the feature fusion of the three feature maps obtained by the improved feature extraction network is repeated three times to improve the accuracy of the detection algorithm. The structure diagram of the improved feature fusion network is shown in Figure 8. Among them, C1~C3 are the three effective feature maps output by the lightweight backbone network, which are the three effective input layers of the feature fusion network at this time, and P1~P3 are the three output feature maps of the improved BiFPN-Short module.

Compared with the original PANet network module, the improved BiFPN-Short network introduces learnable parameters for each path on the basis of PANet, improving the equal contribution of different input features in the original network. In addition, the skip connection structure is added to aggregate features of different resolutions, enrich the semantic expression of features and realize multi-scale feature fusion. The New-BiFPN-Short module is repeated three times, and the three feature maps obtained by the improved feature extraction network are fused to improve the accuracy of the detection algorithm. Finally, the improved BiFPN-Short feature fusion module is used for YOLOv5s underwater fish detection model to perform multi-scale feature fusion, becoming a powerful link connecting the backbone network and the prediction end.

Therefore, the final network structure of the proposed algorithm is shown in Figure 9.

4. Experiment and Analysis

4.1. Experimental Configuration

This experiment is carried out on the “Jiutian·Bisheng” cloud service platform developed by China Mobile. The operating system is Ubuntu 18.04, the processor is Intel(R) Xeon(R) Gold 6240 CPU @ 2.60 GHz, the graphics card is NVIDIA Tesla V100s-PCIE. The video memory is 32 GB, using Python 3.9.7, deep learning framework Pytorch 1.8.0, IDE environment VSCode 1.57.1 and CUDA10.2 acceleration graphics hardware. Stochastic gradient (SGD) is used to update and optimize the network parameters. The training hyperparameter settings are shown in Table 1.

4.2. Evaluating Indicator

In this paper, Precision (P), Recall (R), mean Average Precision (mAP) and F1 score were used to evaluate the performance of the algorithm.

Precision refers to the proportion of positive correct predictions to all positive predictions, representing the accuracy of predictions in positive sample results. The calculation formula is given in Equation (8) below.

P = \frac{TP}{TP + FP} \times 100 %

(8)

Recall is the percentage of the total that was correctly predicted to be positive. The calculation formula is given in Equation (9) below.

R = \frac{TP}{TP + FN} \times 100 %

(9)

In target detection, positive and negative output samples are usually divided according to IoU. If the IoU between the detected box and the true box is greater than a threshold, which is set to 0.45 in our experiments, the detected box is marked as TP. Otherwise, it is marked FP, and if there is no detection box matching the true box, it is marked FN. Therefore, in Equations (8) and (9), TP represents the number of correctly identified targets, FN is the number of targets that are not detected and FP is the number of incorrectly identified targets.

F1 is the harmonic average of Precision and Recall, and the larger F1 is, the better the model effect is. The calculation formula is given in Equation (10) below.

F 1 = \frac{2 \times P \times R}{P + R} \times 100 %

(10)

Another evaluation index, PR curve, is also based on the comprehensive evaluation of P and R, which can be used to evaluate the performance of the model, generally taking R as the horizontal axis and P as the vertical axis. Points on the curve represent the current accuracy value, points on the curve represent positive samples, otherwise negative samples, the area between the curve and the axis represents the average accuracy AP of this category, and the calculation formula is given in Equation (11) below.

A P = \int_{0}^{1} p (r) d_{r}

(11)

m A P = \frac{1}{m} \sum_{i = 1}^{m} A P_{i}

(12)

In Equation (11), AP refers to the average accuracy rate of recognition results, p(r) refers to the value of each point on the PR curve and m in Equation (12) represents the number of types of targets to be detected.

In addition, the number of parameters, model size, FLOPs and Frames Per Second (FPS) are used as the indicators of the lightweight model. The number of parameters and model size are mainly determined by the network structure. The number of floating-point operations (FLOPs) is the number of calculations that the model needs to perform, which can measure the complexity of the model. FPS is the number of frames transmitted per second, which can be understood as the image refresh frequency.

4.3. Experimental Results and Analysis

4.3.1. Experiment on the Selection of Gamma Value in Preprocessing

Due to the complex underwater environment and lighting conditions, the collected underwater fish images are dim and fuzzy, and Gamma transform can effectively correct these images. The selection of the Gamma value is very important, and inappropriate selection of the Gamma value may lead to the decline in the detection effect. Therefore, this paper explores the enhancement effect of different Gamma values on images before conducting other experiments.

As shown in Figure 10, three typical dim underwater images are selected for display and transformed by using seven different Gamma values in Figure 10a–g.

It can be seen from Figure 10 that the appropriate values of Gamma such as 0.50, 0.75 and 1.00 can enhance the image contrast and brightness, thereby highlighting the target features, which is conducive to the network to extract the features of fish images. Inappropriate values, such as Gamma values greater than 1.00 or less than 0.50, may cause the originally dim image to become more blurred, making the target difficult to identify and reducing the accuracy of the model.

More intuitive experimental results are shown in Table 2, where “—” represents that Gamma transformation is not used. When Gamma transformation is not used, the mAP of YOLOv5s model is 97.00%, and when the Gamma value is 0.75, the detection accuracy is the highest. Therefore, other experiments in this paper set the Gamma value to 0.75 for image data enhancement.

4.3.2. Experiments on Attention Mechanism Selection

In order to study whether the introduction of the attention module is effective for underwater fish target detection and recognition, this paper conducts five groups of comparison experiments based on the dataset enhanced with image data. That is, the model after replacing the backbone network in the original network with ShuffleNetv2 is compared with the model after introducing the ECA module, CA module, CBAM module and SE module, respectively. As shown in Table 3.

As can be seen from Table 3, compared with the ShuffleNetv2 model, the models after introducing four different attention modules, respectively, improve the Precision by 0.08%, 0.64%, 0.09% and 0.7%. In terms of Recall, it is increased by 0.07%, 0.27%, 0.05% and 0.8%, respectively. The F1 value is increased by 0.08%, 0.46%, 0.07% and 0.75%, respectively. In terms of mAP, it is increased by 0.1%, 0.31%, 0.06% and 0.75%, respectively. At the same time, the number of model parameters and model size change little after the introduction of different attention modules, and the FLOPs remain basically unchanged. This shows that the introduction of the attention mechanism is helpful for the detection and recognition of underwater fish targets. Therefore, this paper chooses to introduce the SE channel attention mechanism when improving the feature extraction network.

4.3.3. Comparative Experiments before and after the Improvement

In this paper, the original YOLOv5s model and the improved YOLOv5s model (ours) are trained and tested using the dataset and parameters enhanced with the same image data.

The experimental data in Table 4 are obtained. The calculation amount of the improved YOLOv5s algorithm (ours) is reduced to 2.96 G, the number of parameters is reduced to 1,290,218, the model size is reduced to 3.2 MB and the mAP is increased to 98.10%. The improved YOLOv5s algorithm (ours) further improves the detection performance. The balance between being lightweight and accuracy is achieved, and the problem that the speed and accuracy of the existing model detection cannot be balanced is solved.

4.3.4. Ablation Experiment

In order to further verify the effectiveness of each improved part, this paper sets up an ablation experiment based on the dataset after image data augmentation. The ablation experiment setup is as follows: Scheme 0—original YOLOv5s network; Scheme 1—replace the original backbone CSPDarkNet53 with ShuffleNetv2; Scheme 2—introduces the SE channel attention mechanism based on Scheme 1; Scheme 3—the Improved BiFPN-Short network is introduced based on Scheme 2. Scheme 3 is the final model proposed in this paper (ours).

The methods used in different schemes are shown in Table 5, where “√” indicates the introduction of this method, and the results of the ablation experiments are shown in Table 6.

As can be seen from Table 6, although the Precision, Recall, F1 value and mAP values of Scheme 1 and Scheme 2 have a slight decrease compared with Scheme 0, they can help to build a model with small parameters, model size and FLOPs. This shows that the introduction of the lightweight ShuffleNetv2 network is beneficial to reduce the complexity of the model, but it slightly affects the target detection and recognition accuracy of the model. Compared with Scheme 1, Scheme 2 improves the Precision by 0.7%. It improves the Recall by 0.8%. The F1 value is increased by 0.75%. The improvement is 0.75% in terms of mAP. This is because after introducing the SE channel attention mechanism, the weight of more important channel features is increased, which enhances the feature extraction ability. However, it slightly affects the parameters and model size. Compared with Scheme 2, although the parameters of Scheme 3 only increased by less than 0.44 M (from 850,934 to 1,290,218), the model size is only 1.1 MB larger; FLOPs only increases by 1.13 G, but it improves 2.43% in Precision. It improves the Recall by 0.01%. The F1 value is increased by 1.2%. In terms of mAP, the improvement is 0.65%. This shows that after the introduction of the improved BiFPN-Short network, the connection between the deep and shallow networks is strengthened, and more abundant feature information is obtained. Compared with Scheme 0, the Precision, Recall, F1 value and mAP values of Scheme 3 are increased by 0.22%, 0.82%, 0.53% and 0.5%, respectively. The parameters, model size and FLOPs are reduced by 81.60% (from 7,012,822 to 1,290,218), 76.64% (from 13.7 MB to 3.2 MB) and 81.22% (from 15.76 G to 2.96 G), respectively. From the experimental results, we can see that the various improvement points proposed in this paper are effective.

4.3.5. Comparison of Different Detection Algorithms

In order to further verify the superiority of the proposed algorithm, in the same experimental environment, in this paper, the improved algorithm (ours) is compared with the current mainstream two-stage target detection algorithms (Faster R-CNN) and one-stage target detection algorithms (SSD, YOLOv4, YOLOv5x, YOLOv5-Lite, YOLOv5s, etc.). The statistical results of the experiment are shown in Table 7.

From the experimental results in Table 7, it can be seen that the proposed algorithm model has the smallest model size, parameters and FLOPs compared with other mainstream detection models, while maintaining a high detection accuracy. Compared with the original YOLOv5s model, the mAP is 0.5% higher, 0.4% higher than YOLOv5-Lite, 3.76% higher than SSD and 0.94% higher than YOLOv4. Compared with the traditional two-stage target detection algorithm Faster R-CNN, the mAP of the model is improved by 1.95%, and the F1 value is greatly improved and increased by 12.21%. Model size, parameters and FLOPs are reduced by 518.2 MB, 135.4 M (from 136,689,024 to 1,290,218) and 366.76 G, respectively. Compared with YOLOv5x, although the mAP and F1 values of the algorithm improved in this paper (ours), and it cannot achieve the best performance at the same time, the excellent performance of other evaluation indicators can make up for it. In addition, the detection speed of the proposed algorithm reaches 30.94 FPS, which decreases compared with the original YOLOv5s model but still ensures real-time performance. Therefore, the algorithm model proposed in this paper (ours) has the highest detection accuracy and better real-time performance while maintaining a light weight, which proves the feasibility and superiority of the algorithm in this paper.

5. Conclusions

In this paper, we propose a real-time fish target detection algorithm based on improved YOLOv5s, which solves the challenges faced by traditional algorithms in fish target detection. Firstly, the Gamma transform is added to the preprocessing part to improve the contrast and gray level of the underwater image, so as to ensure that the image can provide enough and correct feature information. Secondly, the ShuffleNetv2 lightweight network integrated with the SE channel attention mechanism is used to replace the YOLOv5s backbone network for feature extraction, greatly reducing the complexity of the model and realizing the overall light weight of the YOLOv5s algorithm. Finally, the improved simplified version of the weighted bidirectional feature pyramid network was used as a module, and the feature enhancement extraction was repeated three times to obtain more abundant feature information and further improve the detection performance. Experimental results show that the parameters of the improved model are reduced by 84.60%, the model size is reduced to 3.2 MB and the mAP is increased to 98.10%. Compared with the mainstream target detection models, the proposed algorithm model has lower complexity and higher detection accuracy, and meets the real-time requirements. In the future, we will further explore various underwater environments and work on the development of unmanned underwater vehicles. In addition, it is necessary to collect more marine fish target data to continuously improve the generalization performance of the model.

Author Contributions

Conceptualization, methodology and software, W.L.; validation and formal analysis, W.L. and Z.Z.; investigation, resources and data curation, W.L., B.J. and W.Y.; writing—original draft preparation, W.L.; writing—review and editing, W.L., Z.Z., B.J. and W.Y.; visualization, W.L.; supervision and funding acquisition, Z.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 61871203.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the first author.

Acknowledgments

We are grateful to the anonymous reviewers for their insightful comments and suggestions, all of which were valuable in improving our manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Huang, H.; Feng, C.; Li, L.; Rao, X.; Chen, S.; Yang, J. The development status and prospect of contemporary marine fisheries. J. Fish. Sci. China 2022, 29, 938–949. [Google Scholar]
Bryson, M.; Johnson-Roberson, M.; Pizarro, O.; Williams, S.B. True color correction of autonomous underwater vehicle imagery. J. Field Robot. 2016, 33, 853–874. [Google Scholar] [CrossRef]
Kim, H.-G.; Seo, J.; Kim, S.M. Underwater Optical-Sonar Image Fusion Systems. Sensors 2022, 22, 8445. [Google Scholar] [CrossRef] [PubMed]
Mahmood, A.; Bennamoun, M.; An, S.; Sohel, F.A.; Boussaid, F.; Hovey, R.; Kendrick, G.A.; Fisher, R.B. Deep Image Representations for Coral Image Classification. IEEE J. Ocean Eng. 2018, 44, 121–131. [Google Scholar] [CrossRef] [Green Version]
Bonin-Font, F.; Oliver, G.; Wirth, S.; Massot, M.; Negre, P.L.; Beltran, J.P. Visual sensing for autonomous underwater exploration and intervention tasks. Ocean Eng. 2015, 93, 25–44. [Google Scholar] [CrossRef]
Qiao, X.; Bao, J.; Zeng, L.; Zou, J.; Li, D. An automatic active contour method for sea cucumber segmentation in natural underwater environments. Comput. Electron. Agric. 2017, 135, 134–142. [Google Scholar] [CrossRef]
Sahoo, A.; Dwivedy, S.K.; Robi, P. Advancements in the field of autonomous underwater vehicle. Ocean Eng. 2019, 181, 145–160. [Google Scholar] [CrossRef]
Wan, Q.; Li, Z.; Li, Y.; Ge, Z.; Wang, Y.; Wu, D. Target Tracking Method of Mobile Robot Based on Improved YOLOX. Acta Autom. Sin. 2022, 45, 1–15. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 7–13 December 2015; pp. 1440–1448. [Google Scholar]
Li, J.; Liang, X.; Shen, S.; Xu, T.; Feng, J.; Yan, S. Scale-aware fast R-CNN for pedestrian detection. IEEE Trans. Multimed. 2017, 20, 985–996. [Google Scholar] [CrossRef] [Green Version]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Redmon, J.; Farhadi, A. Yolov3: An incremental improvement. arXiv 2018, arXiv:Abs/1804.02767. [Google Scholar]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.y. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv 2020, arXiv:2004.10934. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. Ssd: Single shot multibox detector. In Proceedings of the Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; pp. 21–37. [Google Scholar]
Sung, M.; Yu, S.C.; Girdhar, Y. Vision Based Real-Time Fish Detection Using Convolutional Neural Network. In Proceedings of the OCEANS 2017-Aberdeen, Aberdeen, UK, 19–22 June 2017; pp. 1–6. [Google Scholar]
Cai, K.; Miao, X.; Wang, W.; Pang, H.; Liu, Y.; Song, J. A modified YOLOv3 model for fish detection based on MobileNetv1 as backbone. Aquacult. Eng. 2020, 91, 102117. [Google Scholar] [CrossRef]
Hua, Y.; Zhang, Z.; Long, S.; Zhang, Q. Remote sensing image target detection based on improved YOLO algorithm. Electron. Meas. Technol. 2020, 43, 87–92. [Google Scholar]
Fang, R.; Wang, M. Retail product packaging type detection based on improved YOLO network. Electron. Meas. Technol. 2020, 43, 108–112. [Google Scholar]
Fang, W.; Wang, L.; Ren, P. Tinier-YOLO: A real-time object detection method for constrained environments. IEEE Access 2019, 8, 1935–1944. [Google Scholar] [CrossRef]
Li, Y.S.; Zhang, C.Y.; Zhao, Y.K. Research on lightweight obstacle detection model based on model compression. Laser J. 2022, 43, 38–43. [Google Scholar]
Boom, B.J.; Huang, P.X.; He, J.; Fisher, R.B. Supporting ground-truth annotation of image datasets using clustering. In Proceedings of the 21st International Conference on Pattern Recognition, Tsukuba, Japan, 11–15 November 2012; pp. 1542–1545. [Google Scholar]
Hu, K.; Weng, C.; Zhang, Y.; Jin, J.; Xia, Q. An overview of underwater vision enhancement: From traditional methods to recent deep learning. J. Mar. Sci. Eng. 2022, 10, 241. [Google Scholar] [CrossRef]
Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision, Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 7132–7141. [Google Scholar]
Lin, T.Y.; Dollár, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature pyramid networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Tan, M.; Pang, R.; Le, Q.V. Efficientdet: Scalable and efficient object detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 10781–10790. [Google Scholar]
Han, K.; Wang, Y.; Tian, Q.; Guo, J.; Xu, C.; Xu, C. Ghostnet: More features from cheap operations. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, DC, USA, 14–19 June 2020; pp. 1580–1589. [Google Scholar]

Figure 1. Examples of the dataset.

Figure 2. YOLOv5s network structure diagram.

Figure 3. ShuffleNetv2 unit. (a) Basic unit. (b) Down-sampling unit.

Figure 4. Structural diagram of the SE channel attention mechanism.

Figure 5. Lightweight backbone network structure.

Figure 6. PANet, BiFPN and BiFPN-Short structure diagram.

Figure 7. Schematic diagram of Ghost convolution module.

Figure 8. Improved feature fusion network structure.

Figure 9. The overall network structure diagram after improvement.

Figure 10. Visualization of the Gamma transformation.

Table 1. Training of the hyperparameter settings.

Parameter Name	Parameter Values
Learning rate	0.01
Momentum	0.937
Weight decay	0.0005
Batch size	16
Epochs	150

Table 2. Effect of different values of Gamma on detection accuracy.

Value of Gamma	mAP (%)
—	97.00
0.25	96.93
0.50	97.35
0.75	97.60
1.00	97.52
1.25	96.95
1.50	96.50
1.75	96.12

Table 3. Effect comparison of introducing different attention mechanisms.

Model	Precision/%	Recall/%	F1/%	Parameters	Model Size/MB	FLOPs/G	mAP/%
ShuffleNetv2	92.67	92.10	92.38	842,358	2.0	1.83	96.70
+ECA	92.75	92.17	92.46	842,388	2.0	1.83	96.80
+CA	93.31	92.37	92.84	861,734	2.1	1.83	97.01
+CBAM	92.76	92.15	92.45	851,914	2.1	1.83	96.76
+SE	93.37	92.90	93.13	850,934	2.1	1.83	97.45

Table 4. Comparison of experimental results before and after improvement.

Model	Parameters	FLOPs/G	Model Size/MB	mAP/%
YOLOv5s	7,012,822	15.76	13.7	97.60
Ours	1,290,218	2.96	3.2	98.10

Table 5. Different scheme design.

Scheme	Replace CSPDarkNet53 with ShuffleNetv2	Add SE Attention Mechanism	Add Improved BiFPN-Short
0
1	√
2	√	√
3	√	√	√

Table 6. Model comparison in the ablation experiment.

Model	Precision/%	Recall/%	F1/%	Parameters	Model Size/MB	FLOPs/G	mAP/%
Scheme 0	95.58	92.09	93.80	7,012,822	13.7	15.76	97.60
Scheme 1	92.67	92.10	92.38	842,358	2.0	1.83	96.70
Scheme 2	93.37	92.90	93.13	850,934	2.1	1.83	97.45
Scheme 3	95.80	92.91	94.33	1,290,218	3.2	2.96	98.10

Table 7. Comparison of different target detection methods.

Method	mAP/%	F1/%	FLOPs/G	Parameters	Model Size/MB	FPS
Faster R-CNN	96.15	82.12	369.72	136,689,024	521.4	6.57
YOLOv5x	98.10	95.19	203.76	86,173,414	173.1	23.09
YOLOv4	97.16	92.00	59.95	63,937,686	244.4	14.49
SSD	94.34	91.29	60.76	23,611,734	90.6	16.36
YOLOv5-Lite	97.70	94.00	14.58	5,257,558	11.2	30.06
YOLOv5s	97.60	93.80	15.76	7,012,822	13.7	32.96
Ours	98.10	94.33	2.96	1,290,218	3.2	30.94

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Li, W.; Zhang, Z.; Jin, B.; Yu, W. A Real-Time Fish Target Detection Algorithm Based on Improved YOLOv5. J. Mar. Sci. Eng. 2023, 11, 572. https://doi.org/10.3390/jmse11030572

AMA Style

Li W, Zhang Z, Jin B, Yu W. A Real-Time Fish Target Detection Algorithm Based on Improved YOLOv5. Journal of Marine Science and Engineering. 2023; 11(3):572. https://doi.org/10.3390/jmse11030572

Chicago/Turabian Style

Li, Wanghua, Zhenkai Zhang, Biao Jin, and Wangyang Yu. 2023. "A Real-Time Fish Target Detection Algorithm Based on Improved YOLOv5" Journal of Marine Science and Engineering 11, no. 3: 572. https://doi.org/10.3390/jmse11030572

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Real-Time Fish Target Detection Algorithm Based on Improved YOLOv5

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Acquisition

2.2. Dataset Annotation

3. Proposed Method

3.1. Data Image Enhancement

3.2. Lightweight Feature Extraction Network Design with a Fused Attention Mechanism

3.2.1. ShuffleNetv2 Model

3.2.2. SE Channel Attention Mechanism

3.2.3. Improved Feature Extraction Network

3.3. Improved Feature Fusion Network Design

3.3.1. BiFPN and Its Simplification

3.3.2. Lightweight Ghost Convolution Module

3.3.3. Improved Feature Fusion Networks

4. Experiment and Analysis

4.1. Experimental Configuration

4.2. Evaluating Indicator

4.3. Experimental Results and Analysis

4.3.1. Experiment on the Selection of Gamma Value in Preprocessing

4.3.2. Experiments on Attention Mechanism Selection

4.3.3. Comparative Experiments before and after the Improvement

4.3.4. Ablation Experiment

4.3.5. Comparison of Different Detection Algorithms

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI