# Adaptation of YOLOv7 and YOLOv7_tiny for Soccer-Ball Multi-Detection with DeepSORT for Tracking by Semi-Supervised System

## Abstract

## 1. Introduction

#### Introduction Background and Scope of This Study

## 2. Materials and Methods

#### 2.1. Dataset

#### 2.2. Proposed Methodology

_{t}= p if y = 1 (where “y = 1” meaning it is the probability of the correct class), while it is 1 − p if y = 0 (meaning the probability class is false or is classified in an incorrect class), γ is a hyperparameter to modulate the approach, which gives us the ability to highlight and give more weight to the most difficult predictions, thus allowing an increase in the assertiveness of the model in the detection of this type of object, playing a crucial role in this process of improving the accuracy of YOLO models, which retain the same number of neurons, convolutions, and the same anchor boxes.

## 3. Results

## 4. Discussion

## 5. Conclusions

## 6. Future Works

## References

**Figure 1.**Comparison of the version used vs. previous versions [43].

**Figure 6.**YOLO v7 architecture in well-defined 2 phases, with Max Pooling in both phases to improve performance [47].

**Figure 15.**Metric results: (

**a**) PR_curve for test YOLOv7 (0.602); (

**b**) test YOLOv7_tiny (0.598); (

**c**) confidence function, with recall greater than 80% and confidence greater than 90% for YOLOv7 test.

**Figure 16.**Ball detection results with the model obtained from YOLOv7 in 200 epochs with the adjustment of hyperparameters.

**Figure 21.**Metric results: (

**a**) precision vs. recall function, with an average of 95% for ball detection; (

**b**) confidence function for the semi-supervised system, with 99% of security; (

**c**) threshold function F1 with a score of 90%.

**Figure 23.**Implementation of the model in a real environment in INEF Madrid fields with the balls that are used there.

**Figure 25.**Multi-detection of balls and implementation of DeepSORT to visualize the trajectories of the ball at different distances and parabolas.

**Figure 27.**Mean average precision for models used in this article, where the semi-supervised model shows better results.

**Table 1.**Summary of related works that address the ball detection problem under the small object approach.

Author | Addressed Problem | Dataset | Model | Precision (%) | Observations |
---|---|---|---|---|---|

D’Orazio et al. [22] (2004) | Ball recognition soccer on real images | Image sequences taken by a camera connected to an S-VHS video:
- −
- 318 Ball images
- −
- 364 No-ball images
- −
- 139 Occluded ball images
| Adaptation of the Atherton algorithm | 96.46 | Includes evaluation with ball occlusion obtaining 92% accuracy |

Zhang et al. [39] (2022) | Golf ball detection and tracking with CNN and Kalman filter | 2169 high resolution golf images from online tournaments of which 17,436 golf ball labels are generated. | - −
- YOLO v3
- −
- YOLO v3_tiny
- −
- YOLO v4
- −
- Faster R-CNN
- −
- SSD
| Tracking with Faster R-CNN: 81.3 YOLOv3 tiny: 82.1 | Addresses small object detection issues |

Kamble et al. [44] (2019) | Deep learning approach for 2D ball detection and tracking (DLBT) in soccer videos | Own dataset 1500 images for each class: ball, player, and background | CNN architecture designed by modifying the Visual Geometry Group (VGG) at University of Oxford, named VGG-M | 93.25 | Soccer videos are used |

Komorowski [45] (2019) | Soccer ball detection in long take videos | ISSIA-CNR Soccer Dataset (20,000 frames)
- −
- 7000 Ball
- −
- 13,000 No-ball
| - −
- A classical CNN + Softmax
| 87 | The hypercolumn concept is implemented with convolutional feature maps |

Hiemann [46] (2021) | Volleyball ball detection | 12,555 images
- −
- 10,363 images training
- −
- 2192 images test
| YOLOv3 | 73.2 | Time inference metrics are presented in frames per second (FPS). |

Design Parameter | Values |
---|---|

Convolutional Layers | |

Kernel Size | 1 × 1, 3 × 3 or 5 × 5 |

Kernel dilatation | 1 or 2 |

Stride | 4 |

Output channels | 512 |

Pooling | Max Pooling |

Activation | Mish, Sigmoid or ReLU |

Batch normalization | No |

Full Connected Layers | |

Layers outputs | 16 |

Hyperparameter | Adjust Value |
---|---|

Initial Learning Rate | 0.01 |

Final Learning Rate | 0.1 |

Momentum | 0.937 |

Weight_decay | 0.0005 |

Warmup_epoch | 3.0 |

Warmup_bias_learning rate | 0.1 |

Box Loss Factor | 0.05 |

Classification Loss Factor | 0.3 |

Classification Loss Weight | 1.0 |

Objectness Loss Factor | 0.7 |

Intersection Over UnionThreshold | 0.2 |

Anchor Threshold | 4.0 |

Focal Loss Gamma | 1.0 |

Mosaic Scale | 0.5 |

Mosaic Augmentation | 1.0 |

Name | Patch Size/Stride | Output Size |
---|---|---|

Conv1 | 3 × 3/1 | 32 × 128 × 64 |

Conv2 | 3 × 3/1 | 32 × 128 × 64 |

Max Pool 3 | 3 × 3/2 | 32 × 64 × 32 |

Residual 4 | 3 × 3/1 | 32 × 64 × 32 |

Residual 5 | 3 × 3/1 | 32 × 64 × 32 |

Residual 6 | 3 × 3/2 | 64 × 32 × 16 |

Residual 7 | 3 × 3/1 | 64 × 32 × 16 |

Residual 8 | 3 × 3/2 | 128 × 16 × 8 |

Residual 9 | 3 × 3/1 | 128 × 16 × 8 |

Dense 10 | 128 | |

Batch | 128 | |

l2 normalization | 128 |

Model | Range Precision | Way to Train CNN |
---|---|---|

YOLOv7_tiny | 70–75% | Transfer Learning |

YOLOv7 | 70–77% | Transfer Learning |

YOLOv7_tiny Focal Loss | 50–60% | Transfer Learning |

YOLOv7 Focal Loss | 65–70% | Transfer Learning |

YOLOv7_tiny semi-supervised with Focal Loss | 90–94.5% | Inherited weights |

YOLOv7 semi-supervised with Focal Loss | 90–95% | Inherited weights |

Model | Model Size | Backbone | Loss Function | mAP | APtest |
---|---|---|---|---|---|

YOLOv7_tiny | 640 | E-ELAN | SigmoidBin | 74.88 | 38.7% |

YOLOv7 | 640 | E-ELAN | SigmoidBin | 76.15 | 51.4% |

YOLOv7_tiny | 640 | RCSP-ELAN | Focal Loss | 53.7 | 43.1% |

YOLOv7 | 640 | RCSP-ELAN | Focal Loss | 60.2 | 56.0% |

YOLOv7_tiny_semisupervised | 640 | RCSP-ELAN | Focal Loss | 94.5 | 59.2% |

YOLOv7_semisupervised | 640 | RCSP-ELAN | Focal Loss | 95 | 67.5% |

