# An Improved Character Recognition Framework for Containers Based on DETR Algorithm

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- The object detection algorithm DETR with transformers as the basic unit is introduced into the recognition of port container characters, which makes up for insufficient feature extraction and the poor anti-interference ability of model detection in the current widely used convolution operation.
- In the backbone of the model, we introduce the split-attention structure on the basis of ResNet [11] and improve the feature extraction ability of the backbone through the connection method of multi-channel weight convolution.
- With the introduction of multi-scale location coding (MSLC), the transformer structure no longer only focuses on the semantic information of input content, but gives consideration to both semantic and location information.

## 2. Materials and Methods

- Obtain the multi-scale features of the image through the backbone named ResNeSt;
- The sum of the multi-scale position code learned from the multi-scale feature map and the sine code of the feature map is used as the location information input of the transformer frame;
- We pass each output embedding of the decoder to a shared feedforward network that predicts either a detection (class or bounding box) or a “no object” class, as shown in the pink color block in Figure 1;
- Finally, the binary matching loss function based on the Hungarian algorithm is used to uniquely match the prediction result with the ground truth, thereby achieving parallel processing.

#### 2.1. Backbone

#### 2.2. Multi-Scale Location Coding

#### 2.3. Encoder-Decoder

#### 2.4. Self-Attention

_{k}is the dimension of K. The dot product of Q and K is to calculate the attention score between each pixel of the same input and different parts of the global, which represents the attention of a certain key point relative to the entire image. The normalized attention degree matrix is multiplied by V to obtain a weighting matrix, which is used to make the pixel pay attention to the part that it should pay attention to. The multi-head attention used in this paper adopts the parallel summation method of multiple self-attentions [20], and the specific calculation process is shown in Figure 3.

## 3. Experiments Setup

#### 3.1. The Forward Process

#### 3.2. Dataset

- (1)
- Random noise, filters and other methods were superimposed to simulate raindrop and snow movement trajectories on the image;
- (2)
- Inspired by the literature [23], the following formula was used to simulate the effect of fog:$$I\left(x,y\right)=A\rho \left(x,y\right){e}^{-\beta d\left(x\right)}+A\left(1-{e}^{-\beta d\left(x\right)}\right)$$

- (1)
- Performed random pixel block occlusion in the original image to simulate situations where characters are occluded by foreign objects;
- (2)
- The brightness of the image was enhanced by the gamma transform algorithm and the adaptive white balance algorithm at the same time to simulate an environment illuminated by strong light.

#### 3.3. The Training Process

**Training set:**33,000;**Training GPU:**NVIDIA Tesla V100 (deep learning framework version requirements, computing power ≥ 3.5);**Memory:**16 Gb × 1;**Epoch:**100.

## 4. Results Analysis

#### 4.1. Test Results and Discussion Based on the Improved Model

#### 4.2. Discussion on the Experimental Results of Sample Reconstruction for Small Sample Classes

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Acknowledgments

## Conflicts of Interest

## Appendix A

## References

- Druzhkov, P.N.; Kustikova, V.D. A survey of deep learning methods and software tools for image classification and object detection. Pattern Recognit. Image Anal.
**2016**, 26, 9–15. [Google Scholar] [CrossRef] - Liu, X.; Meng, G.; Pan, C. Scene text detection and recognition with advances in deep learning: A survey. Int. J. Doc. Anal. Recognit.
**2019**, 22, 143–162. [Google Scholar] [CrossRef] - Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. IEEE Trans. Pattern Anal. Mach. Intell.
**2017**, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Redmon, J.; Farhadi, A. YOLOv3: An Incremental Improvement. arXiv
**2018**, arXiv:1804.02767. Available online: https://arxiv.org/abs/1804.02767 (accessed on 4 July 2021). - Bochkovskiy, A.; Wang, C.Y.; Liao, H. YOLOv4: Optimal Speed and Accuracy of Object Detection. arXiv
**2020**, arXiv:2004.10934. [Google Scholar] - Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the European Conference on Computer Vision, Amsterdam, The Netherlands, 11–14 October 2016. [Google Scholar] [CrossRef] [Green Version]
- Feng, F.; Yang, Y.; Cer, D.; Arivazhagan, N.; Wang, W. Language-agnostic BERT Sentence Embedding. arXiv
**2020**, arXiv:2007.01852. [Google Scholar] - Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Houlsby, N. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv
**2020**, arXiv:2010.11929. [Google Scholar] - Valanarasu, J.; Oza, P.; Hacihaliloglu, I.; Patel, V.M. Medical Transformer: Gated Axial-Attention for Medical Image Segmentation. arXiv
**2021**, arXiv:2102.10662. [Google Scholar] - Carion, N.; Massa, F.; Synnaeve, G.; Usunier, N.; Kirillov, A.; Zagoruyko, S. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision, Glasgow, UK, 23–28 August 2020. [Google Scholar] [CrossRef]
- He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016. [Google Scholar] [CrossRef] [Green Version]
- Zhang, H.; Wu, C.; Zhang, Z.; Zhu, Y.; Zhang, Z.; Lin, H.; Sun, Y.; He, T.; Mueller, J.; Manmatha, R. ResNeSt: Split-Attention Networks. arXiv
**2020**, arXiv:2004.08955. [Google Scholar] - Jie, H.; Li, S.; Gang, S.; Albanie, S. Squeeze-and-Excitation Networks. In Proceedings of the IEEE Transactions on Pattern Analysis and Machine Intelligence, Salt Lake City, UT, USA, 18–22 June 2018; Available online: https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_Squeeze-and-Excitation_Networks_CVPR_2018_paper (accessed on 4 July 2021).
- Li, X.; Wang, W.; Hu, X.; Yang, J. Selective Kernel Networks. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
- Xie, S.; Girshick, R.; Dollár, P.; Tu, Z.; He, K. Aggregated Residual Transformations for Deep Neural Networks. Comput. Vis. Pattern Recognit.
**2017**, 1, 5987–5995. [Google Scholar] - Zhu, X.; Su, W.; Lu, L.; Li, B.; Dai, J. Deformable DETR: Deformable Transformers for End-to-End Object Detection. arXiv
**2020**, arXiv:2010.04159. [Google Scholar] - Di, H.; Ke, X.; Dadong, W. Design of multi-scale receptive field convolutional neural network for surface inspection of hot rolled steels. Image Vis. Comput.
**2019**, 89, 12–20. [Google Scholar] - He, D.; Xu, K.; Zhou, P. Defect detection of hot rolled steels with a new object detection framework called classification priority network. Comput. Ind. Eng.
**2019**, 128, 290–297. [Google Scholar] [CrossRef] - Tang, G.; Müller, M.; Rios, A.; Sennrich, R. Why Self-Attention? A Targeted Evaluation of Neural Machine Translation Architectures. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 31 October–4 November 2018; Association for Computational Linguistics: Brussels, Belgium, 2018. [Google Scholar]
- Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention Is All You Need. arXiv
**2017**, arXiv:1706.03762. [Google Scholar] - Kuhn, H.W. The Hungarian method for the assignment problem. Nav. Res. Logist.
**2010**, 52, 7–21. [Google Scholar] [CrossRef] [Green Version] - Rezatofighi, H.; Tsoi, N.; Gwak, J.Y.; Sadeghian, A.; Savarese, S. Generalized Intersection Over Union: A Metric and a Loss for Bounding Box Regression. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; IEEE: Piscataway, NJ, USA, 2019. [Google Scholar]
- Dong, J.; Liu, K.; Wang, J. Simulation algorithm of outdoor natural fog scene based on monocular video. J. Hebei Univ. Technol.
**2012**, 6, 16–20. [Google Scholar]

**Figure 4.**Data enhancement results of the sample set: (

**a**) original drawing, (

**b**) simulating foggy days, (

**c**) simulating rainy days, (

**d**) simulating snow days, (

**e**) foreign bodies blocked 1, (

**f**) foreign bodies blocked 2, (

**g**) simulating smoke, (

**h**) simulating strong light.

Layer Name | Output Size | ResNet-50 | ResNeSt-50 |
---|---|---|---|

Conv1_x | $112\times 112$ | $7\times 7.64.stride2$ | |

Conv2_x | $56\times 56$ | $3\times 3maxpool.stride2$ | |

$\left\{\begin{array}{c}1\times 1.64\\ 3\times 3.64\\ 1\times 1.256\end{array}\right\}\times 3$ | $\left\{\begin{array}{c}concat\left[\underset{K}{\underset{\u23df}{\begin{array}{c}\left\{\begin{array}{c}1\times 1.64/\left(KR\right)\\ 3\times 3.64/K\end{array}\right\}\times R\\ SplitAttention\end{array}\dots \begin{array}{c}\left\{\begin{array}{c}1\times 1.64/\left(KR\right)\\ 3\times 3.64/K\end{array}\right\}\times R\\ SplitAttention\end{array}}}\right]\\ \\ 1\times 1.256\end{array}\right\}\times 3$ | ||

Conv3_x | $28\times 28$ | $\left\{\begin{array}{c}1\times 1.128\\ 3\times 3.128\\ 1\times 1.512\end{array}\right\}\times 3$ | $\left\{\begin{array}{c}concat\left[\underset{K}{\underset{\u23df}{\begin{array}{c}\left\{\begin{array}{c}1\times 1.128/\left(KR\right)\\ 3\times 3.128/K\end{array}\right\}\times R\\ SplitAttention\end{array}\dots \begin{array}{c}\left\{\begin{array}{c}1\times 1.128/\left(KR\right)\\ 3\times 3.128/K\end{array}\right\}\times R\\ SplitAttention\end{array}}}\right]\\ \\ 1\times 1.512\end{array}\right\}\times 3$ |

Conv4_x | $14\times 14$ | $\left\{\begin{array}{c}1\times 1.256\\ 3\times 3.256\\ 1\times 1.1024\end{array}\right\}\times 3$ | $\left\{\begin{array}{c}concat\left[\underset{K}{\underset{\u23df}{\begin{array}{c}\left\{\begin{array}{c}1\times 1.256/\left(KR\right)\\ 3\times 3.256/K\end{array}\right\}\times R\\ SplitAttention\end{array}\dots \begin{array}{c}\left\{\begin{array}{c}1\times 1.256/\left(KR\right)\\ 3\times 3.256/K\end{array}\right\}\times R\\ SplitAttention\end{array}}}\right]\\ \\ 1\times 1.1024\end{array}\right\}\times 3$ |

Conv5_x | $7\times 7$ | $\left\{\begin{array}{c}1\times 1.512\\ 3\times 3.512\\ 1\times 1.2048\end{array}\right\}\times 3$ | $\left\{\begin{array}{c}concat\left[\underset{K}{\underset{\u23df}{\begin{array}{c}\left\{\begin{array}{c}1\times 1.512/\left(KR\right)\\ 3\times 3.512/K\end{array}\right\}\times R\\ SplitAttention\end{array}\dots \begin{array}{c}\left\{\begin{array}{c}1\times 1.512/\left(KR\right)\\ 3\times 3.512/K\end{array}\right\}\times R\\ SplitAttention\end{array}}}\right]\\ \\ 1\times 1.2048\end{array}\right\}\times 3$ |

$1\times 1$ | Average pool. 1000-d fc. softmax |

Model | DE->EN | ||
---|---|---|---|

2014 | 2017 | Acc (%) | |

RNNS2S | 29.1 | 30.1 | 84.0 |

ConvS2S | 29.1 | 30.4 | 82.3 |

Transformer | 32.7 | 33.7 | 90.3 |

Model | mAP@0.5 (%) | mAP@0.5:0.95 (%) | Recall (%) | F1 Score (%) |
---|---|---|---|---|

DETR | 93.42 | 60.02 | 95.6 | 93 |

DETR-ResNeSt | 96 | 64.2 | 96.03 | 94 |

DETR-ResNeSt-MSLC | 98.6 | 67 | 98.7 | 98 |

Class | V | P | N | Z | W | ALL | |
---|---|---|---|---|---|---|---|

Before | Quantity | 11 | 33 | 132 | 88 | 110 | 4000 |

mAP@0.5 | 0.514 | 0.83 | 0.878 | 0.902 | 0.782 | 0.958 | |

After | Quantity | 355 | 743 | 454 | 817 | 467 | 4500 |

mAP@0.5 | 0.995 | 0.991 | 0.96 | 0.994 | 0.97 | 0.986 |

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zhao, X.; Zhou, P.; Xu, K.; Xiao, L.
An Improved Character Recognition Framework for Containers Based on DETR Algorithm. *Sensors* **2021**, *21*, 4612.
https://doi.org/10.3390/s21134612

**AMA Style**

Zhao X, Zhou P, Xu K, Xiao L.
An Improved Character Recognition Framework for Containers Based on DETR Algorithm. *Sensors*. 2021; 21(13):4612.
https://doi.org/10.3390/s21134612

**Chicago/Turabian Style**

Zhao, Xiaofang, Peng Zhou, Ke Xu, and Liyun Xiao.
2021. "An Improved Character Recognition Framework for Containers Based on DETR Algorithm" *Sensors* 21, no. 13: 4612.
https://doi.org/10.3390/s21134612