DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition

Zhuo, Sichao; Zhang, Xiaoming; Chen, Ziyi; Wei, Wei; Wang, Fang; Li, Quanlong; Guan, Yufan

doi:10.3390/app132011493

Open AccessArticle

DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition

by

Sichao Zhuo

^1,2,

Xiaoming Zhang

^1,2,*,

Ziyi Chen

^1,2,

Wei Wei

^1,2,

Fang Wang

^1,2

,

Quanlong Li

^1,2 and

Yufan Guan

^1,2

¹

College of Information Engineering, Beijing Institute of Petrochemical Technology, Beijing 102617, China

²

Laboratory of Petroleum and Chemical Industry Process Control System Information Security Engineering, Beijing 102617, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2023, 13(20), 11493; https://doi.org/10.3390/app132011493

Submission received: 7 September 2023 / Revised: 13 October 2023 / Accepted: 18 October 2023 / Published: 20 October 2023

(This article belongs to the Special Issue Automation and Digitization in Industry: Advances and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

With the development of Industry 4.0, although some smart meters have appeared on the market, traditional mechanical meters are still widely used due to their long-standing presence and the difficulty of modifying or replacing them in large quantities. Most meter readings are still manually taken on-site, and some are even taken in high-risk locations such as hazardous chemical storage. However, existing methods often fail to provide real-time detections or result in misreadings due to the complex nature of natural environments. Thus, we propose a lightweight network called DAMP-YOLO. It combines the deformable CSP bottleneck (DCB) module, aggregated triplet attention (ATA) mechanism, meter data augmentation (MDA), and network pruning (NP) with the YOLOv8 model. In the meter reading recognition dataset, the model parameters decreased by 30.64% while mAP50:95 rose from 87.92% to 88.82%, with a short inference time of 129.6 ms for the Jetson TX1 intelligent car. In the VOC dataset, our model demonstrated improved performance, with mAP50:95 increasing from 41.03% to 45.64%. The experimental results show that the proposed model is competitive for general object detection tasks and possesses exceptional feature extraction capabilities. Additionally, we have devised and implemented a pipeline on the Jetson TX1 intelligent vehicle, facilitating real-time meter reading recognition in situations where manual interventions are inconvenient and hazardous, thereby confirming its feasibility for practical applications.

Keywords:

meter recognition; YOLOv8; attention machinism; model pruning; lightweight; embedded deployment

1. Introduction

Industry 4.0 is considered the latest stage of industrial development. It is based on information technology to promote the intelligent development of the industrial field, emphasizing the intelligent upgrading of manufacturing through technologies such as the Internet of Things, big data, and artificial intelligence. All manufacturing processes will be digitized within this framework, enabling highly individual production. Among this, the intelligent manufacturing pursued by Industry 4.0 requires many readings of various meters, which are critical for monitoring production processes, optimizing manufacturing processes, improving product quality, and more. In the early stages of industrial development, meter readings mainly relied on manual operation and management, and manual meter readings came into being at this stage. Traditional industrial mechanical meters are still widely used in the transition process of industrial development, necessitating manual meter reading. While technologies offer alternative methods of meter readings, such as Bluetooth, LTE, and LoRa, these technologies require additional hardware to be installed on mechanical meters, which may not always be feasible or cost-effective, and replacing all mechanical instrumentation requires a significant investment of capital and technical support. Comparatively, image-based meter reading requires only a camera, which is already widely used in many devices, providing a more comprehensive solution that can handle multiple types of meters and reading scenarios. Therefore, meter reading recognition technology can be used as a critical technical means to achieve the goals of Industry 4.0, helping enterprises achieve automation, digitization, and intelligence.

The initial research focused on theoretical recognition studies, later expanding to recognizing simple characters such as 0–9 numerals and Chinese symbols. The OCR technology can be classified into two categories: traditional and deep-learning-based character recognition methods. Traditional OCR technology typically comprises three sequential stages: preprocessing, text area localization, and text recognition. Notably, the classifiers utilized in the traditional recognition phase encompass techniques such as template matching, line cross, and support vector machines (SVMs).

Duan et al. [1] began to use OCR technology for digital meter recognition earlier, proposed a method for digital meter recognition based on fuzzy theory, and constructed a digital recognizer that can quickly recognize digital displays. However, the quality requirements of the collected pictures are relatively high, and it needs a better segmentation effect for oblique images, which may cause digital incompleteness and affect the recognition effect. Traditional template matching methods [2,3,4] find it difficult to distinguish slightly deformed, displaced, and rotated images due to interference and have poor recognition effects for digital characters with different fonts and inclinations. Lu et al. [5] proposed an improved standard-template-matching method that subdivides the various interferences of each type of number and makes multiple templates for matching, reducing the recognition error caused by interference to a certain extent. However, as the number of templates produced increases, the algorithm’s complexity will also increase, so this improvement scheme has specific requirements for computer performance. He et al. [6] proposed the line cross method, a deterministic recognition algorithm. Although this method has high accuracy, adapting to complex environmental scenarios takes time and effort. Nodari et al. [7] and Cui X et al. [8] proposed a digital meter recognition method. This method uses a linear support vector machine based on the histogram of the oriented gradient (HOG) feature, which uses HOG feature detection and performs digital recognition through SVM classification. Sampath and Gomathi [9] initially used the HOG feature descriptor to extract features from character images. Then, they used the fuzzy-based multi-kernel spherical support vector machine (FMS-SVM) technique to classify the extracted features. The SVM-based method has better accuracy and better robustness. Still, it is unsuitable for processing extensive sample data, and it is not easy to find the appropriate hyperparameters of the SVM classifier and the best features to be extracted.

Although traditional OCR technologies have established a well-developed technical process system, they demand a complex and time-consuming preprocessing phase. Generally, it necessitates images with a relatively uncluttered background; even a slightly intricate picture may cause recognition failure. Furthermore, limitations, such as the inadequate recognition of digital characters with varied fonts and inclinations, result in poor adaptability, a low fault tolerance, and an inability to recognize text in diverse layouts, complex backgrounds, and low-resolution environments.

With the advent of deep learning, a plethora of object-detection networks have emerged; notable among them are RCNN (Region-CNN) series networks [10,11,12,13], YOLO (you only look once) series networks [14,15,16,17,18], and SSD networks [19]. Many scholars have utilized these vision-based target detection networks for OCR character recognition technology. Gómez et al. [20] introduced an end-to-end digital meter recognition method without segmentation for readings. The author uses the Caffe framework to train a convolutional neural network to read directly from the input image without detecting the meter reading area. However, the author’s method was only evaluated on a large private dataset with close to 180,000 training samples, among which not only are most of the images centered on the digital meter, but also numbers occupy most of the images. Therefore, as the author pointed out, this system can only be used when the meter is in the center of the test image or the meter needs to be bigger in the test image. Cai et al. [21] used a fully convolutional neural network to realize the recognition of digital meter representation numbers and obtained the final result through graph-to-graph prediction. Guo et al. [22] proposed a variable convolutional network structure, optimized the receptive field of the convolutional neural network, and used this neural network for the character segmentation and recognition of digital meters. Waqar et al. [23] proposed a two-stage digital meter detection and recognition model based on Faster R-CNN, which enables the model to detect meter numbers under different lighting conditions more accurately. Laroca et al. [24] designed a two-stage method for a digital meter detection and recognition model. This method compares three CNN-based models in terms of their dataset: CR-NET, Multi TaskLearning, and CRNN, and finally combines Fast-YOLO and CR-NET to achieve the best meter area positioning and reading recognition effect on its meter dataset. Sun et al. [25] used the improved SSD network to detect the meter image. They changed the basic SSD network from VGG-16 to ResNet50, aiming to extract more feature information in the picture to ensure speed and improve recognition accuracy. Li et al. [26] introduce an improved capsule network that overcomes the limitation of the basic capsule network [27], which is challenging to execute on devices with limited computational resources. Additionally, they integrate morphology-based data preprocessing with the network to improve meter reading recognition. Martinelli et al. [28] suggest a method for localizing and recognizing cubic meter and liter areas on water meters using YOLOv5 by annotating the dataset. Lin et al. [29] utilized fast Fourier transform (FFT) for blur detection and DeblurGANv2 for recovering the blurred image. Additionally, They employed Polygon-YOLOv5 to detect the reading area and CRNN for recognizing the readings. Although they claim that their method was applied to inspection robots, it does not ensure real-time performance, which is crucial for robots. Carvalho et al. [30] devised a pipeline for mobile devices, which incorporates MobileNetv2-SSD [31] for screen detection, EAST [32] for text detection, and Rosetta [33] for text recognition; the entire pipeline requires 1500 ms on Samsung Active Table 3 device.

The above-mentioned deep learning methods often have a complex network structure and a large amount of calculation, resulting in a trade-off between detection accuracy and inference speed. Additionally, missed detections or misidentifications still occur with state-of-the-art lightweight networks in complex real-life scenarios, such as lighting conditions, specular reflection, and shadow occlusions. In order to combat these challenges in practical applications and enhance network accuracy and efficiency, we propose a lightweight network called DAMP-YOLO. The main work and contributions of this paper are as follows:

The proposal of a deformable CSP bottleneck (DCB) module that adapts by learning the kernel offset for objects, making the network resilient to noise interference from objects with similar shapes and structures, thus achieving high accuracy;
The design of the aggregated triplet attention (ATA) mechanism enhances global information interaction by aggregating diverse branches, effectively capturing attention across all dimensions.
Aiming at the particularity of meter images: meter data augmentation (MDA) is used to generate training data that are similar to natural scenes to improve the robustness of the model in complex environments;
Utilize network slimming to prune unimportant network channels to achieve model compression and accelerated computing;
Deploy the recognition pipeline to the Jetson TX1-based intelligent car to realize meter reading recognition in real-world scenarios;
Our model can still enhance accuracy even with reduced parameters in general object-detection datasets.

The remainder of this paper is organized as follows. In Section 2, we review related work on meter reading recognition and deep learning models for embedded deployment. Section 3 presents the proposed DAMP-YOLO network architecture and its components. In Section 5, we evaluate the performance of our model on the meter reading and VOC datasets, compare it with state-of-the-art methods, and discuss the deployment of the proposed model in the Jetson TX1 platform. Finally, Section 6 concludes the paper and the future research directions.

2. YOLOv8 Network

The object detection networks based on deep learning are classified into two categories: single-stage and two-stage methods. Single-stage methods aim to directly predict the classification and localization of objects through a convolutional neural network, such as YOLO, SSD [19], and Retina-Net [34]. On the other hand, two-stage methods consist of two networks. The input image first generates region proposals through a region proposal network and then predicts the classification and location of the objects using a convolutional neural network, such as SPPNet [35], Faster R-CNN [12], and Mask R-CNN [13]. Since the single-stage methods significantly simplify the training process, reduce the number of model parameters, and improve the inference speed compared to two-stage methods, the single-stage methods are commonly used in mobile devices.

The YOLOv8 [18] method, developed by Ultralytics, is a single-stage neural network algorithm for object detection, image classification, and instance segmentation tasks. It consists of four main components: input preprocessing, backbone network, neck network, and detection head layer. When compared to YOLOv5 [15], which is also developed by Ultralytics, YOLOv8 includes numerous architectural and developer experience enhancements and modifications. The architecture of YOLOv8 is depicted in Figure 1.

The input preprocessing mainly includes image size processing, random scaling, rotation, cropping, mosaic, and other data enhancements so that the expanded training data is as close as possible to the actual distribution data, preventing the model from learning target-irrelevant information and reducing the overfitting problem caused by sample imbalance, thereby improving the prediction accuracy and robustness of the model.

The backbone network mainly comprises the Conv, C2f, and SPPF modules. Among them, the Conv convolution module consists of three modules: Conv2D, BacthNorm, and SiLU; SPPF is an improvement on SPP in that it uses three 5 × 5 max-pooling layer cascades to replace the original 5 × 5, 9 × 9, and 13 × 13 max-pooling layer, thereby further improving the inference speed while achieving fusion of the same size receptive field; the C2f module is designed concerning the ideas of the C3 and ELAN modules, as shown in Figure 2, each network layer can obtain richer gradient flow information with the help of C3’s gradient shunt and residual structure, combined with more parallel gradient flow branch structures in ELAN.

The neck network uses a top-down FPN (feature pyramid network) combined with a bottom-up PAN (path aggregation network) structure to form a feature pyramid structure, which reduces the transmission path from low-level features to high-level features; it also enables the fusion of feature information of different sizes and better uses low-level feature information to predict results.

In the detection head layer, YOLOv8 introduces the decoupled head mechanism. The traditional coupled head aims to directly predict the category and position through the convolutional layer output features of the neck network. In contrast, the decoupling head uses the output features of the neck network to predict the category and location through different network branches. The difference between the coupled head and the decoupled head is shown in Figure 3. Although using a decoupling head will bring a little extra parameter and computational cost, separating regression and classification tasks can speed up model convergence and improve network accuracy.

Compared with the YOLOv5 model, YOLOv8 utilizes the anchor-free mechanism, which directly predicts the object’s center rather than the offset from the known anchor point. Since the anchor-free mechanism does not need to set the anchor manually, it will predict the bounding box more flexibly and efficiently.

3. DAMP-YOLO Network

As illustrated in Figure 4, the proposed DAMP-YOLO network architecture enhances the YOLOv8 baseline by incorporating a deformable CSP bottleneck (DCB), aggregated triplet attention (ATA), random meter data augmentation (MDA), and network pruning (NP). The DCB module effectively captures the features of objects with similar shapes and structures through adaptive adjustment of deformable convolutional kernel offsets. The ATA module enhances focus on attention across all dimensions by aggregating features from different branches. The MDA module aims to simulate data distribution in real-world scenarios, and network pruning is employed to reduce network parameters and computational costs. These techniques collectively ensure that our network is robust in detecting objects in complex environments with high accuracy while significantly reducing model size.

The following part of this section will describe, in detail, the implementation process and methods for improvement.

3.1. Deformable CSP Bottleneck

When the traditional convolution layer performs convolution calculation on images in visual recognition and detection tasks, it will select input features according to the preset rectangular grid structure for convolution calculation (for example, a convolution module with a convolution kernel of 3 × 3 will select eight neighboring pixels of the current pixel), which cannot cope with the geometric transformation of the image well. At the same time, the traditional convolutional layer is performed in a small local receptive field, ignoring some distant information. Therefore, the traditional convolutional layer is limited to unknown geometric transformations. The ability to model geometric transformations mainly comes from many data enhancements, more model parameters, and some artificially designed modules(such as short-distance translation and invariant max-pooling layers).

Deformable convolution [36] introduces an offset into the traditional convolution operation. It gives each input feature point selected by the convolution layer an offset so that the convolution will focus on selecting the input feature during the training process. In the effective information area, different convolutional layers will focus on different image geometric transformations, thereby improving the robustness of the model to image geometric transformations.

As illustrated in Figure 5, let the input feature map be

X \in R^{H \times W \times C}

and the output feature map be

Y \in R^{H \times W \times D}

, where H and W represent the height and width of the feature map, respectively, and C and D represent the input feature map and output feature map, respectively, for the number of channels. The operation of deformable convolution can be expressed as

Y [h, w, d] = \sum_{c = 1}^{C} \sum_{i = 1}^{I_{h}} \sum_{j = 1}^{I_{w}} W_{c, i, j, d} * X [h + i - 1, w + j - 1, c]

(1)

where

h, w, d

denote the pixel co-ordinates and channel co-ordinates on the output feature map, respectively.

I_{h}

and

I_{w}

represent the offset range of the convolution kernel in height and width, respectively,

W_{c, i, j, d}

represent the weight of the convolution kernel, and the sign “*” represents the convolution operation.

In contrast to traditional convolution, deformable convolution needs to consider the offset of each pixel when operating, so the offset needs to be added to the convolution kernel. The following mathematical formula can calculate these offsets:

O f f s e t_{i, j} = \frac{1}{N} \sum_{n = 1}^{N} G r a d_{i, j, n}

(2)

Among them,

O f f s e t_{i, j}

represents the offset of the convolution kernel at the position (i, j),

G r a d_{i, j, n}

represents the gradient of the loss function to the nth pixel on the output feature map, and N represents the pixel on the output feature map quantity.

For the abovementioned, we proposed DCB modules that combine the advantages of C2f, ELAN, and deformable convolution. The DCB module structure is shown in Figure 6. The input features are initially processed through the first convolutional layer to extract the preliminary features. The output of this layer is then split into two branches: the main gradient flow and the subgradient flow. The main gradient flow is further processed through deformable convolutional bottleneck layers to extract deeper features. Similar to the first layer, the output of each deformable convolutional bottleneck layer is split into two branches, with the main gradient flow serving as the input for the next layer. After passing through N deformable convolutional bottlenecks, the last main gradient flow is concatenated with all subgradient flows and fed into the second convolutional layer to produce the final fused features.

The deformable CSP bottleneck module not only captures more gradient information from the input feature map through the use of main and subgradient flows, thus enhancing network accuracy, but it also adaptively adjusts the deformable convolution kernel’s offset based on the object’s shape and structure. These allow the network to accurately recognize objects with similar shapes and structures under complex environments, ultimately improving the robustness of the network’s recognition performance.

3.2. Aggregated Triplet Attention

In recent years, attention mechanisms have been widely used in various tasks in computer vision. The squeeze-and-excitation (SE) [37] channel attention mechanism learns the importance between channels to improve the useful features and suppress those features that are not very useful for the current task; unlike the SE attention mechanism, efficient channel attention (ECA) [38] does not require channel compression on the input feature map; it uses one-dimensional convolution in the channel dimension to capture the dependencies between channels and extract the importance of the channels; Frequency channel attention (FCA) [39] proves (through mathematics) that global average pooling (GAP) is a particular case of discrete cosine transform (DCT), and the attention mechanism under the frequency domain operation is promoted based on the SE attention mechanism; The convolutional block attention module (CBAM) [40] is a mixed-domain convolutional attention module, which includes a channel attention module and a spatial attention module, the channel attention module can highlight important channels in each feature map, whereas the spatial attention module can focus on the essential regions of the image; Co-ordinate attention (CA) [41] uses co-ordinate information to calculate the attention weight so that the neural network pays more attention to the spatial position relationship between different pixels; triplet attention [42] uses a three-branch structure to capture crossdimensional interactions to calculate attention weights, establish spatial and channel dependencies through rotation operations and residual transformations, and encode interchannel and spatial information with negligible computational overhead.

We compare several different attention mechanisms. As shown in Table 1, the triplet attention mechanism performed best after adding it to the yolov8n network. So, we finally made improvements based on the triplet attention mechanism to improve the network’s performance.

Triplet attention is an attention mechanism with a three-branch structure. The main idea is to divide the input feature representation into three branches; each branch is responsible for calculating and applying the attention weights of two dimensions in the three dimensions of the input tensor to capture the crossdimensional interaction information between different dimensions. The structure diagram of triplet attention is shown in Figure 7. The first branch captures the crossdimensional interaction information between the two dimensions of channel C and space height, H; the second branch captures the interaction information between channel C and the spatial width, W; the third branch captures the interaction information between spatial height, H, and width, W. Specifically, the input tensor passes through the average pooling and maximum pooling layers of Z-Pool (zeroth pool), one of the three dimensions is compressed and then calculated by the convolutional layer and activated by Sigmoid to become an attention weight in the range of 0–1, which is subsequently multiplied by the original input. Finally, the outputs aggregate the three branches by taking the mean value.

Given an input tensor

X \in R^{C \times H \times W}

, the equation can be represented as following:

Y = \frac{1}{3} (\bar{Y_{1}} + \bar{Y_{2}} + Y_{3}) = \frac{1}{3} (\bar{\hat{X_{1}} σ (ψ_{1} (\hat{X_{1}^{*}}))} + \bar{\hat{X_{2}} σ (ψ_{2} (\hat{X_{2}^{*}}))} + X σ (ψ_{3} (\hat{X_{3}^{*}})))

(3)

where

σ

represents the sigmoid activation function;

ψ_{1}

,

ψ_{2}

, and

ψ_{3}

represent the standard two-dimensional convolutional layers defined by a kernel size of

k \times k

;

\hat{X_{1}}

and

\hat{X_{2}}

represent the rotated tensor that interacts with two dimensions;

\hat{X_{1}}

,

\hat{X_{2}}

, and X are passed through Z-Pool and are, subsequently, reduced to

\hat{X_{1}^{*}}

,

\hat{X_{2}^{*}}

, and

\hat{X_{3}^{*}}

in the three branches of triplet attention.

Triplet attention eliminates the information bottleneck and dimensionality-reduction problems common in most attention mechanism structures by capturing the interactive information between two dimensions. Because the network structure does not include a fully connected layer structure but uses three convolutional layers for calculation, the parameter amount of this module is meager. However, since the three branches of this structure are calculated independently, the overall interactive information of the three dimensions is ignored, and using three 7 × 7 large convolution kernels to calculate three branches will affect the reasoning speed due to computational overhead.

Thus, we proposed aggregated triplet attention. The core concept of this module is to aggregate the features of three concurrent branches, blend them with a convolutional layer to focus on the information from all branches simultaneously, and subsequently split them into the original branches for attention-weight calculations in distinct dimensions. The structure diagram of the proposed module is shown in Figure 8. First, the input features are fed into three branches. Each branch performs global average pooling to compress two dimensions, and this is responsible for capturing crossdimensional information interactions between channel C and spatial height, H, between channel C and spatial width, W, and between spatial height, H, and width, W. After the three branches are concatenated, a convolution layer with a 3 × 3 convolution kernel is applied to capture the information interaction between all branches. The output is then split back into three branches, and each branch performs convolution with a 1 × 1 convolution kernel to calculate its attention weight. The calculated attention weights are multiplied with the original input features and passed through a Sigmoid activation function to range the values between 0 and 1. Finally, the average of the three branches is taken as the output.

The aggregated triplet attention mechanism better captures the interactive information among its three branches and computes the corresponding attention weights through feature aggregation. Additionally, employing small convolution kernels instead of the original 7 × 7 large kernel enhances the network’s inference speed.

3.3. Meter Data Augmentation

Traditional data preprocessing relies heavily on parameter tuning, typically based on morphological features. However, given the intricacy and dynamism of real-world environments, fixed-parameter traditional data preprocessing often struggles to adapt well to environmental changes. By artificially incorporating prior knowledge of human vision, data augmentation can enhance the model’s performance and has become a widely accepted practice for training computer vision models. Therefore, we have chosen to adopt data enhancement as a means of simulating real-life data distribution to improve the robustness of the network. Compared with ordinary object detection images, meter reading images have their particularities. These characteristics can be considered and analyzed during meter reading image preprocessing and network training:

Background color: The digital display area of the meter image is usually dominated by light colors (such as light green or light grey), and the peripheral areas of the meter are usually dark colors (such as dark black and dark brown).
Font labels: Meter images contain text labels such as meter readings and timestamps. Meter readings are the core of meter recognition and are usually presented using specific digital fonts, which cannot be flipped left and right or up and down.
Shape layout: Meters of the same model have similar shapes and layouts, and the numbers and labels in the meter reading image are primarily located in the center of the image with the same fonts.
Noise interference: The meter image may be affected by noise interference, such as day and night light changes, stains, reflections, and shadow occlusion. These interferences may affect the accuracy and stability of readings.

Given the above particularities of the meter reading data, the dataset will undergo random processing during input preprocessing to enhance data. Apart from basic geometric transformation, like random translation, rotation, and scale, meter data augmentation also includes random channel multiply, Gaussian blur, Gaussian noise, alpha grid, and conversion to grayscale. The example images are shown in Figure 9.

These steps replicate the process in which a camera captures images from various angles and positions, often with noise interference. By adjusting and refining the training data, we aim to align more closely with the data distribution in real-world scenarios, ultimately enhancing the model’s performance and generalization capabilities.

3.4. Network Pruning

The model trained by the neural network contains some weight parameters with a small proportion of weight and only plays a minor role during inference. Although these parameters have little influence on the calculation results, they require much calculation. Deploying models on mobile devices with limited memory and computing resources requires a model compression method that only sacrifices a little accuracy and removes these trivial weight parameters. Network pruning is a mainstream model compression method. By pruning unimportant neurons, filters, or channels, we can effectively compress the parameters and the amount of calculation the model undertakes.

In this paper, we use the network slimming [43] method to implement the channel pruning algorithm based on the BatchNorm layer, and the method uses the parameters of the BN layer as the evaluation-channel-importance index. The calculation formula of the BN layer is as follows:

\hat{z} = \frac{Z_{i n} - μ_{β}}{\sqrt{σ_{β}^{2} + ϵ}}

(4)

Z_{o u t} = γ \hat{z} + β

(5)

In Formulas (4) and (5),

γ

and

β

are the normalization parameters of the BN layer,

Z_{i n}, Z_{o u t}

represent the input and output of the BN layer, respectively,

μ_{β}, σ_{β}

represent the mean and variance of the BN layer, and

ϵ

represents a small constant to prevent the denominator being 0. In Formula (5), when

γ

is close to 0, the output has nothing to do with the input. Therefore, the channel pruning algorithm uses the

γ

parameter of the BN layer as the scaling factor of network pruning to measure the importance of each channel, multiplies

γ

with the output of each channel, and then jointly trains the network weight of the original network and

γ

scaling factor; then, it counts and sorts

γ

, directly removing the input-output relationship of this part of the channel, which is less than the setting global threshold, to compress the parameter amount and calculation amount undertaken by the model.

Figure 10 shows the overall scheme of model channel pruning. In the pruning process, we first need to perform sparse training on the pretrained model, update the gradient during backpropagation, then prune the noncritical channels in the model by setting the pruning rate, and finally retrain the pruned model to obtain the final model.

3.4.1. Sparse Training

The

γ

value of the BN layer is generally distributed normally, and it is difficult to use it to measure the importance of the channel. Therefore, adding L1 regularization constraints to the

γ

value in the loss function is necessary to make the

γ

value sparse before model clipping. At this point, the loss function is as follows:

L = \sum_{x, y} l (f (x, W), y) + λ \sum_{y \in Γ} g (γ)

(6)

In Formula (6),

\sum_{x, y} l (f (x, W), y)

is the loss generated by the YOLOv8 model prediction, x represents the input data; y represents the output data; W represents the weight parameters during training; l represents the loss function of each layer in the model network;

λ \sum_{y \in Γ} g (γ)

represents the addition of L1 regularization constraints; the balance loss coefficient

γ

as a sparsity parameter is a hyperparameter used to balance two terms.

3.4.2. Channel Prune and Finetune

After the network model is sparsely trained, a part of the

γ

value in the BN layer will tend to be distributed to 0, and this part of the weight has little influence on the inference result. Therefore, an appropriate pruning rate, P, can be set to prune the channel of the BN layer, remove the convolution kernel of the BN layer, where some weights are located, and the adjacent upper layer and the channel output to the lower layer, obtaining a model with a smaller amount of parameters and less calculation. Then, we fine-tuned the pruned model with the same simple training as the standard training, restoring accuracy and enabling lightweight models without sacrificing too much accuracy. The schematic diagram of channel pruning is shown in Figure 11.

4. Embedded Deployment

In order to address the potential safety concerns associated with real-time meter readings, we implement the model presented in this paper on an intelligent car, as depicted in Figure 12. The intelligent car integrates the Jetson TX1 embedded development board. It contains a 256-core NVIDIA Maxwell GPU, 64-bit ARM Cortex-A57 CPU, 4GB LPDDR4 memory, 16 GB flash memory, Bluetooth, an 802.11ac Wi-Fi module, and a gigabit ethernet card. It operates on the Linux for Tegra operating system, leveraging its substantial computational power and image-processing capabilities to achieve intelligent functionalities, such as obstacle avoidance, path planning, and autonomous driving.

Simultaneously, the intelligent vehicle is furnished with an array of sensors, including an Astra Pro camera, LSLiDAR N10 lidar, an MPU9250 gyroscope, and an Actions ATS3605D smart voice chip. These sensors collect data on the car’s surroundings, enable the vehicle to detect obstructions, devise driving routes, and manage its movement.

We intend to utilize the Jetson TX1 as the primary control unit for our intelligent vehicle. By executing machine learning algorithms and deep learning models, it will be capable of processing and analyzing sensor data, ultimately achieving intelligent functionalities.

During the inference stage, we implemented the detection and recognition network model onto the Jetson TX1 intelligent car, seamlessly integrating the input and output of both network components. As illustrated in Figure 13, the intelligent car’s inference pipeline with the meter reading recognition network primarily comprises five stages: (i) input images captured from the camera, (ii) meter reading area detection, (iii) image cropping from the detected area, (iv) meter reading recognition, and (v) meter reading aggregation.

After capturing an image from the camera using OpenNI, we resized the input image to 640 × 640 pixels. We then utilized the proposed model in this paper to identify the meter reading area within the image. Based on the confidence level of the detection, the detected meter reading areas are classified as either clear and operable or illegible and faulty. If an area is deemed illegible, it is discarded, and the image is recaptured. The operable area is then cropped and input into our recognition network to obtain the meter readings. Once again, based on the confidence level, the recognized characters are classified as either correct or false. False readings are rejected, and the meter reading area detection process is repeated. The correct readings are sorted and combined according to their positions, and the final output results are saved and displayed. Figure 14 shows the detailed flow chart.

5. Experiment and Results

5.1. Dataset Setup

In our experiment, we evaluate models using three datasets: the meter reading area detection dataset, the meter reading recognition dataset, and the Pascal VOC dataset [44].

For the meter reading area detection and meter reading recognition datasets, we utilized the floodX [45] dataset in this experiment. The floodX dataset was collected over two days during flash flood experiments in an urban setting, where 21 sensors, such as radars, ultrasonics, and cameras, were installed to monitor flooding-related variables. The environmental conditions during data collection included overcast, direct sunlight, cloudy conditions, and nighttime, thus providing a robust test for vision-based meter detection and recognition. Since the floodX dataset only contains meter reading annotations without specific locations, we employed LabelImg to annotate the reading areas and positions, thereby obtaining two required datasets.

The meter reading area detection dataset aims to localize the meter reading area, whereas the meter reading recognition dataset focuses on recognizing the digits within the detected area. The data collected on the first day was utilized as the training set, and the second day was used for testing. The meter reading area detection dataset comprises 23,365 images, including 12,716 training images. On the other hand, the meter reading recognition dataset consists of 46,632 images, with 25,337 images in the training set. Figure 15 illustrates a few sample images from the dataset.

The PASCAL VOC dataset is a collection of images originally created for the European Conference on Computer Vision (ECCV) project and comprises 20 different object categories, such as person, car, bird, and cat, among others. The dataset consists of 9963 images, with 5011 images in the training and validation set and 4952 images in the test set. Each image in the dataset is annotated with bounding boxes and class labels for the objects present in the image. Additionally, some images are also annotated with segmentation masks for each object.

The PASCAL VOC dataset is widely recognized as a benchmark for object detection, providing a standardized set of images and annotations to evaluate a model’s performance. This enables us to assess our model’s generalization ability and competitiveness in the field. Figure 16 showcases some of the images included in the dataset.

5.2. Evaluation Metrics

This experiment uses Precision, Recall, Mean Average Precision (mAP), and Model Parameters as evaluation indicators to evaluate the performance of different models comprehensively. The calculation formulas for Precision, Recall, AP, and mAP are as follows:

\begin{matrix} P r e c i s i o n = \frac{T P}{T P + F P} \end{matrix}

(7)

\begin{matrix} R e c a l l = \frac{T P}{T P + F N} \end{matrix}

(8)

\begin{matrix} A P = \int_{r = 0}^{1} P (r) d r \end{matrix}

(9)

\begin{matrix} m A P = \frac{1}{K} \sum_{i = 1}^{K} A P_{i} \end{matrix}

(10)

In the formula,

T P

is the number of positive samples that are correctly identified,

T N

is the number of negative samples that are correctly identified,

F P

is the number of negative samples that are false positives, and

F N

is the number of positive samples that are missed. The curve, for which the ordinate is the precision, and the abscissa is the recall, is denoted by

P (r)

. K, and this is the total number of categories, and

A P_{i}

is the average precision of the i-th category.

5.3. Environment Details

All models in this experiment were trained using the Ubuntu 18.04.6 LTS 64-bit operating system, which was configured with NVIDIA Tesla P100 PCIe 16GB GPU, Intel(R) Xeon(R) Silver 4114 CPU@2.20 GHz and 126 GB memory, and trained using Python 3.8 and the Pytorch 1.12 framework. During the stage of testing the inference speed, we used the mobile graphics card MX350 and the Jetson TX1 developer kit to verify the efficiency of the models on devices with limited resources.

5.4. Experiments and Discussion

In this section, we evaluate the proposed model and conduct ablation experiments on the meter reading recognition, meter reading area detection, and VOC datasets. The aim is to compare the performance of our model with other state-of-the-art models and demonstrate its robustness. In addition to comparing our proposed model with YOLOv8, we replace the Backbone in YOLOv8 with other lightweight models, such as MobileNet [46], ShuffleNet [47], and RepVGG [48]. Moreover, we compare these with other YOLO series models like YOLOv5 [15], YOLOv6 [16], and YOLOv7 [17]. Furthermore, we also compare with other popular models, such as SSD300 [19], Retinanet [34], MobileNetV2-SSDLite [31], and TOOD [49], to provide a comprehensive evaluation.

5.4.1. Meter Reading Recognition Results

At the outset of our experiment, we assessed the effectiveness of various models using the meter reading recognition dataset to determine their capacity for identifying meter readings. When referring to Table 2 and Figure 17, it is evident that the models presented in this paper exhibit superior performance compared to other models with comparable parameters. Additionally, once we integrated meter data augmentation, the mAP@50:95 of our model surpassed that of YOLOv8s while significantly reducing the number of parameters required.

5.4.2. PASCAL VOC Object Detection Results

The PASCAL VOC dataset primarily comprises images captured from real-world scenarios, each featuring multiple objects. Each object in the image is annotated with a bounding box and labeled with its corresponding category. The dataset encompasses 20 distinct object categories, such as humans, vehicles, and structure. Consequently, our study aims to leverage this dataset to assess the efficacy and resilience of various models.

Table 3 and Figure 18 demonstrate that the model introduced in this paper outperforms other models with similar parameters in the VOC dataset comparison experiment. Moreover, our pruned model possesses fewer parameters, rendering it well-suited for deployment on embedded platforms with constrained memory resources while maintaining high accuracy.

Upon verifying the PASCAL VOC dataset, it was demonstrated that the proposed model not only excels in the task of meter reading recognition but also surpasses the general target detection task. These results evidently show that DAMP-YOLO possesses outstanding capabilities in feature learning and generalization.

5.4.3. Meter Reading Area Detection Results

In addition, we compared the performance of the YOLOv8 models for the meter reading area detection dataset. As shown in Table 4, the model introduced in this paper exhibits outstanding performance on the three mentioned datasets. This confirms the model’s remarkable learning capability and robustness across various datasets. Moreover, considering the varying levels of task complexity, the pruned model size also adjusts accordingly. In the meter reading area detection task, our model can still rival the YOLOv8 model when the parameters are reduced by 43.93%. Hence, the model can be tailored to different tasks by adjusting suitable parameters while maintaining high accuracy.

5.4.4. Ablation Experiments

The ablation experiments utilized the YOLOv8n model as the baseline and assessed the impact of incorporating the deformable CSP bottleneck (DCB) module, aggregated triplet attention (ATA), meter data augmentation (MDA), and network pruning (NP). The results, presented in Table 5, reveal that each of the proposed modules in this study contributed to enhanced performance when incorporated into the network model.

In order to assess the effectiveness and reliability of meter data augmentation, we incorporated it into various models and evaluated it using the meter reading recognition dataset to compare the performances. The findings of our experiments are presented in Table 6. It is evident that meter data augmentation enhances the precision of distinct models, thus demonstrating its positive impact on meter reading.

5.4.5. Embedded Deployed Results

By utilizing our lightweight network model and inference pipeline, the Jetson TX1 intelligent car can effectively capture images from its camera, conduct real-time recognition of meter readings, and demonstrate robust anti-interference capabilities. After executing the inference process, the resulting output of the intelligent car is presented in Figure 19.

Moreover, we conducted real-world testing on a smart car to compare the performance of DAMP-YOLO and YOLOv8 under varying lighting conditions. The results, presented in Figure 20, demonstrate that YOLOv8n fails to detect objects in scenarios with high or low brightness, whereas DAMP-YOLO maintains consistent and accurate detection capabilities. This further validates the robustness and reliability of DAMP-YOLO in real-world applications.

5.4.6. Inference Speed Results

As this model is designed for deployment on resource-constrained devices, its inference speed is evaluated on the mobile graphics card MX350 and the Jetson TX1 development kit to verify its efficiency on mobile devices.

Table 7 compares the inference speed performance of various models for meter reading recognition. Please note that networks with lower accuracy and slower inference speeds were not evaluated on the Jetson TX1. Our model surpasses YOLOv8s in accuracy, with a slight compromise in inference speed. This trade-off results in a model that occupies less memory than YOLOv8n and exhibits appropriate real-time inference speed while maintaining robust recognition capabilities.

6. Conclusions

In this paper, we propose a lightweight network called DAMP-YOLO. It utilizes deformable CSP bottleneck (DCB), aggregated triplet attention (ATA), and meter data augmentation (MDA) to enhance the YOLOv8 network and employs network pruning (NP) to reduce the number of model parameters. The proposed modules enable the network to better capture objects with similar shapes and structures through deformable convolutional kernel offsets and focus attention on global information interaction across different dimensions. The experimental results demonstrate the superiority of the DAMP-YOLO network compared to other state-of-the-art methods in terms of precision, recall, and mAP, with a significant reduction in model size. This makes DAMP-YOLO a promising solution for real-time meter reading recognition when using resource-constrained devices. Furthermore, we designed a pipeline and deployed it on a Jetson TX1 intelligent car to achieve real-time recognition of meter readings. The successful deployment of this on the Jetson TX1 intelligent car demonstrated the practicability of our method in real-world applications. In the future, we can explore further optimizations of the DAMP-YOLO network for faster inference speed, such as post-training quantization, to enhance its efficiency and reduce memory requirements. Additionally, we can investigate the network’s performance in real-world scenarios to further validate its applicability in practical applications.

Author Contributions

Conceptualization, S.Z.; funding acquisition, X.Z. and W.W. and F.W.; methodology, S.Z. and Z.C.; software, S.Z. and Q.L.; data curation, W.W.; formal analysis, X.Z.; writing—original draft, S.Z.; writing—review and editing, S.Z.; visualization, Q.L. and Y.G.; supervision, X.Z.; project administration, X.Z. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the 2022 Scientific Research Project of Beijing Municipal Education Commission (Grant No. KM202210017006), the 2020 Scientific Research Project of Beijing Municipal Education Commission (Grant No. KM202010017011), and the National College Student Innovation and Entrepreneurship Training Program (Grant No. 2023J00070).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are publicly available on Zenodo at: https://doi.org/10.5281/zenodo.830501 (accessed on 26 September 2023) (floodX Flooding Images).

Acknowledgments

The authors thank the editor and anonymous reviewers for providing helpful suggestions for improving the quality of this manuscript.

Conflicts of Interest

The authors declare no conflict of interest.

References

Duan, H.; Zhang, H.; Zhang, S.; Shou, S.; Liu, H. Reasearch of meter digits recognition based on fuzzy theory. Instrum. Tech. Sens. 2004, 1, 37–39. [Google Scholar]
Shuang, G. Study on automatic identification method of digital tube. Commun. Technol. 2012, 1, 91–93. [Google Scholar]
Zhao, S.; Li, B.; Yuan, J.; Cui, G. Research on remote meter automatic reading based on computer vision. In Proceedings of the 2005 IEEE/PES Transmission & Distribution Conference & Exposition: Asia and Pacific, Dalian, China, 18 August 2005; pp. 1–4. [Google Scholar]
Vanetti, M.; Gallo, I.; Nodari, A. Gas meter reading from real world images using a multi-net system. Pattern Recognit. Lett. 2013, 34, 519–526. [Google Scholar] [CrossRef]
Lu, W.; Liu, C.; Zheng, Y.; Wang, H. A method for digital instrument character recognition based on template matching. Mod. Comput. 2008, 1, 70–72. [Google Scholar]
He, H.; Kong, H. A new processing method of nixie tube computer vision recognition. Electron. Eng. 2007, 33, 65–69. [Google Scholar]
Nodari, A.; Gallo, I. A multi-neural network approach to image detection and segmentation of gas meter counter. In MVA; Citeseer: Centre County, PA, USA, 2011; pp. 239–242. [Google Scholar]
Cui, X.; Hua, F.; Yang, G. A new method of digital number recognition for substation inspection robot. In Proceedings of the 2016 4th International Conference on Applied Robotics for the Power Industry (CARPI), Jinan, China, 11–13 October 2016. [Google Scholar]
Sampath, A.; Gomathi, N. Fuzzy-based multi-kernel spherical support vector machine for effective handwritten character recognition. Sādhanā 2017, 42, 1513–1525. [Google Scholar] [CrossRef]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Boston, MA, USA, 7–12 June 2015; pp. 1440–1448. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster r-cnn: Towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 2015, 28, 91–99. [Google Scholar] [CrossRef] [PubMed]
He, K.; Gkioxari, G.; Dollár, P.; Girshick, R. Mask r-cnn. In Proceedings of the IEEE International Conference on Computer Vision, Honolulu, HI, USA, 21–26 July 2017; pp. 2961–2969. [Google Scholar]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You only look once: Unified, real-time object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Jocher, G. Ultralytics yolov5. 2020. Available online: https://github.com/ultralytics/yolov5 (accessed on 26 September 2023).
Li, C.; Li, L.; Jiang, H.; Weng, K.; Geng, Y.; Li, L.; Ke, Z.; Li, Q.; Cheng, M.; Nie, W.; et al. Yolov6: A single-stage object detection framework for industrial applications. arXiv 2022, arXiv:2209.02976. [Google Scholar]
Wang, C.-Y.; Bochkovskiy, A.; Liao, H.-Y.M. Yolov7: Trainable bag-of-freebies sets new state-of-the-art for real-time object detectors. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Vancouver, BC, Canada, 17–24 June 2023; pp. 7464–7475. [Google Scholar]
Jocher, G.; Chaurasia, A.; Qiu, J. Ultralytics yolov8. 2023. Available online: https://github.com/ultralytics/ultralytics (accessed on 26 September 2023).
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. Ssd: Single shot multibox detector. In Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, 11–14 October 2016; Proceedings, Part I 14; Springer: Berlin/Heidelberg, Germany, 2016; pp. 21–37. [Google Scholar]
Gómez, L.; Rusinol, M.; Karatzas, D. Cutting sayre’s knot: Reading scene text without segmentation. Application to utility meters. In Proceedings of the 2018 13th IAPR International Workshop on Document Analysis Systems (DAS), Vienna, Austria, 24–27 April 2018; pp. 97–102. [Google Scholar]
Cai, M.; Zhang, L.; Wang, Y.; Mo, J. A fully convolution network based approach for character recognition in digital meter. Mod. Comput. 2018, 1, 38–43. [Google Scholar]
Guo, L.; Han, R.; Cheng, X. Digital instrument identification method based on deformable convolutional neural network. Comput. Sci. 2020, 47, 187–193. [Google Scholar]
Waqar, M.; Waris, M.A.; Rashid, E.; Nida, N.; Nawaz, S.; Yousaf, M.H. Meter digit recognition via faster r-cnn. In Proceedings of the 2019 International Conference on Robotics and Automation in Industry (ICRAI), Rawalpindi, Pakistan, 21–22 October 2019; pp. 1–5. [Google Scholar]
Laroca, R.; Barroso, V.; Diniz, M.A.; Gonçalves, G.R.; Schwartz, W.R.; Menotti, D. Convolutional neural networks for automatic meter reading. J. Electron. Imaging 2019, 28, 013023. [Google Scholar] [CrossRef]
Sun, S.; Yang, T. Instrument target detection algorithm based on deep learning. Instrum. Tech. Sens. 2021, 6, 104–108. [Google Scholar]
Li, D.; Hou, J.; Gao, W. Instrument reading recognition by deep learning of capsules network model for digitalization in industrial internet of things. Eng. Rep. 2022, 4, e12547. [Google Scholar] [CrossRef]
Hinton, G.E.; Krizhevsky, A.; Wang, S.D. Transforming auto-encoders. In Artificial Neural Networks and Machine Learning–ICANN 2011: 21st International Conference on Artificial Neural Networks, Espoo, Finland, 14–17 June 2011; Proceedings, Part I 21; Springer: Berlin/Heidelberg, Germany, 2011; pp. 44–51. [Google Scholar]
Martinelli, F.; Mercaldo, F.; Santone, A. Water meter reading for smart grid monitoring. Sensors 2023, 23, 75. [Google Scholar] [CrossRef] [PubMed]
Lin, W.; Zhao, Z.; Tao, J.; Lian, C.; Zhang, C. Research on digital meter reading method of inspection robot based on deep learning. Appl. Sci. 2023, 13, 7146. [Google Scholar] [CrossRef]
Carvalho, R.; Melo, J.; Graça, R.; Santos, G.; Vasconcelos, M.J.M. Deep learning-powered system for real-time digital meter reading on edge devices. Appl. Sci. 2023, 13, 2315. [Google Scholar] [CrossRef]
Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.-C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
Zhou, X.; Yao, C.; Wen, H.; Wang, Y.; Zhou, S.; He, W.; Liang, J. East: An efficient and accurate scene text detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 5551–5560. [Google Scholar]
Borisyuk, F.; Gordo, A.; Sivakumar, V. Rosetta: Large scale system for text detection and recognition in images. In Proceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, London, UK, 19–23 August 2018; pp. 71–79. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollár, P. Focal loss for dense object detection. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
He, K.; Zhang, X.; Ren, S.; Sun, J. Spatial pyramid pooling in deep convolutional networks for visual recognition. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 1904–1916. [Google Scholar] [CrossRef]
Dai, J.; Qi, H.; Xiong, Y.; Li, Y.; Zhang, G.; Hu, H.; Wei, Y. Deformable convolutional networks. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 764–773. [Google Scholar]
Hu, J.; Shen, L.; Sun, G. Squeeze-and-excitation networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 7132–7141. [Google Scholar]
Wang, Q.; Wu, B.; Zhu, P.; Li, P.; Zuo, W.; Hu, Q. Eca-net: Efficient channel attention for deep convolutional neural networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 11534–11542. [Google Scholar]
Qin, Z.; Zhang, P.; Wu, F.; Li, X. Fcanet: Frequency channel attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Montreal, QC, Canada, 10–17 October 2021; pp. 783–792. [Google Scholar]
Woo, S.; Park, J.; Lee, J.-Y.; Kweon, I.S. Cbam: Convolutional block attention module. In Proceedings of the European conference on computer vision (ECCV), Tel Aviv, Israel, 23–27 October 2018; pp. 3–19. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate attention for efficient mobile network design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Misra, D.; Nalamada, T.; Arasanipalai, A.U.; Hou, Q. Rotate to attend: Convolutional triplet attention module. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, Waikoloa, HI, USA, 3–8 January 2021; pp. 3139–3148. [Google Scholar]
Liu, Z.; Li, J.; Shen, Z.; Huang, G.; Yan, S.; Zhang, C. Learning efficient convolutional networks through network slimming. In Proceedings of the IEEE International Conference on Computer Vision, Venice, Italy, 22–29 October 2017; pp. 2736–2744. [Google Scholar]
Everingham, M.; Gool, L.V.; Williams, C.K.I.; Winn, J.; Zisserman, A. The PASCAL Visual Object Classes Challenge 2007 (VOC2007) Results. Available online: http://www.pascal-network.org/challenges/VOC/voc2007/workshop/index.html (accessed on 26 September 2023).
de Vitry, M.M.; Dicht, S.; Leitão, J.P. floodx: Urban flash flood experiments monitored with conventional and alternative sensors. Earth Syst. Sci. Data 2017, 9, 657–666. [Google Scholar] [CrossRef]
Howard, A.; Sandler, M.; Chu, G.; Chen, L.-C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27 October–2 November 2019; pp. 1314–1324. [Google Scholar]
Ma, N.; Zhang, X.; Zheng, H.-T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. Repvgg: Making vgg-style convnets great again. In Proceedings of the IEEE/CVF Conference on computer Vision and PATTERN Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Feng, C.; Zhong, Y.; Gao, Y.; Scott, M.R.; Huang, W. Tood: Task-aligned one-stage object detection. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV); IEEE Computer Society: Washington, DC, USA, 2021; pp. 3490–3499. [Google Scholar]

Figure 1. A structure diagram of the YOLOv8 network, mainly consisting of a Conv module, C2f module, SPPF module, and a Detect module.

Figure 2. Comparison of C3, ELAN, and C2f modules. The C2f module combines the advantages of C3 and ELAN modules.

Figure 3. Comparison of Coupled Head and Decoupled Head module. The Decoupled Head uses different branches to process the feature map separately to predict the category and location. In contrast, the Coupled Head does not process it separately to predict the category and location simultaneously.

Figure 4. The structure diagram of the DAMP-YOLO network, mainly consisting of a Conv module, DCB module, ATA module, SPPF module, and a Detect module.

Figure 5. Schematic diagram of 3 × 3 deformable convolution structure.

Figure 6. Architecture of deformable CSP bottleneck (DCB) module, which improves the feature extraction ability.

Figure 7. The structure of the triplet attention module.

Figure 8. The structure of the aggregated triplet attention module.

Figure 9. The example images of meter data augmentation: (a) the Raw image represents the unprocessed original image; (b) the Channel Multiply image represents the multiplication of each channel of the image to adjust the brightness and darkness; (c) the Gaussian Blur image denotes the smoothing of the image to reduce noise and reduce details; (d) the Gaussian Noise image represents the addition of noise to the image so that the model can learn to remove noise interference; (e) the Alpha Grid image represents adding a translucent black and white mask to the image to simulate realistic shadow occlusion; the (f) Gray Scale image represents the conversion of three-channel images into single-channel grayscale images to further eliminate color interference.

Figure 10. Scheme diagram of model pruning flow, mainly including Sparse Training, the Prune Channel, and Finetune.

Figure 11. The schematic diagram of channel pruning; channels with lower scaling factors are cut out according to the set threshold.

Figure 12. The entire body of the Jetson TX1 intelligent car, which uses our lightweight network.

Figure 13. The pipeline of the Jetson TX1 intelligent car for meter reading recognition.

Figure 14. The flowchart of the Jetson TX1 intelligent car for meter reading recognition.

Figure 15. Some of the images in the floodX dataset.

Figure 16. Some of the images in the PASCAL VOC dataset.

Figure 17. Compared with the mean average precision curve for the meter recognition dataset, the curves of the models proposed in this paper are consistently above the other model curves under similar parameters.

Figure 18. When compared with the mean average precision curve for the VOC dataset, the model proposed in this paper outperforms other lightweight models.

Figure 19. The inference results for the Jetson TX1 intelligent car for meter reading recognition.

Figure 20. The inference results for the Jetson TX1 intelligent car for different models. (a) the results of YOLOv8n, and (b) the results of DAMP-YOLO, showing the stability and robustness of the proposed model in light interference.

Table 1. A comparison of different attention mechanisms. The optimal result is shown in bold font. Adding the triplet attention module to the YOLOv8n model provides the best performance.

Model	mAP@50/%	mAP@50:95/%	Parameters/M
YOLOv8n	62.02	41.03	3.15
+SE	62.29	41.35	3.16
+ECA	62.14	41.12	3.15
+CA	61.95	40.67	3.17
+FCA	61.89	40.61	3.16
+CBAM	61.65	40.48	3.16
+TripletAttention	62.92	41.87	3.15

Table 2. Comparison of digit recognition performance of lightweight networks for the meter reading recognition dataset. DA-YOLO denotes that DCB and ATA are integrated into YOLOv8n; DAM-YOLO means that DA-YOLO adds MDA, and DAMP-YOLO indicates that DCB, ATA, MDA, and network pruning are combined into YOLOv8n. The experimental results demonstrate that our model is better than other lightweight models regarding precision, recall, and mean average precision.

Model	Precision/%	Recall/%	mAP@50/%	mAP@50:95/%	Parameters/M
YOLOv8s	99.44	98.84	99.14	88.61	11.16
SSD300	-	-	98.38	82.44	25.35
Retinanet	96.02	96.01	98.80	84.10	20.02
TOOD	-	-	98.70	84.61	32.05
MobileNetV2-SSDLite	-	-	98.52	83.65	3.21
YOLOv8n	99.50	98.81	99.14	87.92	3.15
YOLOv8n-MobileNetV3	99.28	98.14	99.05	87.58	2.25
YOLOv8n-ShuffleNetV2	98.85	98.15	99.12	87.31	1.86
YOLOv8n-RepVGG	99.17	98.81	99.14	87.74	2.89
YOLOv5n	99.46	98.56	99.02	85.57	1.79
YOLOv6n	-	-	98.93	85.34	4.70
YOLOv7-tiny	99.53	98.76	99.00	84.95	6.07
DA-YOLO	99.57	98.82	99.15	88.60	3.46
DAM-YOLO	99.58	98.86	99.12	88.77	3.46
DAMP-YOLO	99.44	98.41	99.10	88.82	2.40

Table 3. Comparison of the object detection performance of lightweight networks for the VOC 2007 dataset. The model in this paper has the best performance under similar parameters and is close to the model with the highest parameters.

Model	Precision/%	Recall/%	mAP@50/%	mAP@50:95/%	Parameters/M
YOLOv8s	72.22	61.23	67.86	45.76	11.16
SSD300	-	-	71.01	41.35	26.29
Retinanet	49.64	56.23	60.67	33.38	20.17
TOOD	-	-	69.93	44.54	32.05
MobileNetV2-SSDLite	-	-	59.89	34.89	3.32
YOLOv8n	64.83	58.02	62.02	41.03	3.15
YOLOv8n-MobileNetV3	57.37	46.63	49.49	31.36	2.25
YOLOv8n-ShuffleNetV2	59.98	48.53	52.53	32.50	1.86
YOLOv8n-RepVGG	65.59	53.06	58.64	36.89	2.89
YOLOv5n	55.21	49.65	50.34	25.74	1.79
YOLOv6n	-	-	62.11	39.36	4.70
YOLOv7-tiny	57.17	55.76	55.66	30.74	6.07
DA-YOLO	69.79	59.88	66.11	45.64	3.46
DAP-YOLO	70.09	56.80	63.88	43.57	2.42

Table 4. Comparison of the area detection performance of lightweight networks for the meter reading area detection dataset.

Model	mAP@50:95/%	Parameters/M
YOLOv8s	96.01	11.16
YOLOv8n	96.47	3.15
YOLOv8n-MobileNetV3	92.76	2.25
YOLOv8n-ShuffleNetV2	93.27	1.86
YOLOv8n-RepVGG	95.38	2.89
DAM-YOLO	97.95	3.46
DAMP-YOLO	95.72	1.94

Table 5. The ablation experiments for the meter reading recognition dataset. DCB stands for deformable CSP bottleneck; ATA represents aggregated triplet attention module; MDA represents meter data augmentation; Prune denotes the prune channels for network pruning.

Model	DCB	ATA	MDA	Prune	mAP@50:95	Parameters/M	Inference/ms (Jetson TX1)
YOLOv8n					87.92	3.15	76.0
D-YOLOv8n	√				88.36	3.46	126.2
A-YOLOv8n		√			88.21	3.15	84.4
M-YOLOv8n			√		88.46	3.15	76.0
DA-YOLOv8n	√	√			88.60	3.46	136.5
DAM-YOLOv8n	√	√	√		88.77	3.46	136.5
DAMP-YOLOv8n	√	√	√	√	88.82	2.40	129.6

Table 6. Comparison of the meter data augmentation results for the meter reading recognition dataset.

Model	mAP@50:95/%	Parameters/M
YOLOv8n	87.92	3.15
YOLOv8n+MDA	88.46	3.15
YOLOv8n-MobileNetV3	87.58	2.25
YOLOv8n-MobileNetV3+MDA	87.64	2.25
DA-YOLO	88.60	3.46
DA-YOLO+MDA	88.77	3.46
DAP-YOLO	88.34	2.40
DAP-YOLO+MDA	88.82	2.40

Table 7. Comparison of inference speed for the meter reading recognition dataset.

Model	mAP@50:95/%	Parameters/M	Inference/ms (MX350)	Inference/ms (Jetson TX1)
YOLOv8s	88.61	11.16	44.2	184.5
SSD300	82.44	25.35	86.2	-
Retinanet	84.10	20.02	267.4	-
TOOD	84.61	32.05	414.3	-
MobileNetV2-SSDLite	83.65	3.21	25.7	-
YOLOv8n	87.92	3.15	23.5	76.0
YOLOv8n-MobileNetV3	87.58	2.25	23.7	83.9
YOLOv8n-ShuffleNetV2	87.31	1.86	18.6	62.1
YOLOv8n-RepVGG	87.74	2.89	25.1	78.1
DAM-YOLO	88.77	3.46	40.2	136.5
DAMP-YOLO	88.82	2.40	37.3	129.6

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Zhuo, S.; Zhang, X.; Chen, Z.; Wei, W.; Wang, F.; Li, Q.; Guan, Y. DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition. Appl. Sci. 2023, 13, 11493. https://doi.org/10.3390/app132011493

AMA Style

Zhuo S, Zhang X, Chen Z, Wei W, Wang F, Li Q, Guan Y. DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition. Applied Sciences. 2023; 13(20):11493. https://doi.org/10.3390/app132011493

Chicago/Turabian Style

Zhuo, Sichao, Xiaoming Zhang, Ziyi Chen, Wei Wei, Fang Wang, Quanlong Li, and Yufan Guan. 2023. "DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition" Applied Sciences 13, no. 20: 11493. https://doi.org/10.3390/app132011493

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

DAMP-YOLO: A Lightweight Network Based on Deformable Features and Aggregation for Meter Reading Recognition

Abstract

1. Introduction

2. YOLOv8 Network

3. DAMP-YOLO Network

3.1. Deformable CSP Bottleneck

3.2. Aggregated Triplet Attention

3.3. Meter Data Augmentation

3.4. Network Pruning

3.4.1. Sparse Training

3.4.2. Channel Prune and Finetune

4. Embedded Deployment

5. Experiment and Results

5.1. Dataset Setup

5.2. Evaluation Metrics

5.3. Environment Details

5.4. Experiments and Discussion

5.4.1. Meter Reading Recognition Results

5.4.2. PASCAL VOC Object Detection Results

5.4.3. Meter Reading Area Detection Results

5.4.4. Ablation Experiments

5.4.5. Embedded Deployed Results

5.4.6. Inference Speed Results

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI