An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection

Yi, Kefu; Luo, Kai; Chen, Tuo; Hu, Rongdong

doi:10.3390/app122312476

Open AccessArticle

An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection

by

Kefu Yi

^1,*

,

Kai Luo

¹

,

Tuo Chen

¹ and

Rongdong Hu

²

¹

College of Automotive and Mechanical Engineering, Changsha University of Science and Technology, Changsha 410114, China

²

Changsha Intelligent Driving Institute Ltd., Changsha 410208, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(23), 12476; https://doi.org/10.3390/app122312476

Submission received: 14 November 2022 / Revised: 29 November 2022 / Accepted: 30 November 2022 / Published: 6 December 2022

Download

Browse Figures

Versions Notes

Abstract

:

Aimed at the vehicle/pedestrian visual sensing task under low-light conditions and the problems of small, dense objects and line-of-sight occlusion, a nighttime vehicle/pedestrian detection method was proposed. First, a vehicle/pedestrian detection algorithm was designed based on You Only Look Once X (YOLOX). The model structure was re-parameterized and lightened, and a coordinate-based attention mechanism was introduced into the backbone network to enhance the feature extraction efficiency of vehicle/pedestrian targets. A feature-scale fusion detection branch was added to the feature pyramid, while a loss function was designed, which combines Complete Intersection Over Union (CIoU) for target localization and Varifocal Loss for confidence prediction to improve the feature extraction ability for small, dense, and low-illumination targets. In addition, in order to further improve the detection accuracy of the algorithm under low-light conditions, a training strategy based on data domain transfer was proposed, which fuses the larger-scale daylight dataset with the smaller-scale nighttime dataset after low-illumination degrading. After low-light enhancement, training and testing were performed accordingly. The experimental results show that, compared with the original YOLOX model, the improved algorithm trained by the proposed data domain transfer strategy achieved better performance, and the mean Average Precision (mAP) increased by 5.9% to 82.4%. This research provided effective technical support for autonomous driving safety at night.

Keywords:

vehicle and pedestrian detection; YOLOX; attention mechanism; data domain transfer; night object detection

1. Introduction

Traffic accidents are more likely to occur at night due to drivers’ poor vision and tired eyes. Statistics show that the probability of an accident at night is 1–1.5 times higher than the probability of an accident during the day, and the traffic death rate per kilometer at night is approximately three times higher than during the day [1]. Additionally, there is not enough light for us to clearly see the color details of the cars and people in front of us. The safety of autonomous driving systems and cutting-edge assistive driving systems is adversely impacted by the inability of the current visual-based detection algorithms to reliably identify targets. Therefore, it is crucial to increase the object detection algorithm’s precision in low-light conditions at night. To cope with the above, researchers have used different approaches, such as radar [2], lidar [3], the Global Navigation Satellite System (GNSS) [4], etc. However, vision-based solutions have been the preferred choice of many researchers, due to the ease of deployment and affordability of visual sensors.

Object detection is one of the widest studies in computer vision. There are also many useful algorithms for this task. However, there are still two difficulties in vehicle/pedestrian detection, which the current techniques are unable to fully solve. First, performing vision-related tasks in the face of low-light conditions in natural environments can be challenging because of short exposure times, images lacking the necessary features for target detection, and direct image light enhancement potentially generating noise that interferes with vision tasks. The second is the detection of small vehicle/pedestrian targets and multi-scale variations of vehicle/pedestrian targets. The vehicle/pedestrian features extracted by the detector will be more susceptible to noise interference in the environmental background when there are vehicles and pedestrians of different sizes in the scene at the same time, or when there are individuals with a small size and low clarification in the vehicle/pedestrian object. This will cause missed detection and false detection and present significant challenges regarding the accuracy of the detection results. Traditional vehicle/pedestrian detection algorithms recognize targets using manually created feature extraction methods [5,6], but in natural application environments, target features are complex and varied, which makes it difficult to successfully abstract and generalize the manually generated features. The current common approach for detecting vehicles and pedestrians uses deep-learning-based target detection algorithms, which can learn from data to produce feature representations with improved detection accuracy and resilience. One-stage detection techniques and two-stage detection methods are the two categories that deep learning-based object detection algorithms fall under. In order to achieve a trade-off between speed and accuracy, one-stage detection techniques such as the You Only Look Once (YOLO) family of algorithms [7,8] and the Single-Shot multibox Detector (SSD) [9] can directly classify and regress the target. The detection accuracy of two-stage detection techniques, such as the Faster Region with Convolutional Neural Networks (R-CNN) [10], Cascade R-CNN [11], etc., is often greater than that of one-stage detectors, but the detection speed will be significantly slower and they cannot be detected online in real time. The authors of [12] improved the feature pyramid and the feature learning capabilities of the YOLOv3 model for use in actual applications. They also included an attention mechanism and optimized the loss function to further increase the accuracy of vehicle/pedestrian identification. By increasing the detection scale of YOLOv3 from 3 to 4 and establishing the feature fusion target detection layer down-sampled by 4×, Moran Ju et al. [13] were able to extract more object attributes and improve tiny target recognition. Yixing Zhu et al. [14] improved the Cascade R-CNN algorithm. The outline of the object is estimated in the first step using a Locally Sliding Line-based Point Regression (LocSLPR) method, which is defined as the intersection of the sliding lines with the object’s bounding box. to fully utilize information The performance of our system is then further enhanced in the second step by gradually regressing the target object using a Rotated Cascade Region-based Convolutional Neural Network (RCR-CNN). Although the performance of these detection algorithms has been improved, there are still problems of target misses and false detections in the face of small, dense targets, line-of-sight occlusions, and low-illumination conditions at night. In addition, although the detection accuracy of the two-stage detection algorithm is higher, it is difficult to meet the real-time requirements of intelligent driving systems in terms of detection speed, while the YOLOv3 [13] and YOLOv4 [15] models of the one-stage algorithm have larger files and lower detection accuracy, which are not suitable for the deployment of low-performance storage devices.

To address the above issues, a lightweight YOLOX [16] is chosen as the baseline model to adapt to the nighttime vehicle/pedestrian detection task by simultaneously improving the algorithm and training strategy.

The main contributions of this paper are as follows:

(1): An improved algorithm based on YOLOX was proposed for small target pedestrian and vehicle detection at night. The main improvements include 1. reparameterization of the model structure using the Re-parameterization Visual Geometry Group (RepVGG) technique; 2. the introduction of a coordinate-based attention mechanism; 3. the addition of a new feature scale fusion branch; and 4. improvement of the loss function.
(2): A domain transfer training strategy was proposed that allows the model to be trained more efficiently using daytime datasets. The large-scale daytime dataset is fused with the much smaller night dataset after low-illumination degrading. The models were then trained and tested separately after unified low-illumination enhancement to fully extract the data features of the existing daytime dataset and remedy the nighttime data deficiencies problem.
(3): The proposed improved YOLOX and domain transfer training strategies were validated on a real-world dataset. The experimental results showed that the improved YOLOX algorithm produced fewer errors than the original algorithm, and it was more accurate for nighttime vehicle/pedestrian detection when combined with a domain transfer training strategy.

2. Related Work

2.1. Object Detection

2.1.1. Traditional Detection Methods

If we consider deep-learning-based object detection to be a technical aesthetic, then going back in time 20 years would allow everyone to observe “the wisdom of the cold weapon era”. The majority of the early object-detection algorithms were created using handcrafted features. By incorporating three crucial techniques—“integral picture”, “feature selection”, and “detection cascades”—the VJ detector [17,18] significantly increased its detection speed. To reconcile feature invariance (such as translation, scaling, illumination, etc.) and nonlinearity (regarding discriminating different object categories), N. Dalal and B. Triggs first developed the Histograms of Oriented Gradients (HOG) feature descriptor in 2005 [19]. P. Felzenszwalb [20] first suggested the Deformable Part Model (DPM) as an extension of the HOG detector in 2008. Since then, R. Girshick [21,22,23] developed a number of modifications. The detection accuracy of modern object detectors has grown well beyond the DPM, yet many of them, including mixture models, hard negative mining, bounding box regression, etc., are still greatly affected by its insightful findings.

2.1.2. Deep-Learning-Based Detection Methods

Due to the continuous development of computers, deep learning-based object detection has become the mainstream of machine vision. Proposal-based and proposal-free object detection techniques now in use can be broadly split into two groups. Object detection is viewed as a bounding-box regression problem by proposal-free approaches. By jointly classifying categories and regressing the positions of predetermined anchors, for instance, YOLO immediately predicts detection results. To handle cases of various scales, SSD integrates predictions computed from hierarchical networks with many relevant fields. For more effective proposal-free object detectors, a number of extensions [24,25,26] have been suggested. In order to classify them for final detection findings, proposal-based algorithms first create region proposals. For example, R-CNN [27] extracts dense region proposals using a hierarchical grouping method and then classifies these recommendations to produce detection results. By adding a Region Of Interest (ROI) pooling layer to exchange features across each proposal, Fast R-CNN [28] accelerates the R-CNN. The hierarchical grouping used in Fast R-CNN [28] is proposed to be replaced by a more accurate and efficient Region Proposal Network (RPN) in Faster R-CNN. For improved detection accuracy, some expansions [29,30,31,32,33] propose more potent proposal-based detectors. Most existing object detection methods require large amounts of annotated data to train the model, which usually takes time and effort to produce, making domain-based adaptive detectors a popular choice for a lot number of researchers.

2.1.3. Domain Adaption-Based Object Detection Method

As previously mentioned, the research presented in this paper is closely related to the fields of knowledge transfer and domain adaptive object detection, both of which focus on learning a high-performing detector in an unlabeled target domain without accessing any annotation of the target domain. Domain Adaptive (DA) detectors have been presented in a sizable number. For example, in order to reduce domain bias through adversarial learning at both the picture and instance levels, DA [34] introduces a domain adaptive detector based on Faster R-CNN. By including object relations in teacher–student consistency regularization, Mean Teacher with Object Relations (MTOR) [35] reduces domain discrepancy. By substantially aligning local similar features and weakly aligning global dissimilar features, Strong-Weak Distribution Alignment (SWDA) [36] offers a potent cross-domain detection approach. At both the picture and instance levels, the Image-Instance Full Alignment Network (IFAN) [37] aligns feature distributions in a coarse-to-fine fashion. To reduce the possibility of collapse brought on by parameter sharing between the source and target domains, Asymmetric Tri-way Faster-RCNN (ATF) [38] introduces a tri-stream Faster R-CNN. The object detection classifier and the RPN are trained through collaborative self-training in Collaborative Training between Regions (CTR) [39]. To reduce the domain discrepancy, Graph-induced Prototype Alignment (GPA) [40] aligns graph-induced prototype representations in two steps. DA and SWDA can be continuously improved thanks to the effective classification regularization module proposed by Categorical Regularization for Domain Adaptive (CRDA) [41].

2.2. Low-Light Detection

2.2.1. Low Illumination Datasets

For low-light object detection tasks, several datasets have been proposed. The NightOwls dataset was suggested by the authors of [42] for the detection of pedestrians at night. In order to account for different unfavorable conditions such as rain, snow, haze, and low illumination, the authors of [43] gathered an Unconstrained Face Detection Dataset (UFDD). Recently, there have been several tracks for vision tasks in various low-visibility environments in the UG2+ challenge [44]. The DARK FACE dataset contains 10,000 of them, including 6000 labeled images and 4000 unlabeled ones. An exclusively dark (ExDark) dataset with 7363 images and 12 object classes was proposed by the authors of [45] for a multi-class dark object detection task. The BDD100K [46] dataset, which was released by the Berkeley AI Lab, is the largest and most varied open-source video dataset in computer vision to date. The dataset consists of 100,000 videos with an average length of 40 s at 720 p and 30 frames per second, adding up to over 1100 h. The videos came from a variety of American locations. The database includes information on a variety of weather conditions, such as clear, cloudy, and rainy days at various times of the day and night.

2.2.2. Low-Illumination Enhancement and Restoration Methods

When faced with a low-light image, the first thing that comes to mind is the illumination enhancement and restoration of the low-light image to support subsequent tasks. Low-light vision tasks recover image detail and correct color shifts based on the human visual experience. Early attempts relied on Retinex theory approaches [47,48,49] or histogram equalization (HE)-based approaches [50,51]. Nowadays, Convolutional Neural Networks (CNN)-based methods and Generative Adversarial Nets (GAN)-based methods have significantly improved in this task thanks to the development of deep learning. The authors of [52] combined the Retinex theory with a deep network for low-light image enhancement. The authors of [53] used an unsupervised GAN to solve this problem. A self-supervised learning strategy for images with abnormal illumination was recently proposed by the authors of [54]. Existing low-light image enhancement methods mainly focus on low contrast to increase visibility while the high noise is usually addressed with a post-processing module. To solve this problem, The authors of [55] suggest a unique technique for improving low-light images based on simultaneous illumination and noise adjustment with unpaired data. A Structure- and Texture-Aware Network (STAN) for low-light image enhancement was proposed by the authors of [56] based on the observation that the representations of structure and texture are highly separated in the frequency spectrum. STAN is made up of a structure sub-network and a texture sub-network. The methods mentioned above have worked well for low-light enhancement, but it is usually not the best performance to use the enhanced images directly in existing object detection networks.

2.2.3. Different Methods Applied to Low-Illuminated Detection

The authors of [57] proposed a method of domain adaptation for object detection in a low-light situation. Pretrained models are merged into this method in different domains using glue layers and a generative model, which feeds latent features to the glue layers to train them without an additional dataset. The authors of [58] suggest an active object detection approach and brightness control strategy based on reinforcement learning. Without having to retrain the detector, low-quality photos can be enhanced into high-quality images with the aid of the pretrained models, and overall performances are increased. In [59], a hyperbolic tangent curve is used to first map the image brightness to the desired level. Secondly, the YCbCr color space unsharp filter block matching and 3-D filtering algorithms are created. Finally, the nighttime surveillance task is concluded with pedestrian detection using a convolutional neural network model. With only modest processing resources and no supervised training, the authors of [60] described a Flash-No-Flash (FNF)-controlled illumination acquisition methodology that enables and impacts reliable object detection. The technique depends on the simultaneous acquisition of two images, one with and one without strong artificial illumination (flash/no flash). The authors of [61] provided a dataset of unprocessed short-exposure low-light photos together with comparable long-exposure reference images to aid in the development of learning-based pipelines for low-light image processing. They created a pipeline for processing low-light pictures based on end-to-end training of a fully convolutional network that works directly with the raw sensor input using the dataset that was presented. However, the performance of these existing methods in truly dark environments is not very satisfactory.

3. Improved YOLOX Detection Algorithm

3.1. YOLOX Model

YOLOX is a new generation of the YOLO series algorithm following YOLOv5 [62], which improves on YOLOv3 and YOLOv5 and further improves the detection performance. YOLOX is mainly divided into three parts: A backbone network (Backbone), a feature pyramid (Neck), and a prediction module (Head). Among them, the backbone network adopts Cross Stage Partial (CSP) Darknet, which is the backbone feature extraction network of YOLOX, and three effective feature layers are obtained by input images, primarily including Focus, Cross Stage Partial Network (CSPnet) [63], and Spatial Pyramid Pooling (SPP) components. The feature pyramid is constructed by CSP-PAFPN (PAFPN, Path Aggregation Feature Pyramid Networks), which makes full use of the three effective feature layers obtained by the backbone network to improve the model performance by bottom-up and top-down feature fusion. The prediction module improves the detection head to Decoupled Head, which improves the training convergence speed and detection accuracy of the model and uses the Anchor frame approach based on Anchor free, which removes the preset a priori frame and predicts the edges of the target directly to reduce the computational redundancy.

Although the original YOLOX model achieves better performance than other methods for the detection task in this paper, there are still large missed and false detections for small targets, occlusions, and low-illumination environments at night. Therefore, the model structure and loss function of the algorithm are improved in this paper to improve the detection accuracy of nighttime vehicle/pedestrian targets. One of the schematic diagrams of the improved model structure is shown in Figure 1.

3.2. Structural Re-Parameterization and Light-Weighting of Model

3.2.1. Structural Re-Parameterization

In general, the more complex the structure and the more parameters of a deep convolutional neural network, the more expressive features can be extracted, which helps to achieve better results in the training phase. However, more complex models can adversely affect the inference speed. In order to trade-off detection accuracy and speed, the training and inference structures of the YOLOX model are decoupled, drawing on the RepVGG re-parameterization idea [64]. The specific method is shown in the subfigure of the structural reparameterization module in Figure 1, where the original 3 × 3 convolutions in the YOLOX backbone network is replaced with the RepVGG module; a multi-branch structure is used in the training phase to improve the detection performance, i.e., 3 × 3 convolutions + 1 × 1 convolution + Identity branch; in the inference phase, the reparameterization is performed and a single 3 × 3 convolution structure is used to accelerate. In the inference phase, a single 3 × 3 convolutional structure is used to accelerate the inference, which is consistent with the original YOLOX backbone network structure. By improving the model structure with reparameterization, the performance advantage of adopting a complex model in the training phase and the inference speed advantage of adopting a single structure in the inference phase is taken into account.

3.2.2. Lightweight Design of the Model

Compared with large models, small models require less storage space, which can save more resources and enable the models to be deployed on low-performance storage devices. In view of this, this paper optimizes the design of the number of Neck channels and the Head structure of the YOLOX algorithm to reduce the number of model parameters. First, the number of Neck channels is reduced so that the output dimension of all components (including Conv and CSP) in the PAFPN is changed to 128 dimensions, which corresponds to the input dimension of the decoupled head in YOLOX (128 dimensions); subsequently, the redundant 1 × 1 convolution in the Head is removed (the convolution serves to scale the dimensionality of the feature map output from PAFPN and convert it to 128 dimensions uniformly); finally, the depth of the CSP components in PAFPN is increased (from the default 1 to 3) to avoid reducing the detection performance of the model by decreasing the output dimensionality of all features in PAFPN. The above model structure improvement can reduce the size of the YOLOX model to 13.5 M, which is lower than the size of the YOLOv5 model (13.7 M).

3.3. Introduction of Attention Mechanism

In order to obtain richer image semantic information in vehicle/pedestrian target features, an attention mechanism (CA, Coordinate Attention) [65] that fuses coordinate information is introduced to the YOLOX backbone network, added after the CSP component, and the specific structure is shown in the subdiagram of the coordinate attention module in Figure 1. The CA module decomposes the channel attention into two parallel 1D feature encoding operations, which can effectively fuse spatial coordinate information into the generated attention map, enhance the utilization of cross-channel information, direction-aware information, and position-aware information, and promote the model to locate and identify targets more accurately.

Based on the YOLOX model, the CA module aggregates the vertical and horizontal input features into two independent direction-aware feature maps using pooling kernels of dimensions

(H, 1)

and

(1, W)

. Then, the two feature maps embedded with specific direction information are encoded into two attention maps by “cascading, convolving, slicing, and reconvolving”. In this way, long-range dependencies can be captured along one spatial direction and meanwhile precise positional information can be preserved along the other spatial direction. Finally, the two attention maps are multiplied onto the input feature map to enhance the representation of the feature map. The final output of the CA module can be expressed as shown in Equation (1).

y_{c} (i, j) = x_{c} (i, j) \times g_{c}^{h} (i) \times g_{c}^{w} (j)

(1)

3.4. Feature Pyramid Improvement

The label-scale information of the nighttime vehicle/pedestrian dataset was scattered and visualized as shown in Figure 2. Among them, the width and height of the target size after normalization are mainly concentrated in the position less than 0.07, i.e., the width of the detected targets is mostly less than 89.6 (1280 × 0.07) and the height is mostly less than 50.4 (720 × 0.07). It is obvious that most of the detected targets are comparatively small. Therefore, it is necessary to ensure the detection accuracy of the algorithm for small, dense targets.

In the whole feature extraction and fusion process of the YOLOX algorithm, the input image is down-sampled five times in the backbone network, and feature maps sized 320 × 320, 160 × 160, 80 × 80, 40 × 40, and 20 × 20 are generated in sequence, and the last three feature maps are input into PAFPN for feature fusion processing. Among them, large-scale feature maps pay more attention to detailed information and retain a large amount of small target information, while small-scale feature maps pay more attention to semantic information. Although small-scale feature maps are better for understanding complex objects, the loss of information on small targets may be more serious owing to the smaller resolution. Therefore, the model was improved by adding a new branch with a feature scale of 160 × 160 (where the depth of the Cross Pseudo Supervision (CPS) component is 1) into PAFPN, and the feature fusion process is performed on four different scales of feature maps to obtain more shallow image feature information while retaining the deep image semantic information. The specific improved PAFPN network structure is shown in the fusion detection branch module in the Neck part of Figure 1.

3.5. Loss Function Design

3.5.1. Confidence Loss Function Design

In the process of nighttime vehicle and pedestrian detection based on the YOLOX model, due to the anchor-free mechanism and the large ratio of foreground-to-background pixels in the experimental data, the positive and negative samples in the prediction stage of the algorithm are in an extremely unbalanced state, which makes it difficult for the model to be fully trained. In addition, the target locations of the scenes in the experimental dataset are more concentrated and most of them are in occlusion, which easily leads to target miss detection.

To address the above two problems, Varifocal [66] was introduced as the confidence prediction loss function. Varifocal is mainly used to train the dense target detector to perform the Intersection Over Union (IoU)-aware classification score regression to improve the detection accuracy. Target occlusion is one of the features of dense targets, so using this loss function can alleviate the missed detection phenomenon caused by target occlusion. The Varifocal loss function is also based on the weighting method of Focal Loss [24], which can improve the positive and negative sample imbalance during the training process. The formula of the Varifocal loss function is shown in Equation (2).

L_{vf} (p, q) = {\begin{matrix} - q (q \log (p) + (1 - q) \log (1 - p)) & q > 0 \\ - a p^{γ} \log (1 - p) & q = 0 \end{matrix}

(2)

where

p

is the prediction confidence after the Sigmoid activation function;

q

is the IoU value of positive samples with Ground Truth;

a

is used to reconcile the balance between positive and negative samples, which is 0.25; and

γ

is used to reconcile the balance between easy and difficult samples, which is 1.5.

3.5.2. Bounding Box Regression Loss Function Design

The border regression loss in the YOLOX model is mainly calculated by the IoU. When the overlap area between the predicted box and the ground truth box is larger, the IoU value is larger and the localization effect is better. However, IoU loss has the following drawbacks: When there is no overlap between the prediction box and the ground truth box, the IoU is 0, and it does not reflect the relative distance between the prediction box and the ground truth box, and the gradient of IoU loss is 0 at this time, so it cannot be optimized. In addition, IoU is calculated using the overlap area only, and when the overlap area is determined, its IoU value is the same, so it cannot distinguish between the prediction box and the ground truth box alignment mode, which affects the convergence speed.

To address the above problems, CIoU [67] loss was introduced as the bounding box loss function to improve the convergence speed and model detection accuracy. The CIoU loss integrates the overlap area, centroid distance, and aspect ratio between the prediction box and the real box, which directly minimizes the relative distance between the prediction box and the ground truth box and accelerates convergence. The problem that IoU loss does not accurately reflect the overlap of the two boxes was solved, making the regression more accurate and faster when there is overlap or even inclusion with the target box. The formula for calculating CIoU loss is shown in Equations (3)–(5).

L_{ciou} = 1 - (\frac{S_{i}}{S_{u}}) + \frac{ρ^{2} (b, b^{gt})}{c^{2}} + α v

(3)

α = \frac{v}{(1 - \frac{S_{i}}{S_{u}}) + v}

(4)

v = \frac{4}{π^{2}} {(\arctan \frac{w^{gt}}{h^{gt}} - \arctan \frac{w}{h})}^{2}

(5)

where

ρ

denotes the Euclidean distance between two center points;

b

and

b^{gt}

denote the center points of the prediction box and the ground truth box, respectively;

c

denotes the diagonal distance between the minimum closed area of the prediction box and the ground truth box;

S_{i}

is the area of the overlapping area of the prediction box and the ground truth box;

S_{u}

is the merging area of the prediction box and the ground truth box;

w

and

w^{gt}

denote the width of the prediction box and the ground truth box, respectively;

h

and

h^{gt}

denote the height of the prediction box and the ground truth box, respectively.

3.5.3. Combined Loss Function

The final combined loss function consists of the bounding box regression loss

L_{ciou}

, the confidence prediction loss

L_{vf}

, the category prediction loss

L_{cls}

, and the loss together

L_{1}

, and is calculated as shown in Equations (6)–(8).

L = λ L_{ciou} + L_{vf} + L_{cls} + L_{1}

(6)

L_{cls} = - \frac{1}{n} \sum (\hat{p} \times \log (σ (p)) + (1 - \hat{p}) \times \log (1 - σ (p)))

(7)

L_{1} = \frac{1}{n} \sum | P - T |

(8)

where

λ

is the loss weight, which is set to 5.

n

is the number of the candidate box,

\hat{p}

is the true category probability,

p

is the predicted category probability, and

σ (x)

is a

Sigmoid

function that maps the values to between (0, 1). In addition, the

L_{1}

loss is used after turning off Mosaic data enhancement,

P

is the predicted value, and

T

is the ground truth label value.

4. Training Strategy for Data Domain Transfer

To further improve the generalization capability of the improved YOLOX algorithm in the nighttime scenario, data enhancement was performed on the existing dataset: First, low-illumination enhancement is applied to the nighttime data to enhance the color characteristics of the target.; then, the existing larger-scale daytime vehicle/pedestrian images were used for data expansion to compensate for the low number of nighttime vehicle/pedestrian detection data in the publicly available automated driving dataset, thus improving the generalization capability of the model.

However, due to the large variability of color features between daytime and nighttime data, which are distributed in different data domains, the model easily deviates from the target data domain if the mixing training is performed directly without domain transfer, and the obtained performance improvement is not significant. To reduce the color feature differences between the two data domains, a data domain transfer approach was used to process the nighttime and daytime datasets separately: First, the raw nighttime and daytime vehicle–pedestrian datasets are low-illumination enhanced and low-illumination degraded to generate fake-day and fake-night vehicle–pedestrian datasets. Then, the fake-night vehicle–pedestrian dataset was separately low-illumination enhanced to also generate the fake-day vehicle pedestrian dataset. Finally, the two generated fake days were mixed and trained for the purpose of data expansion. It is worth noting that since the data domain in the training phase was transformed, the test phase required low-illumination enhancement of the dataset to shrink the differences in color features between the training and test sets. The whole training and testing process of the detection algorithm is shown in Figure 3.

4.1. Low Light Enhancement

In order to enhance the color features of vehicle and pedestrian targets at night, this paper uses the Zero-Reference Deep Curve Estimation (Zero-DCE) algorithm to achieve low-light enhancement. The algorithm processing process is shown in Figure 4.

The algorithm iterates itself continuously to approach pixel-level higher-order curves by designing specific curves associated with the image, resulting in images with strong brightness and contrast.

The curve equation for the brightness enhancement is shown in Equation (5) below.

L E_{n} (x) = L E_{n - 1} (x) + A_{n} (x) L E_{n - 1} (x) (1 - L E_{n - 1} (x))

(9)

Here

n = 1

,

L E_{n} (I (x); A) = I (x) + A I (x) (1 - I (x))

(10)

where

n

is the number of iterations,

A

is a matrix of curve parameters of the same size as the input image,

I (x)

is the input low-illumination

x

image, and

L E (x)

is the enhanced image.

Furthermore, this algorithm has the advantages of fewer model parameters and faster processing speed compared to other enhancement algorithms such as Low-Light Net (LLNet) [68] and EnlightenGAN, which are suitable for deploying low-end devices while ensuring the enhancement effect. In practical tests, the algorithm consumes only 2 ms of inference time for the experimental dataset in this paper, and the model size accounts for only 313 K. The performance of different low-light enhancement algorithms is shown in Table 1.

4.2. Low-Illumination Degrading Transformations

Further, this chapter was redesigned for the low-illumination degrading stage proposed by Cui et al. [69] to allow the daytime dataset to generate images closer to the target domain. The entire low-illumination degrading transformation is shown in Figure 5, with a new domain adaptation estimation module, and the inverse tone mapping and quantization modules were removed compared to the original paper. Experimentally, these two modules produce the generated results, which makes it difficult for the model to explore the target features and is not conducive to training.

4.2.1. Domain Adaptation Estimation

The module first converts the input daytime data into a grayscale image and estimates the ratio of dark pixels in the grayscale image, and then establishes the corresponding suppression factors for different proportions. In addition, the factor acts on the micro-light corruption process to generate fake-night data with more stable illumination for day data with different brightness levels.

4.2.2. Gamma Correction and Inverse Gamma Correction

Gamma correction is used for human perception of nonlinearity in dark areas. The standard gamma curve [70] and its inverse process (inverse gamma calibration) are shown in Equations (11) and (12).

y_{gamma} = \max {(x, ε)}^{\frac{1}{γ}}

(11)

y_{invert_gamma} = \max {(x, ε)}^{γ}

(12)

where the parameter

γ

obeys the uniform distribution

γ \sim U (2, 3.5)

and is randomly sampled;

ε

is a very small value to ensure the stability of the value during the conversion.

4.2.3. Color Space Conversion

Two color space conversions are included in the low-illumination degrading stage. The first color space conversion is the conversion of sRGB to cRGB and the second color space conversion is the conversion of the white balance signal from the camera’s internal cRGB to sRGB, which is caused by the fact that the camera’s internal sRGB color space is not identical [71,72]. The converted signal

y_{sRGB}

can be obtained from the color correction matrix (CCM)

M_{ccm}

, as shown in the following Equation (13).

y_{sRGB} = M_{ccm} \cdot y_{cRGB}

(13)

In addition, its inverse process is shown in Equation (14).

y_{cRGB} = M_{ccm}^{- 1} \cdot y_{sRGB}

(14)

4.2.4. White Balance and Inverse White Balance

White balance simulates the color constancy of the Human Visual System (HVS) by mapping “white” colors onto white objects [72]. The color of the image is determined by the color of the light and the reflectivity of the material. The white balance step in the camera pipeline estimates and adjusts the gain of the red channel and the blue channel so that the image appears to be illuminated under “neutral” lighting. The process is shown in the following Equation (15).

[\begin{matrix} y_{r} \\ y_{g} \\ y_{b} \end{matrix}] = [\begin{matrix} g_{r} & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & g_{b} \end{matrix}] \cdot [\begin{matrix} x_{r} \\ x_{g} \\ x_{b} \end{matrix}]

(15)

where the

g_{r}

values are chosen randomly among

(1.9, 2.4)

and the

g_{b}

values are chosen randomly among

(1.5, 1.9)

, which exhibit a homogeneous distribution and exist separately; and the red and blue gains 1/g are the reciprocals of the inverse process.

4.2.5. Low-Light Corruption

When photons are focused via the lens onto the capacitor clusters, the electric charge produced by each capacitor, taking into account the identical exposure duration, aperture size, and automatic gain control, correlates to the lux of the ambient illumination. The random entry of photons into the sensor causes scattered particle noise, which is a fundamental restriction of the camera. Since the photon arrival time is affected by the Poisson distribution, the uncertainty in the number of photons collected in a fixed period is

δ_{s} = \sqrt{S}

, where

δ

is the scattered noise and

S

is the signal from the sensor. In the output amplifier, reading noise is generated during the electronic voltage conversion, which can be simulated with a Gaussian random variable with fixed variance and zero mean.

In camera imaging systems, scatter and read noise are more common; therefore, we model the noise measurements on the sensor [73] in Equations (16) and (17).

x_{noise} ~ N (μ = k x, δ^{2} = δ_{γ}^{2} + δ_{s} k s)

(16)

y_{noise} = k s + x_{noise}

(17)

where each pixel

x

’s true intensity is derived through a non-processing method and linearly attenuated with the use of the parameter

k

. The light intensity parameter

k

is randomly chosen to replicate various lighting situations from a truncated Gaussian distribution with a mean of 0.1 and a variance of 0.08 with a range of (0.01, 1.0). In addition, the parameter ranges of

δ_{γ}

and

δ_{s}

refer to the literature [74], as shown in Equations (18) and (19).

\log δ_{s} \sim U (- 4, - 2)

(18)

\frac{\log δ_{r}}{\log δ_{s}} \sim N (μ = 2.18 \log δ_{s} + 0.12, σ = 0.26)

(19)

5. Experimental Comparison and Discussion

5.1. Experimental Datasets and Performance Evaluation Metrics

5.1.1. Experimental Dataset

The experimental data are the public dataset BDD100K, which contains 100,000 images of different weather conditions, as well as road scenes at different times of the day, and is the largest and most diverse autonomous driving dataset in terms of content, of which the image resolution is 1280 × 720.

Since the task studied in this paper is nighttime vehicle/pedestrian detection, a total of 4800 nighttime images with vehicle and pedestrian targets were manually screened, including 3800 images in the training set and 1000 images in the validation set. In the data domain transfer study, 9399 daytime images with vehicle and pedestrian targets were prepared. Data enhancement processes such as random level flipping, color transformation, and Mosaic were also performed on the training data to further expand the dataset and improve the generalization capability of the model.

5.1.2. Evaluation Metrics

In order to evaluate the improvement effect of the algorithm in this paper and the difference with other detection algorithms, six metrics are used to analyze the average detection accuracy mean Average Precision mAP, F1 value, Recall, Accuracy Precision, Inference time, and weight size.

5.2. Experimental Parameter Setting

The model training framework in this chapter was based on Pytorch 1.8, running on CPU: i7-9700k and GPU: NVIDIA GeForce RTX 2070 SUPER with 8G video memory. The network input size is 640 × 640, and the Stochastic Gradient Descent (SGD) is used as the optimizer. The learning rate is set to 0.01 and the weight decay is 0.0005. The momentum of SGD is 0.937, the image batch size is 8, and the total number of training rounds is 200 Epochs. The first 3 Epochs use the WarmUp learning rate strategy and the last 15 Epochs cancel the Mosaic data enhancement. The degrees of rotation are set to 0, random multiscale with Mixup data enhancement is removed, and the rest of the settings are set to default values.

5.3. Analysis

5.3.1. Training Evaluation Process and PR Curve

In this paper, the whole training evaluation process of YOLOv5, YOLOX, and the improved YOLOX algorithm was visualized, and the results are shown in Figure 6. The improved YOLOX algorithm (the algorithm in this paper) has better mAP values (IoU threshold is taken as 0.5) than YOLOv5 and YOLOX, and the results tend to converge. In addition, in order to further improve the detection robustness of the algorithm in the nighttime scenario, the improved YOLOX algorithm is trained again using the training strategy of data domain transfer in this paper, and the results show that the mAP is significantly improved, thus demonstrating the effectiveness of the strategy.

Finally, the trained model is analyzed using the Precision–Recall (PR) curve (the IoU threshold is taken as 0.5) in this paper, and the results are shown in Figure 6. The PR curve of this paper’s algorithm can completely envelop the previous algorithm, thus indicating that this paper’s algorithm is better in terms of performance.

5.3.2. Ablation Studies

The improved YOLOX algorithm proposed in this paper mainly includes five improvements, which were model structure reparameterization and light weighting, the CA attention mechanism, feature pyramid improvement, the confidence loss function based on Varifocal Loss, and the CIoU-based bounding-box regression loss function. In order to verify the effectiveness of each improvement point of the algorithm in this paper, an exhaustive experimental study of model ablation is conducted on the nighttime vehicle/pedestrian dataset, and the results are shown in Table 2.

Among them, Baseline represents the original YOLOX algorithm trained with default parameters, and its mAP result is 76.5%; after the model structure reparameterization and light weighting, the model weight size was reduced by approximately 4 MB, and the accuracy is almost unchanged, and the mAP at this time is 76.6%; after the CSP components at positions P3, P4, and P5 in the YOLOX backbone network were added with the addition of the three-layer CA attention module, the accuracy increased by 1% with almost the same model size, and the mAP is 77.5%, but the inference time increases due to the addition of too many layers. The inference time also increased because the model structure became more complex; then, using Varifocal Loss as the confidence loss, the model inference time was almost unchanged, but the accuracy improved by 0.4%, and the mAP was 79.7% at this time; finally, using CIoU loss as the bounding-box regression loss, the accuracy improved by another 0.4%, and the mAP was 80.1% at this time.

5.3.3. Effect of CA Module Location and Number of Layers

In this paper, we further investigate the effect of the additional position and number of layers of CA modules in the backbone network. As shown in Table 3, a total of eight comparative experiments are conducted, including four different location addition schemes and three different layer number analyses.

In Table 3, the "√" indicates that the CA module was used at that location in the network, and the bolded in the table indicates the best results. From the results in Table 3, we know that the CA module has the best accuracy of 80.1% when it was added after the CSP components at positions P3, P4, and P5 in the backbone network. Upon further reducing the number of layers of the CA module from the initial three layers to one layer, the mAP is reduced by only 0.5%, but the inference time was reduced by 5.3 ms and the model size was also reduced by 0.2 MB, which is more advantageous than adding three layers of the CA module. The final mAP of the improved YOLOX algorithm is 79.6%. Compared with the original YOLOX model, the mAP improved by 3.1% and model size was reduced by 1.6 MB.

5.3.4. Comparison of Different Testing Methods

To further illustrate the effectiveness of the proposed method, the detection performance is compared with YOLOv3, YOLOv4, and YOLOv5 in the one-stage detection method and Faster R-CNN and Cascade R-CNN in the two-stage detection method using the same experimental configuration in the nighttime vehicle/pedestrian dataset, and the results are shown in Table 4.

From Table 4, it can be seen that the improved algorithm proposed in this paper has the best performance indexes in terms of mAP, F1, and Recall value compared with the YOLO series algorithm and the two-stage algorithm, which are 79.6%, 75.8%, and 72.3%, respectively. Meanwhile, the algorithm in this paper can meet the requirements of real-time detection in terms of inference speed, and the model size is only 15.6 MB, which can be deployed in low-performance storage devices. Although compared with YOLOv5, this algorithm does not have an advantage in inference speed and model size, the average detection accuracy mAP is 5.8% higher and the recall Recall is 8.7% higher. As a result, night object detection is improved, and the model size and inference time can be further reduced by model pruning and quantization operations. Therefore, in practical applications, the algorithm in this paper is more advantageous.

In addition, to illustrate the effectiveness of the data domain transfer training strategy more clearly, a comparative analysis was performed with the improved algorithm without the data domain transfer method, and the comparison results are shown in Table 4 above. From the results, it can be seen that performance metrics such as mAP, F1, Recall, and Precision values are further improved after the improved algorithm was trained by the data domain transfer strategy, and the mAP values are better compared to adding additional daytime training datasets directly for mixed training. The final improved algorithm improved the mean detection accuracy mAP by 5.9% to 82.4%, F1 value by 4.3%, Recall by 4.9%, and Precision by 3.3% compared to the former YOLOX algorithm.

5.4. Effectiveness Analysis

In order to test the detection effectiveness of the proposed method in practice, the detection results of the original YOLOX algorithm, the improved algorithm in this paper and the improved algorithm trained by the data domain transfer strategy are visualized using a real nighttime vehicle/pedestrian dataset, as shown in Figure 7.

From the above figure, it can be seen that the improved YOLOX algorithm has fewer false and false positive rates compared to the original algorithm, and it is more effective for nighttime vehicle/pedestrian detection when combined with the data domain transfer training strategy.

6. Conclusions

In this work, a vehicle/pedestrian detection algorithm was designed based on YOLOX. In addition, in order to further improve the detection accuracy of the algorithm under low-light conditions, a training strategy based on data domain transfer was proposed. We trained the improved YOLOX with the proposed domain transfer strategy and achieved satisfactory results. The major remarkable features of the proposed approach are:

(1) The ablation experiments revealed that the improvement in the feature pyramid part of the YOLOX model improved the most, with a 1.8% improvement in mAP. By adding a fusion detection branch at the large-scale feature map in the backbone network and fusing it with the original three smaller-scale feature maps, the detection capability of small, dense targets is effectively enhanced by retaining the original deep image semantic information in order to obtain more shallow image feature information.

(2) Introducing the coordinate-based attention mechanism in the YOLOX backbone network can improve the feature extraction capability of the deep model, but the extra computational effort increases the model inference time by 7.7 ms; by simplifying the attention mechanism module from a three-layer structure to a one-layer structure, the inference time can be reduced by 5.3 ms at the cost of a 0.5% decrease in mAP.

(3) A model training strategy based on data domain transfer was proposed, in which the nighttime and daytime datasets are domain transferred and then mixed for training by combining low-light enhancement and low-illumination degrading methods. After training the improved algorithm with the domain transfer strategy, the detection performance of both nighttime targets is significantly improved, in which the mAP is increased by 2.8%.

(4) The improved algorithm, after being trained by the domain transfer strategy, eventually improved the average detection accuracy of nighttime vehicle/pedestrian targets by 5.9% to 82.4%.

Our proposed method can effectively improve the target detection accuracy of self-driving vehicles in a nighttime environment and has good implications for other target detection tasks in low-light environments. However, it is still a long way from the highly reliable sensing method needed to guarantee safe nighttime driving. In the future, we will further explore the target detection task in fog conditions and explore the fusion of visual detection with other techniques such as radar and lidar.

Author Contributions

Guarantor: K.Y. is responsible for the entirety of the work and the final decision to submit the manuscript. Conceptualization, K.Y. and R.H.; methodology, K.Y., K.L. and T.C.; software, K.L. and T.C.; validation, K.Y., K.L. and T.C.; formal analysis, K.Y. and K.L.; resources, K.L. and T.C. data curation, K.L. and T.C.; writing—original draft preparation, K.Y. and K.L.; writing—review and editing, K.Y., K.L., T.C. and R.H.; visualization, K.L. and T.C.; supervision, K.Y. and R.H.; project administration, K.Y. and R.H.; funding acquisition, K.Y. and R.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the Natural Science Foundation of China, grant number 52002036, partly supported by the Hunan Provincial Natural Science Foundation of China, grant number 2022JJ30611, the Scientific Research Fund of Hunan Provincial Education Department, grant number 21B0342, the Changsha Science and Technology Major Project, grant number kh2202002, and the Postgraduate Scientific Research Innovation Project of Hunan Province, grant number CX20220903.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are available in a publicly accessible repository. The data presented in this study are openly available in https://bdd-data.berkeley.edu/ (accessed on 1 December 2022).

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, J.; Li, J.; Wang, K.; Zhao, J.; Cong, H.; He, P. Exploring Factors Affecting the Severity of Night-Time Vehicle Accidents under Low Illumination Conditions. Adv. Mech. Eng. 2019, 11, 1687814019840940. [Google Scholar] [CrossRef]
Chuma, E.L.; Iano, Y. Human Movement Recognition System Using CW Doppler Radar Sensor with FFT and Convolutional Neural Network. In Proceedings of the 2020 IEEE MTT-S Latin America Microwave Conference (LAMC 2020), Cali, Colombia, 26–28 May 2021; pp. 1–4. [Google Scholar]
Navarro, P.J.; Fernández, C.; Borraz, R.; Alonso, D. A Machine Learning Approach to Pedestrian Detection for Autonomous Vehicles Using High-Definition 3D Range Data. Sensors 2017, 17, 18. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Lee, W.; Cho, H.; Hyeong, S.; Chung, W. Practical Modeling of GNSS for Autonomous Vehicles in Urban Environments. Sensors 2019, 19, 4236. [Google Scholar] [CrossRef] [Green Version]
Wei, Y.; Tian, Q.; Guo, J.; Huang, W.; Cao, J. Multi-Vehicle Detection Algorithm through Combining Harr and HOG Features. Math. Comput. Simul. 2019, 155, 130–145. [Google Scholar] [CrossRef]
Wu, H.; Hu, Y.; Wang, W.; Mei, X.; Xian, J. Ship Fire Detection Based on an Improved YOLO Algorithm with a Lightweight Convolutional Neural Network Model. Sensors 2022, 22, 7420. [Google Scholar] [CrossRef] [PubMed]
Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar]
Redmon, J.; Farhadi, A. YOLO9000: Better, Faster, Stronger. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 7263–7271. [Google Scholar]
Liu, W.; Anguelov, D.; Erhan, D.; Szegedy, C.; Reed, S.; Fu, C.-Y.; Berg, A.C. SSD: Single Shot MultiBox Detector. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 21–37. [Google Scholar]
Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2015; Volume 28. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: Delving Into High Quality Object Detection. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6154–6162. [Google Scholar]
Cheng, Z.; Lv, J.; Wu, A.; Qu, N. YOLOv3 Object Detection Algorithm with Feature Pyramid Attention for Remote Sensing Images. Sens. Mater. 2020, 32, 4537. [Google Scholar] [CrossRef]
Ju, M.; Luo, H.; Wang, Z.; Hui, B.; Chang, Z. The Application of Improved YOLO V3 in Multi-Scale Target Detection. Appl. Sci. 2019, 9, 3775. [Google Scholar] [CrossRef] [Green Version]
Zhu, Y.; Ma, C.; Du, J. Rotated Cascade R-CNN: A Shape Robust Detector with Coordinate Regression. Pattern Recognit. 2019, 96, 106964. [Google Scholar] [CrossRef]
Cai, Y.; Luan, T.; Gao, H.; Wang, H.; Chen, L.; Li, Y.; Sotelo, M.A.; Li, Z. YOLOv4-5D: An Effective and Efficient Object Detector for Autonomous Driving. IEEE Trans. Instrum. Meas. 2021, 70, 1–13. [Google Scholar] [CrossRef]
Zhang, M.; Wang, C.; Yang, J.; Zheng, K. Research on Engineering Vehicle Target Detection in Aerial Photography Environment Based on YOLOX. In Proceedings of the 2021 14th International Symposium on Computational Intelligence and Design (ISCID), Hangzhou, China, 11–12 December 2021; pp. 254–256. [Google Scholar]
Viola, P.; Jones, M. Rapid Object Detection Using a Boosted Cascade of Simple Features. In Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, Kauai, HI, USA, 8–14 December 2001; Volume 1, p. I. [Google Scholar]
Viola, P.; Jones, M.J. Robust Real-Time Face Detection. Int. J. Comput. Vis. 2004, 57, 137–154. [Google Scholar] [CrossRef]
Dalal, N.; Triggs, B. Histograms of Oriented Gradients for Human Detection. In Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), San Diego, CA, USA, 20–25 June 2005; Volume 1, pp. 886–893. [Google Scholar]
Felzenszwalb, P.; McAllester, D.; Ramanan, D. A Discriminatively Trained, Multiscale, Deformable Part Model. In Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, AK, USA, 23–28 June 2008; pp. 1–8. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D. Cascade Object Detection with Deformable Part Models. In Proceedings of the 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, San Francisco, CA, USA, 13–18 June 2010; pp. 2241–2248. [Google Scholar]
Felzenszwalb, P.F.; Girshick, R.B.; McAllester, D.; Ramanan, D. Object Detection with Discriminatively Trained Part-Based Models. IEEE Trans. Pattern Anal. Mach. Intell. 2010, 32, 1627–1645. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Girshick, R.; Felzenszwalb, P.; McAllester, D. Object Detection with Grammar Models. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2011; Volume 24. [Google Scholar]
Lin, T.-Y.; Goyal, P.; Girshick, R.; He, K.; Dollar, P. Focal Loss for Dense Object Detection. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2980–2988. [Google Scholar]
Law, H.; Deng, J. CornerNet: Detecting Objects as Paired Keypoints. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 734–750. [Google Scholar]
Tian, Z.; Shen, C.; Chen, H.; He, T. FCOS: Fully Convolutional One-Stage Object Detection. In Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, Seoul, Republic of Korea, 27–28 October 2019; pp. 9627–9636. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich Feature Hierarchies for Accurate Object Detection and Semantic Segmentation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Columbus, OH, USA, 23–28 June 2014; pp. 580–587. [Google Scholar]
Girshick, R. Fast R-CNN. In Proceedings of the IEEE International Conference on Computer Vision, Santiago, Chile, 11–18 December 2015; pp. 1440–1448. [Google Scholar]
Dai, J.; Li, Y.; He, K.; Sun, J. R-FCN: Object Detection via Region-Based Fully Convolutional Networks. In Advances in Neural Information Processing Systems; Curran Associates, Inc.: Red Hook, NY, USA, 2016; Volume 29. [Google Scholar]
Lin, T.-Y.; Dollar, P.; Girshick, R.; He, K.; Hariharan, B.; Belongie, S. Feature Pyramid Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 2117–2125. [Google Scholar]
Hu, H.; Gu, J.; Zhang, Z.; Dai, J.; Wei, Y. Relation Networks for Object Detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–22 June 2018; pp. 3588–3597. [Google Scholar]
Pang, J.; Chen, K.; Shi, J.; Feng, H.; Ouyang, W.; Lin, D. Libra R-CNN: Towards Balanced Learning for Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 821–830. [Google Scholar]
Cai, Z.; Vasconcelos, N. Cascade R-CNN: High Quality Object Detection and Instance Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 43, 1483–1498. [Google Scholar] [CrossRef] [Green Version]
Chen, Y.; Li, W.; Sakaridis, C.; Dai, D.; Van Gool, L. Domain Adaptive Faster R-CNN for Object Detection in the Wild. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3339–3348. [Google Scholar]
Cai, Q.; Pan, Y.; Ngo, C.-W.; Tian, X.; Duan, L.; Yao, T. Exploring Object Relation in Mean Teacher for Cross-Domain Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 11457–11466. [Google Scholar]
Saito, K.; Ushiku, Y.; Harada, T.; Saenko, K. Strong-Weak Distribution Alignment for Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 16–17 June 2019; pp. 6956–6965. [Google Scholar]
Zhuang, C.; Han, X.; Huang, W.; Scott, M. IFAN: Image-Instance Full Alignment Networks for Adaptive Object Detection. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 13122–13129. [Google Scholar]
He, Z.; Zhang, L. Domain Adaptive Object Detection via Asymmetric Tri-Way Faster-RCNN. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 309–324. [Google Scholar]
Zhao, G.; Li, G.; Xu, R.; Lin, L. Collaborative Training Between Region Proposal Localization and Classification for Domain Adaptive Object Detection. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 86–102. [Google Scholar]
Xu, M.; Wang, H.; Ni, B.; Tian, Q.; Zhang, W. Cross-Domain Detection via Graph-Induced Prototype Alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 12355–12364. [Google Scholar]
Xu, C.-D.; Zhao, X.-R.; Jin, X.; Wei, X.-S. Exploring Categorical Regularization for Domain Adaptive Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 14–19 June 2020; pp. 11724–11733. [Google Scholar]
Neumann, L.; Karg, M.; Zhang, S.; Scharfenberger, C.; Piegert, E.; Mistr, S.; Prokofyeva, O.; Thiel, R.; Vedaldi, A.; Zisserman, A.; et al. NightOwls: A Pedestrians at Night Dataset. In Proceedings of the Asian Conference on Computer Vision, Perth, Australia, 2–6 December 2018. [Google Scholar]
Nada, H.; Sindagi, V.A.; Zhang, H.; Patel, V.M. Pushing the Limits of Unconstrained Face Detection: A Challenge Dataset and Baseline Results. In Proceedings of the 2018 IEEE 9th International Conference on Biometrics Theory, Applications and Systems (BTAS), Redondo Beach, CA, USA, 22–25 October 2018; pp. 1–10. [Google Scholar]
Yang, W.; Yuan, Y.; Ren, W.; Liu, J.; Scheirer, W.J.; Wang, Z.; Zhang, T.; Zhong, Q.; Xie, D.; Pu, S.; et al. Advancing Image Understanding in Poor Visibility Environments: A Collective Benchmark Study. IEEE Trans. Image Process. 2020, 29, 5737–5752. [Google Scholar] [CrossRef]
Loh, Y.P.; Chan, C.S. Getting to Know Low-Light Images with the Exclusively Dark Dataset. Comput. Vis. Image Underst. 2019, 178, 30–42. [Google Scholar] [CrossRef] [Green Version]
Yu, F.; Chen, H.; Wang, X.; Xian, W.; Chen, Y.; Liu, F.; Madhavan, V.; Darrell, T. BDD100K: A Diverse Driving Dataset for Heterogeneous Multitask Learning. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Seattle, WA, USA, 13–19 June 2020; pp. 2636–2645. [Google Scholar]
Land, E.H. An Alternative Technique for the Computation of the Designator in the Retinex Theory of Color Vision. Proc. Natl. Acad. Sci. USA 1986, 83, 3078–3080. [Google Scholar] [CrossRef] [Green Version]
Jobson, D.J.; Rahman, Z.; Woodell, G.A. A Multiscale Retinex for Bridging the Gap between Color Images and the Human Observation of Scenes. IEEE Trans. Image Process. 1997, 6, 965–976. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Guo, X.; Li, Y.; Ling, H. LIME: Low-Light Image Enhancement via Illumination Map Estimation. IEEE Trans. Image Process. 2017, 26, 982–993. [Google Scholar] [CrossRef]
Lee, C.; Lee, C.; Kim, C.-S. Contrast Enhancement Based on Layered Difference Representation of 2D Histograms. IEEE Trans. Image Process. 2013, 22, 5372–5384. [Google Scholar] [CrossRef]
Tomasi, C.; Manduchi, R. Bilateral Filtering for Gray and Color Images. In Proceedings of the Sixth International Conference on Computer Vision (IEEE Cat. No.98CH36271), Washington, DC, USA, 7 January 1998; pp. 839–846. [Google Scholar]
Wei, C.; Wang, W.; Yang, W.; Liu, J. Deep Retinex Decomposition for Low-Light Enhancement. arXiv 2018, arXiv:1808.04560. [Google Scholar]
Jiang, Y.; Gong, X.; Liu, D.; Cheng, Y.; Fang, C.; Shen, X.; Yang, J.; Zhou, P.; Wang, Z. EnlightenGAN: Deep Light Enhancement Without Paired Supervision. IEEE Trans. Image Process. 2021, 30, 2340–2349. [Google Scholar] [CrossRef]
Guo, C.; Li, C.; Guo, J.; Loy, C.C.; Hou, J.; Kwong, S.; Cong, R. Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, WA, USA, 13–19 June 2020; pp. 1780–1789. [Google Scholar]
Guo, S.; Wang, W.; Wang, X.; Xu, X. Low-Light Image Enhancement with Joint Illumination and Noise Data Distribution Transformation. Vis. Comput. 2022. [Google Scholar] [CrossRef]
Xu, K.; Chen, H.; Xu, C.; Jin, Y.; Zhu, C. Structure-Texture Aware Network for Low-Light Image Enhancement. IEEE Trans. Circuits Syst. Video Technol. 2022, 32, 4983–4996. [Google Scholar] [CrossRef]
Sasagawa, Y.; Nagahara, H. YOLO in the Dark—Domain Adaptation Method for Merging Multiple Models. In Proceedings of the Computer Vision—ECCV 2020, Glasgow, UK, 23–28 August 2020; Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M., Eds.; Springer International Publishing: Cham, Switzerland, 2020; pp. 345–359. [Google Scholar]
Xu, N.; Huo, C.; Pan, C. Adaptive Brightness Learning for Active Object Recognition. In Proceedings of the ICASSP 2019—2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 2162–2166. [Google Scholar]
Wang, W.; Peng, Y.; Cao, G.; Guo, X.; Kwok, N. Low-Illumination Image Enhancement for Night-Time UAV Pedestrian Detection. IEEE Trans. Ind. Inform. 2021, 17, 5208–5217. [Google Scholar] [CrossRef]
Arad, B.; Kurtser, P.; Barnea, E.; Harel, B.; Edan, Y.; Ben-Shahar, O. Controlled Lighting and Illumination-Independent Target Detection for Real-Time Cost-Efficient Applications. The Case Study of Sweet Pepper Robotic Harvesting. Sensors 2019, 19, 1390. [Google Scholar] [CrossRef] [Green Version]
Chen, C.; Chen, Q.; Xu, J.; Koltun, V. Learning to See in the Dark. In Proceedings of the 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 3291–3300. [Google Scholar]
Wu, T.-H.; Wang, T.-W.; Liu, Y.-Q. Real-Time Vehicle and Distance Detection Based on Improved Yolo v5 Network. In Proceedings of the 2021 3rd World Symposium on Artificial Intelligence (WSAI), Guangzhou, China, 18–20 June 2021; pp. 24–28. [Google Scholar]
Wang, C.-Y.; Liao, H.-Y.M.; Wu, Y.-H.; Chen, P.-Y.; Hsieh, J.-W.; Yeh, I.-H. CSPNet: A New Backbone That Can Enhance Learning Capability of CNN. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 390–391. [Google Scholar]
Ding, X.; Zhang, X.; Ma, N.; Han, J.; Ding, G.; Sun, J. RepVGG: Making VGG-Style ConvNets Great Again. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13733–13742. [Google Scholar]
Hou, Q.; Zhou, D.; Feng, J. Coordinate Attention for Efficient Mobile Network Design. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 13713–13722. [Google Scholar]
Zheng, Z.; Wang, P.; Liu, W.; Li, J.; Ye, R.; Ren, D. Distance-IoU Loss: Faster and Better Learning for Bounding Box Regression. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12993–13000. [Google Scholar]
Zhang, H.; Wang, Y.; Dayoub, F.; Sunderhauf, N. VarifocalNet: An IoU-Aware Dense Object Detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 8514–8523. [Google Scholar]
Lore, K.G.; Akintayo, A.; Sarkar, S. LLNet: A Deep Autoencoder Approach to Natural Low-Light Image Enhancement. Pattern Recognit. 2017, 61, 650–662. [Google Scholar] [CrossRef] [Green Version]
Cui, Z.; Qi, G.-J.; Gu, L.; You, S.; Zhang, Z.; Harada, T. Multitask AET With Orthogonal Tangent Regularity for Dark Object Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Nashville, TN, USA, 20–25 June 2021; pp. 2553–2562. [Google Scholar]
Wen, L.; Du, D.; Cai, Z.; Lei, Z.; Chang, M.-C.; Qi, H.; Lim, J.; Yang, M.-H.; Lyu, S. UA-DETRAC: A New Benchmark and Protocol for Multi-Object Detection and Tracking. Comput. Vis. Image Underst. 2020, 193, 102907. [Google Scholar] [CrossRef] [Green Version]
Karaimer, H.C.; Brown, M.S. A Software Platform for Manipulating the Camera Imaging Pipeline. In Proceedings of the Computer Vision—ECCV 2016, Amsterdam, The Netherlands, 11–14 October 2016; Leibe, B., Matas, J., Sebe, N., Welling, M., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 429–444. [Google Scholar]
Ramanath, R.; Snyder, W.E.; Yoo, Y.; Drew, M.S. Color Image Processing Pipeline. IEEE Signal Process. Mag. 2005, 22, 34–43. [Google Scholar] [CrossRef]
Foi, A.; Trimeche, M.; Katkovnik, V.; Egiazarian, K. Practical Poissonian-Gaussian Noise Modeling and Fitting for Single-Image Raw-Data. IEEE Trans. Image Process. 2008, 17, 1737–1754. [Google Scholar] [CrossRef]
Plotz, T.; Roth, S. Benchmarking Denoising Algorithms With Real Photographs. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 June 2017; pp. 1586–1595. [Google Scholar]

Figure 1. Structure of Improved YOLOX Model.

Figure 2. Distribution of sample label scale.

Figure 3. General training and testing framework for algorithms.

Figure 4. The framework of Zero-DCE.

Figure 5. The pipeline of low-illumination degrading transformations.

Figure 6. (a) Comparison of mAP for different algorithms. (b) Comparison of PR curve for different algorithms.

Figure 7. Comparison of detection results under the night scenario. (a) YOLOX; (b) improved YOLOX (ours); (c) improved YOLOX + Domain Transfer (ours).

Table 1. Performance comparison of different low-light enhancement algorithms.

Algorithms	Processing Time	Training Parameters	FLOPs	Average Absolute Error
RetinexNet	0.1200	555,205	587.47	104.81
EnlightenGAN	0.0078	8,636,675	273.24	102.78
Zero-DCE	0.0025	79,416	84.99	98.78

Table 2. Improved YOLOX algorithm for ablation experiments.

Training Method	AP		mAP	F1	Recall	Precision	Inference Time (ms)	Model Size (MB)
Training Method	Person	Car	mAP	F1	Recall	Precision	Inference Time (ms)	Model Size (MB)
YOLOX (Baseline)	0.768	0.762	0.765	0.736	0.685	0.796	11.2	17.2
+Lightweighting (lower number of channels)	0.760	0.756	0.758	0.729	0.678	0.788	11.0	13.3
+Lightweighting (CSP depth increase)	0.763	0.762	0.763	0.736	0.687	0.793	12.8	13.8
+reparameterization and lightweighting	0.770	0.761	0.766	0.740	0.662	0.839	12.6	13.5
+Attention Mechanism Module (3 levels)	0.781	0.769	0.775	0.743	0.700	0.792	20.3	13.8
+Feature Pyramid Improvement	0.789	0.796	0.793	0.752	0.700	0.813	23.0	15.8
+Varifocal Loss	0.795	0.798	0.797	0.753	0.724	0.786	23.2	15.8
+CIoU Loss	0.801	0.801	0.801	0.761	0.718	0.809	23.1	15.8

Table 3. Comparison of CA module location and number of layers.

Model	P2	P3	P4	P5	3 Layer		1 Layer
Model	P2	P3	P4	P5	mAP	Inference Time (ms)	mAP	Inference Time (ms)
1				√	0.788	17.2	0.780	16.1
2			√	√	0.791	21.6	0.789	17.2
3		√	√	√	0.801	23.1	0.796	17.8
4	√	√	√	√	0.782	25.1	0.782	18.5

Table 4. Performance comparison of different algorithms.

Algorithms	AP		mAP	F1	Recall	Precision	Inference Time (ms)	Model Size (MB)
Algorithms	Person	Car	mAP	F1	Recall	Precision	Inference Time (ms)	Model Size (MB)
Faster R-CNN	0.762	0.757	0.760	0.753	0.705	0.810	84.7	166.0
Cascade R-CNN	0.765	0.765	0.765	0.749	0.710	0.792	97.1	277.0
YOLOv3	0.700	0.743	0.722	0.676	0.564	0.836	34.9	235.0
YOLOv4	0.747	0.745	0.746	0.687	0.570	0.863	44.2	244.0
YOLOv5	0.740	0.737	0.738	0.721	0.636	0.831	10.6	13.7
YOLOX	0.768	0.762	0.765	0.736	0.685	0.796	11.2	17.2
Improved YOLOX(ours)	0.794	0.798	0.796	0.758	0.723	0.797	17.8	15.6
Improved YOLOX + Dataset Expansion	0.820	0.815	0.817	0.767	0.742	0.793	17.8	15.6
Improved YOLOX + Domain Transfer (ours)	0.831	0.816	0.824	0.779	0.734	0.829	19.7	15.9

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yi, K.; Luo, K.; Chen, T.; Hu, R. An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection. Appl. Sci. 2022, 12, 12476. https://doi.org/10.3390/app122312476

AMA Style

Yi K, Luo K, Chen T, Hu R. An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection. Applied Sciences. 2022; 12(23):12476. https://doi.org/10.3390/app122312476

Chicago/Turabian Style

Yi, Kefu, Kai Luo, Tuo Chen, and Rongdong Hu. 2022. "An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection" Applied Sciences 12, no. 23: 12476. https://doi.org/10.3390/app122312476

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Improved YOLOX Model and Domain Transfer Strategy for Nighttime Pedestrian and Vehicle Detection

Abstract

1. Introduction

2. Related Work

2.1. Object Detection

2.1.1. Traditional Detection Methods

2.1.2. Deep-Learning-Based Detection Methods

2.1.3. Domain Adaption-Based Object Detection Method

2.2. Low-Light Detection

2.2.1. Low Illumination Datasets

2.2.2. Low-Illumination Enhancement and Restoration Methods

2.2.3. Different Methods Applied to Low-Illuminated Detection

3. Improved YOLOX Detection Algorithm

3.1. YOLOX Model

3.2. Structural Re-Parameterization and Light-Weighting of Model

3.2.1. Structural Re-Parameterization

3.2.2. Lightweight Design of the Model

3.3. Introduction of Attention Mechanism

3.4. Feature Pyramid Improvement

3.5. Loss Function Design

3.5.1. Confidence Loss Function Design

3.5.2. Bounding Box Regression Loss Function Design

3.5.3. Combined Loss Function

4. Training Strategy for Data Domain Transfer

4.1. Low Light Enhancement

4.2. Low-Illumination Degrading Transformations

4.2.1. Domain Adaptation Estimation

4.2.2. Gamma Correction and Inverse Gamma Correction

4.2.3. Color Space Conversion

4.2.4. White Balance and Inverse White Balance

4.2.5. Low-Light Corruption

5. Experimental Comparison and Discussion

5.1. Experimental Datasets and Performance Evaluation Metrics

5.1.1. Experimental Dataset

5.1.2. Evaluation Metrics

5.2. Experimental Parameter Setting

5.3. Analysis

5.3.1. Training Evaluation Process and PR Curve

5.3.2. Ablation Studies

5.3.3. Effect of CA Module Location and Number of Layers

5.3.4. Comparison of Different Testing Methods

5.4. Effectiveness Analysis

6. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI