Next Article in Journal
Performance Verification of Autonomous Driving LiDAR Sensors under Rainfall Conditions in Darkroom
Previous Article in Journal
Research on UAV Swarm Network Modeling and Resilience Assessment Methods
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High-Precision Carton Detection Based on Adaptive Image Augmentation for Unmanned Cargo Handling Tasks

Naval Architecture and Ocean Engineering College, Dalian Maritime University, Dalian 116026, China
*
Author to whom correspondence should be addressed.
Sensors 2024, 24(1), 12; https://doi.org/10.3390/s24010012
Submission received: 1 November 2023 / Revised: 4 December 2023 / Accepted: 15 December 2023 / Published: 19 December 2023
(This article belongs to the Topic Applications in Image Analysis and Pattern Recognition)

Abstract

:
Unattended intelligent cargo handling is an important means to improve the efficiency and safety of port cargo trans-shipment, where high-precision carton detection is an unquestioned prerequisite. Therefore, this paper introduces an adaptive image augmentation method for high-precision carton detection. First, the imaging parameters of the images are clustered into various scenarios, and the imaging parameters and perspectives are adaptively adjusted to achieve the automatic augmenting and balancing of the carton dataset in each scenario, which reduces the interference of the scenarios on the carton detection precision. Then, the carton boundary features are extracted and stochastically sampled to synthesize new images, thus enhancing the detection performance of the trained model for dense cargo boundaries. Moreover, the weight function of the hyperparameters of the trained model is constructed to achieve their preferential crossover during genetic evolution to ensure the training efficiency of the augmented dataset. Finally, an intelligent cargo handling platform is developed and field experiments are conducted. The outcomes of the experiments reveal that the method attains a detection precision of 0.828. This technique significantly enhances the detection precision by 18.1% and 4.4% when compared to the baseline and other methods, which provides a reliable guarantee for intelligent cargo handling processes.

1. Introduction

The port is an important channel for import–export trade and economic growth. Cargo handling is the most time-consuming task during cargo trans-shipment and is a key factor leading to cargo backlogs and reduced port throughput. In addition, the virus can be easily transmitted during port operations. Therefore, unattended and intelligent cargo handling is key to improving the efficiency of port operations and reducing the rate of virus transmission. Since cargo is often packed in cartons, carton detection has become one of the core technologies in the intelligent cargo handling process [1,2].
In the process of cargo handling, the intelligent cargo handling system needs to detect each carton in advance and generate corresponding grab instructions. Bulk cargo are characterized by high density, random placement and different scales, as shown in Figure 1, which seriously aggravates the difficulty of carton detection. In view of the high density and poor boundary discrimination of stacked cartons and the serious interference of other rectangular objects to carton detection, traditional image processing methods that are sensitive to environment have poor generalization and are not suitable for carton detection in intelligent cargo handling, while the object detection method based on deep learning has a strong generalization ability by relying on a large amount of data training and is widely used in carton detection. Therefore, this paper focuses on the deep-learning-based carton detection method and optimizes its performance in the carton detection process.
In this paper, a large-scale carton dataset is first presented to train the carton detection model, which includes the logistics cartons, containers and bulk cargo under various interference environments. Traditional deep learning algorithms, such as regions with convolutional neural network (R-CNN) [3] and You Only Look Once (YOLO) [4,5], need to be trained and learned on the basis of a wide range of datasets to improve model precision and generalization. However, in port operations scenarios, where the carton angle and imaging parameters have strong randomness, it is difficult for the presented carton dataset to cover all the cases. Combined with the high density and poor boundary discrimination of the stacked cartons, the trained model generalization ability is sharply reduced, resulting in a precision reduction of carton detection, multiple cartons being detected as one or even cartons not being detected. Therefore, it is particularly important to improve the generalization ability of the carton detection model based on the presented limited carton dataset.
It is time-consuming and impossible to collect the targets to be detected in all scenarios to solve the problem of a poor model generalization ability. Therefore, traditional deep learning methods augment the training sample set from the aspects of imaging parameters and perspective of a single image or synthesis of multiple images. Single image augmentation methods generate new images through style transfer [6,7], motion blur [8], perspective transformation [9], rotation [10], cropping [11], etc., while multi-image synthesis augmentation methods generate new images by pasting cropped foreground objects onto a new background [12,13,14]. However, there are still two main disadvantages of the traditional data augmentation methods when adopting them in carton detection: (1) traditional data augmentation methods only transform the images on the training set, but ignore the impacts of the actual environment on target detection, resulting in limited improvement in the model generalization ability; (2) for dense targets such as a carton stack, there are no purposeful augmentations on the indistinguishable target boundaries in traditional data augmentation methods, which still leads to a low precision of target detection.
To overcome the disadvantages of traditional deep learning methods in carton detection, this paper proposes a data augmentation method that takes into account the interferences of both the multiple scenarios and indistinguishable target boundaries. Firstly, since it is difficult for the presented training set to cover all the scenarios, an adaptive augmentation method for complementary scenarios is proposed, which transforms the background and perspective of the carton dataset to adapt to various practical scenarios. Then, aiming at the problem of the poor boundary discrimination of stacked cartons, a stochastic synthesis method of multiple boundary features is proposed to enhance the detection ability of deep learning methods to the boundary features. Finally, a hyperparameters optimization method of detection model based on an modified genetic algorithm (GA) is proposed to further improve the detection precision. Extensive experimental results on YOLO [15] demonstrate the effectiveness of the proposed method in improving the generalization ability of the carton detection model, and this method can better guide the intelligent cargo handling system to generate grab instructions.
The remainder of this paper is organized as follows. Section 2 reviews the previous work related to this paper. Section 3 details the proposed data augmentation method and the model hyperparameters optimization method. Experiments are presented in Section 4 and discussed in Section 5. Finally, the paper is concluded in Section 6.

2. Related Work

This paper is devoted to solving the problem of the generalization ability of a detection model. At present, there are two main approaches. (1) Deep learning networks are optimized to enhance the learning ability of the detection model. (2) Data augmentation strategies are proposed to realize the volume expansion of the limited data samples.

2.1. Deep Learning Models

According to the number of stages in the object recognition process, deep = learning-based methods for object recognition fall into two categories: two-stage series and one-stage series [16]. The two-stage series are first proposed, and the representative methods are R-CNN [17], Faster R-CNN, etc. Subsequently, to improve the efficiency of object recognition, one-stage series are proposed, which are represented by YOLO [18].
For the two-stage series, a Faster R-CNN model was proposed based on the R-CNN model with a precision of 45–79%, in which selective search was carried out first to determine the candidate area, and then target detection was performed to enhance the pertinence of detection [19]. Subsequently, a target feature extraction and detection model was proposed based on a Mask R-CNN, which improved the precision by 2.64% compared with the Faster R-CNN [20]. To overcome the problem of training set insufficiency, a Global Mask R-CNN detection algorithm based on a small training set was also presented by precisely composing the target feature region and saving the target semantic information in the deep learning backbone, and the precision could reach 66.45% [21]. For the one-stage series, the YOLOs are progressively proposed to improve the network structure, such as YOLO9000 [22], YOLOv3 [15] and YOLOv5 [23]. In one YOLOv3-based ship detection case, the detection precision could reach 55.3% [24]. By combining the CenterNet and YOLOv3 and introducing the spatial shuffle-group enhance (SSE) attention module, more advanced semantic features were integrated, avoiding the problem of detection omissions, and the precision was further improved to 90.6% [4]. On this basis, an extra detection head was added to the YOLOv5 model to improve the multi-scale detection and small target, experiencing an 11.6% rise [23]. In view of the better performance of YOLO series, this paper used YOLOv5 as the baseline to demonstrate the effectiveness of the proposed method.

2.2. Data Augmentations

There are two different approaches to data augmentation: transforming a single image and synthesizing multiple images. For the transformation of a single image, augmentation strategies such as color jittering [25], auto or rand augment [26,27], motion blur [8], perspective transformation [9,28], stochastic cropping [11,25,29] and rotation [10] can effectively improve the learning ability on the training set. However, these methods do not gain much regarding the generalization ability of the detection model because the training set is randomly transformed rather than according to the actual scenarios that can occur. In the aspect of multi-image synthesis, the cut-and-paste methods [12,30] are adopted. However, in synthetic images, the contextual semantic relationship between the target and the background is too stiff to effectively improve the precision of the detection model. The literature [12,14,31] hopes to improve the detection precision by ignoring the subtle pixel artifacts in the synthesized image, but the pixel artifacts are unavoidable [32].

2.3. Discussion

For the optimized deep learning networks, the precision of training sets is indeed greatly improved; however, the improvement effects on the prediction sets are not particularly evident, especially in the carton stack detection process, and there will still be a large number of detection omissions or errors. In comparison, data augmentation methods can effectively expand the training sets and improve the generalization ability of the detection models on the prediction sets. However, the interference of actual scenarios is not taken into account in the existing data enhancement method, which limits the precision of target detection in the actual scenarios. Therefore, a data augmentation method allowing for multiple scenarios and indistinguishable target boundaries is proposed in this paper.

3. Methodology

The goal of the present study is three-fold. First, this study seeks to investigate the distribution law of imaging parameters in multiple scenarios and to construct matrices of imaging parameters in complementary scenarios for each specific scenario, thus enabling adaptive augmentation of complementary scenarios. Second, for problems where dense boundaries are indistinguishable, this study attempts to propose a stochastic synthesis method for multi-boundary features to enable boundary enhancement during training. Third, the correlation between model hyperparameters and model fitness is explored to improve the crossover probability function in GA, and the optimization of the model hyperparameters is achieved by the modified GA. The complete process of our method is shown in Figure 2.

3.1. Adaptive Augmentation for Complementary Scenarios

The size of the training set and its coverage of the various scenarios determine to some extent the precision and generalization ability of the carton detection model. Since it is time-consuming and impossible to artificially collect carton samples in all the scenarios, this paper proposes an adaptive augmentation method for complementary scenarios based on carton samples in limited scenarios, which significantly reduces sample collection and labeling efforts. This approach involves three steps. (1) Calculation of imaging parameters: The imaging parameters in multiple scenarios, such as lightness, saturation and contrast, are calculated according to a large number of easily collected images in daytime, night, fog, etc. (2) Adaptive augmentation: New images are derived by converting the imaging parameters of each carton sample into the imaging parameters calculated above. (3) Perspective augmentation: Perspective augmentation is also applied to take into account the differences in the perspective of the cartons during the actual image acquisition. The architecture of this approach is shown in Figure 3.
The adaptive augmentation approach for complementary scenarios is detailed as follows.
First, the imaging parameters in multiple scenarios need to be calculated. Images from multiple scenarios are collected stochastically for imaging parameter calculation. For illustrative purposes, the scenarios are roughly classified as “day”, “night” and “fog”, and the imaging parameters of lightness, saturation and contrast are taken into account in this paper. Lightness L, saturation S and contrast C can be calculated through Equation (1).
L = 1 2 ( M A X + M I N ) S = M A X M I N 1 | 2 L 1 | C = δ δ ( u , v ) 2 P r δ ( u , v )
where M A X and M I N are the maximum and minimum values of ( R ¯ , G ¯ , B ¯ ) . ( R ¯ , G ¯ , B ¯ ) are the average values of the red (R), green (G) and blue (B) channels of an image, respectively. ( u , v ) represents the horizontal and vertical coordinates of a given pixel on an image, δ ( u , v ) is the gray level difference between the adjacent pixels and ( u , v ) and P r δ ( u , v ) is the distribution probability of the pixels with the gray level difference of δ .
The average value of the imaging parameters of the images from each scenario will be taken to represent the imaging parameters of the scenario, which can be expressed as:
P ¯ s c = ( L ¯ s c , S ¯ s c , C ¯ s c )
where P ¯ s c is the imaging parameter representing the s c scenario ( s c for day, night and fog), and L ¯ s c , S ¯ s c , C ¯ s c , respectively, stand for the lightness, saturation and contrast in P ¯ s c .
Then, for the ith image in the training set, the imaging parameter P i can also be calculated through Equation (1), which is denoted as:
P i = ( L i , S i , C i ) , i = ( 1 , 2 , )
Proceeding to the next step, the new image will be generated by converting P i to P ¯ s c . L i and C i are converted by Equation (4).
f i _ s c ( u , v ) = α f i ( u , v ) + β
where f i ( u , v ) and f i _ s c ( u , v ) , respectively, represent ( R , G , B ) on ( u , v ) of the ith image in the training set and its derived image. α is the contrast coefficient and β is the lightness gain coefficient.
After that, S ¯ s c will be converted by Equation (5).
s i _ s c ( u , v ) = ( 1 + γ ) s i _ s c ( u , v )
where s i _ s c ( u , v ) = m a x ( f ) m i n ( f ) 1 | m a x ( f ) + m i n ( f ) 1 | , f is short for f i _ s c ( u , v ) , s i _ s c ( u , v ) is the saturation of the newly derived image and γ is the saturation adjustment coefficient.
Thus, the ith image in the training set can be converted to a new image with the imaging parameters of P ¯ s c through a set of appropriate coefficients of α , β and γ .
Finally, considering the effect of imaging perspective, perspective augmentation is implemented by translating, rotating and shearing an image according to Equations (6)–(8).
( u , v ) T = ( u , v ) T + ( u t , v t ) T
( u , v ) T = c o s θ r s i n θ r s i n θ r c o s θ r ( u , v ) T
( u , v ) T = c o s ϕ u 0 s i n ϕ u 1 1 s i n ϕ v 0 c o s ϕ v ( u , v ) T
where ( u , v ) represents the transformed pixel coordinates after the original ( u , v ) transformation, u t and v t are the translations of ( u , v ) along the horizontal and vertical axes, respectively, θ r is the rotation angle and ϕ u and ϕ v represent the shear angles along the horizontal and vertical axes.
The algorithm flow is shown in Algorithm 1.
Algorithm 1 Adaptive Complementary Augmentation Algorithm
Input: image sets of multiple scenarios Im s c , s c = ( d a y , n i g h t , f o g ) ;
             original training set Im ;
             allowable deviation of imaging parameters ϵ
Output: augmented training set Im a u g
      1:
Initialize: allowable error ϵ , α = 1 , β = 0 , γ = 0 , α r = [ α r l , α r u ] , β r = [ β r l , β r u ] , γ r = [ γ r l , γ r u ] are the searching range of α , β , γ
      2:
# Imaging Parameters of Scenarios:
      3:
for  s c in ( d a y , n i g h t , f o g )  do
      4:
      for image in Im s c  do
      5:
            Calculate L, S, C for each image by Equation (1)
      6:
      Calculate P ¯ s c = ( L ¯ s c , S ¯ s c , C ¯ s c ) by Equation (2)
      7:
# Appropriate Coefficients of α , β , γ :
      8:
for ith image in Im  do
      9:
      Calculate P i = ( L i , S i , C i ) by Equation (3)
    10:
      for  s c in ( d a y , n i g h t , f o g )  do
    11:
             err = P ¯ s c P i
    12:
            while  | err | > ϵ P ¯ s c  do
    13:
                 ( b o L , b o S , b o C ) = B O O L ( e r r > 0 )
    14:
                α = α + α r [ b o C ] 2
    15:
                β = β + β r [ b o L ] 2
    16:
                γ = γ + γ r [ b o S ] 2
    17:
               Generate a new image by Equations (4) and (5)
    18:
               Calculate P i by Equation (3)
    19:
                err = P ¯ s c P i
    20:
        Save the new image in Im a u g
    21:
# Perspective Augmentation:
    22:
for image in Im a u g  do
    23:
     Random generation of ( u t , v t ) , θ r , ϕ u , ϕ v
    24:
     Augment by Equations (6)–(8) and save to Im a u g
    25:
Return Im a u g

3.2. Stochastic Synthesis of Multi-Boundary Features

After the adaptive augmentation for complementary scenarios of the training set, the enhancement of boundary features is also considered to improve the recognition precision for dense targets. A stochastic synthesis approach of multi-boundary features is proposed in this paper, which can improve the weight of boundary features without greatly expanding the training set.
The flow of the proposed approach is depicted in Figure 4. First, four images are selected stochastically from the training set to serve as metadata for a synthesized image. The targets in each image are then selected stochastically and cropped. To facilitate synthesis, cropped slices are resized to the size of the synthesized image. Meanwhile, a random center is generated to determine the configuration of the synthesized image. Then, a corner is chosen stochastically from the top left, top right, bottom left and bottom right in each resized cropped slice. Finally, the synthesized image is formed by image mosaics.

3.3. Hyperparameters Optimization Based on Modified GA

Based on adaptive complementary augmentation and boundary augmentation, the influence of model hyperparameters on the detection precision is also considered in this paper; thus, the GA is introduced to optimize the hyperparameters. However, in the existing GA, the stochastic crossover principle is employed in the gene crossover process with relatively low efficiency. As a result, a crossover probability function is developed to perform the optimal crossover of the genes and hence improve the optimization efficiency of the model hyperparameters. The hyperparameters optimization process is shown in Figure 5.
For illustrative purposes, population of model hyperparameters is generated stochastically in the hyperparameters ranges as follows.
Par = [ Par 1 , , Par p , , Par P ] T
where Par p = [ P a r p 1 , , P a r p q , , P a r p Q ] represents the pth set of hyperparameters in the hyperparameters population Par, p = ( 1 , 2 , , P ) , where P is the amount of the sets of hyperparameters, and q = ( 1 , 2 , , Q ) , where Q is the quantity of components in a set of hyperparameters; thus, P a r p q represents the qth component in the pth set of hyperparameters.
To evaluate the model performance, four typical evaluation metrics are employed: (1) the precision P r , (2) the recall R e , (3) the average precision A P for a specific value of the intersection over union (IoU) threshold to determine true positives (TPs) and false positives (FPs) and (4) the A P ¯ , which averages A P across the different value of IoU thresholds from 0.5 to 0.95 with a step size of 0.05.
Then, a metric weight is set and a fitness function is established to simplify the evaluation process of the model performance, as shown in Equation (10).
f n p = ω × [ P r p , R e p , A P p , A P ¯ p ] T
where ω is the metric weight and f n p is the fitness of the model based on Par p . Thus, the fitness vector fn of the hyperparameter population Par can be expressed as:
fn = ( f n 1 , , f n p , , f n P ) T
For a couple of selected hyperparameters from Par , component crossover will be performed to obtain a new set of hyperparameters. However, to achieve optimal crossover of hyperparameter components, the correlation of the model fitness with each component in the hyperparameters should be determined first, in which the statistical distributions of fn and Par with respect to their respective medians are employed. Thus, the correlation function is described as Equation (12).
c q = ( Par q Par ˇ q ) T × ( fn fn ˇ )
where Par q consists of the qth components in each Par p , ( · ) ˇ represents the median value of ( · ) and c q is the correlation of the model fitness with the qth component, of which the positivity and negativity indicate the positive and negative correlations, respectively, and the absolute value reflects the correlation degree. Thus, the correlation vector c can be further expressed as:
c = ( c 1 , , c q , , c Q )
Furthermore, for a couple of hyperparameters, such as Par j and Par k , j , k ( 1 , 2 , , P ) , the crossover probability function is established as:
Pc j k = s g n ( ( f n j f n k ) × ( Par j Par k ) ) c
where Pc j k is the crossover probability vector of each component in Par j and Par k , s g n ( · ) represents the signum function, which is equal to + 1 or 1 , respectively, when ( · ) > 0 or ( · ) < 0 , and ⊙ represents the bitwise multiplication of two vectors. Thus, new sets of hyperparameters can be obtained by crossover of Par j and Par k according to Pc j k . Finally, the optimal set of hyperparameters can be efficiently solved by the modified GA based on the introduced crossover probabilities. The algorithm flow is shown in Algorithm 2.
Algorithm 2 Hyperparameters Optimization Algorithm
Input: hyperparameters population Par
Output: the optimal set of hyperparameters Parop
      1:
Initialize: allowable error ϵ , the metric weight ω
      2:
# Fitness Calculation:
      3:
Calculate metrics [ P r , R e , A P , A P ¯ ]
      4:
Calculate fitness based on each Par p in Par by Equation (10), and work out fitness vector fn by Equation (11)
      5:
# Selection, Crossover and Mutation:
      6:
while  m a x ( fn ) m i n ( fn ) > ϵ   do
      7:
     Select: Par j , Par k in Par
      8:
     for  Par q in Par do
      9:
           Calculate c q by Equation (12)
     10:
     Calculate Pc j k by Equation (14)
     11:
     Crossover: Par j , Par k Par j n , Par k n
     12:
     Mutation: stochastics and low probability
     13:
     Calculate f n based on Par j n and Par k n
     14:
     if  f n j n o r f n k n > m i n ( fn )  then
     15:
          Remove m i n ( fn ) , Par m i n
     16:
          Add f n j n o r f n k n , Par j n or Par k n
     17:
Return Par o p in Par with the maximum fitness

4. Experiments

This chapter mainly explores the application of adaptive complementary augmentation and stochastic synthesis approaches in the domain of carton training set expansion, as well as the role of the hyperparameters optimization method in improving the generalization ability of trained models. The effectiveness of our approaches is explored on YOLOv5, while the experiments are based on PyTorch 3.10 and performed on RTX3090.

4.1. Experimental Settings

Multiple scenarios dataset Since the imaging parameters of images in various scenarios are necessary for the adaptive complementary augmentation method, 200 images of ports or waters were collected for each scenario. Some samples are shown in Figure 6. Thus, the imaging parameters of each scenario can be calculated by Equations (1) and (2).
Carton dataset The carton dataset in this paper refers to the stacked carton dataset (SCD) [33]. However, as a direct application of the proposed method on SCD is too time-consuming due to the large scale of the SCD, a portion of the sample is drawn from the SCD to form our carton dataset. The distribution of our carton dataset is given in Table 1. Due to the different difficulties in image collection under various scenarios, the images in the carton dataset are mainly collected under the “day” scenario, accounting for 81.7%, while the images collected under the “night” scenario and “fog” scenario only account for 8.2% and 10.1% respectively, resulting in a great reduction in the generalization ability of the trained model. Moreover, Figure 7 shows that cartons of different sizes are densely stacked and suffer from poor boundary discrimination, which severely affects the detection precision of cartons. Therefore, during the experiments, the carton dataset was split into a training set of 850 images and a testing set of 150 images, and the training set was augmented using the methods described in Section 3.1 and Section 3.2.
Evaluation metric Same as in Section 3.3, four typical evaluation metrics are employed: the precision P r , the recall R e , the average precision A P when the IoU threshold is equal to 0.5 (denoted as A P @ 0.5 ) and the A P ¯ , which averages A P across the different values of IoU thresholds from 0.5 to 0.95 with a step size of 0.05.

4.2. Adaptive Complementary Augmentation

Before performing augmentation for the training set, the imaging parameters were first calculated based on the multiple-scenarios dataset. The distribution of imaging parameters of each image in multiple scenarios is shown in Figure 8.
Figure 8 shows that the imaging parameters are obviously differentiated for different scenarios. Therefore, the average value of the imaging parameters in each scenario was taken to represent the imaging parameters in this scenario as follows.
P ¯ d a y = ( 0.562 , 409.447 , 39.678 ) P ¯ n i g h t = ( 0.274 , 254.860 , 46.044 ) P ¯ f o g = ( 0.536 , 195.848 , 16.624 )
Then, the imaging parameters of each image in the training set were calculated, based on which the images were classified into their corresponding scenarios. Then, following the adaptive complementary augmentation approach in Section 3.1, the imaging parameters of each image in the original training set were adjusted to those representing other scenarios. Further, two perspective augmentation methods were randomly selected from the translation, rotation and shear with two random conversion amplitudes. In this way, new images were generated as shown in Figure 9 and the training set was augmented.
Finally, the precision of the trained models based on the original training set and the augmented training set are compared in Figure 10. It can be seen that the adaptive complementary augmentation approach can effectively improve the model average precision A P @ 0.5 by 8.99% from the original 0.701 to 0.764.

4.3. Stochastic Synthesis

When using a model trained on a dataset without the synthesized images for detection, multiple cartons are easily identified as one due to the poor boundary discrimination of dense cartons, as shown in Figure 11a,c.
Therefore, the stochastic synthesis method in Section 3.2 was employed, and some of the stochastic-synthesized images are shown in Figure 12. The synthesized images enhanced the detection capability of the newly trained model on dense cartons, as shown in Figure 11b,d. It can be seen that, after the introduction of stochastic synthesis, cartons with indistinguishable boundaries can be detected separately, and previously undetectable ones can also be detected. At the same time, the detection box of each carton is more accurate due to the enhanced boundary features. Thus, the model average precision A P @ 0.5 is further improved by 3.80%, from 0.764 to 0.793.

4.4. Hyperparameters Optimization

Since model hyperparameters have an important impact on the precision of the trained model, it is necessary to perform a hyperparameters optimization process. However, due to the large expansion of the training set by the augmentation approaches proposed in this paper, even a single training procedure takes a long time. The hyperparameters optimization process based on the conventional GA can be time-consuming and requires a large number of iterations. Therefore, the modified GA in Section 3.3 is used to reduce the number of training iterations and significantly shorten the hyperparameters optimization time.
With the F N = m a x ( fn ) in Equation (11) as the simplified evaluation of the trained model, Figure 13 shows the variation trend of fn during hyperparameters optimization when conventional and modified GAs are adopted. We observed that the hyperparameters optimization process based on the modified GA requires fewer iterations, resulting in an 8.9% reduction in time consumption.

4.5. Analysis of Carton Detection Precision

To illustrate the effectiveness of the proposed approach, an intelligent cargo handling system has been designed as described in Figure 14. The evaluation metrics for the trained models of the proposed approach have been calculated using the images collected during the actual cargo handling process, and the comparison results among the alternative approaches are presented in Figure 15 and Table 2.
It can be seen that, with the introduction of the approach proposed in this paper, the precision, recall and other metrics of the trained model are greatly improved, and the average precision is increased by 18.1% from the initial 0.701 to 0.828, providing a good guarantee for the carton detection in the cargo handling process.

5. Discussion

The proposed method enables the automatic augmentation and balancing of images collected in various scenarios. Figure 7 illustrates the obtained images in different scenarios. As can be seen from Figure 7, the parameters of the images, such as brightness, saturation and contrast, vary considerably in different scenarios, as demonstrated in Figure 8. Through clustering analysis, the imaging parameters in each scenario are represented by the mean values of parameters such as brightness, saturation and contrast, which are used to guide the augmentation process of the collected images in each scenario, thus increasing the scale of the original dataset from 1000 to 3000, as shown in Figure 9. Figure 10 proves that the precision of the trained model on the augmented dataset is significantly improved compared to the baseline. However, the detection precision of dense boundaries still needs to be improved, as shown in Figure 11a,c. Thus, the boundary feature stochastic synthesis strategy is adopted to further augment the dataset scale from 3000 to 4000, which significantly improves the detection ability of the trained model, as shown in Figure 11b,d. To address the problem of decreasing training efficiency due to the large number of augmented datasets, Figure 13 demonstrates the effect of the modified GA. Compared to the traditional GA, the number of training iterations is reduced from 269 to 246, saving nearly 8.5%. Finally, Figure 15 compares the detection precision of the baseline, cut-and-paste, rand augment and the proposed method in the actual cargo handling process. As shown in Figure 15, the proposed method performs significantly better when compared to the other methods and the baseline, achieving a precision of 0.828 and an improvement from 4.4% to 18.1%.
In summary, we believe that our study contributes significantly to the recognition of dense objects in complex environments due to the simultaneous consideration of the complexity of the scenario, the poor boundary discrimination of the objects and the optimization of the model hyperparameters. The proposed adaptive augmentation method can balance the dataset, making the performance of the trained model better and more stable in each scenario. Meanwhile, the proposed stochastic synthesis method can overcome the effect of dense boundaries and improve the recognition precision. Moreover, with the proposed hyperparameter optimization method, the effect of the augmented dataset on the training speed is eliminated and the training efficiency is improved.
However, the proposed method still suffers from some shortcomings. In the actual cargo handling process, it is found that the proposed method has a significant effect on the detection precision for images collected in “night” and “fog” scenarios, but it is almost ineffective for images collected in the “day” scenario. The reason is that the images from the “day” scenario make up the majority of the original dataset; however, high quality datasets should be balanced. The method in this paper focuses on the balanced augmentation of datasets and is therefore beneficial for scenarios other than “day”. From a generalization point of view, for round-the-clock target detection efforts, there will be an inevitable imbalance in the dataset. Therefore, the method in this paper still has an important role and significance.

6. Conclusions

Carton detection is crucial for unattended intelligent cargo handling to achieve efficient port operations and reduce the virus transmission rate. However, cargo handling scenarios are diverse, and the carton stacks are characterized by high densities with indistinguishable boundaries. Therefore, this paper proposes a novel data augmentation approach to achieve a high detection precision, which takes into account the interferences of multiple scenarios and indistinguishable target boundaries. First, the distribution law of the imaging parameters in multiple scenarios is investigated, and the imaging parameters of each image in the training set are adjusted to those of the complementary scenario of that image, thus enabling adaptive augmentation of complementary scenarios. Then, the images in the training set are stochastically selected, cropped and synthesized to enhance the carton boundary features. Finally, the hyperparameters are also optimized through a modified GA to further improve the precision of the trained model. With the proposed approach, the trained model achieves a large improvement in average precision from 0.701 to 0.828 in the actual cargo detection process. Comparisons with other data augmentation methods are also performed to demonstrate the better performance of the proposed approach.

Author Contributions

Conceptualization, B.L.; methodology, B.L.; software, B.L.; validation, B.L.; resources, X.W. (Xin Wang) and W.Z.; data curation X.W. (Xin Wang) and W.Z.; writing—original draft, B.L.; writing—review and editing, B.L. and X.W. (Xiaobang Wang); funding acquisition, B.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was funded by the Fundamental Research Funds for the Central Universities (No. 3132022113), and National Natural Science Foundation of China (No. 52301411).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to thank the SCD developers for their support.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Arpenti, P.; Caccavale, R.; Paduano, G.; Andrea Fontanelli, G.; Lippiello, V.; Villani, L.; Siciliano, B. RGB-D recognition and localization of cases for robotic depalletizing in supermarkets. IEEE Robot. Autom. Lett. 2020, 5, 6233–6238. [Google Scholar] [CrossRef]
  2. Chiaravalli, D.; Palli, G.; Monica, R.; Aleotti, J.; Rizzini, D.L. Integration of a multi-camera vision system and admittance control for robotic industrial depalletizing. In Proceedings of the 2020 25th IEEE International Conference on Emerging Technologies and Factory Automation (ETFA), Vienna, Austria, 8–11 September 2020; pp. 667–674. [Google Scholar] [CrossRef]
  3. Passos, W.L.; Barreto, C.d.S.; Araujo, G.M.; Haque, U.; Netto, S.L.; da Silva, E.A.B. Toward improved surveillance of Aedes aegypti breeding grounds through artificially augmented data. Eng. Appl. Artif. Intell. 2023, 123, 106488. [Google Scholar] [CrossRef]
  4. Cui, Z.; Wang, X.; Liu, N.; Cao, Z.; Yang, J. Ship detection in large-scale SAR images Via spatial shuffle-group enhance attention. IEEE Trans. Geosci. Remote Sens. 2021, 59, 379–391. [Google Scholar] [CrossRef]
  5. Mushtaq, F.; Ramesh, K.; Deshmukh, S.; Ray, T.; Parimi, C.; Tandon, P.; Jha, P.K. Nuts&bolts: YOLO-v5 and image processing based component identification system. Eng. Appl. Artif. Intell. 2023, 118, 105665. [Google Scholar] [CrossRef]
  6. Elad, M.; Milanfar, P. Style transfer via texture synthesis. IEEE Trans. Image Process. 2017, 26, 2338–2351. [Google Scholar] [CrossRef] [PubMed]
  7. Zhu, J.Y.; Park, T.; Isola, P.; Efros, A.A. Unpaired image-to-image translation using cycle-consistent adversarial networks. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 2242–2251. [Google Scholar] [CrossRef]
  8. Wang, Y.; Wei, X.; Tang, X.; Shen, H.; Ding, L. CNN tracking based on data augmentation. Knowl.-Based Syst. 2020, 194, 105594. [Google Scholar] [CrossRef]
  9. Huang, H.; Zhou, H.; Yang, X.; Zhang, L.; Qi, L.; Zang, A.Y. Faster R-CNN for marine organisms detection and recognition using data augmentation. Neurocomputing 2019, 337, 372–384. [Google Scholar] [CrossRef]
  10. Chen, T.; Wang, N.; Wang, R.; Zhao, H.; Zhang, G. One-stage CNN detector-based benthonic organisms detection with limited training dataset. Neural Netw. 2021, 144, 247–259. [Google Scholar] [CrossRef]
  11. Park, S.; Lee, S.-b.; Park, J. Data augmentation method for improving the accuracy of human pose estimation with cropped images. Pattern Recognit. Lett. 2020, 136, 244–250. [Google Scholar] [CrossRef]
  12. Dwibedi, D.; Misra, I.; Hebert, M. Cut, paste and learn: Surprisingly easy synthesis for instance detection. In Proceedings of the 2017 IEEE International Conference on Computer Vision (ICCV), Venice, Italy, 22–29 October 2017; pp. 1310–1319. [Google Scholar] [CrossRef]
  13. Liu, S.; Guo, H.; Hu, J.G.; Zhao, X.; Zhao, C.; Wang, T.; Zhu, Y.; Wang, J.; Tang, M. A novel data augmentation scheme for pedestrian detection with attribute preserving GAN. Neurocomputing 2020, 401, 123–132. [Google Scholar] [CrossRef]
  14. Tripathi, S.; Chandra, S.; Agrawal, A.; Tyagi, A.; Rehg, J.M.; Chari, V. Learning to generate synthetic data via compositing. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 461–470. [Google Scholar] [CrossRef]
  15. Redmon, J.; Farhadi, A. YOLOv3: An incremental improvement. arXiv 2018. [Google Scholar] [CrossRef]
  16. Zou, Z.; Chen, K.; Shi, Z.; Guo, Y.; Ye, J. Object Detection in 20 Years: A Survey. Proc. IEEE 2023, 111, 257–276. [Google Scholar] [CrossRef]
  17. Zhao, Z.Q.; Zheng, P.; Xu, S.T.; Wu, X. Object Detection With Deep Learning: A Review. IEEE Trans. Neural Netw. Learn. Syst. 2019, 30, 3212–3232. [Google Scholar] [CrossRef] [PubMed]
  18. Redmon, J.; Divvala, S.; Girshick, R.; Farhadi, A. You Only Look Once: Unified, Real-Time Object Detection. In Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, NV, USA, 27–30 June 2016; pp. 779–788. [Google Scholar] [CrossRef]
  19. Ren, S.; He, K.; Girshick, R.; Sun, J. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 1137–1149. [Google Scholar] [CrossRef] [PubMed]
  20. Zhang, D.; Zhan, J.; Tan, L.; Gao, Y.; Zupan, R. Comparison of two deep learning methods for ship target recognition with optical remotely sensed data. Neural Comput. Appl. 2021, 33, 4639–4649. [Google Scholar] [CrossRef]
  21. Sun, Y.; Su, L.; Luo, Y.; Meng, H.; Li, W.; Zhang, Z.; Wang, P.; Zhang, W. Global mask R-CNN for marine ship instance segmentation. Neurocomputing 2022, 480, 257–270. [Google Scholar] [CrossRef]
  22. Redmon, J.; Farhadi, A. YOLO9000: Better, faster, stronger. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017; pp. 6517–6525. [Google Scholar] [CrossRef]
  23. Li, Y.; Bai, X.; Xia, C. An Improved YOLOV5 Based on Triplet Attention and Prediction Head Optimization for Marine Organism Detection on Underwater Mobile Platforms. J. Mar. Sci. Eng. 2022, 10, 1230. [Google Scholar] [CrossRef]
  24. Zheng, R.; Zhou, Q.; Wang, C. Inland river ship auxiliary collision avoidance system. In Proceedings of the 2019 18th International Symposium on Distributed Computing and Applications for Business Engineering and Science (DCABES), Wuhan, China, 8–10 November 2019; pp. 56–59. [Google Scholar] [CrossRef]
  25. Szegedy, C.; Liu, W.; Jia, Y.; Sermanet, P.; Reed, S.; Anguelov, D.; Erhan, D.; Vanhoucke, V.; Rabinovich, A. Going deeper with convolutions. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 1–9. [Google Scholar] [CrossRef]
  26. Cubuk, E.D.; Zoph, B.; Mané, D.; Vasudevan, V.; Le, Q.V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Long Beach, CA, USA, 15–20 June 2019; pp. 113–123. [Google Scholar] [CrossRef]
  27. Cubuk, E.D.; Zoph, B.; Shlens, J.; Le, Q.V. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), Seattle, WA, USA, 14–19 June 2020; pp. 3008–3017. [Google Scholar] [CrossRef]
  28. Qais, M.H.; Hasanien, H.M.; Alghuwainem, S. Augmented grey wolf optimizer for grid-connected PMSG-based wind energy conversion systems. Appl. Soft. Comput. 2018, 69, 504–515. [Google Scholar] [CrossRef]
  29. Krizhevsky, A.; Sutskever, I.; Hinton, G.E. ImageNet classification with deep convolutional neural networks. Commun. ACM 2017, 60, 84–90. [Google Scholar] [CrossRef]
  30. Georgakis, G.; Mousavian, A.; Berg, A.C.; Kosecka, J. Synthesizing training data for object detection in indoor scenes. In Proceedings of the 13th Conference on Robotics—Science and Systems, Cambridge, MA, USA, 12–16 July 2017. [Google Scholar]
  31. Ghiasi, G.; Cui, Y.; Srinivas, A.; Qian, R.; Lin, T.Y.; Cubuk, E.D.; Le, Q.V.; Zoph, B. Simple copy-paste is a strong data augmentation method for instance segmentation. In Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), Nashville, TN, USA, 20–25 June 2021; pp. 2917–2927. [Google Scholar] [CrossRef]
  32. Gou, L.; Wu, S.; Yang, J.; Yu, H.; Lin, C.; Li, X.; Deng, C. Carton dataset synthesis method for loading-and-unloading carton detection based on deep learning. Int. J. Adv. Manuf. Technol. 2023, 124, 3049–3066. [Google Scholar] [CrossRef]
  33. Yang, J.; Wu, S.; Gou, L.; Yu, H.; Lin, C.; Wang, J.; Wang, P.; Li, M.; Li, X. SCD: A Stacked Carton Dataset for Detection and Segmentation. Sensors 2022, 22, 3617. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Samples of bulk cargo.
Figure 1. Samples of bulk cargo.
Sensors 24 00012 g001
Figure 2. The complete process of our method.
Figure 2. The complete process of our method.
Sensors 24 00012 g002
Figure 3. Architecture of the adaptive complementary augmentation approach.
Figure 3. Architecture of the adaptive complementary augmentation approach.
Sensors 24 00012 g003
Figure 4. Flow of the stochastic synthesis approach.
Figure 4. Flow of the stochastic synthesis approach.
Sensors 24 00012 g004
Figure 5. Hyperparameters optimization process based on the modified GA.
Figure 5. Hyperparameters optimization process based on the modified GA.
Sensors 24 00012 g005
Figure 6. Samples of multiple-scenarios dataset.
Figure 6. Samples of multiple-scenarios dataset.
Sensors 24 00012 g006
Figure 7. Samples of carton dataset.
Figure 7. Samples of carton dataset.
Sensors 24 00012 g007
Figure 8. Distribution of imaging parameters.
Figure 8. Distribution of imaging parameters.
Sensors 24 00012 g008
Figure 9. Samples of augmented images in training set.
Figure 9. Samples of augmented images in training set.
Sensors 24 00012 g009
Figure 10. Precision–recall curve based on the original training set (Baseline) and the adaptive complementary augmented training set (ComAug).
Figure 10. Precision–recall curve based on the original training set (Baseline) and the adaptive complementary augmented training set (ComAug).
Sensors 24 00012 g010
Figure 11. False detection of carton stack.
Figure 11. False detection of carton stack.
Sensors 24 00012 g011
Figure 12. Some of the stochastic-synthesized images.
Figure 12. Some of the stochastic-synthesized images.
Sensors 24 00012 g012
Figure 13. Hyperparameters optimization process based on conventional and modified GA.
Figure 13. Hyperparameters optimization process based on conventional and modified GA.
Sensors 24 00012 g013
Figure 14. Intelligent cargo handling system.
Figure 14. Intelligent cargo handling system.
Sensors 24 00012 g014
Figure 15. Precision–recall curve of carton detection.
Figure 15. Precision–recall curve of carton detection.
Sensors 24 00012 g015
Table 1. Distribution of the carton dataset.
Table 1. Distribution of the carton dataset.
ScenarioCarton DatasetTraining SetTesting Set
“day”817 (81.7%)694123
“night”82 (8.2%)7012
“fog”101 (10.1%)8615
ALL1000850150
Table 2. Metrics comparison of trained models.
Table 2. Metrics comparison of trained models.
ApproachesPrReAP AP ¯
baseline0.7150.6570.7010.430
cut-and-paste [28]0.7320.7140.7840.460
rand augment [25]0.7410.7190.7930.493
proposed approach0.7750.7730.8280.521
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Liang, B.; Wang, X.; Zhao, W.; Wang, X. High-Precision Carton Detection Based on Adaptive Image Augmentation for Unmanned Cargo Handling Tasks. Sensors 2024, 24, 12. https://doi.org/10.3390/s24010012

AMA Style

Liang B, Wang X, Zhao W, Wang X. High-Precision Carton Detection Based on Adaptive Image Augmentation for Unmanned Cargo Handling Tasks. Sensors. 2024; 24(1):12. https://doi.org/10.3390/s24010012

Chicago/Turabian Style

Liang, Bing, Xin Wang, Wenhao Zhao, and Xiaobang Wang. 2024. "High-Precision Carton Detection Based on Adaptive Image Augmentation for Unmanned Cargo Handling Tasks" Sensors 24, no. 1: 12. https://doi.org/10.3390/s24010012

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop