1. Introduction
In recent years, convolutional neural networks (CNNs) have been widely used in the field of self-driving car computer vision including obstacle detection, classification, segmentation, object tracking [
1], and semantics. To achieve a better performance, the design of deeper and more complex CNN models has become a general trend.
While the deep neural network has enhanced learning ability and increased parameter quantities, it also has certain drawbacks. Storage space persistence: the model needs to train on a high-performance GPU server. Hardware cost: the neural network needs to load parameters in memory and complete the construction of the calculation graph when training evaluation and predicting. Calculation cost: convolutional neural networks need to perform convolution operations when processing data. Since parameters are usually stored in 32-bit floating-point data types in the model, multiplication operations take up a lot of computing resources when it comes to floating-point data. The training and use of the model are very time-consuming.
The simplification of the neural network structure mainly includes tensor factorization [
2], sparse connection [
3], quantization [
4], channel pruning [
5], and so on. The tensor factorization method can decompose the convolutional layer into several effective layers. However, this method does not reduce the number of channels. Moreover, the decomposition process introduces additional computational overhead. The sparse connection method compresses the model by invalidating the connections between neurons or channels. Although this method can theoretically achieve a large speedup for the operation of the model, the sparse convolution layer has high requirements for hardware. Quantization is a relatively effective method that is widely used, but when dealing with deep CN networks such as GoogleNet, the accuracy is significantly reduced [
6]. Compared with the previous methods, channel pruning directly reduces the number of convolutional kernels and feature maps, thus making the network layer numberless and the structure simpler. This method has an obvious acceleration effect on both CPU and GPU because it does not require additional computation or special hardware requirements. The idea of channel pruning is simple but deleting the channels in the previous layer directly changes the input of the next layer. In recent years, a training-based channel pruning project [
7] has been proposed that focuses on applying sparse constraints to weights during the training process so that hyperparameters can be adaptively determined. However, each training process consumes a lot of time and resources. Moreover, this method is rarely used in CNN networks based on ImageNet datasets and experimental results are rarely mentioned.
There is some novel research about distributed computing that can be used to speed up CNN computing, and some related work can be found in [
8,
9]. This work is about using the compression of convolutional neural networks to accelerate computing. Many researchers have conducted relevant research in this area including weight pruning, quantization, and some other methods. However, the weight pruning method will introduce the operation of the spares matrix, and for the quantization method, the solution of the Hessian matrix is also a great difficulty. This paper chooses the filter pruning method to compress the neural network to accelerate the deep neural network. In this paper, a neural network compression method was proposed based on feature maps clustering. The basic idea is that by clustering, it gets to know how many features the input images have and how many filters are enough to extract all features. The main contributions are as follows: According to the corresponding relationship between feature maps and filters, the filters are clustered by feature maps clustering. K-Means and Hierarchical Clustering algorithms are used to cluster feature maps and a comparison between them is made. In the K-Means algorithm, the elbow and silhouette coefficient methods are used to determine the best clustering number. The changes in network performance before and after pruning are tested by experiments. The core of this thesis is the compression of convolutional neural networks. The many difficulties of this method include: (1) the definition of filters’ importance, (2) determining the number of pruned filters, and (3) explaining the effectiveness of the proposed network.
2. Related Work
In recent years, the deep neural network has received extensive attention and research. The powerful modeling ability of deep neural networks comes from their more complex structure and larger weight parameter amount, which makes it difficult for network models to run on mobile devices or embedded platforms with limited computing and storage resources [
9]. Therefore, these network models need to be pruned or compressed to make them easier to be transplanted to the mobile terminal. This part will introduce four kinds of important algorithms in the field of network compression and acceleration and analyze their advantages and disadvantages.
LeCun et al. proposed an OBD (Optimal Brain Damage) method [
10] for weight pruning; this method has some effect, but it cannot be applied to the general case because the Hessian matrix is not always a diagonal matrix, and the computational cost is relatively high. To solve the problem of the non-diagonal Hessian matrix, Hassib et al. proposed an OBS (Optimal Brain Surgeon) [
11] method to find the optimal solution of the Hessian matrix by the Lagrange equation. Although this method has taken into account the general situation, it still has a high computation cost because of the requirement to find the inverse of the Hessian matrix. In a word, OBD and OBS methods use second-order Taylor expansion to select weights that need to be deleted and improve the generalization performance of the model by pruning and retaining. However, these two methods inevitably require Hessian matrix operation, which significantly increases the memory and computation cost of the hardware device used for network fine-tuning.
The essence of network quantization is to compress the neural network by reducing the number of bits needed for weight storage. In the deterministic quantization method, the quantized value and the actual value have a one-to-one correspondence. Rounding should be the easiest way to quantify actual values. Courbariaux et al. [
12] proposed deterministic quantification using the rounding function. However, during backpropagation, the error cannot be calculated by the same function because the gradient value is zero almost everywhere. So, a heuristic algorithm is needed to estimate the gradient of neurons. Rastegari et al. [
13] improved the backward function. For the rounding function, Polino et al. [
14] proposed a more general form. Gong et al. [
15] first considered using vector quantization to quantize and compress neural networks. In a random rounding function, there is a one-to-one mapping between actual and quantized values. Generally speaking, taking a rounding function is a simple way to convert actual values into quantized values. However, the performance of the network may be significantly degraded when the parameters are quantized. Singular Value Decomposition (SVD) [
16] is a popular low-rank matrix decomposition method. After the matrix is decomposed, the relatively small singular values are discarded and then the network parameters are fine-tuned. However, when the weight matrix in the network is high-dimensional, the SVD method cannot be effectively decomposed. At this time, it is necessary to use the Tucker decomposition method [
17]. The low-rank matrix decomposition method can simplify the date scale and remove the noise points, which can effectively compress and accelerate the network model. However, it also has the disadvantages that data conversion is difficult to understand and the adaptability is poor.
Filter pruning refers to the direct deletion of unimportant filters in the neural network and the feature maps associated with it [
18]. Compared with the weight pruning method of the neural network, filter pruning is a structured pruning method, which does not introduce additional sparse operations, so there is no need to use sparse libraries or any specific hardware devices. The number of pruned filters will directly affect the number of convolutional operations, and the large reduction in the number of convolutional operations will greatly increase the operational efficiency of the neural network. Although filter pruning has many advantages, it is rarely used in practical projects. One reason is that it is difficult to determine the filters that need to be pruned. The other is how to achieve a balance between the improvement of speed and a decrease in precision in the process of filter pruning. Hu et al. [
19] proposed method beads on the percentage of zero activations to find the unimportant filters. Subsequently, experiments have proved that this method of evaluation and deletion of the filters in all layers at the same time will greatly reduce the accuracy of the neural network and it is difficult to restore its performance through retraining. Therefore, the filter needs to be pruned layer by layer. Li et al. [
11] proposed a method for filter pruning based on the L1 norm of the filter matrix. Because L1 norms of filters from different convolutional layers are in different orders of magnitude, they cannot be compared together. Therefore, it is necessary to fine-tune the whole network after one layer is pruned by this method. One obvious disadvantage of this method is that it is difficult to achieve the best trade-off between pruning efficiency and network performance. Hu et al. [
20] proposed a squeeze-and-extraction block at the 2017 ImageNet Image classification challenge competition referred to as SEBlock. SEBlock was originally used to improve the accuracy of image classification; its scaling factor reflects the network’s choice of feature channels. Wang et al. [
21] proposed that the importance of the channel can reflect the importance of the filter because the channel and the filter are one-to-one correspondence. Hence, a soft attention mechanism can be added to the neural network to determine the importance of the filters through its scaling factors.
2.1. Weight Pruning
Weight pruning is the earliest method of network pruning. Its basic idea is to consider weights below a certain threshold in the network as unimportant weights and remove them from the network. Then, retrain the network to get the final weight based on the reserved sparse connection. The pruning process is shown in
Figure 1.
2.2. Quantization Methods
The essence of network quantization is to compress the neural network by reducing the number of bits needed for weight storage. The process is shown in
Figure 2.
Figure 2 shows the process of weights and gradient quantization. The upper left corner is a 4 × 4 weight matrix and the lower-left corner is a 4 × 4 gradient matrix. The weights are all stored in a 32-bit float mode. To quantify the weight to 4-bits, it is necessary to divide the weights into four categories, which are represented by four different colors in
Figure 2, and calculate the average value of each category. Then, it only needs to store an index value for every category in the value of the shared weight. In the process of weights updating, all gradients are grouped into the same category as the weights. The sum of each category is the gradients of the whole category. Finally, the shared weight subtracting the gradient is the final weight value. In general, quantization methods can be divided into two types: deterministic quantization and stochastic quantization.
2.3. Low-Rank Matrix Decomposition
The main idea of low-rank matrix decomposition is to compress and accelerate the neural network by decomposing the parameter matrix in the neural network into a low-rank matrix product and eliminating redundancy in the convolution filter. There is the SVD method and tucker decomposition.
2.4. Filter Pruning
Filter pruning refers to the direct deletion of unimportant filters in the neural network and the feature maps associated with it [
11]. The process is shown in
Figure 3.
represents the number of input channels in the layer. and represent the height and weight of the input feature map, respectively. filters can convert the feature maps into with convolution operations. In this process, the number of convolution operations is . If one of the filters in is pruned, will be directly reduced, at the same time the feature maps of the filter output are deleted. refers to the filter size. In other words, the feature maps of the next convolutional layer input are also reduced. Then, convolution operations will also be reduced in the next layer. According to this rule, the subsequent convolutional layers will reduce at a large number of operations.
Compared with the weight pruning method of the neural network, filter pruning is a structured pruning method, which does not introduce additional sparse operations, so there is no need to use sparse libraries or any specific hardware devices. The number of pruned filters will directly affect the number of convolution operations and the large reduction in the number of convolution operations will greatly increase the operational efficiency of the neural network. Most of the existing filter pruning methods are based on the importance of filters. The importance of filter is based on the L1 norm. The greater the L1 norm value of the filter, the higher the importance of the filter is.
3. Framework of Retinanet
The authors of [
22] proposed a one-stage detection model Retinanet. This model uses the Focal Loss function to solve the problem of extreme foreground-background class imbalance so that it can match the speed of previous one-stage detectors while surpassing the accuracy of all existing state-of-the-art two-stage detectors. The work will try to prune filters using Retinanet as an example and test it on the wider face dataset with the network before and after filter pruning to maintain a good balance between the accuracy and speed of the evaluation. The framework of Retinanet is shown in
Figure 4.
The Retinanet network consists of three parts: Resnet [
21], a network for feature extraction; FPN (feature pyramid network) [
23], a feature pyramid network for fine processing of extracted image features; and subnets [
24], used for classification and location.
3.1. Feature Extraction Network, Resnet 18
The feature map extraction process of Retinanet is divided into two parts. The first part is a bottom-up path, and the second part is a top-down path. In the field of deep learning, a linear convolutional neural network is usually used to extract image features such as AlexNet [
25], LeNet [
26], and VGG-Net [
27]. The ability of this type of neural network to extract picture features tends to improve as the number of layers increases. However, the increase in network depth does not always improve the accuracy of the model and sometimes it may cause higher evaluation and training errors resulting in a rapid deterioration of accuracy after saturation. This phenomenon is called degradation in the field of deep learning. Resnet can avoid the shortcomings of linear convolution neural networks.
3.2. FPN
Before the appearance of FPN (feature pyramid network), most target detection algorithms only used top-level features to predict. However, this kind of method has its drawbacks because although the semantic information of image samples acquired at the high lever is richer than that at the low level, the target location is rougher than that at the low level. In addition, although an algorithm using a multi-scale fusion method has been proposed, it usually uses the fused features to predict. FPN fuses feature maps from different layers through top-down connections, bottom-up connections, and horizontal connections so that it performs better in small target detection.
3.3. Subnets
Faster R-CNN adopts the idea of a regional proposal network (RPN) [
28]. The mapping point of the sliding window center in the original image is called the anchor. With the anchor as the center, proposed regions can be generated in different network layers of FPN. The Retinanet model follows this idea. When generating anchor boxes, nine different anchor boxes can be generated by using three scales and three aspect ratios.
4. Filter Pruning Based on Feature Maps Clustering
The filter is essentially a matrix of size . It is used to detect specific features in images. Different filters have different parameters. When the image is processed in the computer, it is in the form of size. Suppose we only consider the grayscale of the image regardless of RGB, then the size of the image is . The filter processes the image by sliding through all the regions of the image from left to right, from top to bottom, and then dot-multiplying with all the same size areas in the image. After the sum of the products, a new filtered image, i.e., the feature map, is obtained.
When the filter processes the image, it dot-multiplies with each equal area in the image. If a certain area of the image is similar to the feature detected by the filter, the filter will be activated and get a high dot multiplication result when passing through the area. Conversely, if an area of the image is very dissimilar to the features detected by the filter, the filter will not be activated or the value obtained by dot multiplication will be very small.
It can be concluded that when the filter slides through the whole image, the higher the value obtained from an area of the image, the higher the correlation between the area and the undetermined features detected by the filter. The neural network can obtain sufficient features of the image through a large number of filters so that it can detect and locate the image.
The feature map is extracted by the filter and passed as input to the next convolutional layer. Zeiler et al. [
5] used a de-convolution method to project the activated feature matrix back into the input pixel space and found that filters in different layers can extract features of different levels. Chen et al. [
29] demonstrated that different filters in the same layer can also extract features from different aspects. Therefore, the feature map can represent filters in some respects [
30]. Inspired by these, the work hypothesizes that the redundant filters in convolutional neural networks may extract similar feature maps to that from other filters in the same layer when processing the same input.
Based on this assumption, we can determine how many similar feature maps are extracted by each convolutional layer using the clustering method, and then we can determine how many necessary features of the input image need to be extracted. Finally, we can select the filters to be pruned. The image clustering method used in this paper will be introduced below.
4.1. K-Means Method
The K-Means [
31] algorithm is a common clustering method. Its basic idea is to select K points in the space as the center and classify the objects closest to them into one class. Through iteration, the points in each cluster center are updated one by one until the best clustering result is obtained. Since the matrix representing the feature map is 3D, it is difficult to represent it with a single point. It is necessary to reduce the dimension of the feature map and map it to a vector containing important components.
4.1.1. Feature Map Dimension Reduction
The method for a feature map’s dimension reduction mainly includes SVD (singular value decomposition) and PCA (principal component analysis) [
13]. Similar to the SVD method, the basic idea of the PCA method is to project the original data onto a new coordinate system. If the variance of all data on a coordinate axis is the largest, this axis is recorded as A1. That is, the projection of data is most scattered in the direction of the A1 axis, which means that more information is retained. Then, A1 is the first principal component. Then, look at A2. If the covariance between A2 and A1 is 0, that is, the information of A2 and A1 does not overlap and the variance of data in this direction is as large as possible, then A2 is the second principal component. By analogy, it can find the third, fourth, and Nth principal components.
Usually, only the first two principal components of the data need to be preserved so that the high-dimensional image can be transformed into a two-dimensional vector.
4.1.2. Clustering
After the dimensional reduction of the feature maps, the K-Means algorithm can be used to cluster the feature maps. The specific algorithm steps are as follows:
Choose the initial centers of K classes appropriately. Generally, the class centers are initialized by guessing or in a random way.
For any sample, calculate its distance to K centers and classify the sample into the class, to whose center the distance is the shortest.
Calculate the average of all data points belonging to the same class, that is calculate the average of each dimension of the vector and use the average as the new class center.
Repeat steps 2 and 3 until convergence occurs, i.e., the centers of all classes do not change anymore.
Figure 5 is an example of the K-Means algorithm. Where A, B, C, D, and E represent different samples, two orange circles represent different centers. According to the distances between the samples and the centers, A, B, and C are clustered in one group, and D and E are clustered in the other group eventually.
The advantages of this algorithm are simple steps and fast operations. Most importantly, it is easy to be implemented and can be operated in parallel. However, this algorithm also has some drawbacks because the number of classes K should be set in advance and improper selection will lead to poor results. The key to the algorithm lies in the selection of initial centers and distance formula. Moreover, how to determine the clustering number K is also an important problem to be solved.
4.1.3. Determination of the Number of Classes
At present, there are many methods to determine the clustering number, two of which are selected in this thesis.
The first method is the Elbow method. This rule examines the relationship between the number of classes K and the cost function
J. Cost function
J is the sum of the distances from all data in classes to the center points. It can be expressed as:
where
m represents the number of data points in the
class
represents the
data point in the
class. For a class, the smaller the cost function value is, the closer the members in this class are. Conversely, the larger the function value is, the looser the structure within the class is. The cost function decreases with the increase in the number of classes.
Take
Figure 6 as an example. When the number of clusters equals 3, the cost function is greatly improved, so 3 can be considered the best number of clusters. However, it is also possible that the cost function curve does not have a significant curvature change input. At this time, the second method is required to determine the optimal number of clusters.
The second method is the silhouette coefficient method. For a clustering task, the best clustering should be that the data within the class is as compact as possible, and the data between classes is as far away as possible. The silhouette coefficient is an index to measure the degree of dispersion and intensity of a class. The formula is expressed as follows:
where
represents the
sample,
denotes the mean distance from the
sample to the samples in the nearest class,
represents the mean of the distance between samples in the same class,
represents the silhouette coefficient of the sample. The value of
is in the range of (−1, 1). If it is close to 1, the classification of this sample is reasonable; if it is close to −1, it means that the sample should be classified into other classes.
The average value of all samples’
is called the silhouette coefficient of the whole clustering results. The formula is as follows:
In the above formula, N represents the total number of samples, represents the final silhouette coefficient. The silhouette coefficient is the reasonable and effective measure of the clustering result. In general, the larger the silhouette coefficient is, the better the clustering effect is.
Take
Figure 7 as an example, the relationship between cost function and cluster number is shown in
Figure 7a; there is no obvious elbow point in this curve. However, we can use the silhouette coefficient method as is shown in
Figure 7b, when the clustering number is 3, the silhouette coefficient is the biggest, meaning that the classification of all samples is the most reasonable. Then, we can determine that the optimal clustering number is 3.
4.2. HCA Method
The HCA (hierarchical clustering) method can divide the dataset into classes hierarchically, and the class generated later is based on the result of the former layer. Hierarchical clustering algorithms are generally divided into two types. Bottom-up Hierarchical Clustering [
32]: each sample is considered as a class at the beginning, and each time, the two closest classes are merged into a new class according to certain criteria. According to this rule, until the end, all samples belong to one class. Top-down Hierarchical Clustering [
32]: at first, all samples belong to one class and each class is divided into several classes according to certain criteria. According to this rule, until the end, each sample is a class.
In this thesis, a bottom-up clustering method is adopted. The process is shown in
Figure 8. Suppose there are N samples to be clustered, the specific steps are as follows:
Classify each sample into a class and calculate the distance between every two classes, in other words, the similarity between different samples. In this thesis, the class-average method is used to measure the similarity between two classes, which is the average of the distance between two points in the two classes. For any two classes
,
, their similarity is recorded as
, and it can be calculated with the following formula:
where
,
represent the number of samples in the two classes, respectively,
represent the samples in two classes.
represents the Euclidean distance between two samples:
Set a threshold to find two classes with closet distance among all the classes and the distance must be smaller than the threshold. If they exist, classify them into one class and reduce the total number of classes by 1. Otherwise, the classification process stops.
Recalculate the similarity between the generated new classes and the previous old classes.
Repeat steps 2 and 3 until all samples fall into one class, or when the distance between the two closet classes is greater than the specific threshold and the classification stops.
4.3. Filter Pruning
This section will prune the convolutional layer with redundant filters. The idea is to cluster the output feature maps from each filter in the convolutional layer first, and then randomly select and save one sample in each class. Save the index of this sample and make its corresponding filter. The remaining filters in this layer will be pruned. The process is represented in
Figure 9.
In
Figure 9,
represents input feature maps,
and
represent the height and width of the feature maps in the
layer, the original convolutional layer is recorded as
. The output feature map is recorded as
, the number of convolutional operations for this convolution layer is
. Let
represent the cluster number of filters. Then, after cropping,
filters will be preserved and the number of convolution operations in this layer will be reduced to
.
6. Conclusions
This paper studies the filter pruning method in a convolutional neural network and proposes a filter selection method based on feature maps clustering. Through experiment, we verify the feasibility of this method. It is effective to select pruned filters based on feature maps clustering, and its precision is higher than that of a random selection of pruned filters. Among the two methods adopted in this thesis, the HCA method is superior to the K-Means method.
The silhouette coefficient method is a feasible method to find the best clustering number. In general, when the number of layers is pruned to the same number as that of clusters, the detection precision of the model will not decrease much. The precision in the HCA method is higher and more robust. This work sets the number of initial centers in the K-Means method to 112 and the threshold in the HCA method to 0.049.
The detection speed of the pruned model improves greatly. The improvement of computing speed in CPU and GPU is mainly dependent on the structure of the model. Models that require more parallel operations can save more computational time in GPU after pruning. In contrast, models that require more serial operations can save more time on the CPU.
This paper also compared the pruned network with the existing lightweight network, SSD. By pruning, Retinanet exceeds the SSD in the precision and speed of the wide face dataset detection as shown in
Table 3.
In addition to the above research results, there are also many areas that need to be improved. There are a lot of map clustering methods, and it is better if the method is more robust and easy to determine the best clustering number. It also needs a more complete theory to clarify the relationship among feature maps.