An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network
Abstract
:1. Introduction
- A Q-IORN is proposed to facilitate the implementation of remote sensing scene classification on FPGA. The quantization-awareness training method used in this network can convert the feature maps and parameters of the network from floating-point to fixed-point, which efficiently compresses the model size without accuracy loss.
- We analyze and optimize each calculation module of the proposed Q-IORN, which lays a foundation for the subsequent hardware implementation.
- An efficient hardware architecture is proposed to implement the proposed Q-IORN. In this architecture, efficient dual-channel DDR access mode, rational on-chip data processing scheme and high-performance processing engine are adopted.
- We verify the proposed hardware architecture on a Xilinx VC709 development board and evaluate it on an NWPU-RESISC45 dataset. The experimental results show that the classification accuracy of the proposed accelerator is consistent with that of GPU, i.e., 88.31%. The proposed accelerator achieves an overall energy efficiency of 33.16 Giga-Operations Per Second Per Watt (GOP/s/W) at 200 MHz, which is superior to CPU, GPU and other advanced accelerators.
2. Quantized IORN
2.1. IORN
2.2. Q-IORN
3. Algorithm Basics and Mapping Optimization Scheme
3.1. Quantized A-ORConv Layer
3.1.1. Network Quantization
3.1.2. Quantized A-ORConv
3.2. Activation Function
3.3. Pooling Layer
3.4. Quantized FC Layer
3.5. Softmax Layer
4. Hardware Implementation
4.1. Efficient Storage Scheme
4.2. Processing Engine Architecture
- (1)
- A-ARF generation module. As shown in Section 3.1, the unmaterialized filters of A-ARF are produced by coordinate rotation and orientation rotation. The mapping order of coordinate rotation and orientation rotation is exchanged in the proposed hardware accelerator. Specifically, the orientation rotation is implemented before the coordinate rotation. This can reduce the logical resource consumption while generating the required unmaterialized filters. Figure 8 depicts the procedure of generating the m-th input channel of A-ARF. The materialized filter is stored with a custom data arrangement, as shown in Figure 8. In this case, four adjacent channels (i.e., ) of the filter required for orientation rotation can be provided simultaneously. The orientation rotation module reads the i-th column in the storage unit and reorders it according to the value of the counters. For example, when , the sequence is converted from to (defined as ), and represents the i-th input channel of the materialized filter . After the orientation rotation, the m-th input channel of A-ARF is finally generated by rotating the coordinates of each kernel in separately. The detailed process of the coordinate rotation is shown in Figure 1. In conclusion, the implementation of A-ARF mainly depends on reasonable storage arrangement and efficient logical control. The proposed scheme can provide the convolution kernels required for convolution calculation in real-time, and relieve the on-chip storage pressure caused by storing immaterialized filters.
- (2)
- Feature padding module. Unlike the implementation in [29], which used additional Block Random Access Memory (BRAM) to store padding data, the proposed hardware accelerator achieves the padding operation during the calculation by judging the position of the current calculation patch in the feature map. With this optimization, only a few register resources are required for mapping this module.
- (3)
- Convolution calculation module. During the deployment of the convolution calculation module, the main challenge is to map the dense multiplication and addition operations to FPGA. Thus, we focus on exploring a suitable parallel computing scheme. In the proposed convolution calculation module, a total of 64 PEs are placed. Each PE with nine multipliers adopts an adder tree structure to carry out the calculation in a window. Based on the network structure and the proposed parallel scheme, our accelerator reads the three consecutive rows of the input feature map (with a size of ) and the required convolution kernel (with a size of ) to the corresponding on-chip buffer each time. Then a row of the output feature map (with a size of ) can be generated according to the obtained input feature map and convolution kernel, as shown in Figure 9. The detailed calculation process for generating the m-th row of the output feature map is illustrated in Figure 10. After obtaining the required input feature map and convolution kernel, we divide them into groups. The i-th group contains three rows of the i-th input channel in the input feature map (with a size of ) and the i-th channel of the convolution kernel (with a size of ). This operation provides a guarantee for the realization of pipelining within layer. As 64 PEs are set in this module, the convolution kernels in the i-th group are further grouped by 64 (i.e., the kernels in each group are further divided into units, and each unit contains 64 windows), as shown in Figure 10. After that, the final outputs are produced by combining data reuse, vector inner product within group and accumulation between groups.
- (4)
- FC calculation module. The mapping of the FC calculation module is mainly limited by the access of weights. Since DDR can only provide a 512-bit data per clock cycle, we set up a PE with 64 multipliers for FC calculation. With this scheme, we can make full use of the bandwidth of DDR. The detailed calculation process of the FC layer is shown in Figure 11. To match the bit-width of the DDR, the input neurons and weights are processed into a custom form through flattening and splicing (the input neurons are processed online, and the weights are processed offline). In this way, the input neurons and corresponding weights are divided into groups, and each group has a 512-bit input and 512-bit weights. The input neurons are reused during the calculation process, which effectively reduces the access to on-chip storage. The weights are read from the external memory in real-time for calculation. Similar to convolution calculation module, the final output neurons are also generated by applying the vector inner product within group and the accumulation between groups.
- (5)
- Fusion calculation module. Based on our previous work [25], we propose a fused layer to apply the quantization operation to the proposed processing engine. The proposed fused layer is obtained by merging the n-th de-quantized layer and the (n + 1)-th quantized layer and is implemented before ReLU. From Equations (3) and (5), the implementation of the de-quantization and quantization rely on the multiplication operation and division operation, respectively. While for fused layer, only multiplication operation is required, which can reduce the consumption of computing resources. Figure 12 depicts the process of producing the (n + 1)-th convolutional layer’s input element in general network inference and optimized network inference. As shown in Figure 12, this optimized implementation scheme can also optimize the input bit-width of the ReLU module and max-pooling module from 32-bit to 8-bit, which effectively reduces the consumption of storage resources.
5. Experiments and Results
5.1. Experimental Settings
5.1.1. Dataset Description
5.1.2. Experimental Procedure
5.2. Performance Evaluation of the Q-IORN
5.3. Performance Evaluation of the Proposed Accelerator
6. Conclusions
Author Contributions
Funding
Acknowledgments
Conflicts of Interest
References
- Cheng, G.; Han, J.; Lu, X. Remote sensing image scene classification: Benchmark and state of the art. Proc. IEEE 2017, 105, 1865–1883. [Google Scholar] [CrossRef] [Green Version]
- Cheng, G.; Guo, L.; Zhao, T.; Han, J.; Li, H.; Fang, J. Automatic landslide detection from remote-sensing imagery using a scene classification method based on BoVW and pLSA. Int. J. Remote Sens. 2013, 34, 45–59. [Google Scholar] [CrossRef]
- Mishra, N.B.; Crews, K.A. Mapping vegetation morphology types in a dry savanna ecosystem: Integrating hierarchical object-based image analysis with random forest. Int. J. Remote Sens. 2014, 35, 1175–1198. [Google Scholar] [CrossRef]
- Cheng, G.; Han, J.; Guo, L.; Liu, Z.; Bu, S.; Ren, J. Effective and efficient midlevel visual elements-oriented land-use classification using VHR remote sensing images. IEEE Trans. Geosci. Remote Sens. 2015, 53, 4238–4249. [Google Scholar] [CrossRef] [Green Version]
- Xia, G.-S.; Hu, J.; Hu, F.; Shi, B.; Bai, X.; Zhong, Y.; Zhang, L.; Lu, X. AID: A benchmark data set for performance evaluation of aerial scene classification. IEEE Trans. Geosci. Remote Sens. 2017, 55, 3965–3981. [Google Scholar] [CrossRef] [Green Version]
- Hu, F.; Xia, G.-S.; Hu, J.; Zhang, L. Transferring Deep Convolutional Neural Networks for the Scene Classification of High-Resolution Remote Sensing Imagery. Remote Sens. 2015, 7, 14680–14707. [Google Scholar] [CrossRef] [Green Version]
- Zhong, Y.; Fei, F.; Zhang, L. Large patch convolutional neural networks for the scene classification of high spatial resolution imagery. J. Appl. Remote Sens. 2016, 10, 025006. [Google Scholar] [CrossRef]
- Cheng, G.; Li, Z.; Yao, X.; Guo, L.; Wei, Z. Remote sensing image scene classification using bag of convolutional features. IEEE Geosci. Remote Sens. Lett. 2017, 14, 1735–1739. [Google Scholar] [CrossRef]
- Yuan, B.; Li, S.; Li, N. Multiscale deep features learning for land-use scene recognition. J. Appl. Remote Sens. 2018, 12, 015010. [Google Scholar] [CrossRef]
- Zhang, C.; Prasanna, V. Frequency Domain Acceleration of Convolutional Neural Networks on CPU-FPGA Shared Memory System. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 35–44. [Google Scholar]
- Zhang, J.; Li, J. Improving the Performance of OpenCL-based FPGA Accelerator for Convolutional Neural Network. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 25–34. [Google Scholar]
- Zhang, M.; Li, L.; Wang, H.; Liu, Y.; Qin, H.; Zhao, W. Optimized Compression for Implementing Convolutional Neural Networks on FPGA. Electronics 2019, 8, 295. [Google Scholar] [CrossRef] [Green Version]
- Mei, C.; Liu, Z.; Niu, Y.; Ji, X.; Zhou, W.; Wang, D. A 200MHZ 202.4GFLOPS@10.8W VGG16 accelerator in Xilinx VX690T. In Proceedings of the 2017 IEEE Global Conference on Signal and Information Processing, Montreal, QC, Canada, 14–16 November 2017; pp. 784–788. [Google Scholar]
- Gysel, P.; Motamedi, M.; Ghiasi, S. Hardware-oriented approximation of convolutional neural networks. arXiv 2016, arXiv:1604.03168. [Google Scholar]
- Hubara, I.; Courbariaux, M.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Quantized neural networks: Training neural networks with low precision weights and activations. J. Mach. Learn. Res. 2017, 18, 6869–6898. [Google Scholar]
- Han, S.; Pool, J.; Tran, J.; Dally, W. Learning both weights and connections for efficient neural network. In Proceedings of the Advances in Neural Information Processing Systems, Montreal, QC, Canada, 7–12 December 2015; pp. 1135–1143. [Google Scholar]
- Liang, S.; Yin, S.; Liu, L.; Luk, W.; Wei, S. FP-BNN: Binarized Neural Network on FPGA. Neurocomputing 2018, 275, 1072–1086. [Google Scholar] [CrossRef]
- Zhao, R.; Song, W.; Zhang, W.; Xing, T.; Lin, J.H.; Srivastava, M.; Gupta, R.; Zhang, Z. Accelerating Binarized Convolutional Neural Networks with Software-Programmable FPGAs. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 15–24. [Google Scholar]
- Zhang, C.; Li, P.; Sun, G.; Guan, Y.; Xiao, B.; Cong, J. Optimizing FPGA-based accelerator design for deep convolutional neural networks. In Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2015; pp. 161–170. [Google Scholar]
- Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics 2019, 8, 281. [Google Scholar] [CrossRef] [Green Version]
- Alwani, M.; Chen, H.; Ferdman, M.; Milder, P. Fused-layer CNN accelerators. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture, Taipei, Taiwan, 15–19 October 2016; pp. 1–12. [Google Scholar]
- Sun, F.; Wang, C.; Gong, L.; Xu, C.; Zhou, X. A high-performance accelerator for large-scale convolutional neural networks. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications, Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–9. [Google Scholar]
- Li, L.; Zhang, S.; Wu, J. Efficient Object Detection Framework and Hardware Architecture for Remote Sensing Images. Remote Sens. 2019, 11, 2376. [Google Scholar] [CrossRef] [Green Version]
- Wang, J.; Liu, W.; Ma, L.; Chen, H.; Chen, L. IORN: An Effective Remote Sensing Image Scene Classification Framework. IEEE Geosci. Remote Sens. Lett. 2018, 15, 1695–1699. [Google Scholar] [CrossRef]
- Wei, X.; Liu, W.; Chen, L.; Ma, L.; Chen, H.; Zhuang, Y. FPGA-Based Hybrid-Type Implementation of Quantized Neural Networks for Remote Sensing Applications. Sensors 2019, 19, 924. [Google Scholar] [CrossRef] [PubMed] [Green Version]
- Lu, X.; Wang, B.; Zheng, X.; Li, X. Exploring models and data for remote sensing image caption generation. IEEE Trans. Geosci. Remote Sens. 2018, 56, 2183–2195. [Google Scholar] [CrossRef] [Green Version]
- Simonyan, K.; Zisserman, A. Very deep convolutional networks for large-scale image recognition. In Proceedings of the 2015 International Conference on Learning Representations, San Diego, CA, USA, 7–9 May 2015; pp. 1–14. [Google Scholar]
- Zhou, Y.; Ye, Q.; Qiu, Q.; Jiao, J. Oriented response networks. In Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 4961–4970. [Google Scholar]
- Liu, Z.; Chow, P.; Xu, J.; Jiang, J.; Dou, Y.; Zhou, J. A Uniform Architecture Design for Accelerating 2D and 3D CNNs on FPGAs. Electronics 2019, 8, 65. [Google Scholar] [CrossRef] [Green Version]
- Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J.S. Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks. In Proceedings of the 2017 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, Monterey, CA, USA, 22–24 February 2017; pp. 45–54. [Google Scholar]
Network. | Accuracy | Model Size (MB) |
---|---|---|
IORN | 88.49% | 485.88 |
Q-IORN | 88.31% | 121.51 |
Multipliers Type | Floating-Point Multipliers | Fixed-Point Multipliers |
---|---|---|
Multipliers Number | 65 | 640 |
DSP Utilization | 130 | 640 |
Resource | DSP | BRAM | FF | LUT |
---|---|---|---|---|
Available | 3600 | 1470 | 866,400 | 433,200 |
Utilization | 770 | 404.5 | 116,742 | 73,320 |
Utilization () | 21.39 | 27.52 | 13.47 | 16.93 |
Platform | CPU | GPU | FPGA |
---|---|---|---|
Vendor | Intel Xeon E5-2697 v4 | NVIDIA TITAN Xp | Xilinx VC709 |
Technology (nm) | 14 | 16 | 28 |
Frequency (MHz) | 2300 | 1582 | 200 |
Power (W) | 145 | 250 | 6.32 |
Latency (ms) | 1137.62 | 43.00 | 147.62 |
System Performance (fps) | 0.88 | 23.26 | 6.77 |
Throughput (GOP/s) | 27.20 | 720.55 | 209.60 |
Energy Efficiency (GOP/s/W) | 0.19 | 2.88 | 33.16 |
[30] | [13] | [12] | [23] | Ours | |
---|---|---|---|---|---|
FPGA | Arria-10 GX 1150 | Xilinx XC7VX690T | Xilinx XCZU7EV | Xilinx XC7Z100 | Xilinx XC7VX690T |
Frequency (MHz) | 150 | 200 | 300 | 200 | 200 |
Precision | 8–16 bit fixed | 16-bit float | 8-bit fixed | 16-bit fixed | 8-bit fixed |
Power (W) | 21.2 | 10.81 | 17.67 | 19.52 | 6.32 |
Throughput (GOP/s) | 645.25 | 202.03 (conv) 202.42 (all) | 290.40 (conv) 14.11 (all) | 452.8 | 224.43 (conv) 209.60 (all) |
Energy efficiency (GOP/s/W) | 30.44 | 18.69 (conv) 18.73 (all) | 16.44 (conv) 0.80 (all) | 23.20 | 35.51 (conv) 33.16 (all) |
© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhang, X.; Wei, X.; Sang, Q.; Chen, H.; Xie, Y. An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network. Electronics 2020, 9, 1344. https://doi.org/10.3390/electronics9091344
Zhang X, Wei X, Sang Q, Chen H, Xie Y. An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network. Electronics. 2020; 9(9):1344. https://doi.org/10.3390/electronics9091344
Chicago/Turabian StyleZhang, Xiaoli, Xin Wei, Qianbo Sang, He Chen, and Yizhuang Xie. 2020. "An Efficient FPGA-Based Implementation for Quantized Remote Sensing Image Scene Classification Network" Electronics 9, no. 9: 1344. https://doi.org/10.3390/electronics9091344