# A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator

^{1}

^{2}

^{3}

^{*}

## Abstract

**:**

## 1. Introduction

- 1.
- A scalable and highly reusable multiplication and accumulation engine (MAE) is proposed to solve the engine-idling problem caused by the separate dedicated engine architecture, and the MAE is compatible with different types of computation.
- 2.
- An efficient convolution algorithm is proposed for DWC and PWC, respectively, to reduce the on-chip memory occupancy. Meanwhile, the two algorithms achieve layer fusion between PWC and DWC and improve off-chip memory access efficiency.
- 3.
- An address-mapping method for off-chip access is proposed. This maximises bandwidth utilization and reduces latency when reading feature maps.

## 2. Background

#### 2.1. Convolutional Neural Network Components

#### 2.1.1. Convolutional Layer

#### 2.1.2. Activation Function Layer

#### 2.1.3. Global Pooling Layer

#### 2.1.4. Fully Connected Layer

#### 2.2. Depthwise Separable Convolution

## 3. Design of Accelerator

#### 3.1. Overall Architecture

#### 3.2. The Scalable Multiplication and Accumulation Engine

#### 3.3. Two Efficient Convolution Algorithms

Algorithm 1: DWC calculation with PUDL convolution order. |

Algorithm 2: PWC calculation with PUPL convolution order. |

#### 3.4. Address-Mapping Method

- Input features are read into the accelerator from off-chip memory at large burst lengths from contiguous addresses, where the data is contiguous but must be heavily cached on-chip because the data order does not match the expected input order of the $PW{C}_{PUPL}$.
- Input features are read into the accelerator from discrete external memory cells at a small burst length, where the data order matches the desired computational order of the $PW{C}_{PUPL}$ but the off-chip memory access is inefficient.

## 4. Evaluation

#### 4.1. Implementation Consideration

#### 4.2. Implementation Results

#### 4.2.1. FPGA Resource Utilization and Burst Length

#### 4.2.2. Comparison with CPU Implementation

#### 4.2.3. Comparison with FPGA Implementations

#### 4.3. Evaluations on Other Networks

## 5. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Chen, L.; Li, S.; Bai, Q.; Yang, J.; Jiang, S.; Miao, Y. Review of Image Classification Algorithms Based on Convolutional Neural Networks. Remote Sens.
**2021**, 13, 4712. [Google Scholar] [CrossRef] - Wang, Y.; Zhang, W.; Gao, R.; Jin, Z.; Wang, X. Recent advances in the application of deep learning methods to forestry. Wood Sci. Technol.
**2021**, 55, 1171–1202. [Google Scholar] [CrossRef] - Guo, Z.; Huang, Y.; Hu, X.; Wei, H.; Zhao, B. A Survey on Deep Learning Based Approaches for Scene Understanding in Autonomous Driving. Electronics
**2021**, 10, 471. [Google Scholar] [CrossRef] - Sandler, M.; Howard, A.; Zhu, M.; Zhmoginov, A.; Chen, L.C. Mobilenetv2: Inverted residuals and linear bottlenecks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 4510–4520. [Google Scholar]
- Li, B.; Wang, H.; Zhang, X.; Ren, J.; Liu, L.; Sun, H.; Zheng, N. Dynamic Dataflow Scheduling and Computation Mapping Techniques for Efficient Depthwise Separable Convolution Acceleration. IEEE Trans. Circuits Syst.-Regul. Pap.
**2021**, 68, 3279–3292. [Google Scholar] [CrossRef] - Li, G.; Zhang, J.; Zhang, M.; Wu, R.; Cao, X.; Liu, W. Efficient depthwise separable convolution accelerator for classification and UAV object detection. Neurocomputing
**2022**, 490, 1–16. [Google Scholar] [CrossRef] - Ding, W.; Huang, Z.; Huang, Z.; Tian, L.; Wang, H.; Feng, S. Designing efficient accelerator of depthwise separable convolutional neural network on FPGA. J. Syst. Archit.
**2019**, 97, 278–286. [Google Scholar] [CrossRef] - Wu, D.; Zhang, Y.; Jia, X.; Tian, L.; Li, T.; Sui, L.; Xie, D.; Shan, Y. A high-performance CNN processor based on FPGA for MobileNets. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 136–143. [Google Scholar]
- Lin, Y.; Zhang, Y.; Yang, X. A Low Memory Requirement MobileNets Accelerator Based on FPGA for Auxiliary Medical Tasks. Bioengineering
**2023**, 10, 28. [Google Scholar] [CrossRef] [PubMed] - Liu, X.; Yang, J.; Zou, C.; Chen, Q.; Yan, X.; Chen, Y.; Cai, C. Collaborative Edge Computing With FPGA-Based CNN Accelerators for Energy-Efficient and Time-Aware Face Tracking System. IEEE Trans. Comput. Soc. Syst.
**2022**, 9, 252–266. [Google Scholar] [CrossRef] - Shang, J.; Zhang, K.; Zhang, Z.; Li, C.; Liu, H. A high-performance convolution block oriented accelerator for MBConv-Based CNNs. Integr.-Vlsi J.
**2023**, 88, 298–312. [Google Scholar] [CrossRef] - Wang, X.; Tian, T.; Zhao, L.; Wu, W.; Jin, X. Exploration of Balanced Design in Resource-Constrained Edge Device for Efficient CNNs. IEEE Trans. Circuits Syst.-Express Briefs
**2022**, 69, 4573–4577. [Google Scholar] [CrossRef] - Choi, K.; Sobelman, G.E. An Efficient CNN Accelerator for Low-Cost Edge Systems. ACM Trans. Embed. Comput. Syst.
**2022**, 21, 44. [Google Scholar] [CrossRef] - Xuan, L.; Un, K.F.; Lam, C.S.; Martins, R.P. An FPGA-Based Energy-Efficient Reconfigurable Depthwise Separable Convolution Accelerator for Image Recognition. IEEE Trans. Circuits Syst.-Express Briefs
**2022**, 69, 4003–4007. [Google Scholar] [CrossRef] - Yu, X.; Wang, Y.; Miao, J.; Wu, E.; Zhang, H.; Meng, Y.; Zhang, B.; Min, B.; Chen, D.; Gao, J. A data-center FPGA acceleration platform for convolutional neural networks. In Proceedings of the 2019 29th International Conference on Field Programmable Logic and Applications (FPL), Barcelona, Spain, 8–12 September 2019; pp. 151–158. [Google Scholar]
- Bai, L.; Zhao, Y.; Huang, X. A CNN Accelerator on FPGA Using Depthwise Separable Convolution. IEEE Trans. Circuits Syst.-Express Briefs
**2018**, 65, 1415–1419. [Google Scholar] [CrossRef][Green Version] - Fan, H.; Liu, S.; Ferianc, M.; Ng, H.C.; Que, Z.; Liu, S.; Niu, X.; Luk, W. A real-time object detection accelerator with compressed SSDLite on FPGA. In Proceedings of the 2018 International Conference on Field-Programmable Technology, Naha, Japan, 10–14 December 2018; pp. 14–21. [Google Scholar]
- Liang, Y.; Lu, L.; Jin, Y.; Xie, J.; Huang, R.; Zhang, J.; Lin, W. An Efficient Hardware Design for Accelerating Sparse CNNs With NAS-Based Models. IEEE Trans.-Comput.-Aided Des. Integr. Syst.
**2022**, 41, 597–613. [Google Scholar] [CrossRef] - Chang, L.; Zhang, S.; Du, H.; Chen, Y.; Wang, S. A Reconfigurable Neural Network Processor With Tile-Grained Multicore Pipeline for Object Detection on FPGA. IEEE Trans. Very Large Scale Integr. (Vlsi) Syst.
**2021**, 29, 1967–1980. [Google Scholar] [CrossRef] - Howard, A.G.; Zhu, M.; Chen, B.; Kalenichenko, D.; Wang, W.; Weyand, T.; Andreetto, M.; Adam, H. Mobilenets: Efficient convolutional neural networks for mobile vision applications. arXiv
**2017**, arXiv:1704.04861. [Google Scholar] - Howard, A.; Sandler, M.; Chu, G.; Chen, L.C.; Chen, B.; Tan, M.; Wang, W.; Zhu, Y.; Pang, R.; Vasudevan, V.; et al. Searching for mobilenetv3. In Proceedings of the IEEE/CVF International conference on Computer Vision, Seoul, Republic of Korea, 27 October 2019–2 November 2019; pp. 1314–1324. [Google Scholar]
- Zhang, X.; Zhou, X.; Lin, M.; Sun, J. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 6848–6856. [Google Scholar]
- Ma, N.; Zhang, X.; Zheng, H.T.; Sun, J. Shufflenet v2: Practical guidelines for efficient cnn architecture design. In Proceedings of the European Conference on Computer Vision (ECCV), Munich, Germany, 8–14 September 2018; pp. 116–131. [Google Scholar]
- Tan, M.; Le, Q. Efficientnet: Rethinking model scaling for convolutional neural networks. In Proceedings of the International Conference on Machine Learning, PMLR, Long Beach, CA, USA, 9–15 June 2019; pp. 6105–6114. [Google Scholar]
- Tan, M.; Le, Q. Efficientnetv2: Smaller models and faster training. In Proceedings of the International Conference on Machine Learning, PMLR, Virtual, 18–24 July 2021; pp. 10096–10106. [Google Scholar]
- Chollet, F. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, HI, USA, 21–26 July 2017; pp. 1251–1258. [Google Scholar]
- Fan, H.; Luo, C.; Zeng, C.; Ferianc, M.; Que, Z.; Liu, S.; Niu, X.; Luk, W. F-E3D: FPGA-based acceleration of an efficient 3D convolutional neural network for human action recognition. In Proceedings of the 2019 IEEE 30th International Conference on Application-Specific Systems, Architectures and Processors (ASAP), New York, NY, USA, 15–17 July 2019; Volume 2160, pp. 1–8. [Google Scholar]
- Ma, Y.; Suda, N.; Cao, Y.; Seo, J.s.; Vrudhula, S. Scalable and modularized RTL compilation of convolutional neural networks onto FPGA. In Proceedings of the 2016 26th International Conference on Field Programmable Logic and Applications (FPL), Lausanne, Switzerland, 29 August–2 September 2016; pp. 1–8. [Google Scholar]
- Knapheide, J.; Stabernack, B.; Kuhnke, M. A high throughput MobileNetV2 FPGA implementation based on a flexible architecture for depthwise separable convolution. In Proceedings of the 2020 30th International Conference on Field-Programmable Logic and Applications (FPL), Gothenburg, Sweden, 31 Augus–4 September 2020; pp. 277–283. [Google Scholar]
- Liu, B.; Zou, D.; Feng, L.; Feng, S.; Fu, P.; Li, J. An FPGA-Based CNN Accelerator Integrating Depthwise Separable Convolution. Electronics
**2019**, 8, 281. [Google Scholar] [CrossRef][Green Version]

**Figure 1.**Comparison of the computational flow of STC and DSC. (

**a**) The computational flow of standard convolution. (

**b**) The computational flow of depthwise separable convolution. (∗) represents the operation of convolution.

**Figure 5.**Four convolution orders based on the minimum cache and access unit. (

**a**) The convolution order with pointwise unit and pointwise loop (PUPL). (

**b**) The convolution order with pointwise unit and depthwise loop (PUDL). (

**c**) The convolution order with depthwise unit and pointwise loop (DUPL). (

**d**) The convolution order with depthwise unit and depthwise loop (DUDL).

**Figure 7.**The diagram of convolution order for $DW{C}_{PUDL}$ and $PW{C}_{PUPL}$. By default, m is equal to n. (

**a**) DWC calculation using pointwise unit and depthwise loop. $K\times K\times m$ weights are stored in the on-chip buffer, and are updated after the calculation of a $Fragment$. Input feature maps need to be read only once. (

**b**) PWC calculation using pointwise unit and pointwise loop. The weights are stored in on-chip buffer and updated to another $n\times IC$ after the calculation of each $Map$. Input feature maps need to be read multiple times. (∗) represents the operation of convolution.

**Figure 8.**The address mapping between the output unit of $DW{C}_{PUDL}$ and the input unit of $PW{C}_{PUPL}$.

**Figure 10.**Block diagram of the evaluation system. The number of accelerator cores is configurable, which is determined by FPGA resource and off-chip access bandwidth.

**Table 1.**The minimum preconvolution feature cache size and output order for DWC and PWC with various convolution orders.

Convolution Order | Convolution Type | Minimum Cache Size (bit) | Output Order |
---|---|---|---|

PUPL | DWC | $(({F}_{in}+1)\times (K-1)\times IC+m)\times Q$ | PUPL |

PWC | 0 | PUDL | |

PUDL | DWC | $({F}_{in}\times (K-1)+K)\times m\times Q$ | PUDL |

PWC | 0 | PUDL | |

DUPL | DWC | $({F}_{in}\times IC\times (K-1)+m\times n)\times Q$ | DUPL |

PWC | $m\times n\times Q$ | PUDL | |

DUDL | DWC | $({F}_{in}\times ({F}_{in}\times (n-1)+K-1)+m)\times Q$ | DUDL |

PWC | $({F}_{in}^{2}\times (n-1)+m)\times Q$ | PUDL |

ALM | M20K | DSP | |
---|---|---|---|

Utilization | 42,116.7 (16.73%) | 1118 (52.46%) | 1082 (64.14%) |

Work | Network | Platform | Bit-Width | Frequency (MHz) | FPS | GOPS | GFLOPS | DSP | DSP Efficiency ^{1}(GOPS/DSP) |
---|---|---|---|---|---|---|---|---|---|

M. Sandler’18 [4] | MobileNetV2 | CPU | - | - | 13.3 | - | - | - | - |

G. Li’22 [6] | MobileNetV2 | XC7Z100 | 16-bit | 200 | 371.4 | 222.8 | - | 1163 | 0.19 |

W. Ding’19 [7] | DS-CNN | Arria10 GX 1150 | 16-bit | 180 | - | 98.9 | - | 712 | 0.14 |

L. Bai’18 [16] | MobileNetV2 | Arria10 SX 660 | 16-bit | 133 | 266.2 | 170.6 | - | 1278 | 0.13 |

B. Liu’19 [30] | MobileNetV2 | Zynq 7100 | float 32-bit | 100 | - | - | 17.11 | 1926 | 0.02 |

Proposed | MobileNetV2 | Arria10 GX 660 | float 32-bit | 200 | 205.1 | - | 128.8 | 1082 | 0.24 |

^{1}For a fair comparison, it is normalized to 16-bit, where the value of a 32-bit system is multiplied by 2, which uses the same normalization method as Ref. [5].

L. Bai’18 [16] | G. Li’22 [6] | Proposed | |
---|---|---|---|

Platform | Intel Arria10 SX660 | Xilinx XC7Z100 | Intel Arria10 GX660 |

Bit width | 16-bit | 16-bit | float 32-bit |

Logic Usage | ALM 82K | LUT 183K/FF 231K | ALM 42K |

Normalized ^{1} Logic Usage | LUT 164K/FF 328K | LUT 183K/FF 231K | LUT 84K/FF 168K |

DSP Usage | 1278 | 1163 | 1082 |

Memory Usage (Mb) | 24.5 | - | 13.3 ^{2} |

Normalized ^{3} Memory Usage | 49 | - | 13.3 |

GOPS/W | - | 12.5 | - |

GFLOPS/W | - | - | 7.8 |

GOPS/DSP ^{4} | 0.13 | 0.19 | 0.24 |

^{1}An ALM in an Arria 10 FPGA contains 2 LUTs and 4 regs.

^{2}Total block memory bits.

^{3}It is nomalized to 32-bit, where the value of a 16-bit system is multiplied by 2.

^{4}It is nomalized to 16-bit, where the value of a 32-bit system is multiplied by 2.

Network | Bit Width | Frequency (MHz) | Total Off-Chip Memory Accesses (Million) | FPS |
---|---|---|---|---|

MobileNetV3-Small | 117.7 | 200 | 1.80 | 222.1 |

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Huang, J.; Liu, X.; Guo, T.; Zhao, Z. A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator. *Electronics* **2023**, *12*, 1571.
https://doi.org/10.3390/electronics12071571

**AMA Style**

Huang J, Liu X, Guo T, Zhao Z. A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator. *Electronics*. 2023; 12(7):1571.
https://doi.org/10.3390/electronics12071571

**Chicago/Turabian Style**

Huang, Jiye, Xin Liu, Tongdong Guo, and Zhijin Zhao. 2023. "A High-Performance FPGA-Based Depthwise Separable Convolution Accelerator" *Electronics* 12, no. 7: 1571.
https://doi.org/10.3390/electronics12071571