# Recent Developments in Low-Power AI Accelerators: A Survey

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Background and Related Work

#### 2.1. Core Operation in Artificial Neural Networks

#### 2.2. Typical Accelerator Structure

#### 2.3. Related Work

## 3. Method

#### 3.1. Aim and Scope

#### 3.2. Criterias for Acceptance

#### 3.3. Missing Data

## 4. Results

#### 4.1. Power, Throughput, and Power Efficiency

#### 4.2. Power, Area, and Clock Frequency

#### 4.3. Strategies to Decrease Power

#### 4.4. Acceleration Targets

#### 4.5. Number and Precision Formats

#### 4.6. Neuromorphic Accelerators

#### 4.7. Summary of Company Accelerators

#### 4.8. Summary of Recent AI Accelerators

## 5. Discussion

## 6. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## Appendix A

Accelerator | Acc. Target | Tech | Area | Power | Freq. | Perf. | Power Eff. | Area Eff. | Type |
---|---|---|---|---|---|---|---|---|---|

nm | mm${}^{2}$ | mW | MHz | GOPS | GOPS/W | GOPS/mm${}^{2}$ | |||

Eyeriss v2 [47] | CNN | 65 | - | 476.25 * | 200.0 | 153.6 | 578.3 | - | simulated |

Khabbazan et al. [79] | CNN | - | - | 1770.00 | 160.0 | 41.0 | 23.1 | - | FPGA |

Wang et al. [31] | CNN | 40 | 2.16 | 68.00 | 50.0 | 3.2 * | 47.1 * | 1.5 * | FPGA |

SparTen [80] | CNN | 45 | 0.77 | 118.30 | 800.0 | 25.6 * | 216.4 * | 33.4 * | simulated |

DT-CNN [81] | CNN, GAN | 65 | 6.80 | 196.00 | 200.0 | 639.7 | 3260.0 | 94.1 | simulated |

Dante [82] | DNN | 14 | 2.30 | - | 330.0 | - | - | - | simulated |

TIE [83] | DNN | 28 | 1.74 | 154.80 | 1000.0 | 7640.0 | 72,900.0 | 43,807.3 * | simulated |

MnnFast [84] | Transformer | 28 | - | - | 1000.0 | 120.0 | - | - | FPGA |

ExTensor [85] | SpGEMM | 32 | 98.89 | - | 1000.0 | 128.0 | - | - | simulated |

WAX [29] | CNN | 28 | 0.33 | 1.55 * | 200.0 | 26.2 * | 16,938.5 * | 206.0 | simulated |

CGNet-NM [41] | CNN | 28 | 3.86 | 159.50 | 500.0 | 144.0 * | 902.8 * | 37.3 * | simulated |

Tianjic [61,62] | SNN, ANN | 28 | 14.44 | 950.00 | 300.0 | 1214.1 * | 1278.0 | 84.1 * | ASIC |

MASR [86] | RNN | 16 | 4.70 | 375.00 | 1000.0 | 544.0 * | 1051.4 * | 194.6 * | simulated |

Li et al. [87] | CNN | 28 | 10.92 | 154.03 | 240.0 | 604.7 | 2970.0 | 55.4 * | ASIC |

NPU’19 [66] ${}^{\u2021}$ | DNN, CNN | 8 | 5.50 | 796.00 | 933.0 | 4423.5 | 8000.0 | 804.3 * | ASIC |

LNPU [88] | CNN, LSTM, RNN | 45 | 16.00 | 367.00 | 200.0 | 2482.8 * | 3382.5 * | 77.6 * | ASIC |

eCNN [89] | CNN | 40 | 55.23 | 6950.00 | 250.0 | 41,000.0 | 5899.3 | 742.4 | simulated |

Lightspeeur 5801S [59] ${}^{\u2021}$ | CNN | 28 | 36.00 | 224.00 | 200.0 | 2800.0 | 12,600.0 | 77.8 * | ASIC |

Lightspeeur 2801S [60] ${}^{\u2021}$ | CNN | - | 49.00 | 600.00 | 100.0 | 5600.0 | 9300.0 | 114.3 * | ASIC |

Hailo-8 [67] ${}^{\u2021}$ | DNN | - | 225.00 | 2500.00 | - | 26,000.0 | 10,400.0 * | 115.6 * | ASIC |

KL520 [57] ${}^{\u2021}$ | LSTM, CNN | - | - | 575.00 * | 300.0 | 345.0 | 600.0 | - | ASIC |

Jetson Nano [33] ${}^{\u2021}$ | DNN | - | 3150.00 | 7500.00 | - | 250.0 | 37.5 * | 0.1 * | ASIC |

^{*}Calculated based on configuration and/or published metrics of specified accelerator.

^{‡}Published research paper’s authors are affiliated with a company (see Table 6 for specifics).

Accelerator | Acc. Target | Tech | Area | Power | Freq. | Perf. | Power Eff. | Area Eff. | Type |
---|---|---|---|---|---|---|---|---|---|

nm | mm${}^{2}$ | mW | MHz | GOPS | GOPS/W | GOPS/mm${}^{2}$ | |||

JPEG-ACT [90] | CNN | - | 1.48 | 1360.00 | 1455.0 | - | - | - | simulated |

TIMELY [32] | DNN, CNN | 65 | 91.00 | 0.41 * | 40.0 | 4.2 * | 21000.0 | 0.0 * | simulated |

RecNMP [54] | PR | 40 | 0.54 | 169.25 | 240.0 | - | - | - | simulated |

uGEMM [91] | GEMM | 45 | 0.60 | 165.00 | 400.0 | 102.4 * | 670.5 * | 185.6 * | simulated |

SpinalFlow [48] | SNN | 28 | 2.09 | 162.40 | 200.0 | 25.6 *${}^{\u2020}$ | 157.6 *${}^{\u2020}$ | 32.0 *${}^{\u2020}$ | simulated |

NEBULA [51] | SNN, ANN | 32 | 86.73 | 5200.00 | 1200.0 | - | - | - | simulated |

XpulpNN [43] | DNN, CNN | 22 | 1.00 | 100.00 | 250.0 | 5.0 | 550.0 | 55.0 * | simulated |

YOSO [42] | SNN | 22 | - | 0.73 | <1.0 | 1.3 *${}^{\u2020}$ | 1757.8 *${}^{\u2020}$ | - | simulated |

DRQ [44] | DNN | 45 | 0.32 | 71.96 * | 500.0 | 1584.0 * | 22160.2 * | 4950.0 * | simulated |

SmartExchange [92] | DNN, CNN | 28 | - | 2853.00 * | 1000.0 | 8192.0 * | 4054.0 * | - | simulated |

MatRaptor [93] | SpGEMM | 28 | 2.26 | 1344.95 | 2000.0 | 32.0 | 23.8 * | 14.2 * | simulated |

DT-CNN [94] | CNN, GAN | 65 | 6.80 | 206.00 | 200.0 | 664.4 | 3220.0 | 97.6 | simulated |

Zhang et al. [35] | MAC | 45 | - | 199.68 | 25.0 | 121.4 | 610.0 | - | simulated |

A3 [95] | Transformer | 40 | 2.08 | 98.92 | 1000.0 | 221.0 | 2234.1 * | 106.1 * | simulated |

OPTIMUS [96] | GEMM | 28 | 5.19 | 928.14 | 200.0 | - | - | - | simulated |

SpArch [97] | SpGEMM | 40 | 28.49 | 9260.00 | 1000.0 | 16.0 | 1.7 * | 0.6 * | simulated |

Oh et al. [56] ${}^{\u2021}$ | CNN, LSTM, RNN | 14 | 9.80 | 2727.00 * | 1500.0 | 3000.0 | 1100.0 | 300.0 | ASIC |

REVEL [98] | Matrix Alg. | 28 | 1.93 | 1663.30 | 1250.0 | - | - | - | simulated |

ETHOS-U55 [55] ${}^{\u2021}$ | CNN, LSTM, RNN | 16 | 0.10 | - | 1000.0 | 512.0 | - | 2880.0 * | ASIC |

ETHOS-U65 [68] ${}^{\u2021}$ | CNN, LSTM, RNN | 16 | 0.60 | - | - | 1000.0 | - | 1666.7 * | ASIC |

KL720 [58] ${}^{\u2021}$ | CNN, LSTM, RNN | - | - | 1362.50 | 696.0 | 1500.0 | 1250.0 * | - | ASIC |

^{*}Calculated based on configuration and/or published metrics of specified accelerator.

^{†}Corresponding research paper display results in GSyOPS (Giga-Synaptic Operations per Second) instead of GOPS.

^{‡}Published research paper’s authors are affiliated with a company (see Table 6 for specifics).

Accelerator | Acc. Target | Tech | Area | Power | Freq. | Perf. | Power Eff. | Area Eff. | Type |
---|---|---|---|---|---|---|---|---|---|

nm | mm${}^{2}$ | mW | MHz | GOPS | GOPS/W | GOPS/mm${}^{2}$ | |||

GoSPA [30] | CNN | 28 | 0.25 * | 26.64 | 500.0 | 16.0 * | 600.6 * | 64.0 * | simulated |

RingCNN [45] | CNN | 40 | 23.36 | 2220.00 | 250.0 | 41.0 | 28,400.0 | 1755.1 * | simulated |

SNAFU-ARCH [6] | GEMM | 28 | 1.00 | 0.32 | 50.0 | 0.2 * | 625.0 * | 0.2 * | simulated |

Cambricon-Q [69] ${}^{\u2021}$ | CNN, RNN | 45 | 9.18 * | 1030.31 | 1000.0 | 2000.0 | 1941.2 * | 217.9 * | simulated |

ELSA [99] | Transformer | 40 | 2.10 | 1472.89 | 1000.0 | 1088.0 | 738.7 * | 518.1 * | simulated |

RaPiD [70,71] ${}^{\u2021}$ | CNN, LSTM, RNN, Transformer | 7 | 36.00 | 6206.00 * | 1600.0 | 51,200.0 | 8250.0 | 2610.0 | ASIC |

NPU’21 [72,73] ${}^{\u2021}$ | DNN, CNN | 5 | 5.46 | 327.00 | 1196.0 | 14,700.0 | 13,600.0 | 4447.2 * | simulated |

Lin et al. [37] | MLP | 40 | 0.57 | 93.00 | 1600.0 | 899.0 | 9700.0 | 1577.2 * | simulated |

PremCNN [46] | DNN, CNN | 28 | 3.44 | 500.00 | 800.0 | 102.4 * | 204.8 * | 29.8 * | simulated |

GAMMA [100] | SpGEMM | 45 | 30.60 | - | 1000.0 | 20.0 | - | 0.7 * | simulated |

FPRaker [38] | CNN | 65 | 771.61 * | 3942.00 * | 600.0 | 11,059.2 * | 2835.7 * | 14.3 * | simulated |

Bitlet [101] | DNN, CNN, GAN, Transformer | 28 | 1.54 | 366.00 | 1000.0 | 744.7 | 1335.9 | 483.6 * | simulated |

RASA [102] | GEMM | 15 | 0.90 * | - | 500.0 | 128.0 * | - | 138.5 * | simulated |

StepStone PIM [103] | GEMM | - | - | 1400.00 | 1200.0 | - | - | - | simulated |

SpAtten [39] | Transformer | 40 | 18.71 | 8300.00 | 1000.0 | 2880.0 * | 347.0 * | 153.9 * | simulated |

IMPULSE [28] | SNN | 65 | 0.09 | 0.20 | 200.0 | 0.2 | 990.0 | 2.2 | ASIC |

VSA [63] | SNN | 40 | - | 88.97 | 500.0 | 2304.0 | 25900.0 | - | simulated |

Zeng et al. [104] | CNN | 65 | - | 478.00 | 250.0 | 576.0 | 1205.0 | - | simulated |

Ethos-N78 [74] ${}^{\u2021}$ | CNN, LSTM, RNN | - | - | - | - | 10,000.0 | - | - | ASIC |

KL530 [75] ${}^{\u2021}$ | CNN, LSTM, RNN, Transformer | - | - | 500.00 | - | 1000.0 | 2000.0 * | - | ASIC |

q16 [76] ${}^{\u2021}$ | DNN | 16 | 256.00 | 5500.00 | 1000.0 | 64,000.0 | 11636.4 * | 253.9 | ASIC |

M1076 AMP [77] ${}^{\u2021}$ | DNN | - | 294.50 | 4000.00 | - | 25,000.0 | 6250.0 * | 84.9 | ASIC |

VE2002 [78] ${}^{\u2021}$ | DNN, CNN | 7 | - | 6000.00 | - | 5000.0 | 833.3* | - | ASIC |

VE2102 [78] ${}^{\u2021}$ | DNN, CNN | 7 | - | 7000.00 | - | 8000.0 | 1142.9 * | - | ASIC |

^{*}Calculated based on configuration and/or published metrics of specified accelerator.

^{‡}Published research paper’s authors are affiliated with a company (see Table 6 for specifics).

Accelerator | Acc. Target | Tech | Area | Power | Freq. | Perf. | Power Eff. | Area Eff. | Type |
---|---|---|---|---|---|---|---|---|---|

nm | mm${}^{2}$ | mW | MHz | GOPS | GOPS/W | GOPS/mm${}^{2}$ | |||

RedMulE [27] | GEMM | 22 | 0.50 | 43.50 | 476.0 | 30.0 | 688.0 | 60.0 * | simulated |

DOTA [36] | Transformer | 22 | 2.75 | 3020.00 | 100.0 | 2048.0 * | 678.1 * | 745.8 * | simulated |

EcoFlow [49] | CNN, GAN | 45 | - | 332.00 | 200.0 | 39.0 * | 117.5 * | - | simulated |

FINGERS [105] | Graph Mining | 28 | 18.68 * | 3682.00 * | 1000.0 | - | - | - | simulated |

DTQAtten [106] | Transformer | 40 | 1.41 | - | 1000.0 | 952.8 | - | 678.4 | simulated |

Chen et al. [64] | SNN | 28 | 0.89 | 149.30 | 500.0 | - | - | - | simulated |

Skydiver [65] | SNN | - | - | 960.00 | 200.0 | 22.6 ${}^{\u2020}$ | 19.3 ${}^{\u2020}$ | - | FPGA |

DIMMining [40] | Graph Mining | 32 | 0.38 | 105.82 | 500.0 | - | - | - | simulated |

MeNDA-PU [50] | SpMatrix Transposition | 40 | 7.10 | 92.40 | 800.0 | - | - | - | simulated |

Ubrain [107] | DNN, CNN, RNN | 32 | 4.00 | 45.00 | 400.0 | - | - | - | simulated |

Mokey [108] | Transformer | 65 | 23.90 | 4478.00 * | 1000.0 | - | - | - | simulated |

HP-LeOPArd [109] | Attention | 65 | 3.47 | - | 800.0 | 574.1 | - | 165.5 | simulated |

^{*}Calculated based on configuration and/or published metrics of specified accelerator.

^{†}Corresponding research paper display results in GSyOPS (Giga-Synaptic Operations per Second) instead of GOPS.

## References

- Amant, R.S.; Jiménez, D.A.; Burger, D. Low-power, high-performance analog neural branch prediction. In Proceedings of the 2008 41st IEEE/ACM International Symposium on Microarchitecture, Lake Como, Italy, 8–12 November 2008; IEEE: Piscataway, NJ, USA, 2008; pp. 447–458. [Google Scholar]
- Chen, Y.; Zheng, B.; Zhang, Z.; Wang, Q.; Shen, C.; Zhang, Q. Deep Learning on Mobile and Embedded Devices: State-of-the-Art, Challenges, and Future Directions. ACM Comput. Surv.
**2020**, 53, 1–37. [Google Scholar] [CrossRef] - Theis, T.N.; Wong, H.S.P. The End of Moore’s Law: A New Beginning for Information Technology. Comput. Sci. Eng.
**2017**, 19, 41–50. [Google Scholar] [CrossRef] - Hennessy, J.L.; Patterson, D.A. A New Golden Age for Computer Architecture. Commun. ACM
**2019**, 62, 48–60. [Google Scholar] [CrossRef] [Green Version] - Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. Survey and Benchmarking of Machine Learning Accelerators. In Proceedings of the 2019 IEEE High Performance Extreme Computing Conference (HPEC), Waltham, MA, USA, 24–26 September 2019; pp. 1–9. [Google Scholar] [CrossRef] [Green Version]
- Gobieski, G.; Atli, A.O.; Mai, K.; Lucia, B.; Beckmann, N. Snafu: An Ultra-Low-Power, Energy-Minimal CGRA-Generation Framework and Architecture. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1027–1040. [Google Scholar] [CrossRef]
- Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. Survey of Machine Learning Accelerators. In Proceedings of the 2020 IEEE High Performance Extreme Computing Conference (HPEC), Greater Boston Area, MA, USA, 22–24 September 2020; pp. 1–12. [Google Scholar] [CrossRef]
- Reuther, A.; Michaleas, P.; Jones, M.; Gadepally, V.; Samsi, S.; Kepner, J. AI Accelerator Survey and Trends. In Proceedings of the 2021 IEEE High Performance Extreme Computing Conference (HPEC), Virtual, 19–23 September 2022; pp. 1–9. [Google Scholar] [CrossRef]
- Lin, W.; Adetomi, A.; Arslan, T. Low-Power Ultra-Small Edge AI Accelerators for Image Recognition with Convolution Neural Networks: Analysis and Future Directions. Electronics
**2021**, 10, 2048. [Google Scholar] [CrossRef] - Sze, V.; Chen, Y.H.; Yang, T.J.; Emer, J.S. Efficient Processing of Deep Neural Networks: A Tutorial and Survey. Proc. IEEE
**2017**, 105, 2295–2329. [Google Scholar] [CrossRef] [Green Version] - Nabavinejad, S.M.; Baharloo, M.; Chen, K.C.; Palesi, M.; Kogel, T.; Ebrahimi, M. An Overview of Efficient Interconnection Networks for Deep Neural Network Accelerators. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2020**, 10, 268–282. [Google Scholar] [CrossRef] - Chen, T.; Du, Z.; Sun, N.; Wang, J.; Wu, C.; Chen, Y.; Temam, O. DianNao: A Small-Footprint High-Throughput Accelerator for Ubiquitous Machine-Learning. In Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems, Salt Lake City, UT, USA, 1–5 March 2014; Association for Computing Machinery: New York, NY, USA, 2014; pp. 269–284. [Google Scholar] [CrossRef]
- Chen, Y.; Luo, T.; Liu, S.; Zhang, S.; He, L.; Wang, J.; Li, L.; Chen, T.; Xu, Z.; Sun, N.; et al. DaDianNao: A Machine-Learning Supercomputer. In Proceedings of the 2014 47th Annual IEEE/ACM International Symposium on Microarchitecture, Cambridge, UK, 19–17 December 2014; pp. 609–622. [Google Scholar] [CrossRef]
- Du, Z.; Fasthuber, R.; Chen, T.; Ienne, P.; Li, L.; Luo, T.; Feng, X.; Chen, Y.; Temam, O. ShiDianNao: Shifting Vision Processing Closer to the Sensor. SIGARCH Comput. Archit. News
**2015**, 43, 92–104. [Google Scholar] [CrossRef] - Liu, D.; Chen, T.; Liu, S.; Zhou, J.; Zhou, S.; Teman, O.; Feng, X.; Zhou, X.; Chen, Y. PuDianNao: A Polyvalent Machine Learning Accelerator. In Proceedings of the Twentieth International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, Canada, 25–29 March 2023; Association for Computing Machinery: New York, NY, USA, 2015; pp. 369–381. [Google Scholar] [CrossRef]
- Akopyan, F.; Sawada, J.; Cassidy, A.; Alvarez-Icaza, R.; Arthur, J.; Merolla, P.; Imam, N.; Nakamura, Y.; Datta, P.; Nam, G.J.; et al. TrueNorth: Design and Tool Flow of a 65 mW 1 Million Neuron Programmable Neurosynaptic Chip. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst.
**2015**, 34, 1537–1557. [Google Scholar] [CrossRef] - DeBole, M.V.; Taba, B.; Amir, A.; Akopyan, F.; Andreopoulos, A.; Risk, W.P.; Kusnitz, J.; Ortega Otero, C.; Nayak, T.K.; Appuswamy, R.; et al. TrueNorth: Accelerating From Zero to 64 Million Neurons in 10 Years. Computer
**2019**, 52, 20–29. [Google Scholar] [CrossRef] - Chen, Y.H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits
**2017**, 52, 127–138. [Google Scholar] [CrossRef] [Green Version] - Ibtesam, M.; Solangi, U.S.; Kim, J.; Ansari, M.A.; Park, S. Highly Efficient Test Architecture for Low-Power AI Accelerators. IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst.
**2022**, 41, 2728–2738. [Google Scholar] [CrossRef] - Shrestha, A.; Fang, H.; Mei, Z.; Rider, D.P.; Wu, Q.; Qiu, Q. A Survey on Neuromorphic Computing: Models and Hardware. IEEE Circuits Syst. Mag.
**2022**, 22, 6–35. [Google Scholar] [CrossRef] - Schuman, C.D.; Potok, T.E.; Patton, R.M.; Birdwell, J.D.; Dean, M.E.; Rose, G.S.; Plank, J.S. A Survey of Neuromorphic Computing and Neural Networks in Hardware. arXiv
**2017**, arXiv:1705.06963. [Google Scholar] - Seo, J.s.; Saikia, J.; Meng, J.; He, W.; Suh, H.s.; Anupreetham; Liao, Y.; Hasssan, A.; Yeo, I. Digital Versus Analog Artificial Intelligence Accelerators: Advances, trends, and emerging designs. IEEE-Solid-State Circuits Mag.
**2022**, 14, 65–79. [Google Scholar] [CrossRef] - Sunny, F.P.; Taheri, E.; Nikdast, M.; Pasricha, S. A Survey on Silicon Photonics for Deep Learning. J. Emerg. Technol. Comput. Syst.
**2021**, 17, 1–57. [Google Scholar] [CrossRef] - Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D. A systematic literature review on hardware implementation of artificial intelligence algorithms. J. Supercomput.
**2021**, 77, 1897–1938. [Google Scholar] [CrossRef] - Li, W.; Liewig, M. A survey of AI accelerators for edge environment. In Proceedings of the World Conference on Information Systems and Technologies, Budva, Montenegro, 7–10 April 2020; Springer: Berlin/Heidelberg, Germany, 2020; pp. 35–44. [Google Scholar]
- Wohlin, C. Guidelines for snowballing in systematic literature studies and a replication in software engineering. In Proceedings of the 18th International Conference on Evaluation and Assessment in Software Engineering, London, UK, 13–14 May 2014; pp. 1–10. [Google Scholar]
- Tortorella, Y.; Bertaccini, L.; Rossi, D.; Benini, L.; Conti, F. RedMulE: A Compact FP16 Matrix-Multiplication Accelerator for Adaptive Deep Learning on RISC-V-Based Ultra-Low-Power SoCs. arXiv
**2022**, arXiv:2204.11192. [Google Scholar] - Agrawal, A.; Ali, M.; Koo, M.; Rathi, N.; Jaiswal, A.; Roy, K. IMPULSE: A 65-nm Digital Compute-in-Memory Macro With Fused Weights and Membrane Potential for Spike-Based Sequential Learning Tasks. IEEE-Solid-State Circuits Lett.
**2021**, 4, 137–140. [Google Scholar] [CrossRef] - Gudaparthi, S.; Narayanan, S.; Balasubramonian, R.; Giacomin, E.; Kambalasubramanyam, H.; Gaillardon, P.E. Wire-Aware Architecture and Dataflow for CNN Accelerators. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 1–13. [Google Scholar] [CrossRef]
- Deng, C.; Sui, Y.; Liao, S.; Qian, X.; Yuan, B. GoSPA: An Energy-efficient High-performance Globally Optimized SParse Convolutional Neural Network Accelerator. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1110–1123. [Google Scholar] [CrossRef]
- Wang, M.; Chandrakasan, A.P. Flexible Low Power CNN Accelerator for Edge Computing with Weight Tuning. In Proceedings of the 2019 IEEE Asian Solid-State Circuits Conference (A-SSCC), Macau, China, 4–6 November 2019; pp. 209–212. [Google Scholar] [CrossRef]
- Li, W.; Xu, P.; Zhao, Y.; Li, H.; Xie, Y.; Lin, Y. Timely: Pushing Data Movements And Interfaces In Pim Accelerators Towards Local And In Time Domain. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 832–845. [Google Scholar] [CrossRef]
- NVIDIA Corporation. JETSON NANO. Available online: https://www.nvidia.com/en-us/autonomous-machines/embedded-systems/jetson-nano/product-development/ (accessed on 29 June 2022).
- Dennard, R.; Gaensslen, F.; Yu, H.N.; Rideout, V.; Bassous, E.; LeBlanc, A. Design of ion-implanted MOSFET’s with very small physical dimensions. IEEE J.-Solid-State Circuits
**1974**, 9, 256–268. [Google Scholar] [CrossRef] [Green Version] - Zhang, S.; Huang, K.; Shen, H. A Robust 8-Bit Non-Volatile Computing-in-Memory Core for Low-Power Parallel MAC Operations. IEEE Trans. Circuits Syst. Regul. Pap.
**2020**, 67, 1867–1880. [Google Scholar] [CrossRef] - Qu, Z.; Liu, L.; Tu, F.; Chen, Z.; Ding, Y.; Xie, Y. DOTA: Detect and Omit Weak Attentions for Scalable Transformer Acceleration. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, Canada, 25–29 March 2023; Association for Computing Machinery: New York, NY, USA, 2022; pp. 14–26. [Google Scholar] [CrossRef]
- Lin, W.C.; Chang, Y.C.; Huang, J.D. An Efficient and Low-Power MLP Accelerator Architecture Supporting Structured Pruning, Sparse Activations and Asymmetric Quantization for Edge Computing. In Proceedings of the 2021 IEEE 3rd International Conference on Artificial Intelligence Circuits and Systems (AICAS), Washington, DC, USA, 6–9 June 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Awad, O.M.; Mahmoud, M.; Edo, I.; Zadeh, A.H.; Bannon, C.; Jayarajan, A.; Pekhimenko, G.; Moshovos, A. FPRaker: A Processing Element For Accelerating Neural Network Training. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual, 18–22 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 857–869. [Google Scholar] [CrossRef]
- Wang, H.; Zhang, Z.; Han, S. SpAtten: Efficient Sparse Attention Architecture with Cascade Token and Head Pruning. In Proceedings of the 2021 IEEE International Symposium on High-Performance Computer Architecture (HPCA), Seoul, Korea, 27 February–3 March 2021; pp. 97–110. [Google Scholar] [CrossRef]
- Dai, G.; Zhu, Z.; Fu, T.; Wei, C.; Wang, B.; Li, X.; Xie, Y.; Yang, H.; Wang, Y. DIMMining: Pruning-Efficient and Parallel Graph Mining on near-Memory-Computing. In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 130–145. [Google Scholar] [CrossRef]
- Hua, W.; Zhou, Y.; De Sa, C.; Zhang, Z.; Suh, G.E. Boosting the Performance of CNN Accelerators with Dynamic Fine-Grained Channel Gating. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 139–150. [Google Scholar] [CrossRef]
- P, S.; Chu, K.T.N.; Tavva, Y.; Wu, J.; Zhang, M.; Li, H.; Carlson, T.E. You Only Spike Once: Improving Energy-Efficient Neuromorphic Inference to ANN-Level Accuracy. arXiv
**2020**, arXiv:2006.09982. [Google Scholar] - Garofalo, A.; Tagliavini, G.; Conti, F.; Rossi, D.; Benini, L. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions. In Proceedings of the 2020 Design, Automation Test in Europe Conference Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 186–191. [Google Scholar] [CrossRef]
- Song, Z.; Fu, B.; Wu, F.; Jiang, Z.; Jiang, L.; Jing, N.; Liang, X. DRQ: Dynamic Region-based Quantization for Deep Neural Network Acceleration. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 1010–1021. [Google Scholar] [CrossRef]
- Huang, C.T. RingCNN: Exploiting Algebraically-Sparse Ring Tensors for Energy-Efficient CNN-Based Computational Imaging. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 1096–1109. [Google Scholar] [CrossRef]
- Deng, C.; Liao, S.; Yuan, B. PermCNN: Energy-Efficient Convolutional Neural Network Hardware Architecture With Permuted Diagonal Structure. IEEE Trans. Comput.
**2021**, 70, 163–173. [Google Scholar] [CrossRef] - Chen, Y.H.; Yang, T.J.; Emer, J.; Sze, V. Eyeriss v2: A Flexible Accelerator for Emerging Deep Neural Networks on Mobile Devices. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2019**, 9, 292–308. [Google Scholar] [CrossRef] - Narayanan, S.; Taht, K.; Balasubramonian, R.; Giacomin, E.; Gaillardon, P.E. SpinalFlow: An Architecture and Dataflow Tailored for Spiking Neural Networks. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 349–362. [Google Scholar] [CrossRef]
- Orosa, L.; Koppula, S.; Umuroglu, Y.; Kanellopoulos, K.; Gómez-Luna, J.; Blott, M.; Vissers, K.A.; Mutlu, O. EcoFlow: Efficient Convolutional Dataflows for Low-Power Neural Network Accelerators. arXiv
**2022**, arXiv:2202.02310. [Google Scholar] - Feng, S.; He, X.; Chen, K.Y.; Ke, L.; Zhang, X.; Blaauw, D.; Mudge, T.; Dreslinski, R. MeNDA: A near-Memory Multi-Way Merge Solution for Sparse Transposition and Dataflows. In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 245–258. [Google Scholar] [CrossRef]
- Singh, S.; Sarma, A.; Jao, N.; Pattnaik, A.; Lu, S.; Yang, K.; Sengupta, A.; Narayanan, V.; Das, C.R. NEBULA: A Neuromorphic Spin-Based Ultra-Low Power Architecture for SNNs and ANNs. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 363–376. [Google Scholar] [CrossRef]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version] - Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.u.; Polosukhin, I. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2017; Volume 30. [Google Scholar]
- Ke, L.; Gupta, U.; Cho, B.Y.; Brooks, D.; Chandra, V.; Diril, U.; Firoozshahian, A.; Hazelwood, K.; Jia, B.; Lee, H.H.S.; et al. RecNMP: Accelerating Personalized Recommendation with Near-Memory Processing. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 790–803. [Google Scholar] [CrossRef]
- ARM Limited. ARM MICRONPU ETHOS-U55. Available online: https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u55 (accessed on 29 June 2022).
- Oh, J.; Lee, S.K.; Kang, M.; Ziegler, M.; Silberman, J.; Agrawal, A.; Venkataramani, S.; Fleischer, B.; Guillorn, M.; Choi, J.; et al. A 3.0 TFLOPS 0.62V Scalable Processor Core for High Compute Utilization AI Training and Inference. In Proceedings of the 2020 IEEE Symposium on VLSI Circuits, Honolulu, HI, USA, 14–19 June 2020; pp. 1–2. [Google Scholar] [CrossRef]
- Kneron. KL520 AI SoC. Available online: https://www.kneron.com/cn/page/soc/ (accessed on 29 June 2022).
- Kneron. KL720 AI SoC. Available online: https://www.kneron.com/cn/page/soc/ (accessed on 29 June 2022).
- Gyrfalcon Technology Inc. LIGHTSPEEUR® 5801S NEURAL ACCELERATOR. Available online: https://www.gyrfalcontech.ai/solutions/lightspeeur-5801/ (accessed on 29 June 2022).
- Gyrfalcon Technology Inc. LIGHTSPEEUR® 2801S NEURAL ACCELERATOR. Available online: https://www.gyrfalcontech.ai/solutions/2801s/ (accessed on 29 June 2022).
- Pei, J.; Deng, L.; Song, S.; Zhao, M.; Zhang, Y.; Wu, S.; Wang, G.; Zou, Z.; Wu, Z.; He, W.; et al. Towards artificial general intelligence with hybrid Tianjic chip architecture. Nature
**2019**, 572, 106–111. [Google Scholar] - Deng, L.; Wang, G.; Li, G.; Li, S.; Liang, L.; Zhu, M.; Wu, Y.; Yang, Z.; Zou, Z.; Pei, J.; et al. Tianjic: A Unified and Scalable Chip Bridging Spike-Based and Continuous Neural Computation. IEEE J. -Solid-State Circuits
**2020**, 55, 2228–2246. [Google Scholar] [CrossRef] - Lien, H.H.; Hsu, C.W.; Chang, T.S. VSA: Reconfigurable Vectorwise Spiking Neural Network Accelerator. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Daegu, Korea, 22–28 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Chen, Q.; He, G.; Wang, X.; Xu, J.; Shen, S.; Chen, H.; Fu, Y.; Li, L. A 67.5μJ/Prediction Accelerator for Spiking Neural Networks in Image Segmentation. IEEE Trans. Circuits Syst. Ii Express Briefs
**2022**, 69, 574–578. [Google Scholar] [CrossRef] - Chen, Q.; Gao, C.; Fang, X.; Luan, H. Skydiver: A Spiking Neural Network Accelerator Exploiting Spatio-Temporal Workload Balance. IEEE Trans.-Comput.-Aided Des. Integr. Circuits Syst.
**2022**, 1. [Google Scholar] [CrossRef] - Song, J.; Cho, Y.; Park, J.S.; Jang, J.W.; Lee, S.; Song, J.H.; Lee, J.G.; Kang, I. 7.1 An 11.5TOPS/W 1024-MAC Butterfly Structure Dual-Core Sparsity-Aware Neural Processing Unit in 8nm Flagship Mobile SoC. In Proceedings of the 2019 IEEE International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 17–21 February 2019; pp. 130–132. [Google Scholar] [CrossRef]
- Hailo. Hailo-8™ AI Processor. Available online: https://hailo.ai/product-hailo/hailo-8 (accessed on 29 June 2022).
- ARM Limited. ARM MICRONPU ETHOS-U65. Available online: https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-u65 (accessed on 29 June 2022).
- Zhao, Y.; Liu, C.; Du, Z.; Guo, Q.; Hu, X.; Zhuang, Y.; Zhang, Z.; Song, X.; Li, W.; Zhang, X.; et al. Cambricon-Q: A Hybrid Architecture for Efficient Training. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 706–719. [Google Scholar] [CrossRef]
- Venkataramani, S.; Srinivasan, V.; Wang, W.; Sen, S.; Zhang, J.; Agrawal, A.; Kar, M.; Jain, S.; Mannari, A.; Tran, H.; et al. RaPiD: AI Accelerator for Ultra-low Precision Training and Inference. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 153–166. [Google Scholar] [CrossRef]
- Agrawal, A.; Lee, S.K.; Silberman, J.; Ziegler, M.; Kang, M.; Venkataramani, S.; Cao, N.; Fleischer, B.; Guillorn, M.; Cohen, M.; et al. 9.1 A 7nm 4-Core AI Chip with 25.6TFLOPS Hybrid FP8 Training, 102.4TOPS INT4 Inference and Workload-Aware Throttling. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; Volume 64, pp. 144–146. [Google Scholar] [CrossRef]
- Jang, J.W.; Lee, S.; Kim, D.; Park, H.; Ardestani, A.S.; Choi, Y.; Kim, C.; Kim, Y.; Yu, H.; Abdel-Aziz, H.; et al. Sparsity-Aware and Re-configurable NPU Architecture for Samsung Flagship Mobile SoC. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 15–28. [Google Scholar] [CrossRef]
- Park, J.S.; Jang, J.W.; Lee, H.; Lee, D.; Lee, S.; Jung, H.; Lee, S.; Kwon, S.; Jeong, K.; Song, J.H.; et al. 9.5 A 6K-MAC Feature-Map-Sparsity-Aware Neural Processing Unit in 5nm Flagship Mobile SoC. In Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; Volume 64, pp. 152–154. [Google Scholar] [CrossRef]
- ARM Limited. ARM NPU ETHOS-N78. Available online: https://www.arm.com/products/silicon-ip-cpu/ethos/ethos-n78 (accessed on 29 June 2022).
- Kneron. KL530 AI SoC. Available online: https://www.kneron.com/cn/page/soc/ (accessed on 29 June 2022).
- Qadric. Quadric Dev Kit. Available online: https://www.quadric.io/technology/devkit (accessed on 30 June 2022).
- Mythic. M1076 Analog Matrix Processor. Available online: https://mythic.ai/products/m1076-analog-matrix-processor/ (accessed on 30 June 2022).
- Advanced Micro Devices, Inc. Versal AI Edge Series. Available online: https://www.xilinx.com/products/silicon-devices/acap/versal-ai-edge.html (accessed on 30 June 2022).
- Khabbazan, B.; Mirzakuchaki, S. Design and Implementation of a Low-Power, Embedded CNN Accelerator on a Low-end FPGA. In Proceedings of the 2019 22nd Euromicro Conference on Digital System Design (DSD), Kallithea, Greece, 28–30 August 2019; pp. 647–650. [Google Scholar] [CrossRef]
- Gondimalla, A.; Chesnut, N.; Thottethodi, M.; Vijaykumar, T.N. SparTen: A Sparse Tensor Accelerator for Convolutional Neural Networks. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 151–165. [Google Scholar] [CrossRef]
- Im, D.; Han, D.; Choi, S.; Kang, S.; Yoo, H.J. DT-CNN: Dilated and Transposed Convolution Neural Network Accelerator for Real-Time Image Segmentation on Mobile Devices. In Proceedings of the 2019 IEEE International Symposium on Circuits and Systems (ISCAS), Washington, DC, USA, 16–20 February 2019; pp. 1–5. [Google Scholar] [CrossRef]
- Chandramoorthy, N.; Swaminathan, K.; Cochet, M.; Paidimarri, A.; Eldridge, S.; Joshi, R.V.; Ziegler, M.M.; Buyuktosunoglu, A.; Bose, P. Resilient Low Voltage Accelerators for High Energy Efficiency. In Proceedings of the 2019 IEEE International Symposium on High Performance Computer Architecture (HPCA), Washington, DC, USA, 16–20 February 2019; pp. 147–158. [Google Scholar] [CrossRef]
- Deng, C.; Sun, F.; Qian, X.; Lin, J.; Wang, Z.; Yuan, B. TIE: Energy-Efficient Tensor Train-Based Inference Engine for Deep Neural Network. In Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 264–278. [Google Scholar] [CrossRef]
- Jang, H.; Kim, J.; Jo, J.E.; Lee, J.; Kim, J. MnnFast: A Fast and Scalable System Architecture for Memory-Augmented Neural Networks. In Proceedings of the 46th International Symposium on Computer Architecture, Phoenix, AZ, USA, 22–26 June 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 250–263. [Google Scholar] [CrossRef]
- Hegde, K.; Asghari-Moghaddam, H.; Pellauer, M.; Crago, N.; Jaleel, A.; Solomonik, E.; Emer, J.; Fletcher, C.W. ExTensor: An Accelerator for Sparse Tensor Algebra. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 319–333. [Google Scholar] [CrossRef]
- Gupta, U.; Reagen, B.; Pentecost, L.; Donato, M.; Tambe, T.; Rush, A.M.; Wei, G.Y.; Brooks, D. MASR: A Modular Accelerator for Sparse RNNs. In Proceedings of the 2019 28th International Conference on Parallel Architectures and Compilation Techniques (PACT), Seattle, WA, USA, 23–26 September 2019; pp. 1–14. [Google Scholar] [CrossRef] [Green Version]
- Li, Z.; Chen, Y.; Gong, L.; Liu, L.; Sylvester, D.; Blaauw, D.; Kim, H.S. An 879GOPS 243mW 80fps VGA Fully Visual CNN-SLAM Processor for Wide-Range Autonomous Exploration. In Proceedings of the 2019 IEEE International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 17–21 February 2019; pp. 134–136. [Google Scholar] [CrossRef]
- Lee, J.; Lee, J.; Han, D.; Lee, J.; Park, G.; Yoo, H.J. 7.7 LNPU: A 25.3TFLOPS/W Sparse Deep-Neural-Network Learning Processor with Fine-Grained Mixed Precision of FP8-FP16. In Proceedings of the 2019 IEEE International Solid- State Circuits Conference—(ISSCC), San Francisco, CA, USA, 17–21 February 2019; pp. 142–144. [Google Scholar] [CrossRef]
- Huang, C.T.; Ding, Y.C.; Wang, H.C.; Weng, C.W.; Lin, K.P.; Wang, L.W.; Chen, L.D. ECNN: A Block-Based and Highly-Parallel CNN Accelerator for Edge Inference. In Proceedings of the 52nd Annual IEEE/ACM International Symposium on Microarchitecture, Columbus, OH, USA, 12–16 October 2019; Association for Computing Machinery: New York, NY, USA, 2019; pp. 182–195. [Google Scholar] [CrossRef] [Green Version]
- Evans, R.D.; Liu, L.; Aamodt, T.M. JPEG-ACT: Accelerating Deep Learning via Transform-based Lossy Compression. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 860–873. [Google Scholar] [CrossRef]
- Wu, D.; Li, J.; Yin, R.; Hsiao, H.; Kim, Y.; Miguel, J.S. UGEMM: Unary Computing Architecture for GEMM Applications. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 377–390. [Google Scholar] [CrossRef]
- Zhao, Y.; Chen, X.; Wang, Y.; Li, C.; You, H.; Fu, Y.; Xie, Y.; Wang, Z.; Lin, Y. SmartExchange: Trading Higher-cost Memory Storage/Access for Lower-cost Computation. In Proceedings of the 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), Virtual, 30 May–3 June 2020; pp. 954–967. [Google Scholar] [CrossRef]
- Srivastava, N.; Jin, H.; Liu, J.; Albonesi, D.; Zhang, Z. MatRaptor: A Sparse-Sparse Matrix Multiplication Accelerator Based on Row-Wise Product. In Proceedings of the 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Athens, Greece, 17–21 October 2020; pp. 766–780. [Google Scholar] [CrossRef]
- Im, D.; Han, D.; Choi, S.; Kang, S.; Yoo, H.J. DT-CNN: An Energy-Efficient Dilated and Transposed Convolutional Neural Network Processor for Region of Interest Based Image Segmentation. IEEE Trans. Circuits Syst. Regul. Pap.
**2020**, 67, 3471–3483. [Google Scholar] [CrossRef] - Ham, T.J.; Jung, S.J.; Kim, S.; Oh, Y.H.; Park, Y.; Song, Y.; Park, J.H.; Lee, S.; Park, K.; Lee, J.W.; et al. A3: Accelerating Attention Mechanisms in Neural Networks with Approximation. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 328–341. [Google Scholar] [CrossRef] [Green Version]
- Park, J.; Yoon, H.; Ahn, D.; Choi, J.; Kim, J.J. OPTIMUS: OPTImized matrix MUltiplication Structure for Transformer neural network accelerator. In Proceedings of the Machine Learning and Systems, Austin, TX, USA, 2–4 March 2020; Volume 2, pp. 363–378. [Google Scholar]
- Zhang, Z.; Wang, H.; Han, S.; Dally, W.J. SpArch: Efficient Architecture for Sparse Matrix Multiplication. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 261–274. [Google Scholar] [CrossRef] [Green Version]
- Weng, J.; Liu, S.; Wang, Z.; Dadu, V.; Nowatzki, T. A Hybrid Systolic-Dataflow Architecture for Inductive Matrix Algorithms. In Proceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA), San Diego, CA, USA, 22–26 February 2020; pp. 703–716. [Google Scholar] [CrossRef]
- Ham, T.J.; Lee, Y.; Seo, S.H.; Kim, S.; Choi, H.; Jung, S.J.; Lee, J.W. ELSA: Hardware-Software Co-design for Efficient, Lightweight Self-Attention Mechanism in Neural Networks. In Proceedings of the 2021 ACM/IEEE 48th Annual International Symposium on Computer Architecture (ISCA), Valencia, Spain, 14–18 June 2021; pp. 692–705. [Google Scholar] [CrossRef]
- Zhang, G.; Attaluri, N.; Emer, J.S.; Sanchez, D. Gamma: Leveraging Gustavson’s Algorithm to Accelerate Sparse Matrix Multiplication. In Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Virtual, 19–23 April 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 687–701. [Google Scholar] [CrossRef]
- Lu, H.; Chang, L.; Li, C.; Zhu, Z.; Lu, S.; Liu, Y.; Zhang, M. Distilling Bit-Level Sparsity Parallelism for General Purpose Deep Learning Acceleration. In Proceedings of the MICRO-54: 54th Annual IEEE/ACM International Symposium on Microarchitecture, Virtual, 18–22 October 2021; Association for Computing Machinery: New York, NY, USA, 2021; pp. 963–976. [Google Scholar] [CrossRef]
- Jeong, G.; Qin, E.; Samajdar, A.; Hughes, C.J.; Subramoney, S.; Kim, H.; Krishna, T. RASA: Efficient Register-Aware Systolic Array Matrix Engine for CPU. In Proceedings of the 2021 58th ACM/IEEE Design Automation Conference (DAC), San Francisco, CA, USA, 5–9 December 2021; pp. 253–258. [Google Scholar] [CrossRef]
- Cho, B.Y.; Jung, J.; Erez, M. Accelerating Bandwidth-Bound Deep Learning Inference with Main-Memory Accelerators. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, MO, USA, 14–19 November 2021; Association for Computing Machinery: New York, NY, USA, 2021. [Google Scholar] [CrossRef]
- Zeng, Y.; Sun, H.; Katto, J.; Fan, Y. Accelerating Convolutional Neural Network Inference Based on a Reconfigurable Sliced Systolic Array. In Proceedings of the 2021 IEEE International Symposium on Circuits and Systems (ISCAS), Virtual, 22–28 May 2021; pp. 1–5. [Google Scholar] [CrossRef]
- Chen, Q.; Tian, B.; Gao, M. FINGERS: Exploiting Fine-Grained Parallelism in Graph Mining Accelerators. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Vancouver, Canada, 25–29 March 2023; Association for Computing Machinery: New York, NY, USA, 2022; pp. 43–55. [Google Scholar] [CrossRef]
- Yang, T.; Li, D.; Song, Z.; Zhao, Y.; Liu, F.; Wang, Z.; He, Z.; Jiang, L. DTQAtten: Leveraging Dynamic Token-based Quantization for Efficient Attention Architecture. In Proceedings of the 2022 Design, Automation & Test in Europe Conference & Exhibition, Virtual, 14–23 March 2022; pp. 700–705. [Google Scholar] [CrossRef]
- Wu, D.; Li, J.; Pan, Z.; Kim, Y.; Miguel, J.S. UBrain: A Unary Brain Computer Interface. In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 468–481. [Google Scholar] [CrossRef]
- Zadeh, A.H.; Mahmoud, M.; Abdelhadi, A.; Moshovos, A. Mokey: Enabling Narrow Fixed-Point Inference for out-of-the-Box Floating-Point Transformer Models. In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 888–901. [Google Scholar] [CrossRef]
- Li, Z.; Ghodrati, S.; Yazdanbakhsh, A.; Esmaeilzadeh, H.; Kang, M. Accelerating Attention through Gradient-Based Learned Runtime Pruning. In Proceedings of the 49th Annual International Symposium on Computer Architecture, New York, NY, USA, 18–22 June 2022; Association for Computing Machinery: New York, NY, USA, 2022; pp. 902–915. [Google Scholar] [CrossRef]

**Figure 2.**Typical architecture of an ANN accelerator. Each PE has a private register attached to them (gray box), used for storing partial sums. The structure of the grid of PEs varies depending on dataflow used.

**Figure 3.**Power (mW) vs. throughput (GOPS) scatter plot of low-power AI accelerators at logarithmic scales.

**Figure 4.**Power (mW) vs. throughput (GOPS) scatter plot of low-power AI accelerators at logarithmic scales. The colors represent the year and the shapes (filled or hollow) represent if it can be used for training and/or inference.

**Figure 5.**Power (mW) vs. throughput (GOPS) scatter plot of low-power AI accelerators at logarithmic scales. The colors represent an accelerator’s affiliation: company (green) or non-company (red) affiliated accelerator.

**Figure 6.**Power efficiency of AI accelerators throughout the years. Left figure: Data from accelerators created by different companies. Right figure: Data from accelerators published as research papers and not affiliated with a company.

**Figure 8.**Power consumption throughout the years. Left figure: The data are from accelerators created by different companies. Right figure: The data are from accelerators published as research papers and not affiliated with a company.

**Figure 9.**Clock frequencies of AI accelerators throughout the years. Left figure: The data are from accelerators created by different companies. Right figure: The data are from accelerators published as research papers and not affiliated with a company.

**Figure 10.**Accelerating targets throughout the years, where RNN refers to RNN and LSTM; DNN includes DNN, ANN, and MLP; Transformer groups the Transformer model and Attention mechanism together; and finally, Other refers to the remaining acceleration targets. Note, however, that some accelerators accelerate a variety of applications; having more than one acceleration target.

**Figure 11.**Number of accelerators and targets for both company and non-company affiliated accelerators over the years.

**Figure 12.**Occurrences of different precision formats throughout the years. Where unspecified denotes all accelerators where the precision format used was not explicitly mentioned, and Bit denotes all accelerators where the number of bits used were mentioned, but not the type (integer, float, fixed-point, etc.).

**Figure 13.**Occurrences of different precision formats throughout the years. Where ≤8-bit and >8-bit denote precision formats for less then or equal to 8-bits and greater then 8-bits, respectively. Unspecified denotes all accelerators where the precision format used was not explicitly mentioned.

**Table 1.**Number of acceleration targets per year. Divided into company and non-company (research) acceleration targets. Note that an accelerator can accelerate more than one target.

Targets | 2019 | 2020 | 2021 | 2022 | ||||
---|---|---|---|---|---|---|---|---|

Research | Company | Research | Company | Research | Company | Research | Company | |

CNN | 10 | 4 | 5 | 4 | 7 | 6 | 2 | - |

GAN | 1 | - | 1 | - | 1 | - | 1 | - |

DNN | 2 | 3 | 4 | - | 2 | 5 | 1 | - |

Transformer | 1 | - | 1 | - | 3 | 2 | 3 | - |

SpGEMM | 1 | - | 2 | - | 1 | - | - | - |

SNN | 1 | - | 3 | - | 2 | - | 2 | - |

ANN | 1 | - | 1 | - | - | - | - | - |

RNN | 2 | - | - | 4 | 1 | 3 | 1 | - |

LSTM | 1 | 1 | - | 4 | - | 3 | - | - |

PR system | - | - | 1 | - | - | - | - | - |

GEMM | - | - | 2 | - | 3 | - | 1 | - |

MAC | - | - | 1 | - | - | - | - | - |

Matrix Alg. | - | - | 1 | - | - | - | - | - |

MLP | - | - | - | - | 1 | - | - | - |

Graph Mining | - | - | - | - | - | - | 2 | - |

SpMatrix Transposition | - | - | - | - | - | - | 1 | - |

Attention | - | - | - | - | - | - | 1 | - |

Unique targets | 9 | 3 | 11 | 3 | 9 | 5 | 10 | 0 |

Total targets | 19 | 8 | 22 | 12 | 21 | 19 | 15 | 0 |

**Table 2.**Number of accelerators per year. Divided into company and non-company (research) accelerators.

2019 | 2020 | 2021 | 2022 | |
---|---|---|---|---|

Company | 6 | 4 | 8 | 0 |

Academia | 16 | 17 | 16 | 12 |

Total | 22 | 21 | 24 | 12 |

Power ${}^{\mathit{a}}$ | Performance ${}^{\mathit{b}}$ | Power Efficiency ${}^{\mathit{c}}$ | ||||
---|---|---|---|---|---|---|

Mean | Median | Mean | Median | Mean | Median | |

All | $1654.6$ | $478.0$ | $5008.6$ | $639.7$ | $6060.7$ | $1264.0$ |

Company | $2928.0$ | $1931.2$ | $11,912.1$ | $4423.5$ | $5558.8$ | $4125.0$ |

Non Company | $1270.2$ | $206.0$ | $2027.5$ | $148.8$ | $6272.0$ | $1020.7$ |

SNN | $938.9$ | $155.8$ | $1172.8$ | $1214.1$ | $9389.3$ | $1278.0$ |

Non SNN | $1748.5$ | $500.0$ | $5200.4$ | $622.2$ | $5864.9$ | $1250.0$ |

^{a}Power is measured in mW.

^{b}Performance is measured in GOPS.

^{c}Power Efficiency is measured in GOPS/W.

Strategy | Description | Examples |
---|---|---|

Pruning | Remove parameters based on different criteria, thereby reducing the workload (e.g., zero valued weights in DNNs and tokens in Transformers can be removed, resulting in less computation). | [37,38,39,40,41] |

Quantization | Mapping continuous values to discrete values (i.e., 32-bit float to 16-bit integer), and doing the computations on the discrete representation. | [37,38,42,43,44,45,46] |

Dataflow | Reduce movement of data by specifying the flow of data in the architecture (e.g., Weight Stationary, Output Stationary, No Local Reuse, etc.). | [29,47,48,49] |

Near-Memory Processing | Reducing the distance between memory and processing components. | [40,50] |

Compute-in-Memory | MAC computation in the analog domain, computed in the memory array. Thereby removing the need for movements of weights. | [28,32] |

Reduced flexibility | Reduce the flexibility/reconfigurability of the accelerator by restricting the hardware to perform specific, predetermined tasks (e.g, a PE might only compute a single operation, only support ReLU, etc.). | [6,51] |

Accelerator | Year | Tech | Area | Power | Freq. | Perf. | Power Eff. | Area Eff. | Type | Acc. Target |
---|---|---|---|---|---|---|---|---|---|---|

nm | mm${}^{2}$ | mW | MHz | GOPS | GOPS/W | GOPS/mm${}^{2}$ | ||||

Tianjic [61,62] | 2019 | 28 | 14.44 | 950.0 | 300.0 | 1214.1 * | 1278.0 | 84.1 * | ASIC | SNN, ANN |

SpinalFlow [48] | 2020 | 28 | 2.09 | 162.4 | 200.0 | 25.6 *${}^{\u2020}$ | 157.6 *${}^{\u2020}$ | 32.0 *${}^{\u2020}$ | simulated | SNN |

NEBULA [51] | 2020 | 32 | 86.73 | 5200.0 | 1200.0 | - | - | - | simulated | SNN, ANN |

YOSO [42] | 2020 | 22 | - | 0.73 | <1.0 | 1.3 *${}^{\u2020}$ | 1757.8 *${}^{\u2020}$ | - | simulated | SNN |

IMPULSE [28] | 2021 | 65 | 0.09 | 0.2 | 200.0 | 0.2 | 990.0 | 2.2 | ASIC | SNN |

VSA [63] | 2021 | 40 | - | 88.97 | 500.0 | 2304.0 | 25,900.0 | - | simulated | SNN |

Chen et al. [64] | 2022 | 28 | 0.89 | 149.3 | 500.0 | - | - | - | simulated | SNN |

Skydiver [65] | 2022 | - | - | 960.0 | 200.0 | 22.6 ${}^{\u2020}$ | 19.3 ${}^{\u2020}$ | - | FPGA | SNN |

^{*}Calculated based on configuration and/or published metrics of specified accelerator.

^{†}Corresponding research paper display results in GSyOPS (Giga-Synaptic Operations per Second) instead of GOPS.

Company | Accelerator | Year | Tech | Area | Power | Freq. | Perf. | Power Eff. | Area Eff. | Type |
---|---|---|---|---|---|---|---|---|---|---|

nm | mm${}^{2}$ | mW | MHz | GOPS | GOPS/W | GOPS/mm${}^{2}$ | ||||

Samsung | NPU’19 [66] | 2019 | 8 | 5.50 | 796.00 | 933.0 | 4423.5 | 8000.0 | 804.3 * | ASIC |

Gyrfalcon ${}^{\u2020}$ | Lightspeeur 5801S [59] | 2019 | 28 | 36.00 | 224.00 | 200.0 | 2800.0 | 12,600.0 | 77.8 * | ASIC |

Gyrfalcon ${}^{\u2020}$ | Lightspeeur 2801S [60] | 2019 | - | 49.00 | 600.00 | 100.0 | 5600.0 | 9300.0 | 114.3 * | ASIC |

Hailo | Hailo-8 [67] | 2019 | - | 225.00 | 2500.00 | - | 26,000.0 | 10,400.0 * | 115.6 * | ASIC |

Kneron | KL520 [57] | 2019 | - | - | 575.00 * | 300.0 | 345.0 | 600.0 | - | ASIC |

NVIDIA | Jetson Nano [33] | 2019 | - | 3150.00 | 7500.00 | - | 250.0 | 37.5 * | 0.1 * | ASIC |

IBM | Oh et al. [56] | 2020 | 14 | 9.80 | 2727.00 * | 1500.0 | 3000.0 | 1100.0 | 300.0 | ASIC |

ARM | ETHOS-U55 [55] | 2020 | 16 | 0.10 | - | 1000.0 | 512.0 | - | 2880.0 * | ASIC |

ARM | ETHOS-U65 [68] | 2020 | 16 | 0.60 | - | - | 1000.0 | - | 1666.7 * | ASIC |

Kneron | KL720 [58] | 2020 | - | - | 1362.50 | 696.0 | 1500.0 | 1250.0 * | - | ASIC |

Cambricon | Cambricon-Q [69] | 2021 | 45 | 9.18 * | 1030.31 | 1000.0 | 2000.0 | 1941.2 * | 217.9 * | simulated |

IBM | RaPiD [70,71] | 2021 | 7 | 36.00 | 6206.00 * | 1600.0 | 51,200.0 | 8250.0 | 2610.0 | ASIC |

Samsung | NPU’21 [72,73] | 2021 | 5 | 5.46 | 327.00 | 1196.0 | 14,700.0 | 13,600.0 | 4447.2 * | simulated |

ARM | Ethos-N78 [74] | 2021 | - | - | - | - | 10,000.0 | - | - | ASIC |

Kneron | KL530 [75] | 2021 | - | - | 500.00 | - | 1000.0 | 2000.0 * | - | ASIC |

quadric | q16 [76] | 2021 | 16 | 256.00 | 5500.00 | 1000.0 | 64,000.0 | 11,636.4 * | 253.9 | ASIC |

Mythic | M1076 AMP [77] | 2021 | - | 294.50 | 4000.00 | - | 25,000.0 | 6250.0 * | 84.9 | ASIC |

Xilinx | VE2002 [78] | 2021 | 7 | - | 6000.00 | - | 5000.0 | 833.3 * | - | ASIC |

Xilinx | VE2102 [78] | 2021 | 7 | - | 7000.00 | - | 8000.0 | 1142.9 * | - | ASIC |

^{*}Calculated based on configuration and/or published metrics of specified accelerator.

^{†}Gyrfalcon Technology Inc.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Åleskog, C.; Grahn, H.; Borg, A.
Recent Developments in Low-Power AI Accelerators: A Survey. *Algorithms* **2022**, *15*, 419.
https://doi.org/10.3390/a15110419

**AMA Style**

Åleskog C, Grahn H, Borg A.
Recent Developments in Low-Power AI Accelerators: A Survey. *Algorithms*. 2022; 15(11):419.
https://doi.org/10.3390/a15110419

**Chicago/Turabian Style**

Åleskog, Christoffer, Håkan Grahn, and Anton Borg.
2022. "Recent Developments in Low-Power AI Accelerators: A Survey" *Algorithms* 15, no. 11: 419.
https://doi.org/10.3390/a15110419