A Brief Review of Deep Neural Network Implementations for ARM Cortex-M Processor

Lucan Orășan, Ioan; Seiculescu, Ciprian; Căleanu, Cătălin Daniel

doi:10.3390/electronics11162545

Open AccessFeature PaperReview

A Brief Review of Deep Neural Network Implementations for ARM Cortex-M Processor

by

Ioan Lucan Orășan

,

Ciprian Seiculescu

and

Cătălin Daniel Căleanu

^*

Department of Applied Electronics, Faculty of Electronics, Telecommunications and Information Technologies, Politehnica University Timișoara, 300006 Timișoara, Romania

^*

Author to whom correspondence should be addressed.

Electronics 2022, 11(16), 2545; https://doi.org/10.3390/electronics11162545

Submission received: 20 July 2022 / Revised: 10 August 2022 / Accepted: 11 August 2022 / Published: 14 August 2022

(This article belongs to the Special Issue New Trends in Deep Learning for Computer Vision)

Download

Browse Figures

Versions Notes

Abstract

:

Deep neural networks have recently become increasingly used for a wide range of applications, (e.g., image and video processing). The demand for edge inference is growing, especially in the areas of relevance to the Internet-of-Things. Low-cost microcontrollers as edge devices are a promising solution for optimal application systems from several points of view such as: cost, power consumption, latency, or real-time execution. The implementation of these systems has become feasible due to the advanced development of hardware architectures and DSP capabilities, while the cost and power consumption have been maintained at a low level. The aim of the paper is to provide a literature review on the implementation of deep neural networks using ARM Cortex-M core-based low-cost microcontrollers. As an emerging research direction, there are a limited number of publications that address this topic at the moment. Therefore, the research papers that stand out have been analyzed in greater detail, to promote further interest of researchers to bring AI techniques to low power standard ARM Cortex-M microcontrollers. The article addresses a niche research domain. Despite the increasing interest manifested toward both (1) edge AI applications and (2) theoretical contributions in DNN optimization and compression, the number of existing publications dedicated to the current topic is rather limited. Therefore, a comprehensive literature survey using systematic mapping is not possible. The presentation focuses on systems that have shown increased efficiency in resource-constrained applications, as well as the predominant impediments that still hinder their implementation. The reader will take away the following concepts from this paper: (1) an overview of applications, DNN architectures, and results obtained using ARM Cortex-M core-based microcontrollers, (2) an overview of low-cost hardware devices and SW development solutions, and (3) understanding recent trends and opportunities.

Keywords:

deep neural networks; edge computing; microcontrollers; ARM Cortex-M; literature review

1. Introduction

In recent years, a variety of different applications, such as face, speech, images or handwriting recognition, natural language processing, and automatic medical diagnostics, have demonstrated outstanding performance by applying Deep Learning (DL) techniques [1]. To further improve application performance and add additional features, complex Deep Neural Networks (DNN) designs are currently being studied. However, this results in increasingly high computational requirements. To satisfy these requirements, integrated circuit manufacturers have focused on increasing the number of available cores, the working frequencies of processing cores and memory systems, and the specialized hardware circuits. Lately specialized hardware accelerators paired with high bandwidth memory systems is the preferred architecture to manage the high computational demands at manageable power levels. Apart from these mainstream applications, a new class of systems is trying to take advantage of DNN algorithms to solve various tasks. These applications come from the field of smart sensors and devices. Unlike mainstream applications, these have lower complexity but, at the same time, require very low power consumption, as most of these systems have to operate on batteries for long periods of time. Some of these applications can even be implemented on low-cost, low-power microcontrollers such as the ones built around the ARM Cortex-M core.

This paper aims to briefly review the representative achievements in the area of DL implementations on low-cost ARM Cortex-M core-based microcontrollers. In addition to this overview, we also highlight the current challenges for DL implementation on low-power microcontrollers. The ARM Cortex-M core-based architecture is already present in the focus of researchers because of the tooling and firmware support provided by the manufacturers that are very helpful, especially to reduce the development effort, time, and cost. A recent systematic review for tiny machine learning existing research [2] has shown that STM32 microcontrollers and ARM Cortex-M series represent the top of hardware devices used in the field. Our aim is to promote and encourage future research in this area by presenting the state-of-the-art applications already available in this field. To the best of our knowledge, this is the first work addressing an overview of the utilization of ARM Cortex-M core-based microcontrollers for DL paradigms.

More specifically, we will focus on the following main topics needed in building efficient edge processing systems: Typical DNN embedded architectures, Optimization Techniques (Pruning, Quantization, etc.), the ARM Cortex-M core (M3, M4, M7, etc.), the selected microcontroller and the chip manufacturer (STM (Plan-les-Ouates, Switzerland), TI (Dallas, TX, USA), Samsung (Suwon City, Korea), Application/Use Case and, finally, Experimental Results with special emphasis on the reported accuracy, inference time, and power consumption.

We summarize our main contributions as follows:

We present some low-cost, e.g., less than USD 10 per microcontroller, representative hardware devices and support libraries or tools that are commonly used for DL deployments.
We provide a brief overview of more than 10 papers using ARM Cortex-M core-based microcontrollers for DL applications. The following main points are presented: model architecture, hardware features (e.g., memory footprint, core architecture, and operating frequency), tools and support libraries, and results obtained (e.g., accuracy, inference time, and power consumption).
We discuss the results with a focus on the papers in which the best results were obtained. The main reasons that make it difficult to achieve the target results are also discussed.
We provide challenges and research opportunities that emphasize important issues that hinder progress in this area.

This paper is organized as follows. In Section 2 we introduce the migration from cloud to edge computing. We also describe some hardware devices and support libraries or tools for DL embedded implementation. In Section 3 we present an overview of DL implementations using the ARM Cortex-M core. In Section 4, based on the earlier review, we provide insights related to the topic of the paper along with future research challenges and trends. Section 5 concludes this paper.

2. From Cloud to Edge Computing

Usually, DL models require high computational power and substantial available memory, especially if we refer to State-Of-The-Art (SOTA) models. For this reason, some applications based on DL models use cloud computing services, e.g., Google Colaboratory [3], which is a free cloud service hosted by Google. There are also other popular services such as AWS Deep Learning AMIs [4], the Microsoft MLOps solution for accelerated training, deployment, and management of the deep learning projects, and Azure [5]. At the same time, using a cloud computing approach has some significant drawbacks. On the one hand, in a classical cloud computing paradigm, a large number of computational tasks are executed in the cloud. Therefore, traffic overload of the network may cause unacceptable delays in some real-time scenarios [6]. On the other hand, real-time inference is an important consideration, especially for latency-sensitive applications such as autonomous driving. In this case, a cloud computing approach can introduce significant latencies. In addition, there are concerns about whether data transmission to the cloud can usually be performed in a sufficiently secure manner. For these reasons, on-device computing is a new trend that brings deep learning models computation directly to the source of data, rather than transmitting the data to remote devices with high computational capabilities. This approach is generally found under the name of edge computing [7]. The relevant DL applications for edge computing are present in several fields such as smart multimedia, smart transportation, smart cities, and smart industry [8].

This migration from cloud computing to edge computing also comes with certain constraints or limitations that must be considered when developing a DL architecture for a certain task. This is due to computational limitations that are inherent when considering edge computing using resource-constrained low-power embedded devices. To mitigate this problem, two approaches are commonly used: (1) different compression methods are applied to existing DL models and (2) the architecture itself must be optimized directly from the design stage [9].

Considering the first approach, to run a DL model on embedded devices, one or more compression algorithms must be applied. Examples of the most common compression algorithms are: quantization of the model parameters [10], neural network pruning [11], network distillation [12], and binarization [13].

For the second approach, the basic idea is to obtain an optimized architecture that after the training process does not require the use of compression methods. SqueezeNet architecture made a significant contribution in this direction [14]. The main purpose of this architecture is to obtain a small number of parameters with minimal impact on accuracy.

It is important to note that the embedded devices are suitable only for the inference task, which is much less expensive in terms of computational resources compared to the training process. Among these embedded devices are also general-purpose microcontrollers that have been used with high efficiency in various fields, such as IoT applications [15].

2.1. Embedded Hardware for Deep Learning

In the last decade, computational constraints have become easier to cope with due to the introduction of specialized hardware devices on the market, at an accelerated pace [16]. These hardware devices that are found in the context of DL are usually called hardware accelerators. They are optimized and specialized hardware architectures that can reduce system cost and power consumption by optimizing the necessary resources while improving performance [17]. For instance, common implementations of hardware accelerators are based on the use of FPGAs, GPUs, ASICs [18], or Tensor Processing Units (TPUs) developed by Google [19]. The embedded system used to implement deep learning applications must have high processing capacity, as well as be able to acquire and process data in real time. At the same time, the processor must have enough memory to store the model data and the parameters.

System-On-Chip (SoC) devices are an attractive solution that, in addition to high processing capabilities, includes multiple peripheral devices that can be very helpful for the sophisticated requirements of deep-learning applications. Examples of manufacturers that develop AI integrated circuits for edge computing are Samsung, Texas Instruments, Qualcomm, and STM. Some of their recent products are briefly presented below.

The Maxim Integrated MAX78000 ultra-low power microcontroller is a relatively new device specially designed for edge Artificial Intelligence (AI) applications. Integrates a dedicated Convolutional Neural Network (CNN) accelerator along with a low-power ARM Cortex-M4 core and a RISC-V core [20]. Low power AI applications can be developed using this architecture, because it provides a variety of configuration options such as different oscillators, clock sources, or operation modes. One of the main advantages of this device is that it combines an energy-efficient AI processing unit with Maxim Integrated’s proven ultra-low power microcontrollers. It is suitable for battery-powered applications. The architecture is briefly presented in Figure 1, which summarizes the main features of the CNN accelerator and the microcontroller, such as cores, memory footprint, and external interfaces. The microcontroller has a dual core architecture: an ARM Cortex-M4 Processor with FPU running at clock frequencies of up to 100 MHz and a 32-Bit RISC-V Coprocessor running up to 60 MHz. To demonstrate the performance of this device, two applications were used: Keywords Spotting [21] and Face Identification [22]. The accuracy results are quite promising, 99.6% for keywords spotting and 94.4% for face identification.

STMicroelectronics is another example of a manufacturer that has made important contributions to the market to support AI edge computing as a new paradigm for IoT. These contributions were aimed at running neural networks on STM32 general purpose microcontrollers, with a significant impact on the productivity of edge AI system developers by reducing application deployment time. In this case, the focus is not on hardware accelerators, but on an extensive software toolchain designed to port DNN models to standard STM32 microcontrollers with high efficiency for the ARM Cortex-M4 and M7 processor core. The STM32CubeMX.AI framework will be detailed in the next section. A similar solution is also developed for SPC5 automotive microcontrollers. In this case an AI plug-in called SPC5-STUDIO-AI of the SPC5-STUDIO development environment is used.

A Parallel-Ultralow Power (PULP) SoC architecture has been developed [23] in order to achieve peak performance for the new generation of IoT applications, which have higher computing power requirements, up to giga operations per second (GOp/s), and a memory footprint of up to a few MB. However, to be competitive for IoT applications, the power consumption budged had to be maintained at levels similar to those of existing solutions such as ARM Cortex-M microcontrollers. Named Mr. Wolf [23], the device is a fully programmable SoC for IoT edge computing. It contains a RISC-V general purpose processor and a parallel computing cluster. The RISC-V processor was designed to handle the connectivity tasks and manage the I/O peripherals for the system integration with the outside world. This accelerator was designed to handle computationally intensive tasks. It consists of eight customized RISC-V cores, with DSP capabilities and other extensions such as Single Instruction Multiple Data (SIMD) or bit manipulations instructions. Mr. Wolf SoC has already been used in several applications such as tactile data decoding [24], electromyographic gesture recognition [25], stress detection using a wearable multisensor bracelet [26], electroencephalography signals classification for brain–machine interfaces [27]. The results showed that Mr. Wolf exceeds the performance of ARM Cortex-M processors, while maintaining low power consumption.

Other SoCs also have optimization and embedding solutions. For instance, a 16 nm SoC [28] was recently developed with dedicated optimizations for automatic speech recognition real-life applications. Another example is the TI TDAx series of boards [29].

2.2. Deep Learning Frameworks and Tools for Embedded Implementation

Firmware support for microcontrollers has seen accelerated development in recent years. Some examples of firmware and framework solutions are: CMSIS-NN released by ARM in 2018 which is an open source library consisting of efficient kernels developed to maximize Neural Network (NN) performance on ARM Cortex-M processors [30], TensorFlow Lite Micro which is an open source Machine Learning framework to enable DL models on embedded systems [31], STM X-CUBE-AI expansion package providing capabilities of automatic conversion of a pretrained Neural Network for 32 bit microcontrollers [32], MicroTensor (µTensor) which is a lightweight Machine Learning (ML) framework used for TensorFlow models and optimized for ARM cores [33], and PyTorch Mobile to execute ML models on edge devices using the PyTorch ecosystem [34].

CMSIS-NN was developed to help build IoT applications that run small neural networks directly on the systems that collect the data. This approach is preferred over cloud computing, as the number of IoT nodes is increasing. This leads to bandwidth limitations, as well as increasing latencies. The utility of this library has been demonstrated using a CNN designed for image classification task on the CIFAR-10 dataset. An ARM Cortex-M7 platform was used for the demonstration, obtaining a classification of 10.1 images per second with an accuracy of 79.9% [30,35].

With the multitude of available embedded platforms, with different hardware support, converting and optimizing an inference model to run on such a device is a very difficult task. TensorFlow Lite Micro was designed to address this situation and provide a unified ML framework. Covers these shortcomings with a flexible ML framework for embedded devices. In summary, this is an interpreter-based approach where hardware vendors have the possibility to provide platform-specific optimizations; it can be easily adapted to new applications and official benchmark solutions are supported.

X-CUBE-AI is a package that extends the capabilities of STM32CubeMX.AI. It adds the possibility to convert a pre-trained NN into an ANSI C library that is performance optimized for STM32 microcontrollers based on ARM Cortex-M4 and M7 processor cores. The generated ANSI C source files are then compiled to run inference on the microcontroller. The generation process using this framework is depicted in Figure 2. In fact, CMSIS-NN kernels are used at a low level, but using this tool has a series of advantages for the developers such as: a graphical user interface, support for different DL frameworks such as Keras and TensorFlow Lite, 8-bit-quantization, and compatibility with different STM32 microcontroller series.

µTensor is a framework that converts a model built and trained with TensorFlow into a microcontroller-optimized C++ code. Its structure consists of a runtime library and an offline tool that make this conversion possible. The resulting C++ source files can be compiled to obtain an efficient and optimized inference engine that can run on ARM microcontrollers.

3. Deep Learning on ARM Cortex-M. Case Reviews of Uses

F. Alongi et al. [36] used deep learning techniques to perform weather forecasting using a Deep Tiny Neural Network (DTNN) running on a standalone system without cloud dependencies. The standalone system was based on the STM32 microcontroller and the X-CUBE-AI toolchain to automatically convert the neural network model to an optimized microcontroller version. The predictions are made using atmospheric pressure as input parameter. They provide a detailed description of the system architecture constructed around the STM32 microcontroller. The core is ARM Cortex-M4 microcontroller with 512 kB flash memory and 96 kB SRAM memory. Additionally, a Raspberry PI is used to manage the data visualizer. A real-time operating system Miosix is used for thread management [37]. The authors investigated several models. Recurrent Neural Networks (RNN) such as Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) were employed because they are used quite often for time series data processing. Mixed-architecture CNN-RNN was investigated due to the promising results for time series data processing. In the end, the authors settled on four different families of models: LSTM, GRU, CNN-LSTM and CNN-GRU. The model with the best performance for each family was selected. The dataset used was obtained from a certified weather station and used further as input data through specific pre-processing steps. For training, the Keras and TensorFlow frameworks have been considered. The results were presented using the following metrics: Normalized Root Mean Square Error (NRMSE) and Normalized Mean Absolute Error (NMAE). They provide a detailed description of the models that have shown the best performance, such as number of layers, LSTM cells, and filters. Based on the results, LSTM and GRU have been selected as promising candidates. After further analysis of the memory footprint using the X-CUBE-AI Analyze tool, considering the trade-off between complexity and accuracy, a LSTM model has been selected for the final evaluation. The performance of the system has been evaluated in real-time conditions for 30 days. The results are slightly different from those obtained in the validation stage. This could be explained because the tests were performed approximately 50 km from the place where the training data were acquired. The NRMSE result was 0.0328 and NMAE 0.0251.

S. Akhtari et al. [38] implemented a Deep Neural Network (DNN) running on a STM32 microcontroller to monitor the applied load in a powertrain system. More specifically, the goal was, using smart sensors, to monitor the condition in order to perform predictive maintenance, to reduce costs. If the applied forces are known, the operating conditions can be estimated, and possible defects can be predicted in time. To detect the applied forces, they measure the vibrations using a capacitive accelerometer. They used the microcontroller STM32F469AI, an ARM Cortex-M4 with 2 MB flash memory and 384 + 4 kB SRAM memory working at a frequency of 180 MHz Fast Fourier Transform (FFT) was calculated for the vibration signal and implemented on the same microcontroller. They used the Keras framework to build the DNN neural network. The DNN has three convolutional layers, followed by three fully connected dense layers having approximately 45,000 training parameters. The main objective behind this topology was to extract useful characteristics from the vibration signal in the frequency domain using the convolutional layers and to classify them using the fully connected layers. Seven forces have been selected to be classified, resulting in seven output classes. The pre-trained model was automatically converted to an optimized C library using the STM32Cube.AI tool. The overall accuracy is 97.71%, slightly lower than the accuracy obtained with the original model before conversion. The lowest accuracy is still higher than 90%. Therefore, it can be concluded that this work is a very good example of a DNN implementation on the STM32 ARM Cortex-M4 microcontroller, providing good results for industrial applications when using the STM32Cube.AI toolchain. As an important note, the microcontroller can handle the DNN and FFT algorithms with the help of the DSP features of the STM32.

A. A. Jordan et al. [39] proposed the implementation of a Convolutional Neural Network (CNN) for the detection of drowsiness. The system was integrated in smart glasses, as a wearable device. The method has been compared with a method commonly used in similar applications, which is based on threshold detection mechanisms. An IR sensor was used to provide the input data. The detection is based on the eye blink events. This is a difficult task because there may be different normal situations that could be interpreted as blinking. They provide a detailed description of the dataset used and the IR signal waveform. The basic CNN architecture is composed of two 1D convolution layers with 6 and 12 filters, with a filter size of 7 for both layers. Every convolution layer is followed by a mean-pooling layer. At the end, a fully connected layer is used to predict the class. The model has been optimized using the binary cross-entropy loss function and the efficient Adam version of gradient descent. After training for 30 epochs using a batch size of 10, the average accuracy achieved for 5 iterations was 98.2%, ±0.8%. In order to improve the accuracy, the number of convolutional layers and the number of filters per layer as well as the size of the filters and the type of down-sampling operation have been varied. As a result, they present seven CNN models that have shown the best accuracy. Specifically, the highest average accuracy was 99.5%. The microcontroller that was integrated in the glasses belongs to the STM32L451xx family. It is designed on the 32 bits ARM Cortex-M4 core and was used at a frequency of 40 MHz. It has a ROM memory of 512 kB and 160 kB of RAM memory. The X-CUBE-AI toolchain has been used in order to convert the pre-trained models. On top of the accuracy constraint, there was an additional application requirement to limit the ROM memory used to 90 kB. The resulting usage by X-CUBE-AI is 47 kB. They provided a detailed description of different performance metrics for all the models evaluated compared to the threshold-based mechanism. Examples of the performance metrics that were considered are: execution time, average power consumption, sensitivity, specificity, and accuracy. They concluded that CNN models provide better accuracy than the threshold-based mechanism method. For every model, they discussed the behavior regarding performance metrics. The lowest average accuracy was 87.4% and the highest was 90.8%. Finally, the model that proved the best performance had the following specificities compared to the threshold-based mechanism: sensitivity, specificity, and accuracy were improved with 10%, more than 4%, and almost 6%, respectively. As a drawback, power and memory consumption increased. However, for the average MCU power consumption there is only a slight increase from 3.5 mW to maximum 5 mW. The authors concluded that the ability to generalize was improved compared to the threshold-based method, respecting the constraints of memory and power consumption.

F. de Vita et al. [40] developed an AI-based system in the field of IoT smart agriculture. The system is intended to use a neural network to detect diseases in coffee plants, and it can run on resource-constrained devices, such as low-power microcontrollers. Because of the limitations imposed by cloud computing, such as latency and security, and edge computing approach has been considered more suitable for this work, as the processing is performed directly on the device. The application is called Deep Leaf and uses a Quantized Convolutional Neural Network (Q-CNN) running on the STM32 microcontroller. They implemented five different models using the X-CUBE-AI tool: (1) a 32-bit floating point model, (2) a compressed model, (3) a quantized model using TensorFlow Lite converter, (4) a quantized model using an integer representation, and (5) a quantized model using a fixed-point Q-format representation. They provide performance analysis using the following metrics: inference time, memory utilization, and energy consumption. A STM32F746GDISCOVERY development platform with the ARM Cortex-M7 core-based STM32F746NG microcontroller was used. It has 1 MB of flash memory and 340 kB of RAM memory. The features of the X-CUBE-AI tool, such as compression and quantization techniques, have been used to convert the model to meet microcontroller constraints. The dataset used consists of images of healthy and diseased coffee leaves. Data augmentation techniques have been applied to increase the number of images, and noise has been added to the images to improve noise tolerance. A detailed description of the CNN architecture is provided: the number of layers and filters, the type of activation function, etc. Four classes are selected to be considered: (1) healthy, (2) miner, (3) phoma, and (4) rust. The system consists of a box where a leaf is introduced to be analyzed. For all five models, the accuracy, precision, and recall metrics have been measured. The same performance was obtained for the TensorFlow Lite and integer quantized models as for the 32-bit floating point model. For the Qm,n quantized model, a slight decrease in accuracy was obtained. This was 95%, a percent lower than the maximum accuracy. Additionally, the models have been compared in terms of flash and RAM utilization, average inference time, and average energy consumption. The authors concluded that the quantized model using a fixed-point Q-format representation is suitable for deployment on the microcontroller. For this model, the average energy consumption is 134.12 mJ, the lowest compared to the other models. An important conclusion was that the quantization techniques outperformed the compression method in all performance metrics considered.

L. Grzymkowski et al. [41] present a performance analysis of Convolutional Neural Networks (CNNs) deployed on an ARM microcontroller. This work addresses the problem from a different perspective, namely the performance impact of CNN depending on various aspects such as core frequency, memory access, or DSP instructions utilization. They provide a description of Real-Valued and Complex-Valued Neural Networks (RVNN and CVNN), highlighting the differences especially for the complex-valued type. TensorFlow Lite and CMSIS NN have been used as inference engines developing benchmarks to measure the performance of CNNs under various conditions. In order to execute the benchmarks, they used NXP i.MX RT1050 development board with ARM Cortex-M7 core and 512 kB SRAM memory. Different CNN architectures derived from a baseline model have been tested. Both RVNNs and CVNNs have been tested with a focus on the inference time and less on the accuracy. The input images used belong to the CIFAR-10 dataset, which consists of 32 × 32 color images. The results concerning the memory footprint showed that the required flash memory using CMSIS NN is significantly lower than using TensorFlow Lite. RAM memory utilization depends on the NN size (the number of layers or parameters), and was similar for both CMSIS NN and TendorFlow Lite for the same NN size. Regarding the inference time, with TensorFlow Lite better results have been obtained. The efficiency of computation with TensorFlow Lite is higher when the NNs are deeper, unlike CMSIS NN, where the efficiency remains constant. The inference time for RVNN and CVNN has been measured in detail for every layer, concluding that with DSP acceleration the computation times are similar for both. Therefore, if a complex-valued representation is required from the accuracy point of view, this can be used without penalty on the inference time. Finally, an analysis was performed using different core frequencies and cache configuration. Detailed results have been provided that highlight the importance of these configurations to achieve an optimal system. For example, the effects of memory latency at different core frequencies on inference time were analyzed. The importance of understanding the whole system in order to obtain an energy efficiency application, improving the inference time using DSP instructions, and considering memory access latency represents a key conclusion of the presented approach.

M. T. Nyamukuru et al. [42] developed a shallow Gated Recurrent Unit (GRU) neural network on a low power ARM Cortex-M0+ microcontroller to classify eating episodes called Tiny Eats GRU. To detect eating episodes, they used a contact microphone that senses jaw movement whose output is analyzed by a microcontroller. For the detection of audio events, the GRU neural network was used. The network was modified to comply with the memory and computation constraints of the microcontroller. The ARM Cortex-M0+ is the most energy-efficient ARM processor operating at 48 MHz clock frequency with a very restricted memory footprint of 32 kB RAM and 156 kB flash. The dataset used was collected from 20 participants classified as eating and not eating. A pre-processing step has been applied to the dataset by computing a Short-Time Fourier Transform (STFT) to extract the features. These features were used to train the neural network. A floating-point model was initially trained using Python and TensorFlow Keras, followed by an integer quantized model implemented in Pytorch. The trained integer quantized model was implemented on the microcontroller. The feature extraction was performed using only 23 kB (9%) of memory. The GRU inference time was 6 ms for one sample, using 12 kB (4%) of memory. The authors, on the one hand, show that it is possible to develop an efficient GRU neural network on a low-cost resource-constraint ARM Cortex-M0+ microcontroller that includes the implementation of STFT and 8-bit quantization with only 2% decrease in accuracy. On the other hand, they show that the usage of computational efficiency soft-sign activation function achieving a cross-validation accuracy of 96.13% and 94.41% for floating-point and quantized weight implementation, respectively.

G. Cerutti et al. [43] proposed an implementation of an outdoor sound event detection application using deep learning techniques on the ARM Cortex-M4 microcontroller, as part of the IoT concept. The authors present this as the first work in which a student-teacher approach is used for sound event recognition at the edge. This approach refers to the fact that a smaller network (student) is trained to mimic the output of the larger one (teacher). The UrbanSound8K dataset has been selected for training/testing purposes. It consists of 8732 city environment audio samples organized in 10 classes. The teacher model consists of a VGGish feature extractor, and a Gated Recurrent Unit (GRU) classifier followed by a fully connected layer. A new distillation strategy is presented, called two-stage distillation. Using this strategy, an accuracy of 72.67% was achieved, with a three-point improvement compared to the traditional distillation strategy. The STM32L476RG platform has been selected for implementation. The CMSIS-NN framework was used to implement the networks on the ARM Cortex-M4 microcontroller. A detailed description of the quantization design from 32-bit floating-point to 8-bit fixed-point representation is provided. The main target was to achieve a prediction accuracy as close as possible to the 32-bit floating point representation. The results have been evaluated in terms of power consumption, execution, and recognition accuracy. The average accuracy performance decrease has been only 2% compared to the floating-point implementation. The execution time is compared using a SIMD unit as a ‘fast’ implementation and a ‘basic’ implementation (not using the SIMD unit). Experiments show that a ‘basic’ convolutional layer takes more than half of the overall execution time. In addition, the implementation with SIMD directives has been compared with a plain C implementation. The conclusion of this comparison was that the optimized CMSIS-NN implementation increases the execution time by 2.32 times. The execution time for each 1-second audio-clip was of 125.6 ms, significantly lower than of the plain C model that was of 291.4 ms. The average power consumption has been of 5.5 mW. The RAM memory required was 34.4 kB.

S. Adhau et al. [44] have implemented a Deep Neural Network for Model Predictive Control (DNN-MPC) using an ARM microcontroller. More specifically, they investigated the performance of a deep learning-based MPC model for anesthesia control for intravenous anesthesia drug delivery. Since linear MPC models have high computational demands, they are not suitable for real-time implementation. However, by using deep learning methods to provide an accurate approximation of the linear MPC control law, the computational complexity and memory footprint are reduced. A Recurrent Neural Network (RNN) has been chosen, as such a model is usually used for MPC applications. Training data was collected from simulations. The training process was performed offline using the MATLAB-based neural network time series function and the Levenberg–Marquardt method. The training was stopped when no further generalization improvement was visible, and the Mean Squared Error (MSE) and the Regression (R) got close to zero. The microcontroller was an ARM Cortex-M3 with 512 kB flash memory and 96 kB SRAM memory, running at a frequency of 84 MHz. The computational time of the iteration is reduced from 11.354 ms to 2.99 ms. These results are shown as a comparison between a linear MPC and a DNN MPC, including data and program memory usage. The memory footprint was similar, but they mentioned that, in the case of larger systems, the difference will be much more visible.

G. Cerutti et al. [45] implemented a Convolutional Neural Network (CNN) for the detection of outdoor human presence using a low-resolution thermal camera. The inference is executed on a 32-bit ARM Cortex-M4 microcontroller with 1 MB flash memory and 128 kB SRAM memory. Grid-EYE infrared array sensor is used to sense an 8 × 8 thermal image every 10 Hz. Thermal cameras are more suitable to be used, but the cost and power consumption are much higher than PIR sensors. The dataset used is a custom one, which has been extended by adding images taken under different temperature circumstances. Some preprocessing steps were applied to the image (background subtraction and running background average) before being used as input to the CNN. The network architecture is simple, targeting a binary classification: ‘person’ or ‘no person’. It consists of three convolutional layers and one fully connected layer. The hyperparameters kernel size and stride were three and one, respectively. The Rectified Linear Unit (ReLu) activation function was used for the convolutional layers and sigmoid for the last dense layer because of the binary classification. The training process has been performed using TensorFlow framework for 1000 epochs. The cross-entropy cost function and the Adam optimization algorithm were used. CMSIS-NN optimized kernels for ARM Cortex-M have been used and 8-bit fixed-point quantization of the weights and activations was performed. The STM NUCLEO-L476RG generic development board was used. Power consumption was 16.5 mW, execution time of 4.01 ms, and memory footprint of 25.08 kB (text, BSS and data). The classification performance has been analyzed using both models, TensorFlow (32-bit floating point representation) and CMSIS-NN (8-bit fixed point implementation). They used for the test all the three divisions of the dataset: train, validation, and test. The classification performance decreased by 0.2%, 1.0%, and 0.2%, respectively. The CMSIS-NN model classification performance was of 80.9%, 76.4%, and 76.7%, respectively. In conclusion, an inference time of only 4 ms was obtained with 2.3 mW power consumption. Experiments show that the 8-bit fixed point representation does not significantly affect the performance, introducing a maximum 1% accuracy loss.

G. Crocioni et al. [46] presented different ML algorithms to estimate the State of Health (SoH) and the maximum releasable capacity of Lithium-Ion (Li-Ion) batteries. They provide a comparison of these ML algorithms with a focus on Forward and Recurrent Neural Networks (FNNs and RNNs). These estimations can be useful for preventive battery replacement, reducing safety risks, and preventing critical failures. According to the authors, existing models are too large to run on microcontrollers, and therefore the following architectures have been analyzed: CNN, LSTM, GRU, CNN-LSTM, and CNN-GRU. The dataset used for training and testing is made available by NASA Ames Prognostics Center of Excellence (PCoE), which contains data related to measurements of voltage, current, temperature, capacity, and impedance of Li-Ion batteries. During the training process, Adam optimization algorithm and MSE loss function were used. A Random Forest (RF) model and an SVR model have been used as a baseline for the developed architectures. The SensorTile Wireless Industrial Node (STWIN) development kit based on the ARM Cortex-M4 MCU was used to run the trained neural networks. To convert the pre-trained architectures into an optimized ANSI C library, the STM32Cube.AI software tool was used, allowing performance analysis both on PC and MCU. To generate 8-bit integer quantized models, the same tool was used as well. They provide many details regarding quantization algorithm used, with the mention that for recursive layers (LSTM and GRU layers) quantization is not supported and their execution use floating point representation. Benchmarking was performed for the CNN model, comparing TensorFlow Lite for Microcontrollers (TFLM) and the STM32Cube.AI tools. Three different development boards have been used: Nucleo-H743ZI (480 MHz), Nucleo-F411 (100 MHz), and Nucleo-L4r5ZI (120 MHz). To evaluate the accuracy of the prediction, they used RMSE and MAE metrics. They demonstrated that Neural Networks (NNs) have better results than baseline models (RF and SVR). CNN GRU is the model with the best results compared to the other NNs. The maximum RMSE and MAE was 0.0488 and 0.0414, respectively. The total number of parameters for all the models has been specified, concluding that CNN GRU has almost the best results, exceeded only by the GRU model. The complexity of NN models has been evaluated in terms of RAM, flash memory usage, and MACC 32-bit floating point operations. From this point of view, the GRU model has the best results. In the next step, the models were converted to ANSI C code to evaluate the RMSE and MAE metrics. The performance after conversion remains unchanged. Quantization had a large improvement on RAM size and CPU cycles, however, also impacted the accuracy with 0.0334 additional RMSE and 0.0331 additional MAE. From the point of view of energy consumption, the quantized model is one order of magnitude better than the non-quantized models. More specifically, the inference energy consumption for the non-quantized CNN models is 399.079 nJ while for the quantized model it is 18.654 nJ. Additionally, they provide a comparison between STM32Cube.AI and TFLM in terms of inference time and memory consumption for both quantized and non-quantized models. In conclusion, it has been demonstrated that the results obtained using STM32Cube.AI outperform those obtained with TFLM.

B. Karg et al. [47] use deep learning neural networks for mixed-integer model predictive control. The targeted application is the energy management system of a smart building. The dataset used for training was obtained from 500 different MPC runs. They divided the dataset between training (90%) and the evaluation set (10%). The frameworks used in order to design the DNN have been TensorFlow and Keras with Adam optimizer. Three different models have been trained: two shallow networks and one deep network. The authors show that the deep network performs better than the shallow networks by having a smaller training error and a reduced memory footprint. Therefore, the deep network architecture has been selected for further investigation. Because of the model simplicity, it was finally implemented on a microcontroller. The EdgeAI tool has been used to generate the C code. The microcontroller was based on ARM Cortex-M3 core with 96 kB of RAM, 512 kB of flash, and running with a frequency of 89 MHz. The network has only 5 hidden layers with 10 neurons per layer. The computation time was 2.9 ms and the memory footprint was only 35 kB. Differences between the usage of ReLU and tanh activation functions have been analyzed, concluding that the computation time is significantly longer (7.3 ms) and the code is larger (37.3 kB) using the tanh function due to the additional math libraries that were needed.

E. Torti et al. [48] implemented a fall detection wearable system using LSTM Recurrent Neural Networks (RNNs) running on a microcontroller. Such a system is useful to monitor older adults for unintentional falls and to issue alert notifications to a remote monitoring system on positive. The authors started from three main requirements as follows: (1) permanent wireless connection shall be ensured for alert notifications, (2) the system shall be as small as possible and lightweight to avoid possible inconveniences, and (3) the system shall be a low power device as it is powered from a battery. These requirements were the basis for the need to implement a real-time fall detection system that performs computation directly on the embedded device. The SensorTile miniaturized board produced by STMicroelectronics has been selected as a suitable device for this application due to the low-power ARM Cortex-M4 core-based STM32L476JGY microcontroller and the additional onboard features available, such as the three-axial accelerometers. The microcontroller has 1 MB of flash memory and 128 kB of RAM memory. Single-precision floating-point arithmetic is used. This has been decided to avoid significant losses in precision that could be introduced by pruning or quantization techniques. The dataset used was SisFall, which is one of the common datasets made available for this application. This was manually labeled and divided into training and validation set (80% training set and 20% test set). The relevant classes that have been defined were FALLS, ALLERTS, and a Background Class (BKG) that covers normal activities that are not related to a fall. The network architecture consists of two LSTM cells with an inner dimension LS of 32 units for each cell. This has been designed using TensorFlow library, and the training procedure was performed on a DELL 5810 workstation. The results have been presented in terms of accuracy, specificity, and sensitivity. The overall accuracy obtained was 98%. A quantization approach has been tested using a technique called node quantization. The results showed an overall precision loss of more than 9%, concluding that the current technique is not feasible to use. The network has been tested on the SensorTile device and implemented using the CMSIS library on the microcontroller that works at a maximum frequency of 80 MHz. The required memory was 82 kB of the total memory of 128 kB available. To evaluate the power consumption, the STM32CubeMX Power Consumption Calculator has been used. The resulting current consumption was 5 mA and the estimated time for which the device can be active according to battery capacity was 20 h.

A. Faraone et al. [49] converted an existing convolutional recurrent neural network designed to detect and classify cardiac arrhythmias from a single-lead electrocardiogram. It was desired to be compatible with a low power embedded device System-on-Chip nRF52, equipped with an ARM Cortex-M4 processing core. The paper is especially focused on the inference process and trade-offs between model complexity and performance degradation. They used CMSIS-NN optimized software library designed for implementing neural networks on ARM Cortex-M. The hardware platform was nRF52832 SoC, designed for IoT applications and medical wearable devices. The microcontroller runs at a frequency of 64 MHz and has 64 kB of RAM and 512 kB of FLASH memory. The dataset consists of 8528 single-lead ECG signal samples and was used as a reference dataset for the Computing in Cardiology 2017 Challenge. In summary, the architecture contains seven convolutional layers followed by a gated recurrent unit topology. The training was performed using Keras with the cross-entropy loss function and Adam optimizer. They summarize the main changes that have been performed on the original neural network model obtaining the total memory slightly less than 200 kB, using 8-bit fixed-point representation. The quantization technique was described with a focus mainly on the concept of defining the Q format, number of integer and fractional bits in order to achieve an efficient quantization. After training for 250 epochs the accuracy was 89.3% on the training set and 86.1% on the test set, while with fixed point implementation the accuracy was 85.7%. It has been concluded that sensitivity to noise is the most penalized metric. However, overall, a slight negative impact due to fixed-point quantization has been observed. On some metrics, also slight improvements were visible, for example: specificity metric for the normal rhythm class (0.002) or sensitivity for atrial fibrillation class (0.007). The overall memory footprint was around 210 kB, 195.6 kB for hard-coded data, and approximately 7 kB of RAM. The execution time for one inference was measured at 94.8 ms. Around 91 ms was necessary for the convolution part, 3.8 ms for GRU execution, and 28 µs for the fully connected layer. Moreover, the total number of network operations was measured at 33.98 MOps/s. With a power consumption of 20.65 mW, the resulting power efficiency was 1.64 GOps/s/W. Taking into account the idle time, which was significantly present in the final implementation, the efficiency achieved for the convolutional part alone is 0.124 GOps/s/W. In conclusion, in this work, the authors have demonstrated the applicability of resource-constraint microcontrollers in the medical area for arrhythmia detection case study.

4. Discussion

4.1. Summary of the Selected Works

In the previous section, we analyzed a few representative works for the use of DNNs on ARM microcontrollers. Considering all recent developments in the field, we can conclude that the use of ARM Cortex-M core-based microcontrollers for applications using deep learning algorithms is a promising emerging solution in several fields. This is supported by the fact that the low-cost ARM Cortex-M architecture is very popular and is already widely used in various embedded applications. Therefore, adding AI algorithms to these platforms is the next logical step. As a brief overview, solutions using deep learning edge computing algorithms were presented for the following applications: weather forecasting [36], predictive maintenance [38], drowsiness detection [39], agriculture [40], classification of eating episodes [42], outdoor sound event detection [43], model predictive control [44,47], outdoor human presence detection [45], Lithium-Ion battery monitoring [46], fall detection systems [48], and medical application [49]. Table 1 summarizes these solutions. It presents the tools used to train models or convert them into a compatible format for running on the microcontroller, the model architecture, the application, the hardware resources, and the results.

In some presented papers, the hardware resources are not adequately detailed, e.g., [42,44,47], thus they were omitted from Table 1. The most common approach to build a model for embedded devices is to perform the training on a host workstation platform and to apply optimization techniques on the pre-trained model. However, recently, there is great interest in directly obtaining an optimized model, during training, for example, to perform quantization-aware training [50]. The main drawback of the approach is that the inference undergoes certain changes which can affect the performance of the models to a certain extent depending on the efficiency of the optimization technique.

4.2. Architectures and Results

We can conclude that the most common deep neural network architectures running on embedded ARM Cortex-M core-based systems at this moment are CNN, RNN, LSTM, GRU, or combinations. The architecture may be limited by the computational requirements that are subject to critical constraints due to the hardware limitations and because the microcontrollers typically operate at a low frequency to reduce power consumption. Another limitation is the supported layers, which depend on the SW library. State-of-the-art architectures usually come with new features that are not supported by existing SW libraries. Thus, the NN architecture must be defined considering these constraints related to SW implementation, computational power, and memory, depending on the microcontroller used.

When pre-trained models are converted to be deployed on embedded devices, it is normal and expected that performance will be slightly affected, as the original model is changed. For example, the drowsiness detection application [39]. In this case, one of the authors main goals is to perform a comparison with a traditional method entitled threshold-based mechanism. However, the average accuracy before and after applying the conversion techniques can be observed as well. The decrease in the average accuracy for the most efficient CNN model is 8.7%. The decrease in accuracy is also visible for the predictive maintenance application [38], where it decreased by 5.8%. This indicates an interest in further optimizing the conversion techniques. A more promising method is to apply a specific procedure for the training stage, such that to consider the edge devices limitations already from the architecture definition and training. Detailed analysis of conversion techniques has been performed for the detection of coffee plant diseases [40].

However, despite performance degradation, high-performance models can still be obtained. For example, the best average results for CNN architecture of more than 95% were obtained in applications such as predictive maintenance [38] and disease detection on coffee leaves [40]. CNN used for outdoor human presence application [44] has a modest accuracy, below 80%. However, the input image has a low resolution, only 8 × 8. Using a higher-resolution image could improve the accuracy of the detection. A different class of neural networks, RNN is widely used in edge computing applications. This is due to their efficiency to operate with time-series data. The most common architectures used are LSTM and GRU. Furthermore, mixed architectures with CNN class have been useful for certain applications [36,45]. The best accuracy results were obtained for applications such as fall detection wearable systems [47] or eating episodes classifications [41], with accuracies of 94.41% and 98%, respectively. Overall, the results are similar to those obtained for CNN architectures, with a slight decrease in performance felt by the use of optimization techniques.

A part of the reviewed works presents an evaluation of the power consumption that is mentioned in Table 1. For instance, in the case of disease detection on coffee leaves [40] where a good performance is obtained, the maximum power consumption is 5 mW. It can be noticed that the power consumption of the analyzed systems is less than 10 mW. Therefore, such kind of systems can easily be designed as battery-powered devices. The efficiency can be improved using model optimization techniques such as quantization, as stated in [46], where a difference in energy consumption of 380.4 nJ is obtained after quantization.

4.3. Useful Hardware Features

We may conclude that, in the most cases, the analyzed solutions involve ARM Cortex-M4 or M7 cores, as they demonstrate high performances in the category of the low-cost systems. We briefly describe in this section the features which in our opinion make them an elective solution. Like most processor cores, the ARM Cortex-M cores are optimized for applications dominated by flow control. However, inference of DNNs mostly consists of parallel data processing and would underperform running only on the CPU. Adding a second DSP or accelerator is not a feasible solution for low-power, low-cost sensors. To bridge the gap between flow control dominated programs and data parallel computation ARM provides two families of cores, the M4 and M7 that include DSP instructions directly into the core without the need for a coprocessor. Furthermore, the M4 includes a single precision floating point unit, while the M7 integrates a double precision floating point unit [51]. The DSP instructions can be used on both the integer data path and the floating-point data path. An example of supported DSP instruction for both integer and floating-point is the multiply accumulate, which is also a very frequently used instruction in the inference of DNNs. The instruction set is also extended to perform Single Instruction Multiple Data (SIMD) instructions where the 32-bit data path can be used to perform two 16-bit or four 8-bit additions or subtractions. Combined with the optimized software functions from the CMSIS-DSP and CMSIS-NN libraries, these hardware extensions can be used to efficiently increase the performance for DNN inference [51]. The M4 and M7 integrated SIMD instructions should not be confused with the NEON SIMD instructions. The NEON architecture is a dedicated SIMD engine designed by ARM, but it is only available in the ARM Cortex A and R families.

4.4. Resources and Tools

Regarding power consumption, memory footprint, or inference time, they are strongly dependent on the hardware platform used, the complexity of the model, or the operation frequency. These requirements are set according to the application, to obtain a system as efficient as possible. As noted above, an important step in developing applications on embedded devices is model optimization. For instance, quantization is a common technique in the optimization process; among others, it can significantly reduce the memory footprint. However, the right technique must be carefully decided because it can introduce a decrease in accuracy. This negative impact is strongly dependent on the quantization method used. To mitigate this, a recent article has shown that this influence can be overcome and even quantization could introduce an increase in accuracy [52].

The most common tools and frameworks that support the development of deep learning models on embedded devices are: TensorFlow Lite, STM32Cube.AI, and CMSIS-NN. Currently, STM32Cube.AI is more commonly used. This is because using the X-CUBE-AI expansion package provides end-to-end solutions for automatic neural network model conversion, validation, and system performance measurements. Consequently, ARM Cortex-M 32-bit STMicroelectronics microcontrollers are the most common platform. The availability of these toolchains justifies the popularity of ARM-based solutions versus specialized hardware such as Maxim Integrated MAX78000.

4.5. Challenges and Research Opportunities

Most of the papers presented have shown quite promising results in terms of accuracy, execution time, power consumption, or memory footprint. However, the edge computing paradigm is a relatively new research topic facing a lot of new challenges. These challenges concern both hardware and software solutions. In this section, we briefly discuss several high-level challenges and opportunities for future research in DL implementation on low-cost microcontrollers. This is broken down into hardware devices, software implementation, and deep neural network compression.

Hardware devices: The use of custom hardware accelerators leads to efficient implementations that can maximize the computation throughput with reduced power consumption. However, the cost to develop such application-specific hardware accelerators is prohibitive for most IoT applications. When using general-purpose microcontrollers such as those referred to in this brief review, certain capabilities can be leveraged to improve the computational performance and efficiency (i.e., SIMD or vector extensions, hardware floating point units, cache hierarchy, or larger nonvolatile memories to store the large amounts of parameters). Furthermore, mathematical algorithms such as singular value decomposition are proposed to be implemented for bare-metal systems with ARM Cortex-M series [53]. This solution may represent a benefit for deep learning implementations. Memory access is a key point in these applications due to the large number of data movements, with a strong impact on energy consumption and latency. To mitigate this, advanced techniques such as in memory computing have emerged [54]. Unlike von Neumann architecture where the memory and processing units are physically separated, with this technique certain computational tasks can be performed in the memory itself based on the physical attributes of the memory devices. Nonvolatile analog memristor crossbar arrays are used, which physically represent the weights as conductances at each cross point. When voltages are applied to row lines, vector-matrix multiplication is generated as the current in column lines using Kirchhoff’s and Ohm’s laws. Such advanced technology is in its infancy and cannot be used for real-world applications at the moment. The multicore architecture is a more efficient solution in order to increase the parallel computation while maintaining a low power budget. However, the management of a multi-core system is more challenging. Another more efficient architecture is to combine specially designed hardware accelerators for deep learning with a general-purpose CPU and I/O peripherals, as the presented device in Section II-A from Maxim Integrated MAX78000. In this way, the inference task is performed by the specialized hardware accelerator, and the other tasks, such as the input data acquisition or results management, can be performed using typical I/O peripherals. This device is relatively new and has not been widely used so far. Moreover, the accelerator is specially designed for CNN architectures only.

Software implementation: Typically, manufacturers of the latest hardware devices do not provide basic software libraries for application development. Therefore, to use state-of-the-art hardware devices, development from scratch is often required, which is time-consuming and challenging. Currently, STM32 microcontrollers are widely used as general-purpose microcontrollers, as the vendor provides the X-CUBE-AI package. However, the freedom for manual optimization in detail is limited as it is a precompiled environment with a high-level configuration. On the other hand, this is an advantage as developers without extensive experience in the field can use it. Microchip is another manufacturer that provides complete AI-based solutions for various silicon devices, including microcontrollers, microprocessors, and field-programmable gate arrays. Popular ML frameworks such as TensorFlow, Keras, and Caffe are supported by their software tool kits, making this a promising solution and a good perspective for future research [55]. Recently, new frameworks have emerged that are very promising in overcoming the results obtained so far, such as MicroAI [56] and MCUNet [57]. MicroAI is a framework for end-to-end deployment of deep neural networks on microcontrollers, including quantization. The main features of this framework are: (1) implementation of CNNs with nonsequential topologies, (2) it supports 16-bit quantization, and (3) it is not dedicated to a limited family of hardware targets. MCUNet is a framework that provides efficient neural architecture design using a two-stage neural architecture search approach along with an inference library. They achieved a record high accuracy of 70.7% using the large-scale ImageNet dataset on microcontrollers. Both frameworks outperform existing solutions such as TensorFlow Lite Micro [31] and CMSIS-NN [30], which classifies them as state-of-the-art frameworks.

Deep neural network compression: Network compression techniques such as quantization and pruning are continuously emerging at an accelerated pace. Their complexity is increasing and developing from scratch is a great challenge. To mitigate this, different open-source solutions are being developed, for example: model compression from Neural Network Intelligence (NNI) [58], AI Model Efficiency Toolkit (AIMET) [59], and SparseML [60]. Most of the scientific papers that relies on a specific compression technique are not really tested on bare-metal devices. This is understandable because it is not a mandatory measure for validating a compression algorithm. However, in the end, the technique must be applied in a real use case on standalone edge devices. The limitations of the available tools such as X-CUBE-AI hinder the implementation of state-of-the-art algorithms in application systems. Currently, there is a lack of tools to support the implementation of state-of-the-art solutions for model compression, resulting in a model that can be deployed on edge devices such as general-purpose microcontrollers. Solving this problem is of great benefit to develop end-to-end applications using the latest techniques.

5. Conclusions

Deep learning and deep neural networks are emerging as promising solutions for solving complex problems. Solving complex problems requires high computational capabilities and memory resources, so are traditionally designed to run on a large computer system around specialized hardware. However, recent research shows that simple applications can benefit from the deep learning paradigm and their edge computing implementation as well. Edge computing is the solution to many real-world problems that need to be solved soon. For instance, the automotive industry is using and developing prototypes using state-of-the-art hardware and software solutions for autonomous driving. Once these prototypes prove their ability to solve problems, the systems will have to run on real-world cars. At that stage, cost is necessary to be competitive in the market, and, using high performance computing solutions, the cost is high. The edge computing paradigm must be prepared with efficient and low-cost solutions while meeting specific requirements such as functional safety.

In this work, we provide a summary of what edge computing means in the context of low-cost/low-power applications. Here, the ARM Cortex-M processor represents one of the best possible candidates. More specifically, we summarize deep neural network implementations using ARM Cortex-M core-based microcontrollers. From the software perspective, the STM32Cube.AI support package, made available by STMicroelectronics for its 32-bit microcontroller series, represents one of the best freely available tools.

Implementing deep neural networks on embedded devices, such as microcontrollers, is a difficult task. This is mainly due to the computation and memory footprint constrains. For this reason, it is observed that developers are forced to customize existing architectures or even develop from scratch innovative models that better suit embedded processors. Optimization techniques such as quantization, pruning, and distillation are constantly evolving to achieve higher performance, and they are enabling developers to introduce state-of-the-art models of increasing complexity to the embedded domain. Ultimately, using an optimized hardware combined with optimized deep neural network architectures leads to maximum energy efficient systems.

Future work proposes to extend the study to a wider family of ARM cores, including, for example, deep learning applications running on Cortex-A type processors or even specialized Arm Ethos-N series processors for machine learning [61].

Author Contributions

Conceptualization, I.L.O. and C.S.; methodology, C.D.C.; resources, I.L.O., C.S. and C.D.C.; writing—original draft preparation, I.L.O.; writing—review and editing, C.S. and C.D.C.; visualization, I.L.O.; supervision, C.D.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

Dargan, S.; Kumar, M.; Ayyagari, M.R.; Kumar, G. A Survey of Deep Learning and Its Applications: A New Paradigm to Machine Learning. Arch. Comput. Methods Eng. 2020, 27, 1071–1092. [Google Scholar] [CrossRef]
Han, H.; Siebert, J. TinyML: A Systematic Review and Synthesis of Existing Research. In Proceedings of the International Conference on Artificial Intelligence in Information and Communication (ICAIIC), Jeju Island, Korea, 21–24 February 2022; pp. 269–274. [Google Scholar] [CrossRef]
Carneiro, T.; da Nobrega, R.V.M.; Nepomuceno, T.; Bian, G.B.; de Albuquerque, V.H.C.; Filho, P.P.R. Performance Analysis of Google Colaboratory as a Tool for Accelerating Deep Learning Applications. IEEE Access 2018, 6, 61677–61685. [Google Scholar] [CrossRef]
Jackovich, J.; Richards, R. Machine Learning with AWS: Explore the Power of Cloud Services for Your Machine Learning and Artificial Intelligence Projects; Packt Publishing: Birmingham, UK, 2018. [Google Scholar]
Salvaris, M.; Dean, D.; Tok, W.H. Deep Learning with Azure: Building and Deploying Artificial Intelligence Solutions on the Microsoft AI Platform, 1st ed.; Apress Imprint: Berkeley, CA, USA, 2018. [Google Scholar]
Han, Y.; Wang, X.; Leung, V.; Niyato, D.; Yan, X.; Chen, X. Convergence of Edge Computing and Deep Learning: A Comprehensive Survey. arXiv 2019, arXiv:1907.08349. [Google Scholar]
Shi, W.; Cao, J.; Zhang, Q.; Li, Y.; Xu, L. Edge computing: Vision and challenges. IEEE Internet Things J. 2016, 3, 637–646. [Google Scholar] [CrossRef]
Wang, F.; Zhang, M.; Wang, X.; Ma, X.; Liu, J. Deep learning for edge computing applications: A state-of-the-art survey. IEEE Access 2020, 8, 58322–58336. [Google Scholar] [CrossRef]
Berthelier, A.; Chateau, T.; Duffner, S.; Garcia, C.; Blanc, C. Deep Model Compression and Architecture Optimization for Embedded Systems: A Survey. J. Signal Process. Syst. 2020, 93, 863–878. [Google Scholar] [CrossRef]
Benoit, J.; Skirmantas, K.; Chen, B.; Zhu, M.; Tang, M.; Andrew, G.H.; Hartwig, A.; Kalenichenko, D. Quantization and training of neural networks for efficient integer-arithmetic-only inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 18–23 June 2018; pp. 2704–2713. [Google Scholar]
Blalock, D.; Ortiz, J.J.G.; Frankle, J.; Guttag, J. What is the state of neural network pruning? arXiv 2020, arXiv:2003.03033. [Google Scholar]
Hinton, G.; Vinyals, O.; Dean, J. Distilling the Knowledge in a Neural Network. arXiv 2015, arXiv:1503.02531. [Google Scholar]
Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained to +1 or −1. arXiv 2016, arXiv:1602.02830. [Google Scholar]
Iandola, F.N.; Han, S.; Moskewicz, M.W.; Ashraf, K.; Dally, W.J.; Keutzer, K. SqueezeNet: AlexNet-level accuracy with 50× fewer parameters and <0.5 MB model size. arXiv 2016, arXiv:1602.07360. [Google Scholar]
Merenda, M.; Porcaro, C.; Iero, D. Edge machine learning for AI-enabled IoT devices?: A review. Sensors 2020, 20, 2533. [Google Scholar] [CrossRef] [PubMed]
Thompson, C.N.; Greenewald, K.; Lee, K.; Manso, F.G. The Computational Limits of Deep Learning. arXiv 2020, arXiv:2007.05558. [Google Scholar]
Misra, J.; Saha, I. Artificial neural networks in hardware: A survey of two decades of progress. Neurocomputing 2010, 74, 239–255. [Google Scholar] [CrossRef]
Talib, M.A.; Majzoub, S.; Nasir, Q.; Jamal, D. A systematic literature review on hardware implementation of artificial intelligence algorithms. J. Supercomput. 2020, 77, 1897–1938. [Google Scholar] [CrossRef]
Jouppi, N.; Young, C.; Patil, N.; Patterson, D. Motivation for and evaluation of the first tensor processing unit. IEEE Micro 2018, 38, 10–19. [Google Scholar] [CrossRef]
Maxim Integrated. Application Note 7417: Developing Power-Optimized Applications on the MAX78000. Available online: https://www.maximintegrated.com/en/design/technical-documents/app-notes/7/7417.html (accessed on 31 May 2021).
Maxim Integrated. Application Note 7359: Keywords Spotting Using the MAX78000. Available online: https://www.maximintegrated.com/en/design/technical-documents/app-notes/7/7359.html (accessed on 31 May 2021).
Maxim Integrated. Application Note 7364: Face Identification Using the MAX78000. Available online: https://www.maximintegrated.com/en/design/technical-documents/app-notes/7/7364.html (accessed on 31 May 2021).
Pullini, A.; Rossi, D.; Loi, I.; Tagliavini, G.; Benini, L. Mr.Wolf: An Energy-Precision Scalable Parallel Ultra Low Power SoC for IoT Edge Processing. IEEE J. Solid-State Circuits 2019, 54, 1970–1981. [Google Scholar] [CrossRef]
Osta, M.; Ibrahim, A.; Magno, M.; Eggimann, M.; Pullini, A.; Gastaldo, P.; Valle, M. An energy efficient system for touch modality classification in electronic skin applications. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Sapporo, Japan, 26–29 May 2019; pp. 1–4. [Google Scholar] [CrossRef]
Benatti, S.; Montagna, F.; Kartsch, V.; Rahimi, A.; Rossi, D.; Benini, L. Online Learning and Classification of EMG-Based Gestures on a Parallel Ultra-Low Power Platform Using Hyperdimensional Computing. IEEE Trans. Biomed. Circuits Syst. 2019, 13, 516–528. [Google Scholar] [CrossRef]
Magno, M.; Wang, X.; Eggimann, M.; Cavigelli, L.; Benini, L. InfiniWolf: Energy efficient smart bracelet for edge computing with dual source energy harvesting. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 342–345. [Google Scholar] [CrossRef]
Schneider, T.; Wang, X.; Hersche, M.; Cavigelli, L.; Benini, L. Q-EEGNet: An energy-efficient 8-bit quantized parallel EEGNet implementation for edge motor-imagery brain-machine interfaces. In Proceedings of the IEEE International Conference on Smart Computing (SMARTCOMP), Bologna, Italy, 14–17 September 2020; pp. 284–289. [Google Scholar] [CrossRef]
Tambe, T.; Yang, E.-Y.; Ko, G.G.; Chai, Y.; Hooper, C.; Donato, M.; Whatmough, P.N.; Rush, A.M.; Brooks, D.; Wei, G.-Y. 9.8 A 25mm2 SoC for IoT devices with 18ms noise-robust speech-to-text latency via bayesian speech denoising and attention-based sequence-to-sequence DNN speech recognition in 16 nm FinFET. In Proceedings of the IEEE International Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 13–22 February 2021; pp. 158–160. [Google Scholar] [CrossRef]
Texas Instruments. Embedded Low-Power Deep Learning with TIDL. Available online: https://www.ti.com/lit/wp/spry314/spry314.pdf (accessed on 31 May 2021).
Lai, L.; Suda, N.; Chandra, V. CMSIS-NN: Efficient neural network kernels for arm cortex-m cpus. arXiv 2018, arXiv:1801.06601. [Google Scholar]
David, R.; Duke, J.; Jain, A.; Reddi, V.J.; Jeffries, N.; Li, J.; Kreeger, N.; Nappier, I.; Natraj, M.; Regev, S.; et al. TensorFlow Lite Micro: Embedded Machine Learning on TinyML Systems. arXiv 2020, arXiv:2010.08678. [Google Scholar]
Falbo, V.; Apicella, T.; Aurioso, D.; Danese, L.; Bellotti, F.; Berta, R.; De Gloria, A. Analyzing machine learning on mainstream microcontrollers. In Proceedings of the International Conference on Applications Electronics Pervading Industry, Environment and Society, Pisa, Italy, 19–20 November 2019; Springer: Cham, Switzerland, 2020; pp. 103–108. Available online: https://link.springer.com/chapter/10.1007/978-3-030-37277-4_12 (accessed on 29 November 2021).
uTensor. TinyML AI Inference Library. Available online: https://github.com/uTensor/uTensor (accessed on 31 May 2021).
Pytorch Mobile. End-to-End Workflow from Training to Deployment for iOS and Android Mobile Devices. Available online: https://pytorch.org/mobile/home/ (accessed on 31 May 2021).
Orășan, I.L.; Căleanu, C.D. ARM embedded low cost solution for implementing deep learning paradigms. In Proceedings of the International Symposium on Electronics and Telecommunications (ISETC), Timișoara, Romania, 5–6 November 2020; pp. 1–4. [Google Scholar] [CrossRef]
Alongi, F.; Ghielmetti, N.; Pau, D.; Terraneo, F.; Fornaciari, W. Tiny neural networks for environmental predictions: An integrated approach with miosix. In Proceedings of the IEEE International Conference on Smart Computing (SMARTCOMP), Bologna, Italy, 14–17 September 2020; pp. 350–355. [Google Scholar] [CrossRef]
Miosix OS Kernel. Available online: https://miosix.org/ (accessed on 8 August 2022).
Akhtari, S.; Pickhardt, F.; Pau, D.; Di Pietro, A.; Tomarchio, G. Intelligent embedded load detection at the edge on industry 4.0 powertrains applications. In Proceedings of the IEEE 5th International Forum on Research and Technology for Society and Industry (RTSI), Florence, Italy, 9–12 September 2019; pp. 427–430. [Google Scholar] [CrossRef]
Jordan, A.A.; Pegatoquet, A.; Castagnetti, A.; Raybaut, J.; Le Coz, P. Deep Learning for Eye Blink Detection Implemented at the Edge. IEEE Embed. Syst. Lett. 2020, 13, 130–133. [Google Scholar] [CrossRef]
De Vita, F.; Nocera, G.; Bruneo, D.; Tomaselli, V.; Giacalone, D.; Das, S.K. Quantitative Analysis of Deep Leaf: A plant disease detector on the smart edge. In Proceedings of the IEEE International Conference on Smart Computing (SMARTCOMP), Bologna, Italy, 14–17 September 2020; pp. 49–56. [Google Scholar] [CrossRef]
Grzymkowski, L.; Stefański, T.P. Performance analysis of convolutional neural networks on embedded systems. In Proceedings of the 27th International Conference on Mixed Design of Integrated Circuits and System (MIXDES), Wroclaw, Poland, 25–27 June 2020; pp. 266–271. [Google Scholar]
Nyamukuru, M.T.; Odame, K.M. Tiny eats: Eating detection on a microcontroller. In Proceedings of the IEEE Second Workshop on Machine Learning on Edge in Sensor Systems (SenSys-ML), Sydney, Australia, 21 April 2020; pp. 19–23. [Google Scholar] [CrossRef]
Cerutti, G.; Prasad, R.; Brutti, A.; Farella, E. Compact Recurrent Neural Networks for Acoustic Event Detection on Low-Energy Low-Complexity Platforms. IEEE J. Sel. Top. Signal Process. 2020, 14, 654–664. [Google Scholar] [CrossRef]
Adhau, S.; Patil, S.; Ingole, D.; Sonawane, D. Embedded implementation of deep learning-based linear model predictive control. In Proceedings of the Sixth Indian Control Conference (ICC), Hyderabad, India, 18–20 December 2019; pp. 200–205. [Google Scholar] [CrossRef]
Cerutti, G.; Prasad, R.; Farella, E. Convolutional neural network on embedded platform for people presence detection in low resolution thermal images. In Proceedings of the ICASSP 2019—IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 12–17 May 2019; pp. 7610–7614. [Google Scholar] [CrossRef]
Crocioni, G.; Pau, D.; Delorme, J.-M.; Gruosso, G. Li-Ion Batteries Parameter Estimation with Tiny Neural Networks Embedded on Intelligent IoT Microcontrollers. IEEE Access 2020, 8, 122135–122146. [Google Scholar] [CrossRef]
Karg, B.; Lucia, S. Deep learning-based embedded mixed-integer model predictive control. In Proceedings of the European Control Conference (ECC), Limassol, Cyprus, 12–15 June 2018; pp. 2075–2080. [Google Scholar] [CrossRef]
Torti, E.; Fontanella, A.; Musci, M.; Blago, N.; Pau, D.; Leporati, F.; Piastra, M. Embedded real-time fall detection with deep learning on wearable devices. In Proceedings of the 21st Euromicro Conference on Digital System Design (DSD), Prague, Czech Republic, 29–31 August 2018; pp. 405–412. [Google Scholar] [CrossRef]
Faraone, A.; Delgado-Gonzalo, R. Convolutional-recurrent neural networks on low-power wearable platforms for cardiac arrhythmia detection. In Proceedings of the 2nd IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Genova, Italy, 31 August–2 September 2020; pp. 153–157. [Google Scholar] [CrossRef]
Krishnamoorthi, R. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv 2018, arXiv:1806.08342. [Google Scholar]
Lorenser, T. The DSP capabilities of arm cortex-m4 and cortex-m7 processors. ARM White Pap. 29 November 2016. Available online: https://community.arm.com/arm-community-blogs/b/architectures-and-processors-blog/posts/white-paper-dsp-capabilities-of-cortex-m4-and-cortex-m7 (accessed on 29 November 2021).
Chen, W.; Qiu, H.; Zhuang, J.; Zhang, C.; Hu, Y.; Lu, Q.; Xu, X. Quantization of Deep Neural Networks for Accurate EdgeComputing. arXiv 2021, arXiv:2104.12046. [Google Scholar]
Alessandrini, M.; Biagetti, G.; Crippa, P.; Falaschetti, L.; Manoni, L.; Turchetti, C. Singular Value Decomposition in Embedded Systems Based on ARM Cortex-M Architecture. Electronics 2021, 10, 34. [Google Scholar] [CrossRef]
Mehonic, A.; Sebastian, A.; Rajendran, B.; Simeone, O.; Vasilaki, E.; Kenyon, A.J. Memristors—From in-memory computing, deep learning acceleration, and spiking neural networks to the future of neuromorphic and bio-inspired computing. Adv. Intell. Syst. 2020, 2, 2000085. [Google Scholar] [CrossRef]
Microchip. Artificial Intelligence and Machine Learning. Available online: https://www.microchip.com/en-us/solutions/machine-learning# (accessed on 29 November 2021).
Novac, P.E.; Boukli Hacene, G.; Pegatoquet, A.; Miramond, B.; Gripon, V. Quantization and Deployment of Deep Neural Networks on Microcontrollers. Sensors 2021, 21, 2984. [Google Scholar] [CrossRef] [PubMed]
Lin, J.; Chen, W.-M.; Lin, Y.; Cohn, J.; Gan, C.; Han, S. MCUNet: Tiny deep learning on IoT devices. arXiv 2020, arXiv:2007.10319. [Google Scholar]
NNI (Neural Network Intelligence). Model Compression. Available online: https://nni.readthedocs.io/en/v2.0/model_compression.html (accessed on 29 November 2021).
Q.I Center. AI Model Efficiency Toolkit User Guide. Available online: https://quic.github.io/aimet-pages/index.html (accessed on 29 November 2021).
SparseML. Available online: https://github.com/neuralmagic/sparseml (accessed on 29 November 2021).
ARM Developer. ARM Ethos-N Series Processors. Available online: https://developer.arm.com/ip-products/processors/machine-learning/arm-ethos-n (accessed on 31 May 2021).

Figure 1. The architecture of MAX78000.

Figure 2. Conversion of a pre-trained model using STM32CubeMX.AI.

Table 1. Summary of the selected works on deep learning inference with ARM Cortex-M.

[Source] Year	Model	Application	Tools, Support Frameworks	Hardware	Results
[36] 2020	LSTM, GRU, CNN-LSTM, CNN-GRU	Weather Forecasting	X-CUBE-AI toolchain, Keras, TensorFlow	STM32F401RET6 ARM Cortex-M4 512 kB of Flash 96 kB of SRAM 84 MHz FPU DSP instructions SIMD Instruction and data cache memory	NRMSE: 0.0328 NMAE: 0.0251
[38] 2019	CNN	Monitoring the load applied to a powertrain system	STM32Cube.AI, Keras	STM32F469AI ARM Cortex-M4 2 MB of Flash 384 + 4 kB of SRAM including 64 kB of CCM (Core Coupled Memory) 180 MHz FPU DSP instructions SIMD Instruction and data cache memory	Accuracy: 97.71%
[39] 2020	CNNs	Drowsiness detection based on Eye Blink sensing	X-CUBE-AI toolchain	STM32L451xx ARM Cortex-M4 512 kB of Flash 160 kB of SRAM 80 MHz FPU DSP instructions SIMD Instruction and data cache memory	Accuracy range: 87.4–90.8% Average MCU power consumption: 3.5–5 mW
[40] 2020	Q-CNN	Detection of Coffee Plant Diseases	X-CUBE-AI toolchain, TensorFlow Lite	STM32F746NG ARM Cortex-M7 1 MB of Flash 340 kB of SRAM 216 MHz FPU DSP instructions SIMD L1-cache: 4 kB Instruction cache 4 kB Data cache	Accuracy: 96% Average energy consumption: 134.12 mJ
[41] 2020	CNN, RVNN, CVNN	No application was used	TensorFlow Lite, CMSIS-NN	i.MX RT1050 MCU ARM Cortex-M7 512 kB of SRAM 600 MHz FPU DSP instructions SIMD L1-cache: 32 kB Instruction cache 32 kB Data cache	CMSIS-NN: smaller memory footprint penalty on the inference time TensorFlow Lite: better inference time higher computation efficiency
[42] 2020	GRU	Eating episodes classification	TensorFlow, Keras, PyTorch	ARM Cortex-M0+ 156 kB of Flash 32 kB of SRAM 48 MHz	Inference time: 6 ms Accuracy: 94.41%
[43] 2020	VGGish feature extractor, GRU	Outdoor sound event detection	CMSIS-NN	STM32L476RG ARM Cortex-M4 1 MB of Flash 128 kB of SRAM 80 MHz FPU DSP instructions SIMD Instruction and data cache memory	Accuracy: 72.67% Inference time using CMSIS-NN: 125.6 ms Inference time with a plain C implementation: 291.4 ms Average power consumption: 5.5 mW
[44] 2019	RNN	Model predictive control for anesthesia control	Matlab	ARM Cortex-M3 512 kB of Flash 96 kB of SRAM 84 MHz	Inference time: 2.99 ms
[45] 2019	CNN	Outdoor human presence detection	TensorFlow, CMSIS-NN	STM32L476RG ARM Cortex-M4 1 MB of Flash 128 kB of SRAM 80 MHz FPU DSP instructions SIMD Instruction and data cache memory	Inference time: 4.01 ms TensorFlow accuracy: 76.9% CMSIS-NN accuracy: 76.7% Power consumption: 2.3 mW
[46] 2020	CNN, LSTM, GRU, CNN-LSTM, CNN-GRU	State of Health estimation and maximum releasable capacity of Lithium-Ion batteries	STM32Cube.AI TensorFlow Lite for Microcontrollers	STM32H743ZI ARM Cortex-M7 STM32F411RE ARM Cortex-M4 STM32L4R5ZI ARM Cortex-M4	RMSE: 0.0488 MAE: 0.0414 Inference energy consumption: 399.079 nJ (non-quantized), 18.654 nJ (quantized)
[47] 2018	DNN	Mixed-integer model predictive control for energy management system of a smart building	TensorFlow, Keras, EdgeAI	ARM Cortex-M3 512 kB of Flash 96 kB of SRAM 89 MHz	Inference time: 2.9 ms Memory footprint: 35 kB
[48] 2018	LSTM, RNNs	Fall detection wearable system	TensorFlow, CMSIS	STM32L476JGY ARM Cortex-M4 1 MB of Flash 128 kB of SRAM 80 MHz FPU DSP instructions SIMD Instruction and data cache memory	Accuracy: 98%
[49] 2020	C-RNN	Cardiac arrhythmia detection	Keras, CMSIS-NN	nRF52832 ARM Cortex-M4 512 kB of Flash 64 kB of SRAM 64 MHz FPU DSP instructions SIMD Instruction and data cache memory	Accuracy: 85.7% Inference time: 94.8 ms Power consumption: 20.65 mW

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Lucan Orășan, I.; Seiculescu, C.; Căleanu, C.D. A Brief Review of Deep Neural Network Implementations for ARM Cortex-M Processor. Electronics 2022, 11, 2545. https://doi.org/10.3390/electronics11162545

AMA Style

Lucan Orășan I, Seiculescu C, Căleanu CD. A Brief Review of Deep Neural Network Implementations for ARM Cortex-M Processor. Electronics. 2022; 11(16):2545. https://doi.org/10.3390/electronics11162545

Chicago/Turabian Style

Lucan Orășan, Ioan, Ciprian Seiculescu, and Cătălin Daniel Căleanu. 2022. "A Brief Review of Deep Neural Network Implementations for ARM Cortex-M Processor" Electronics 11, no. 16: 2545. https://doi.org/10.3390/electronics11162545

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Brief Review of Deep Neural Network Implementations for ARM Cortex-M Processor

Abstract

1. Introduction

2. From Cloud to Edge Computing

2.1. Embedded Hardware for Deep Learning

2.2. Deep Learning Frameworks and Tools for Embedded Implementation

3. Deep Learning on ARM Cortex-M. Case Reviews of Uses

4. Discussion

4.1. Summary of the Selected Works

4.2. Architectures and Results

4.3. Useful Hardware Features

4.4. Resources and Tools

4.5. Challenges and Research Opportunities

5. Conclusions

Author Contributions

Funding

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI