# Energy-Efficient Non-Von Neumann Computing Architecture Supporting Multiple Computing Paradigms for Logic and Binarized Neural Networks

^{*}

## Abstract

**:**

^{3}times when performing a BNN inference task with respect to a SIMPLY implementation.

## 1. Introduction

## 2. Results

#### 2.1. Logic-in-Memory and the SIMPLY Architecture

#### 2.1.1. Material Implication Logic

_{G}in Figure 1a), a control logic and analog tri-state buffers to deliver appropriate voltages to RRAM devices. Since the IMPLY and the FALSE form a complete logic group, all logic operations can be implemented with a sequence of these two operations [27]. In this framework, logic bits are mapped to RRAM devices resistances, and a logic 0 and a logic 1 are encoded into a high-resistive state (HRS) or low-resistive state (LRS), respectively. When performing computations RRAMs act at the same time both as the inputs and the outputs of computation. In particular, to perform an IMPLY operation between two bits (i.e., P and Q in Figure 1b,c), the control logic simultaneously drives the top electrodes of the two input RRAM devices with two voltage pulses with amplitudes V

_{COND}and V

_{SET}(see Figure 1b) on the two devices, respectively. By determining an appropriate value for V

_{SET}and V

_{COND}voltages, which must satisfy all the different requirements for each input combinations reported in Figure 1c, the state of the device receiving V

_{COND}(i.e., P in Figure 1b,c) never changes while the state of the other device (i.e., Q in Figure 1b,c) changes according to the IMPLY truth table (see Figure 1c, where Q’ is the state of Q after the IMPLY operation execution). The FALSE operation is executed by applying a negative voltage pulse with amplitude V

_{FALSE}to a single device to reset it into a HRS (see Figure 1d). However, this IMPLY scheme is affected by several reliability challenges, such as logic state degradation and small tolerance to voltage variations, which hinder its implementation [28,29].

#### 2.1.2. SIMPLY

_{READ}to P and Q and comparing the voltage across R

_{G}(V

_{N}) with a threshold (V

_{TH}) using a comparator, as shown in Figure 2a. In fact, V

_{N}is lower when both inputs are zero than in all the other cases (see Figure 2b), providing a sufficient read margin (RM) for the comparator. The output of the comparator is fed to a control logic which then pulses V

_{SET}on Q only when necessary. By using a sufficiently low V

_{READ}voltage the problem of logic state degradation is effectively solved [17]. Also, the drivers in the peripheral circuitry of the array can be simplified, as V

_{COND}is no longer required. In addition, the high V

_{SET}voltage pulse is applied only in the first case of the truth table, while in the other three cases the main energy consumption is due to the small V

_{READ}pulses and the comparator. As described in previous works [19,20], the latter can be implemented with the voltage sense amplifier (VSA) in Figure 2d, which is fast and energy efficient (i.e., the VSA implemented with a 45 nm technology from [30], and a V

_{DD}of 2V dissipated just 8 fJ per comparison on average). Therefore, SIMPLY considerably improves the energy per IMPLY operation in three out of four cases of the truth table compared to the conventional IMPLY architecture [17,19].

_{FALSE}voltage results in unnecessary energy dissipation. This can be prevented by first reading the state of the device and then applying the V

_{FALSE}only when the device is in LRS (see Figure 2c). The effectiveness in reducing the energy per operation depends on the employed RRAM technology [19]. In fact, the achieved energy reduction with RRAM technology characterized by very high HRS is limited, while it is relevant for technologies with relatively low HRS.

_{READ}(see Figure 3a), a higher RM is also obtained by using the optimal value for R

_{G}(see Figure 3b) that is determined using equation (1) from [31], where R

_{HRS,MAX}and R

_{HRS,MIN}are the ±3σ values of the R

_{HRS}distribution, while R

_{LRS,MAX}is the +3σ value of the R

_{LRS}distribution.

_{G}and those enabling the selected rows are enabled, therefore routing each active row to the specific VSA. The control logic receives in input the results of all the comparisons and activates only the rows where an RRAM device should be switched during the device SET step. As shown in previous works [19,31], the use of SIMPLY-based SIMD architectures results in high computing efficiency and throughput.

#### 2.2. Binarized Neural Networks (BNNs) Hardware Accelerator Architectures

#### 2.2.1. Binarized Neural Networks with SIMPLY

#### 2.2.2. Binarized Neural Networks with Analog Vector Matrix Multiplication

_{HRS}, R

_{LRS}, the pull-down resistance (R

_{PD}) and the VSA threshold voltage. As shown in Figure 7a, considering the case with 15 devices read in parallel during each MAC, increasing R

_{PD}changes the required VSA threshold voltage. Ideally, a linear relation between V

_{N}and the number of positive products is desirable. However, to achieve such linearity very low R

_{PD}values should be used, resulting in a considerable reduction of the dynamic range at the input of the comparator thus increasing the probability of errors due to the effect of resistive state variability. On the other hand, a too high R

_{PD}value causes the input voltage (i.e., V

_{N}) to the VSA to rapidly saturate to V

_{READ}. While, considering a fixed V

_{TH}and lowering R

_{PD}increases the number of devices that can be read in parallel during each MAC operation, as shown in Figure 7b. However, too low R

_{PD}values would require large FET devices and the effect of line parasitic resistances may affect the circuit reliability making the circuit more susceptible to noise. Thus, in this framework, MAC operations are split into multiple computing steps using the input split method [21,36], so that partial MAC operations are computed using the maximum parallelism enabled by the designed architecture. These partial results need to be accumulated and the result of the accumulation is compared to a trained threshold to produce the neuron output activation.

#### 2.3. Merging SIMPLY and BNN Analog Vector Matrix Multiplication Accelerator

_{G}in the SIMPLY paradigm and R

_{PD}in the BNN AVMM acceleration, VSAs and corresponding voltage thresholds. However, when considering the same 1T1R crossbar array, there are some differences between the two architectures and in their respective control signals management. Specifically, when performing a read step in the SIMPLY computing paradigm the select line corresponding to the row where the devices are located is activated, the read voltages are applied to the crossbar columns and the output is read from the appropriate crossbar row by means of the VSA and of a threshold. This holds true both when performing an IMPLY between devices located in the same row and when the devices are located in the same column.

_{READ}to the crossbar rows and comparing the V

_{N}voltage with the appropriate threshold at each column. Since IMPLY operations can be performed both on devices on the same row or column by applying V

_{READ}at the crossbar columns and rows, respectively, the selector transistor in series with each RRAM device is subject to different source-bulk voltages. Nevertheless, since V

_{READ}is small, the influence of the body effect can be minimized by driving these transistors with sufficiently high gate voltages. Also, using the same VSA threshold voltage for the two computing paradigms translates to different optimal R

_{G}and R

_{PD}values, requiring appropriate control of the gate voltage of the FET devices implementing such resistances. In fact, R

_{PD}is much lower than R

_{G}since more devices are read in parallel compared to SIMPLY. A too high R

_{PD}would let the input of the VSA saturate at V

_{READ}with only a few active devices in LRS, thus hindering the correct circuit operation.

#### 2.4. Circuit Design Tradeoffs for Performance and Reliability

_{C}), and therefore higher R

_{LRS}. Firstly, the use of lower I

_{C}leads to lower energy dissipation when programming a device, thus tackling the main energy limitation associated to the SIMPLY paradigm, largely improving the energy efficiency. Secondly, higher R

_{LRS}values also lower the energy required for each parallel read both when computing an IMPLY in the SIMPLY paradigm and when implementing the BNN AVMM. Furthermore, this strategy results in additional advantages on the overall area consumption and speed. By lowering I

_{C,}the required size of the FET devices used as selector devices and in the array periphery is reduced, thus reducing the chip area. Also, using higher R

_{LRS}increases the parallelism of the BNN AVMM implementation, since more rows can be read in parallel when using the same R

_{PD}resistance. However, cycle-to-cycle (C2C) and device-to-device (D2D) variability is inversely proportional to I

_{C}[31,37], thus too low I

_{C}values may affect the circuit reliability depending on the memory technology employed. The energy efficiency is also improved by reducing V

_{READ}, which in turns reduces the energy consumption during the read operations performed both in the SIMPLY and in the BNN AVMM computing paradigms. Also, in this case the reliability issue may arise, since too low read voltages would reduce the RM and the SNR at the input of the VSAs.

^{2}) with respect to a passive 1R device [38] (i.e., 4F

^{2}), and requires a higher number of control signals and interconnections. Other passive selector devices could introduce the required high non-linearity to solve the sneak-path issue while retaining the 4F

^{2}[39] device feature size and could be also used with the proposed architecture by changing the driving voltage scheme. Also, the equivalent number of memory devices per chip area can be increased by fabricating 3D array structures. These can be implemented by stacking horizontal crossbars arrays, and even more efficiently by realizing a 3D vertical structure that would lead to the highest densities and costs reduction [40].

## 3. Discussion

^{3}.

^{14}). Thus, for this application spin-transfer torque magnetic RAM (STT-MRAM) devices are more suitable candidate, thanks to their high retention (>10 years), endurance (>10

^{14}) and switching speed (~ns) [43]. However, STT-MRAM devices have usually small tunnel magneto resistance (TMR) which leads to a very small memory window that can affect the circuit reliability if not address properly, especially when implementing the AVMM. For instance, Gao et al. [44] showed that STT-MRAM can indeed be used to accelerate the AVMM for BNNs, however at the cost of additional in-hardware calibration steps and more complex peripheral circuitry, which include operational amplifiers to implement the virtual ground, thus limiting the throughput, energy efficiency, and chip density. Conversely, devices characterized by lower endurance are better suited for applications requiring less frequent burst operations, such as smart sensors. For instance, to provide a reliable device operation over a 10-year period with a memory technology providing a 10

^{8}endurance would limit the computing speed to 20 inferences per minute without introducing mitigation strategies, such as periodically changing devices used for computations. Currently, several emerging non-volatile memory (NVM) technologies were shown to achieve endurance >10

^{8}

_{.}Among these technologies, phase change memory (PCM) devices are the most mature and offer long retention, high endurance (>10

^{12}) [45,46,47], but require higher switching currents compared to other ENVMs, therefore limiting the integration density. Also, ferroelectric tunnel junction (FTJ) devices are a promising candidate for the development of ultra-low-power in-memory computing architectures thanks to low programming energy and fast speed. However, high endurance, retention, and scalability still need to be fully demonstrated [45,47]. At the state of the art, RRAM technologies provide the best overall characteristics. Depending on the used stack of materials, RRAMs can achieve endurance up to 10

^{10}[48], long retention, large memory window (≈10), and can be used to realize vertical 3D structures similar to flash memory technology, leading to ultra-dense arrays. However, two main technology-related challenges remain to be solved. Specifically, C2C and D2D variability lead to random resistance distributions which spread when lowering I

_{C}[31,37] thus introducing a tradeoff between energy efficiency, reliability, and throughput when performing the AVMM. Also, to achieve ultra-dense vertical 3D arrays while preventing the sneak path issue, a particular research focus must be directed to the development of compatible selector devices with a strongly non-linear conduction behavior [39,49].

## 4. Materials and Methods

#### 4.1. Circuit Simulations

#### 4.1.1. RRAM Physics-Based Compact Model

^{®}software, using the RRAM physics-based compact model from [50] available on nanoHub, that was calibrated on a TiN/HfOx/AlOx/Pt RRAM technology from [26] that is programmed with an I

_{C}of 100 µA. A sketch of the compact model is reported in Figure 9a,b. While other general-purpose memristors [51,52,53,54] and physics-based RRAM compact models [55,56,57,58,59,60] exist in the literature, the used RRAM physics-based compact model is particularly suited for estimating the circuit performance and reliability, as it includes all the relevant RRAM devices’ characteristics and non-idealities (e.g., dynamic temperature modelling, resistive state variability, and RTN) which only a few other physics-based compact models [56,57] consider, as discussed in [28]. Specifically, the compact model approximates the device resistance as the sum of a conductive filament (CF) and a dielectric barrier component (see Figure 9a,b). Differential equations model the field-activated and temperature-accelerated bond breaking during set, and the field-driven oxygen ions’ drift and recombination during reset, thus reproducing the dielectric barrier thickness dynamics. Thermal effects are also modeled with differential equation, leading to accurate results also when ultra-fast pulses are considered. As shown in Figure 9c,d, the compact model well reproduces with a single set of parameters both the DC IV and the response to fast reset pulses. Additionally, the compact model includes all the RRAM non-idealities that are relevant to accurately estimate circuit performance and reliability, specifically RTN and variability [28,50]. The complete list of calibrated parameters is available in [20].

#### 4.1.2. SIMPLY Simulations

_{G}resistors were simulated using planar NMOS devices in a 45 nm technology [30] with a channel width of 250 nm and a channel length of 50 nm. Also, the energy contribution of the SA is considered by simulating the circuit shown in Figure 4 in the same 45 nm technology, which results in an average energy consumption of 8 fJ when operated with a 2 V V

_{DD}over a temperature range from 0 °C to 85 °C, as reported in [20]. The device SET and RESET operations are achieved using 1 ns voltage pulses with amplitudes 3 V and –2.9 V, respectively. The SET and RESET amplitudes were determined using the physics-based compact model to ensure that a memory window larger than 10 is achieved also for very short voltage pulses, as discussed in [19]. The read margin (RM) and performance for both IMPLY and FALSE operations reported in Table 2 and Table 3 were estimated including the effect of variability and RTN during the read operation by repeating the simulations (i.e., 50 trials) and reseeding the random sources. Additional information regarding SIMPLY circuit simulations and the list of variability and RTN model parameters are available in [20,50], respectively.

#### 4.2. Implemented Neural Network

#### 4.3. BNN Performance Estimates

_{PD}and R

_{G}by just adjusting V

_{GS}(i.e., V

_{GS}is 1.48 V and 2.9 V when operating the crossbar array as SIMPLY or BNN AVMM accelerator, respectively). With the considered RRAM technology, a maximum of 15 crossbar rows can be reliably read in parallel during the AVMM. Thus, as an example, 27 (i.e., 400/15) computing steps are needed to compute all the intermediate MAC operations in the first layer. After each parallel read operation, the intermediate MAC results are stored in RRAM devices in the crossbar array and accumulated using the SIMPLY accumulator implementation shown in Figure 5e [20]. The results of the accumulations are compared with a threshold to produce each layer output activations using the SIMPLY comparator implementation from [19], which requires $9\cdot m+\frac{m\left(m+1\right)}{2}$ where m is the number of compared bits. Finally, the output layer computes the predicted class using the hardmax SIMPLY implementation reported in [19] which accounts for 1457 computing steps on the proposed architecture and determines the predicted class as the class with the highest activation value. Thus, a total of 7902 computing steps are required for each inference, resulting in a 31.6 µs inference latency when 1 ns voltage pulses are used, as reported in Table 1. The worst-case energy for an inference task reported in Table 1 is estimated by running the neural network on the complete test set and considering only the worst-case energy for each IMPLY, SET, and FALSE operations for each specific input combination. The VSA energy contribution is included when performing both SIMPLY and BNN MAC operations. By considering the worst-case energy for each SIMPLY operation, which is the main contribution to the overall energy consumption, the energy assessments are indeed slightly overestimated, and roughly account for additional energy contributions possibly introduced by the peripheral circuitry. Nevertheless, even when increasing by 20% the energy consumption to account for the decoders and drivers considering the power breakdown reported by He at al. [21], the results (see Table 1) underline the remarkable energy efficiency in comparison with conventional embedded system implementations.

## 5. Conclusions

^{3}, indicating that the proposed architecture is a viable solution for the realization of reconfigurable ultra-low-power hardware accelerators for edge computing applications.

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## References

- Zhang, W.; Gao, B.; Tang, J.; Yao, P.; Yu, S.; Chang, M.-F.; Yoo, H.-J.; Qian, H.; Wu, H. Neuro-Inspired Computing Chips. Nat. Electron.
**2020**, 3, 371–382. [Google Scholar] [CrossRef] - Deng, S.; Zhao, H.; Fang, W.; Yin, J.; Dustdar, S.; Zomaya, A.Y. Edge Intelligence: The Confluence of Edge Computing and Artificial Intelligence. IEEE Internet Things J.
**2020**, 7, 7457–7469. [Google Scholar] [CrossRef] [Green Version] - Pedretti, G.; Ielmini, D. In-Memory Computing with Resistive Memory Circuits: Status and Outlook. Electronics
**2021**, 10, 1063. [Google Scholar] [CrossRef] - Kvatinsky, S.; Belousov, D.; Liman, S.; Satat, G.; Wald, N.; Friedman, E.G.; Kolodny, A.; Weiser, U.C. MAGIC—Memristor-Aided Logic. IEEE Trans. Circuits Syst. II: Express Briefs
**2014**, 61, 895–899. [Google Scholar] [CrossRef] - Ziegler, T.; Waser, R.; Wouters, D.J.; Menzel, S. In-Memory Binary Vector–Matrix Multiplication Based on Complementary Resistive Switches. Adv. Intell. Syst.
**2020**, 2, 2000134. [Google Scholar] [CrossRef] - Kingra, S.K.; Parmar, V.; Chang, C.-C.; Hudec, B.; Hou, T.-H.; Suri, M. SLIM: Simultaneous Logic-in-Memory Computing Exploiting Bilayer Analog OxRAM Devices. Sci. Rep.
**2020**, 10. [Google Scholar] [CrossRef] [Green Version] - Pei, J.; Deng, L.; Song, S.; Zhao, M.; Zhang, Y.; Wu, S.; Wang, G.; Zou, Z.; Wu, Z.; He, W.; et al. Towards Artificial General Intelligence with Hybrid Tianjic Chip Architecture. Nature
**2019**, 572, 106–111. [Google Scholar] [CrossRef] [PubMed] - Xiao, T.P.; Bennett, C.H.; Feinberg, B.; Agarwal, S.; Marinella, M.J. Analog Architectures for Neural Network Acceleration Based on Non-Volatile Memory. Appl. Phys. Rev.
**2020**, 7, 031301. [Google Scholar] [CrossRef] - Saxena, V. Neuromorphic Computing: From Devices to Integrated Circuits. J. Vac. Sci. Technol. B
**2021**, 39, 010801. [Google Scholar] [CrossRef] - Berggren, K.; Xia, Q.; Likharev, K.K.; Strukov, D.B.; Jiang, H.; Mikolajick, T.; Querlioz, D.; Salinga, M.; Erickson, J.R.; Pi, S.; et al. Roadmap on Emerging Hardware and Technology for Machine Learning. Nanotechnology
**2020**, 32, 012002. [Google Scholar] [CrossRef] - Benoit, P.; Dalmasso, L.; Patrigeon, G.; Gil, T.; Bruguier, F.; Torres, L. Edge-Computing Perspectives with Reconfigurable Hardware. In Proceedings of the 2019 14th International Symposium on Reconfigurable Communication-centric Systems-on-Chip (ReCoSoC); York, UK, 1–3 July 2019; pp. 51–58. [Google Scholar]
- Yu, J.; Du Nguyen, H.A.; Abu Lebdeh, M.; Taouil, M.; Hamdioui, S. Enhanced Scouting Logic: A Robust Memristive Logic Design Scheme. In Proceedings of the 2019 IEEE/ACM International Symposium on Nanoscale Architectures (NANOARCH), Qingdao, China, 17−19 July 2019; pp. 1–6. [Google Scholar]
- Borghetti, J.; Snider, G.S.; Kuekes, P.J.; Yang, J.J.; Stewart, D.R.; Williams, R.S. ‘Memristive’ Switches Enable ‘Stateful’ Logic Operations via Material Implication. Nature
**2010**, 464, 873–876. [Google Scholar] [CrossRef] - Siemon, A.; Menzel, S.; Waser, R.; Linn, E. A Complementary Resistive Switch-Based Crossbar Array Adder. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2015**, 5, 64–74. [Google Scholar] [CrossRef] [Green Version] - Siemon, A.; Drabinski, R.; Schultis, M.J.; Hu, X.; Linn, E.; Heittmann, A.; Waser, R.; Querlioz, D.; Menzel, S.; Friedman, J.S. Stateful Three-Input Logic with Memristive Switches. Sci. Rep.
**2019**, 9, 1–13. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Hu, S.-Y.; Li, Y.; Cheng, L.; Wang, Z.-R.; Chang, T.-C.; Sze, S.M.; Miao, X. Reconfigurable Boolean Logic in Memristive Crossbar: The Principle and Implementation. IEEE Electron Device Lett.
**2019**, 40, 200–203. [Google Scholar] [CrossRef] - Puglisi, F.M.; Zanotti, T.; Pavan, P. SIMPLY: Design of a RRAM-Based Smart Logic-in-Memory Architecture Using RRAM Compact Model. In Proceedings of the ESSDERC 2019—49th European Solid-State Device Research Conference (ESSDERC), Krakow, Poland, 23−26 September 2019; pp. 130–133. [Google Scholar]
- Courbariaux, M.; Hubara, I.; Soudry, D.; El-Yaniv, R.; Bengio, Y. Binarized Neural Networks: Training Deep Neural Networks with Weights and Activations Constrained To+ 1 or-1. arXiv
**2016**, arXiv:1602.02830. [Google Scholar] - Zanotti, T.; Puglisi, F.M.; Pavan, P. Reliability and Performance Analysis of Logic-in-Memory Based Binarized Neural Networks. IEEE Trans. Device Mater. Reliab.
**2021**, 1. [Google Scholar] [CrossRef] - Zanotti, T.; Puglisi, F.M.; Pavan, P. Reconfigurable Smart In-Memory Computing Platform Supporting Logic and Binarized Neural Networks for Low-Power Edge Devices. IEEE J. Emerg. Sel. Top. Circuits Syst.
**2020**, 1. [Google Scholar] [CrossRef] - He, W.; Yin, S.; Kim, Y.; Sun, X.; Kim, J.-J.; Yu, S.; Seo, J.-S. 2-Bit-Per-Cell RRAM-Based In-Memory Computing for Area-/Energy-Efficient Deep Learning. IEEE Solid State Circuits Lett.
**2020**, 3, 194–197. [Google Scholar] [CrossRef] - Sun, X.; Peng, X.; Chen, P.; Liu, R.; Seo, J.; Yu, S. Fully Parallel RRAM Synaptic Array for Implementing Binary Neural Network with (+1, −1) Weights and (+1, 0) Neurons. In Proceedings of the 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC), Jeju, Korea, 22−25 January 2018; pp. 574–579. [Google Scholar]
- Vieira, J.; Giacomin, E.; Qureshi, Y.; Zapater, M.; Tang, X.; Kvatinsky, S.; Atienza, D.; Gaillardon, P.-E. A Product Engine for Energy-Efficient Execution of Binary Neural Networks Using Resistive Memories. In Proceedings of the 2019 IFIP/IEEE 27th International Conference on Very Large Scale Integration (VLSI-SoC), Cuzco, Peru, 6−9 October 2019; pp. 160–165. [Google Scholar]
- Yi, W.; Kim, Y.; Kim, J.-J. Effect of Device Variation on Mapping Binary Neural Network to Memristor Crossbar Array. In Proceedings of the 2019 Design, Automation Test in Europe Conference Exhibition (DATE), Florence, Italy, 25−29 March 2019; pp. 320–323. [Google Scholar]
- Qin, Y.-F.; Kuang, R.; Huang, X.-D.; Li, Y.; Chen, J.; Miao, X.-S. Design of High Robustness BNN Inference Accelerator Based on Binary Memristors. IEEE Trans. Electron Devices
**2020**, 67, 3435–3441. [Google Scholar] [CrossRef] - Yu, S.; Wu, Y.; Chai, Y.; Provine, J.; Wong, H.-S.P. Characterization of Switching Parameters and Multilevel Capability in HfOx/AlOx Bi-Layer RRAM Devices. In Proceedings of the 2011 International Symposium on VLSI Technology, Systems and Applications, Hsinchu, Taiwan, 25−27 April 2011; pp. 1–2. [Google Scholar]
- Lehtonen, E.; Poikonen, J.H.; Laiho, M. Two Memristors Suffice to Compute All Boolean Functions. Electron. Lett.
**2010**, 46, 239–240. [Google Scholar] [CrossRef] - Zanotti, T.; Puglisi, F.M.; Pavan, P. Reliability-Aware Design Strategies for Stateful Logic-in-Memory Architectures. IEEE Trans. Device Mater. Reliab.
**2020**, 20, 278–285. [Google Scholar] [CrossRef] - Kvatinsky, S.; Satat, G.; Wald, N.; Friedman, E.G.; Kolodny, A.; Weiser, U.C. Memristor-Based Material Implication (IMPLY) Logic: Design Principles and Methodologies. IEEE Trans. Very Large Scale Integr. (VLSI) Syst.
**2014**, 22, 2054–2066. [Google Scholar] [CrossRef] - Stine, J.E.; Castellanos, I.; Wood, M.; Henson, J.; Love, F.; Davis, W.R.; Franzon, P.D.; Bucher, M.; Basavarajaiah, S.; Oh, J.; et al. FreePDK: An Open-Source Variation-Aware Design Kit. In Proceedings of the 2007 IEEE International Conference on Microelectronic Systems Education (MSE’07), San Diego, CA, USA, 3−4 June 2007; pp. 173–174. [Google Scholar]
- Zanotti, T.; Zambelli, C.; Puglisi, F.M.; Milo, V.; Pérez, E.; Mahadevaiah, M.K.; Ossorio, O.G.; Wenger, C.; Pavan, P.; Olivo, P.; et al. Reliability of Logic-in-Memory Circuits in Resistive Memory Arrays. IEEE Trans. Electron Devices
**2020**, 67, 4611–4615. [Google Scholar] [CrossRef] - Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv
**2018**, arXiv:1606.06160. [Google Scholar] - Krestinskaya, O.; Otaniyozov, O.; James, A.P. Binarized Neural Network with Stochastic Memristors. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu, Taiwan, 18−20 March 2019; pp. 274–275. [Google Scholar]
- Chen, W.-H.; Dou, C.; Li, K.-X.; Lin, W.-Y.; Li, P.-Y.; Huang, J.-H.; Wang, J.-H.; Wei, W.-C.; Xue, C.-X.; Chiu, Y.-C.; et al. CMOS-Integrated Memristive Non-Volatile Computing-in-Memory for AI Edge Processors. Nat. Electron
**2019**, 2, 420–428. [Google Scholar] [CrossRef] - Wan, W.; Kubendran, R.; Gao, B.; Joshi, S.; Raina, P.; Wu, H.; Cauwenberghs, G.; Wong, H.S.P. A Voltage-Mode Sensing Scheme with Differential-Row Weight Mapping for Energy-Efficient RRAM-Based In-Memory Computing. In Proceedings of the 2020 IEEE Symposium on VLSI Technology, Honolulu, HI, USA, 16–19 June 2020; pp. 1–2. [Google Scholar]
- Yin, S.; Kim, Y.; Han, X.; Barnaby, H.; Yu, S.; Luo, Y.; He, W.; Sun, X.; Kim, J.-J.; Seo, J. Monolithically Integrated RRAM- and CMOS-Based In-Memory Computing Optimizations for Efficient Deep Learning. IEEE Micro.
**2019**, 39, 54–63. [Google Scholar] [CrossRef] - Grossi, A.; Nowak, E.; Zambelli, C.; Pellissier, C.; Bernasconi, S.; Cibrario, G.; El Hajjam, K.; Crochemore, R.; Nodin, J.F.; Olivo, P.; et al. Fundamental Variability Limits of Filament-Based RRAM. In Proceedings of the 2016 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 3−7 December 2016. [Google Scholar]
- Mahmoodi, M.R.; Vincent, A.F.; Nili, H.; Strukov, D.B. Intrinsic Bounds for Computing Precision in Memristor-Based Vector-by-Matrix Multipliers. IEEE Trans. Nanotechnol.
**2020**, 19, 429–435. [Google Scholar] [CrossRef] - Xia, Q.; Yang, J.J. Memristive Crossbar Arrays for Brain-Inspired Computing. Nat. Mater.
**2019**, 18, 309–323. [Google Scholar] [CrossRef] - Yu, M.; Cai, Y.; Wang, Z.; Fang, Y.; Liu, Y.; Yu, Z.; Pan, Y.; Zhang, Z.; Tan, J.; Yang, X.; et al. Novel Vertical 3D Structure of TaO
_{x}-Based RRAM with Self-Localized Switching Region by Sidewall Electrode Oxidation. Sci. Rep.**2016**, 6, 21020. [Google Scholar] [CrossRef] [Green Version] - Fouda, M.E.; Eltawil, A.M.; Kurdahi, F. Modeling and Analysis of Passive Switching Crossbar Arrays. IEEE Trans. Circuits Syst. I: Regul. Pap.
**2018**, 65, 270–282. [Google Scholar] [CrossRef] - McDanel, B.; Teerapittayanon, S.; Kung, H.T. Embedded Binarized Neural Networks. In Proceedings of the 2017 International Conference on Embedded Wireless Systems and Networks, Uppsala, Sweden, 20−22 February 2017; pp. 168–173. [Google Scholar]
- Kim, C.-H.; Lim, S.; Woo, S.Y.; Kang, W.-M.; Seo, Y.-T.; Lee, S.-T.; Lee, S.; Kwon, D.; Oh, S.; Noh, Y.; et al. Emerging Memory Technologies for Neuromorphic Computing. Nanotechnology
**2019**, 30, 032001. [Google Scholar] [CrossRef] - Gao, S.; Chen, B.; Qu, Y.; Zhao, Y. MRAM Acceleration Core for Vector Matrix Multiplication and XNOR-Binarized Neural Network Inference. In Proceedings of the 2020 International Symposium on VLSI Technology, Systems and Applications (VLSI-TSA), Hsinchu, Taiwan, 10−13 August 2020; pp. 153–154. [Google Scholar]
- Slesazeck, S.; Mikolajick, T. Nanoscale Resistive Switching Memory Devices: A Review. Nanotechnology
**2019**, 30, 352003. [Google Scholar] [CrossRef] [PubMed] - Ielmini, D.; Wong, H.-S.P. In-Memory Computing with Resistive Switching Devices. Nat. Electron.
**2018**, 1, 333–343. [Google Scholar] [CrossRef] - Chen, A. A Review of Emerging Non-Volatile Memory (NVM) Technologies and Applications. Solid State Electron.
**2016**, 125, 25–38. [Google Scholar] [CrossRef] - Nail, C.; Molas, G.; Blaise, P.; Piccolboni, G.; Sklenard, B.; Cagli, C.; Bernard, M.; Roule, A.; Azzaz, M.; Vianello, E.; et al. Understanding RRAM Endurance, Retention and Window Margin Trade-off Using Experimental Results and Simulations. In Proceedings of the 2016 IEEE International Electron Devices Meeting (IEDM), San Francisco, CA, USA, 3−7 December 2016. [Google Scholar]
- Shi, L.; Zheng, G.; Tian, B.; Dkhil, B.; Duan, C. Research Progress on Solutions to the Sneak Path Issue in Memristor Crossbar Arrays. Nanoscale Adv.
**2020**, 2, 1811–1827. [Google Scholar] [CrossRef] - Puglisi, F.M.; Zanotti, T.; Pavan, P. Unimore Resistive Random Access Memory (RRAM) Verilog-A Model. nanoHUB
**2019**. [Google Scholar] [CrossRef] - Yakopcic, C.; Taha, T.M.; Subramanyam, G.; Pino, R.E.; Rogers, S. A Memristor Device Model. IEEE Electron Device Lett.
**2011**, 32, 1436–1438. [Google Scholar] [CrossRef] - Kvatinsky, S.; Friedman, E.G.; Kolodny, A.; Weiser, U.C. TEAM: ThrEshold Adaptive Memristor Model. IEEE Trans. Circuits Syst. I: Regul. Pap.
**2013**, 60, 211–221. [Google Scholar] [CrossRef] - Kvatinsky, S.; Ramadan, M.; Friedman, E.G.; Kolodny, A. VTEAM: A General Model for Voltage-Controlled Memristors. IEEE Trans. Circuits Syst. II: Express Briefs
**2015**, 62, 786–790. [Google Scholar] [CrossRef] - Messaris, I.; Serb, A.; Stathopoulos, S.; Khiat, A.; Nikolaidis, S.; Prodromakis, T. A Data-Driven Verilog-A ReRAM Model. IEEE Trans. Comput-Aided Des. Integr. Circuits Syst.
**2018**, 37, 3151–3162. [Google Scholar] [CrossRef] - La Torre, C.; Zurhelle, A.F.; Breuer, T.; Waser, R.; Menzel, S. Compact Modeling of Complementary Switching in Oxide-Based ReRAM Devices. IEEE Trans. Electron Devices
**2019**, 66, 1268–1275. [Google Scholar] [CrossRef] - Wiefels, S.; Bengel, C.; Kopperberg, N.; Zhang, K.; Waser, R.; Menzel, S. HRS Instability in Oxide-Based Bipolar Resistive Switching Cells. IEEE Trans. Electron Devices
**2020**, 67, 4208–4215. [Google Scholar] [CrossRef] - González-Cordero, G.; González, M.B.; Campabadal, F.; Jiménez-Molinos, F.; Roldán, J.B. A Physically Based SPICE Model for RRAMs Including RTN. In Proceedings of the 2020 XXXV Conference on Design of Circuits and Integrated Systems (DCIS), Segovia, Spain, 18−20 November 2020; pp. 1–6. [Google Scholar]
- Yu, S.; Gao, B.; Fang, Z.; Yu, H.; Kang, J.; Wong, H.-P. A Neuromorphic Visual System Using RRAM Synaptic Devices with Sub-PJ Energy and Tolerance to Variability: Experimental Characterization and Large-Scale Modeling. In Proceedings of the 2012 International Electron Devices Meeting, San Francisco, CA, USA, 10−13 December 2012. [Google Scholar]
- Jiang, Z.; Yu, S.; Wu, Y.; Engel, J.H.; Guan, X.; Wong, H.-P. Verilog-A Compact Model for Oxide-Based Resistive Random Access Memory (RRAM). In Proceedings of the 2014 International Conference on Simulation of Semiconductor Processes and Devices (SISPAD), Yokohama, Japan, 9−11 September 2014; pp. 41–44. [Google Scholar]
- Li, H.; Jiang, Z.; Huang, P.; Wu, Y.; Chen, H.-; Gao, B.; Liu, X.Y.; Kang, J.F.; Wong, H.-P. Variation-Aware, Reliability-Emphasized Design and Optimization of RRAM Using SPICE Model. In Proceedings of the 2015 Design, Automation Test in Europe Conference Exhibition (DATE), Grenoble, France, 9−13 March 2015; pp. 1425–1430. [Google Scholar]
- Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-Based Learning Applied to Document Recognition. Proc. IEEE
**1998**, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]

**Figure 1.**(

**a**) Circuit implementing the elementary IMPLY logic gate. (

**b**) Driving voltage scheme implementing the P IMPLY Q operation. (

**c**) IMPLY operation truth table highlighting the contrasting requirements on V

_{SET}and V

_{COND}for a reliable gate functionality. Q’ represents the state of Q after the IMPLY operation execution. (

**d**) Driving voltage scheme implementing the FALSE Q operation.

**Figure 2.**(

**a**) Circuit implementation of the elementary IMPLY gate in the SIMPLY framework. (

**b**) Driving voltage scheme used to implement the P IMPLY Q operation. The control logic pulses V

_{SET}on Q only when the comparator detects P = Q = 0 (green lines) while the drivers are kept in high impedance (Hi-Z) in all other cases (dashed black lines). (

**c**) Driving voltage scheme used to implement the FALSE Q operation in the SIMPLY framework. The comparator detects when Q = 1 (black lines) and pulses V

_{FALSE}accordingly. (

**d**) Voltage sense amplifier implemented and simulated with a 45nm technology [30]. All FETs have minimum size (i.e., L = 50 nm W = 90 nm).

**Figure 3.**(

**a**) Distribution of the comparator input voltage (V

_{N}) for increasing V

_{READ}considering the TiN/HfO

_{x}/AlO

_{x}/Pt RRAM devices from [26], when considering a suboptimal R

_{G}in (

**a**) and the optimal R

_{Gopt}in (

**b**), which maximize the read margin (RM). The distributions for P = Q = 0 (grey bands) and P ≠ Q (green bands) are reported together with the read margins (RM—blue arrows) and associated threshold voltages (V

_{TH}—violet line) for the comparator. The effects of cycle-to-cycle, device-to-device variability, and random telegraph noise (RTN) are considered by repeating the simulations 50. The extreme points of the distributions are indicated with black whiskers, and outliers due to RTN with red crosses.

**Figure 4.**SIMPLY implementation on a 1T1R crossbar array. FET devices are used to implement R

_{G}, select specific rows, and to connect adjacent columns.

**Figure 5.**(

**a**) Breakdown of the percentage number of computing steps performed in a binarized neural network (BNN) layer with 1000 input activations. (

**b**) Example of SIMPLY implementation of the multiply and accumulate (MAC) operation for a single neuron. Devices storing the neural network weights (W), their complement ($\overline{W}$), the results of the bitwise XNOR (O), the result of the accumulation (S), the carry-out (C

_{0}), and supporting intermediate computations (M

_{1}, M

_{2}, M

_{3}) are reported. (

**c**) Sequence of IMPLY and FALSE operations implementing a two input XNOR [19]. (

**d**) SIMPLY-based half-adder (HA) implementation. (

**e**) SIMPLY-based accumulator operation implementing the popcount operation.

**Figure 6.**(

**a**) Example of an in-memory computing architecture based on a 1T1R crossbar array enabling the analog BNN vector matrix multiplication acceleration using voltage sense amplifiers (VSAs). (

**b**) Implemented binary multiplication between the input activation and the neuron weight. A pair of 1T1R devices with complementary resistive states is used to map the neuron weights. The input activation is realized with two complementary signals driving the two selector transistors of each weight.

**Figure 7.**(

**a**) Qualitative trends of the voltage at the input of the comparator for different R

_{PD}values at increasing number of +1 products results with a V

_{READ}= 0.2V. The comparator commute when the #positive products greater or equal than 8, thus the voltage threshold, the trend and slope change with R

_{PD}. (

**b**) Optimal R

_{PD}values at increasing number of devices read in parallel for a fixed threshold voltage V

_{TH}(i.e., the same used for SIMPLY). Increasing the computation parallelism requires lowering R

_{PD}, thus leading to a tradeoff between area (i.e., lower R

_{PD}require a larger FET area) and parallelism. In all cases, the nominal R

_{HRS}and R

_{LRS}for a TiN/HfOx/AlOx/Pt RRAM technology from the literature [26], are considered.

**Figure 8.**Proposed architecture, enabling the coexistence of the SIMPLY and BNN analog vector matrix multiplication computing paradigms on the same 1T1R crossbar array.

**Figure 9.**Sketch of an RRAM device in (

**a**) high-resistive state (HRS) and in (

**b**) low-resistive state (LRS) as represented in the compact model. (

**c**) Experimental (square symbols) and simulated (dotted line) IV characteristic of the RRAM technology from [26]. (

**d**) Experimental (boxes) and simulated (lines) response to 50 ns reset voltage pulses at different reset voltages (V

_{RESET}) data from [26].

**Table 1.**Benchmark of the performance of the proposed architecture on a classification task of black and white 20 × 20 pixels images from the MNIST dataset performed with a shallow multilayer perceptron neural network with 1 hidden layer of 1000 neurons and 10 output neurons.

Implementation ^{1} | Average Energy | Latency | Average EDP | EDP Improvement |
---|---|---|---|---|

Embedded system [42] | 5.37 mJ | 17.35 ms | 9.3 × 10^{−5} Js | 1 |

SIMPLY parallel [19]^{1, 2} | 11.4 µJ | 663 µs | 7.6 × 10^{−9} Js | 1.2 × 10^{4} |

SIMPLY parallel ^{1} w R_{G, Opt} | 78.9 µJ | 663 µs | 5.2 × 10^{−9} Js | 1.8 × 10^{4} |

This work ^{1} w R_{G, Opt} | 231 nJ | 31.6 µs | 7.3 × 10^{−12} Js | 1.3 × 10^{7} |

^{1}Estimates were determined considering the RRAM technology from [26], and the ideal case where all the network parameters can be stored in a single crossbar. Only the worst-case estimates for RRAM variability are reported. The energy estimates do not include the decoder and driver energy overhead.

^{2}In [19], a suboptimal R

_{G}= 10 kΩ was used.

**Table 2.**Performance estimates of the IMPLY operation implemented on the SIMPLY architecture using R

_{G,opt}.

Input Configuration | Energy ^{1} (min-avg-max) | |
---|---|---|

0 | 0 | 139 – 429 – 509 fJ |

0 | 1 | 6.18 – 6.183 – 6.185 fJ |

1 | 0 | 6.18 – 6.183 – 6.185 fJ |

1 | 1 | 6.184 – 6.184 – 6.185 fJ |

^{1}Device-to-device (D2D) and cycle-to-cycle (C2C) variability are included by repeating the circuit simulations with different seed for the random noise sources (50 trials).

**Table 3.**Performance estimates of the FALSE operation implemented on the SIMPLY architecture using R

_{G,opt}.

Input Configuration | Energy ^{1} (min-avg-max) |
---|---|

0 | 9.6 – 11.2 – 12 fJ |

1 | 100 – 145 – 190 fJ |

^{1}D2D and C2C variability are included by repeating the circuit simulations with different seed for the random noise sources (50 trials).

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Zanotti, T.; Puglisi, F.M.; Pavan, P.
Energy-Efficient Non-Von Neumann Computing Architecture Supporting Multiple Computing Paradigms for Logic and Binarized Neural Networks. *J. Low Power Electron. Appl.* **2021**, *11*, 29.
https://doi.org/10.3390/jlpea11030029

**AMA Style**

Zanotti T, Puglisi FM, Pavan P.
Energy-Efficient Non-Von Neumann Computing Architecture Supporting Multiple Computing Paradigms for Logic and Binarized Neural Networks. *Journal of Low Power Electronics and Applications*. 2021; 11(3):29.
https://doi.org/10.3390/jlpea11030029

**Chicago/Turabian Style**

Zanotti, Tommaso, Francesco Maria Puglisi, and Paolo Pavan.
2021. "Energy-Efficient Non-Von Neumann Computing Architecture Supporting Multiple Computing Paradigms for Logic and Binarized Neural Networks" *Journal of Low Power Electronics and Applications* 11, no. 3: 29.
https://doi.org/10.3390/jlpea11030029