Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies

Cheng, Xiaoshu; Wang, Yiwen; Liu, Jiazhi; Ding, Weiran; Lou, Hongfei; Li, Ping

doi:10.3390/electronics12102177

Open AccessArticle

Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies

by

Xiaoshu Cheng

¹

,

Yiwen Wang

^1,*

,

Jiazhi Liu

¹,

Weiran Ding

¹,

Hongfei Lou

¹ and

Ping Li

^1,2

¹

School of Integrated Circuit Science and Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China

²

State Key Laboratory of Electronic Thin Films and Integrated Devices, University of Electronic Science and Technology of China, Chengdu 610054, China

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(10), 2177; https://doi.org/10.3390/electronics12102177

Submission received: 25 March 2023 / Revised: 30 April 2023 / Accepted: 5 May 2023 / Published: 10 May 2023

(This article belongs to the Section Artificial Intelligence Circuits and Systems (AICAS))

Download

Browse Figures

Versions Notes

Abstract

:

Bit-serial multiply-accumulate units (MACs) play a crucial role in various hardware accelerator applications, including deep learning, image processing, and signal processing. Despite the advantages of bit-serial MACs, such as a small footprint, full hardware utilization, and high frequency, their serial nature can lead to high latency and potentially compromised performance. This study investigates the potential of bit-serial solutions by applying Booth encoding to bit-serial multipliers within MACs to enhance area and power efficiencies. We present two types of bit-serial MACs based on radix-2 and radix-4 Booth encoding multipliers, respectively. Their performance is assessed through simulations and synthesis results, demonstrating the benefits of the proposed approach. The radix-4 Booth bit-serial MAC improves power and area efficiencies compared to the original bit-serial MAC. Operating at TSMC 90 nm and 150 MHz, our design exhibits a remarkable 96.39% reduction in area-power-product (APP). Moreover, the prototype verification on a Xilinx Kintex-7 FPGA proved successful. The proposed solution offers significant advantages in energy efficiency, area reduction, and APP, making it a promising candidate for next-generation hardware accelerators in offline inference, low-power devices, and other applications.

Keywords:

hardware accelerator; Booth encoding; bit-serial; multiplier; multiply-accumulate unit (MAC)

1. Introduction

Bit-serial multiply-accumulate units (MACs) are essential building blocks for various hardware accelerator applications, including deep learning, image processing, and signal processing, due to their small footprint, full hardware utilization, and high frequency [1]. In many applications, power consumption and chip area are critical factors that influence the overall system performance. Thus, many types of MACs have been introduced to reduce the power consumption and chip area.

Most MACs have been implemented using bit-parallel technology, which can transfer several data segments at a time. Though many studies have focused on improving the architecture of MACs in various hardware accelerators, bit-serial solutions have been given little attention. The bit-serial system, with a longer history, transfers one data segment at a time when the data length is one [2]. Despite the potential advantages of bit-serial MACs, such as a smaller area footprint and lower power consumption, their serial nature can result in high latency and potentially compromised performance.

By exploring bit-serial MACs, we could potentially uncover innovative design approaches that can further optimize hardware accelerators for deep learning applications. The focus on bit-serial solutions may help reveal new techniques to improve energy efficiency and resource utilization, while maintaining or even enhancing overall system performance. This underscores the importance of broadening the scope of investigation beyond conventional bit-parallel MAC architectures, to fully understand and exploit the potential benefits of bit-serial MACs in the context of hardware accelerator applications.

MAC units consist of multipliers and adders. As shown in Table 1, multipliers tend to have higher energy and area costs compared to adders. This difference in resource consumption can significantly impact the overall efficiency and performance of MAC-based hardware accelerator systems. By focusing on the improvement of multiplier efficiency, the overall performance of MAC-based systems can be significantly enhanced.

In this study, we propose the application of Booth encoding to bit-serial multipliers within MACs as a means to improve area and power efficiencies. Booth encoding is a well-known technique for reducing the number of partial products in a multiplication operation, leading to significant reductions in power consumption and chip area. In conjunction with the double-precision bit-serial adder, we present two types of bit-serial MACs based on radix-2 and radix-4 Booth encoding multipliers, respectively. Although these proposed architectures do not directly address the high latency issue, they do offer substantial improvements in area and power efficiencies, making them promising approaches for optimizing bit-serial MAC designs in some aspects [4].

The main contributions of this paper are as follows:

Proposing two bit-serial MAC designs based on radix-2 and radix-4 Booth encoding multipliers to improve area and power efficiencies.
Evaluating the performance of the proposed multiplier through simulations and synthesis results and comparing it with the original bit-serial multiplier and state-of-the-art multipliers. Demonstrating a 96.39% reduction in area-power-product (APP) for the proposed design at TSMC 90 nm process and 150 MHz.
Verifying the correctness of a general bit-serial MAC using our radix-4 Booth encoding multiplier, via our implementation of an FPGA prototype, demonstrating that the proposed bit-serial MAC provides correct computation results, and that the prototype verification is successful.

The remainder of this paper is organized as follows: Section 2 introduces some prior studies relating to our work, Section 3 introduces the basics of bit-serial representation and Booth encoding, Section 4 presents the two new Booth bit-serial multipliers and introduces the bit-serial adders briefly, Section 5 presents the performance evaluation, and Section 6 concludes and discusses this paper.

2. Related Work

MACs consume a significant number of logical resources, increasing the occupied area and extending the critical path. Current optimization techniques include pipelining, CSD encoding, Booth encoding, Wallace tree compression, and more. By improving the efficiency of MAC units and transforming data structures, computational demands can be reduced. For instance, Garland et al. [5] employed weight sharing in data compression, indexing, and access, as well as parallel accumulation and MAC restructuring, to decrease gate counts and power consumption, leading to fewer logic operations. In the computation process of neural networks, numerous matrix multiplications and convolution operations are involved, with the primary computing unit being the MAC. Parashar et al. [6] introduced the Sparse Convolutional Neural Network (SCNN) acceleration architecture, where the gate MAC takes advantage of zero-value weights generated during network pruning in training and zero-value activations produced by regular ReLU operators. This approach enhances performance and energy efficiency while eliminating unnecessary data transfers and lowering storage requirements. To theoretically boost the computational rate of convolutional neural network (CNN) accelerators, Lee et al. [7] developed a novel method called double MAC, which doubles the computational throughput of CNN layers by packaging two MAC operations into a single digital signal processing (DSP) block. Furthermore, Xie et al. [8] optimized network throughput, energy consumption, and execution time by simplifying matrix multiplication using Fast Fourier Transform (FFT) in convolution, thereby reducing the complexity of convolution layers. Kang et al. [9] proposed a mixed-precision MAC unit structure that supports both low-precision and high-precision multiplication modes, reducing the cost of multiplication operations and energy consumption compared to traditional MAC structures.

Using bit-serial approaches, Judd et al. [10] proposed Serial Inner Product (SIP) units within a hardware accelerator called Stripes, which has an execution time that scales almost proportionally to the length of the numerical representation used. Stripes leverages bit-serial computation units and inherent parallelism to enhance the execution time and energy efficiency of convolution layers without compromising accuracy, facilitating dynamic trade-offs between accuracy, performance, and energy. Utilizing the lookup table (LUT)-based bit-serial processing element (LBPE), UNPU [11] supports fully variable weight bit precision ranging from 1 to 16 bits, achieving reduced energy consumption compared to the conventional fixed-point MAC array. Hsu et al. [12] introduced an energy-aware bit-serial streaming CNN accelerator. Their bit-serial processing elements (PE) are designed to use fewer bits in weights, reducing computation and external memory access. In [13], an efficient hardware architecture combining bit-serial processing and systolic arrays to accelerate CNN convolution operations is presented. The implementation of pipelined multipliers based on bit-serial computing decreases the logic complexity of computation units and the interconnect complexity of the circuit. Utilizing systolic arrays minimizes data access demands between computation and storage units.

In the MAC circuit, the multiplier consists of three primary components: partial product generation, compression tree, and final result summation. The multiplier’s bottleneck is often the simplification of partial products. Nguyen et al. [14] doubled the computation by constructing virtual SIMD lanes within the DSP, ensuring that data on the channels would not clash. The design concept enables unsigned multiplication with common operands using a single multiplier. Since MACs are frequently implemented in a pipelined form, the number of flip-flops, area, and power consumption all increase. As a result, Ryu et al. [15] proposed a new pipelined technique that selectively eliminates some flip-flops, making the system more energy-efficient and compact. The Advanced Precision Scalable MAC [16] utilizes a high-precision multiplier configured in parallel low-precision mode, rather than individually tuned low-precision multipliers and precision-variable accumulation schemes. A detailed comparison of two designs based on Sum Separate (SS) and Sum Together (ST) is conducted, taking throughput, energy efficiency, and area efficiency into consideration. Since the precision of input and output neural and weight parameters in CNN is low, [9] proposed a stacked and combinable layer MAC with mode 0 for low-precision multiplication and mode 1 for high-precision multiplication to reduce power consumption and cost.

3. Basics

There is still room for improvement in the design of MACs, particularly in the area and power efficiency of bit-serial multipliers. By applying Booth encoding to bit-serial multipliers within MACs, this study aims to optimize area and power efficiencies by reducing the number of partial products in the multiplication operation.

3.1. Bit-Serial Representation

Fixed-point binary signed data are represented by two’s complement. Two types of bit-serial data can be defined: single precision and double precision. For a P-bit length of data

X = (x_{P - 1} \cdot x_{P - 2} \dots x_{1} x_{0})

, then single-precision data are

X_{single} = - x_{P - 1} + \sum_{i = 0}^{P - 2} x_{i} 2^{i - P + 1}

, and double-precision data are

X_{double} = - x_{2 P - 1} + \sum_{i = 0}^{2 P - 2} x_{i} 2^{i - 2 P + 1}

. The data length is used to express the dynamic range and precision of binary data. Figure 1 shows the composition and timing of the two types of data with a data length P = 4. The decimal point is between the most significant bit (MSB) and lower significant bit. The lowest significant bit (LSB) is transmitted in the front. Single-precision data comprise a sign and (P − 1) other bits of data; double-precision data comprise a sign and (2P − 1) other bits of data. The bit-serial operator not only produces the output bits but also the control bits of the output data, which are called the head bits [17].

3.2. Booth Encoding Basics

In binary computing, increasing the data length in a multiplier increases the partial products, which requires more adders. This problem can be solved by Booth encoding. When combined with bit-serial multipliers, Booth encoding offers a promising approach to balance the inherent advantages of bit-serial MACs. This is achieved by reducing the number of partial products, which in turn decreases the hardware resources required, such as the number of inner cells and adders. If standard binary multiplication can be regarded as radix-1 multiplication, then Booth encoding converts basic radix-1 multiplication to radix-2, radix-4, or radix-8 multiplication to reduce the valid length of the multiplicand. Table 2 presents radix-2 Booth encoding, where x[i] and x[i − 1] are the MSB and LSB, respectively, of the multiplicand. Table 3 presents radix-4 Booth encoding, in which x[i + 1] and x[i − 1] are the MSB and LSB, respectively of the 3-bit encoding group of the binary input x[i] [18]. The Booth code −2Y needs to be interpreted as −(2Y), which means that Y shifts for one bit to the left and is inverted.

4. Design of the Booth Bit-Serial Multipliers

The previous multipliers were implemented with a bit-parallel structure. In this study, we used the bit-serial structure. Compared with bit-parallel multipliers, bit-serial multipliers are simpler and have fewer operations per clock cycle. Moreover, bit-parallel multipliers take up more area to perform the same operations of MACs and multibit parallel computing. Bit-serial multipliers are slower, but this can be compensated for by Booth encoding. In this section, we present the proposed radix-2 and radix-4 Booth bit-serial multipliers. In general, both multipliers comprise an LSB cell, an MSB cell, and several inner cells. A longer multiplier increases the inner cells, but the lengths of the LSB and MSB cells remain unchanged.

We used a gate-level manual design approach, creating our own schematics by hand to better control the design process and ensure accurate implementation of radix-2/4 Booth encoding and bit-serial format. Specifically, we designed logic gates for the corresponding process and directly wrote code for each gate using a gate-level netlist format. We then applied constraints to ensure that the synthesized netlist matched our design schematics. Since our main focus was on the circuit structure, the synthesis tool carried out some minor optimizations, with logic gate size and other conditions being managed by the tool itself. Using synthesis tools to generate gate-level architectures might yield more cost-effective circuits, but this approach is generally intended for behavioral-level description methods.

4.1. Radix-2 Booth Bit-Serial Multiplier

Adding radix-2 Booth encoding requires modifying a bit-serial multiplier in three ways: an encoding circuit should be added, addition and subtraction circuits should be selected, and the timing problem should be solved. Thus, the radix-2 Booth bit-serial multiplier consists of the input cell, output cell, and inner cells. The inner cells have four parts: multiplication, addition and subtraction, shift judgment, and shift. Figure 2a shows the calculation diagram of the radix-2 Booth bit-serial multiplier. It is similar to the original bit-serial multiplier, but the partial products are produced differently. In the radix-2 Booth bit-serial multiplier, y0, y1, y2, and y3 are multiplied by X after being converted to B0, B1, B2, and B3 through the encoding circuit. The multiplication results have sign bits, so addition or subtraction needs to be chosen. To minimize the circuits and errors, the result is fed into the addition and subtraction circuit, and the final choice of addition or subtraction is made by regarding y as the encoding data, B0 as 0, and B1 as −1. Figure 2b shows the calculation timing of the radix-2 Booth bit-serial multiplier.

Figure 3 shows the radix-2 Booth bit-serial multiplier. As shown in Figure 3a, the subtraction circuit should be used to process –Y in the Booth code while either the subtraction or addition circuit can be used to process 0. To reduce the area and power consumption, we only utilize the subtraction circuit to process 0. Figure 3b,c show that the addition and subtraction circuits and the D flip-flop of the right shift circuit are connected back to the multiplexer (MUX). The radix-2 Booth bit-serial multiplier mainly differs from the original bit-serial multiplier in terms of the calculation approach toward sign bits. The radix-2 Booth bit-serial multiplier does not use an OR gate to calculate sign bits but instead utilizes the nature of the addition and subtraction circuits and the interaction between the multiplier and right shift circuit connected back to the MUX to produce sign bits.

There are two calculation cases. In the first case, the head bit h_xin appears only once (i.e., the input remains the same). Then, sign bits are calculated by the input and the addition and subtraction circuits. The input comprises the output of Cell [0], X, and Y. As the output high bit of Cell [0], sout_h also includes the part of the D flip-flop connected to the MUX, where the sign bits are transferred many times. X is fixed as the high bit from the input circuit. Y is locked by registers, and the results are the three inputs, which remain unchanged. Correct sign bits are guaranteed by the calculation of the addition and subtraction circuits. In the second case, the head bit h_xin appears more than once (i.e., the input changes). Then, sign bits are calculated by the right shift circuit connected back to the MUX. After exhaustive enumeration, h_xin becomes 1, which causes the MUX to select 1 (i.e., to select the sign bits stored in the D flip-flop). The sign bits are transmitted to ensure that the sign bits in the next cell are also properly calculated. This part is consistent with the right shift circuit of Cell [0].

4.2. Radix-4 Booth Bit-Serial Multiplier

Adding radix-4 Booth encoding increases the encoding depth even further. The bit-serial multiplier needs to be modified in three ways: the encoding circuit should be modified, the shift judgment circuit should be added and modified, and the timing problem should be solved. The number of cells is half the length P of the multiplier, which can only be an even number because of the nature of radix-4 Booth encoding. Figure 4a shows the calculation diagram of the radix-4 Booth bit-serial multiplier. It is similar to the radix-2 Booth bit-serial multiplier, but it produces partial products differently. Both the sign bits and radix-4 Booth encoding of +2Y and −2Y appear in the multiplication results, which need the addition and subtraction circuits and a new shift circuit to ensure normal function. y is regarded as the encoding data, B0 is −2, and B1 is −0. Figure 4b shows the calculation timing of the radix-4 Booth bit-serial multiplier.

Figure 5 shows the structure of the radix-4 Booth bit-serial multiplier. The number of D flip-flops is significantly increased for two main reasons: the change in the relationship among h_xin, yin_h, and xin_h; and the change in the timing relationship between p_in_judge and sin_h or sin_l. Regarding the first relationship among h_xin, yin_h, and xin_h, the D flip-flops in the encoding circuit should have corresponding timing to ensure that xin_h and yin_h are already prepared when h_xin is at a high level. In the radix-2 Booth bit-serial multiplier, only two bits of yin_h are locked, and y needs only one D flip-flop (i.e., y0 and 0 are encoded as B0, y1 and y0 are encoded as B1, and y2 and y1 are encoded as B2). In the circuit, y only shifts 1 bit at a time compared to h_xin. In the radix-4 Booth bit-serial multiplier, three bits of yin_h are locked, and y needs two D flip-flops (i.e., y1, y0, and 0 are encoded as B0; y3, y2, and y1 are encoded as B1; and y5, y4, and y3 are encoded as B2). In the circuit, y shifts 2 bits at a time compared to h_xin. Correspondingly, the numbers of D flip-flops for xin_h and h_xin need to be increased simultaneously. Figure 5a shows that the LSB has two D flip-flops for y and three D flip-flops for xin_h and h_xin. The inner part has three D flip-flops for y and four D flip-flops for h_xin. Regarding the second relationship among sin_h, sin_l, and p_in_judge, the radix-4 Booth encoding of +2Y and −2Y appear in the shift circuit, which means that p_in needs to be shifted 1 bit to the left. Therefore, the D flip-flops and MUXs are utilized for operation, and the modified shift circuit is as shown in Figure 5a–c. sin_h and sin_l misalign with p_in_judge because of the D flip-flops of the shift circuit. Compared with the radix-2 Booth multiplier, the radix-4 Booth LSB multiplier cell has two more D flip-flops for h_xin and two more D flip-flops for p_in_judge so that their timing can be aligned.

4.3. Bit-Serial Adder

As discussed by Isshiki [17], bit-serial adders also include single- and double-precision adders. The single-precision bit-serial adder contains only one full adder, which accumulates bits by reusing the latched carry output and the value 0, controlled by the delayed head bits through a MUX, and feeding them back to the carry input. The double-precision bit-serial adder consists of two full adders, with the default carry input of the second adder being the carry output of the first adder. Since the double-precision adder performs addition operations simultaneously on both the high and low bits of the data, it is twice as fast as the single-precision adder. As our proposal involves two double-precision multipliers, choosing a double-precision adder is a more suitable option in this study.

5. Experiment and Results

Bit-serial multipliers were implemented by using Verilog and simulated in Synopsys VCS and Verdi. The TSMC 90 nm process was used, and Design Compiler was used for synthesis to obtain information on the area, power consumption, and timing. Additionally, we utilized the FPGA board model XC7K325TFFG900-2 to make functional verification of a general MAC.

5.1. Simulation of Bit-Serial Multipliers

The multipliers were assumed to have a precision P of 16 bits. The original bit-serial multiplier and radix-2 Booth bit-serial multiplier took P clock cycles to input the multipliers and multiplicands simultaneously and (2P − 1) clock cycles to output the product. The radix-4 Booth bit-serial multiplier took P clock cycles to input the multipliers and multiplicands simultaneously and 1.5P clock cycles to output the product. Note that the multipliers were reset by inputting serial 0 instead of resetting the whole system at once. Figure 6 shows the simulation waveforms; the signals were reset periodically to denote data entry, where the head bits represent the beginning and end of data.

5.2. Simulation of Bit-Serial Adders

As shown in Figure 7a, when performing an addition operation with a precision of 16 bits, the adder is reset by turning h_xin to a high level. One clock cycle later, two addends enter the adder, and h_out is at a high level at the same time, which indicates that the result is output after 16 clock cycles in serial.

Figure 7a shows the addition operation of a single-precision bit-serial adder with 16 bits. The adder is reset by turning h_xin to a high level. After one clock cycle, two addends enter the adder, and h_out is at a high level at the same time, which indicates that the result is output after 16 clock cycles. Figure 7b shows a double-precision bit-serial adder. In contrast to the single-precision bit-serial adder, the double-precision bit-serial adder has two more high-bit inputs in1_h, in2_h and the carry read signal h_xin_h. For 16-bit addition, the calculation takes eight clock cycles, and the carry of the 8-bit low-bit inputs (in1_l and in2_l) is output. The high bit and carry read signal enter the high-level adder together to complete the addition operation. The low 8 bits of the next data also enter the double-precision bit-serial adder, which fully loads the adder. The double-precision bit-serial adder takes only eight clock cycles to add two 16-bit data series, which means that the low and high bits are added at the same time. This computing architecture effectively releases the computing capacity of Booth bit-serial multipliers and MACs to increase the overall computing speed.

5.3. Synthesis Results of Bit-Serial Multipliers

Table 4 compares the synthesis results of the three bit-serial multipliers. The area and power product comprise a performance metric known as area-power-product (APP), which can be used as an indicator to measure design efficiency. A smaller APP implies higher design efficiency, meaning that better performance is achieved under given area and power constraints. Operating at TSMC 90 nm, 500 MHz and 16 bits, the radix-4 Booth bit-serial multiplier had the lowest power consumption, shortest latency, and highest energy efficiency. The area was 7.01% larger and the APP was 1.15% higher than that of the original bit-serial multiplier, but the power consumption was 5.48% lower, and the energy efficiency was 36.61% higher. At 8 bits, the radix-4 Booth bit-serial multiplier was 8.68% larger in area, 2.99% lower in power consumption, 33.15% higher in energy efficiency, and 5.43% higher in APP than the original bit-serial multiplier. However, upon implementing these three types of bit-serial multipliers in SMIC 0.18 μm and SMIC 28 nm processes, we found that the radix-4 Booth bit-serial multiplier offers a smaller area, lower power consumption, higher energy efficiency, and lower APP than the original bit-serial multiplier, except for the delay. This could be because the use of Design Compiler to synthesize the circuits resulted in slight differences between the designs due to the variation in technology libraries used to translate, optimize, and map. The radix-4 Booth encoding method, despite the additional encoding circuit added to the original bit-serial multiplier, reduced the total calculation time by half, resulting in an overall reduction in area and power consumption of the multiplying circuit [19]. While the original design has a lower delay, enabling a higher operating frequency, it is not common to set the maximum clock frequency for bit-serial applications. At an equal frequency, the radix-4 Booth bit-serial multiplier delivers greater energy efficiency and requires fewer cycles, thus maintaining an overall latency advantage compared to the original design. The radix-2 Booth bit-serial multiplier performed poorer than the other two multipliers in all aspects. Therefore, the radix-4 Booth bit-serial multiplier is more beneficial than the original bit-serial multiplier and radix-2 Booth bit-serial multiplier. Moreover, we have included data comparisons at a 1.5 GHz frequency, which further illustrate the enhanced energy efficiency of the radix-4 Booth bit-serial multiplier. In addition, we observed an increase in power consumption and APP, while the area, delay, and energy efficiency remained virtually unchanged. The slight variations in area, delay, and energy efficiency can be attributed to the fact that our design was generated from being manually constructed by logic gates optimized by the software. This software-driven optimization of the automatically synthesized circuits also explains the minor differences in area, delay, and energy efficiency under different clock frequencies.

The radix-4 Booth bit-serial multiplier is designed to handle signed numbers efficiently and to reduce the number of partial products generated during multiplication. To compare it with state-of-the-art multipliers, we can focus on several key aspects such as area, power consumption, energy efficiency, and APP. The radix-4 Booth bit-serial multiplier is typically smaller in area compared to the other two state-of-the-art multipliers and traditional multiplier in Table 5, as it requires fewer resources for the same functionality. A larger chip area implies higher manufacturing costs and lower yield rates. Power consumption also affects costs, as higher power consumption may require more expensive cooling solutions or higher operating costs. As an indicator for evaluating the effectiveness of design optimization, the APP has been reduced by 96.39% compared to the state-of-the-art bit-serial multiplier [12], reflecting the high efficiency of our design. However, bit-serial multipliers inherently have a higher delay and are less energy-efficient compared to bit-parallel multipliers [16], as they process bits one at a time rather than in parallel. Although, when compared to the state-of-the-art bit-serial multiplier, our proposed architecture demonstrates better energy efficiency. While bit-serial multipliers indeed cannot compete with parallel multipliers in terms of throughput, they do possess certain advantages. For instance, the area occupied by our bit-serial multiplier is only about a quarter of that of the traditional parallel multiplier, the power consumption is roughly a third of the traditional parallel multiplier, and the maximum operating frequency is approximately three times higher than that of a parallel architecture. In summary, radix-4 Booth bit-serial multipliers offer advantages in terms of area, power consumption, energy efficiency, and APP, but may have a high delay.

5.4. Synthesis Results of Bit-Serial Adders

Table 6 compares the two types of bit-serial adders with different precisions. Although the double-precision bit-serial adder improved the overall computing speed, this was at the cost of hardware performance. The area and power consumption were more than twice that of the single-precision bit-serial adder, and the timing performance was poor as well. Thus, it seems that the single-precision bit-serial adder is preferable, but employing double-precision bit-serial adders can enhance the performance of MACs. The double-precision MAC with a single-precision adder requires more wait time for computation, which in turn requires more registers and control logics than the double-precision MAC with a double-precision adder. As the two types of Booth bit-serial multipliers we propose are also double-precision, they can effectively adapt to each other.

5.5. MAC Implemention on an FPGA

A general MAC [20] performs two basic operations: multiplication and accumulation. In a MAC unit, the inputs are first multiplied together, and the product is then added to an accumulator. These units are used extensively in various applications, including filtering, matrix operations, and inner product computations. As shown in Figure 8, the circuit structure of a MAC primarily consists of a multiplier, an adder, and a D flip-flop. The multiplier is responsible for taking two input signals and computing their product. The adder receives the product from the multiplier and the accumulated value from the D flip-flop. It performs the addition operation and generates a new accumulated value. The D flip-flop mainly has a data input (D) and a data output (Q). The output from the adder is connected to the data input (D) of the D flip-flop, and the data output (Q) of the D flip-flop is connected back to one input of the adder. This feedback loop enables the MAC unit to perform sequential accumulation of the products generated by the multiplier. The final accumulated result can be retrieved from the data output (Q) of the D flip-flop after the completion of the required multiply-accumulate operations.

5.5.1. Resource Utilization

In this study, we utilized a Xilinx Kintex-7 FPGA model XC7K325TFFG900-2 to implement the 16-bit radix-4 Booth bit-serial MAC. The add_double module serves as a double-precision adder, while the int_output module functions as an integer bit-serial-to-parallel MAC output circuit. The parallel_serial_input module is a parallel-to-serial conversion circuit used prior to the multiplier input, and the top_PE module is a 16-bit radix-4 Booth bit-serial multiplier. The resource utilization, after implementing the bit-serial MAC through synthesis, is depicted in Figure 9.

5.5.2. Prototype Verification

To verify the correctness of the bit-serial MAC proposed in this study, we implemented an FPGA prototype. We used software to generate a 16-bit by 16-bit signed parallel multiplier and created a 32-bit register to store the computed values. After each computation, the result was added to the existing value in the register and reassigned, implementing the MAC operation for the parallel multiplier. We buffered the output results of the parallel multiplier, allowing them to be output simultaneously with the results from the radix-4 Booth bit-serial MAC for comparison. If the compared results were consistent, LED 1 would light up; if the results were inconsistent, LED 2 would light up, and the wrong_count would decrement by one. Starting from 6’b111111, the wrong_count is a 6-bit number represented by LEDs 3–8.

Operating at a 150 MHz board-level clock with input data generated by a random number sequence generator, the results after running for several tens of seconds are shown in Figure 10. LED 1 remains consistently illuminated and bright, while LEDs 2–8 remain off, indicating that the results are consistently accurate. This demonstrates that the bit-serial MAC proposed in this study provides correct computation results, and that the prototype verification is successful.

6. Conclusions and Discussion

In this study, we introduced two types of Booth encoding to the original bit-serial multiplier. Overall, the radix-4 Booth bit-serial multiplier is more beneficial than the original bit-serial multiplier. The radix-2 Booth bit-serial multiplier performed worse than the other two bit-serial multipliers in all aspects. Compared with the single-precision adder, the double-precision adder effectively increased the overall computing speed by processing the low and high bits at the same time. It can also effectively complement the multipliers we proposed. With radix-4 Booth encoding, our study achieved significant reductions in chip area and power consumption, making it a suitable option for MAC designs where power efficiency and a smaller area footprint are of higher priority.

However, if minimizing latency and achieving higher throughput are the main concerns, alternative state-of-the-art multipliers such as bit-parallel multipliers, Wallace tree multipliers, or Karatsuba-based multipliers could be more appropriate. It is worth noting that in variable-precision MACs, using the original multiplier or radix-2 Booth bit-serial multiplier may be advantageous over the radix-4 Booth multiplier. Due to the inherent characteristics of the radix-4 Booth bit-serial multiplier, its cell count can only be even. This implies that the precision of a variable-precision MAC based on the radix-4 Booth bit-serial multiplier can only change in even-numbered increments. In contrast, the original multiplier and radix-2 Booth bit-serial multiplier allow for full precision adjustment.

Future work may involve designing MACs with variable precision, a smaller area, lower power consumption, and reduced delay. Moreover, we will continue to optimize MAC arrays based on the bit-serial MAC and bit-serial dataflow to make them more suitable for bit-serial computing. Building on the work presented in this study, we plan to continue applying our bit-serial MACs to specific applications. For example, by constructing a complete CNN accelerator, we aim to explore further innovations and development of processing elements. Additionally, we will investigate the possibility of applying our bit-serial MACs to other application domains, such as natural language processing, computer vision, and IoT devices, to assess their versatility and effectiveness across a wide range of use cases.

Author Contributions

Conceptualization, all authors; methodology, X.C., Y.W. and J.L.; software and validation, W.D. and H.L.; formal analysis, X.C., Y.W. and J.L.; investigation, X.C.; resources, Y.W. and P.L.; data curation, X.C., W.D. and H.L.; writing—original draft preparation, X.C., W.D. and H.L.; writing—review and editing, all authors; visualization, X.C., W.D. and H.L.; supervision, Y.W. and P.L.; project administration, X.C., J.L, W.D. and H.L.; funding acquisition, Y.W. and P.L. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

This work was supported by the State Key Laboratory of Electronic Thin Films and Integrated Devices.

Conflicts of Interest

The authors declare no conflict of interest.

References

Tu, F.; Yin, S.; Ouyang, P.; Tang, S.; Liu, L.; Wei, S. Deep Convolutional Neural Network Architecture with Reconfigurable Computation Patterns. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2017, 25, 2220–2233. [Google Scholar] [CrossRef]
Hartley, R.; Parhi, K.K. Digit-Serial Computation; Springer: Boston, MA, USA, 1995. [Google Scholar]
Chen, Y.-H.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef]
Umuroglu, Y.; Conficconi, D.; Rasnayake, L.; Preusser, T.B.; Själander, M. Optimizing Bit-Serial Matrix Multiplication for Reconfigurable Computing. ACM Trans. Reconfig. Technol. Syst. 2019, 12, 1–24. [Google Scholar] [CrossRef]
Garland, J.; Gregg, D. Low Complexity Multiply Accumulate Unit for Weight-Sharing Convolutional Neural Networks. IEEE Comput. Arch. Lett. 2017, 16, 132–135. [Google Scholar] [CrossRef]
Parashar, A.; Rhu, M.; Mukkara, A.; Puglielli, A.; Venkatesan, R.; Khailany, B.; Emer, J.; Keckler, S.W.; Dally, W.J. SCNN: An Accelerator for Compressed-Sparse Convolutional Neural Networks. In Proceedings of the 44th Annual International Symposium on Computer Architecture, Toronto, ON, Canada, 24 June 2017; ACM: New York, NY, USA, 2017; pp. 27–40. [Google Scholar]
Lee, S.; Kim, D.; Nguyen, D.; Lee, J. Double MAC on a DSP: Boosting the Performance of Convolutional Neural Networks on FPGAs. IEEE Trans. Comput. -Aided Des. Integr. Circuits Syst. 2019, 38, 888–897. [Google Scholar] [CrossRef]
Xie, B.; Zhang, G.; Shen, Y.; Liu, S.; Ge, Y. Fast FFT-Based Inference in 3D Convolutional Neural Networks. In Innovative Mobile and Internet Services in Ubiquitous Computing; Barolli, L., Xhafa, F., Javaid, N., Enokido, T., Eds.; Advances in Intelligent Systems and Computing; Springer International Publishing: Cham, Switzerland, 2019; Volume 773, pp. 420–431. [Google Scholar]
Kang, J.; Kim, T. PV-MAC: Multiply-and-Accumulate Unit Structure Exploiting Precision Variability in on-Device Convolutional Neural Networks. Integration 2020, 71, 76–85. [Google Scholar] [CrossRef]
Judd, P.; Albericio, J.; Hetherington, T.; Aamodt, T.M.; Moshovos, A. Stripes: Bit-Serial Deep Neural Network Computing. IEEE Comput. Arch. Lett. 2016, 16, 80–83. [Google Scholar] [CrossRef]
Lee, J.; Kim, C.; Kang, S.; Shin, D.; Kim, S.; Yoo, H.-J. UNPU: An Energy-Efficient Deep Neural Network Accelerator with Fully Variable Weight Bit Precision. IEEE J. Solid-State Circuits 2019, 54, 173–185. [Google Scholar] [CrossRef]
Hsu, L.-C.; Chiu, C.-T.; Lin, K.-T.; Chou, H.-H.; Pu, Y.-Y. ESSA: An Energy-Aware Bit-Serial Streaming Deep Convolutional Neural Network Accelerator. J. Syst. Archit. 2020, 111, 101831. [Google Scholar] [CrossRef]
Li, L.; Hu, J.; Huang, Q.; Zhou, W. Bit-Serial Systolic Accelerator Design for Convolution Operations in Convolutional Neural Networks. IEICE Electron. Express 2020, 17, 1–6. [Google Scholar] [CrossRef]
Nguyen, D.; Kim, D.; Lee, J. Double MAC: Doubling the Performance of Convolutional Neural Networks on Modern FPGAs. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Lausanne, Switzerland, 27–31 March 2017; pp. 890–893. [Google Scholar]
Ryu, S.; Park, N.; Kim, J.-J. Feedforward-Cutset-Free Pipelined Multiply–Accumulate Unit for the Machine Learning Accelerator. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2019, 27, 138–146. [Google Scholar] [CrossRef]
Mei, L.; Dandekar, M.; Rodopoulos, D.; Constantin, J.; Debacker, P.; Lauwereins, R.; Verhelst, M. Sub-Word Parallel Precision-Scalable MAC Engines for Efficient Embedded DNN Inference. In Proceedings of the 2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS), Hsinchu City, Taiwan, 18–20 March 2019; pp. 6–10. [Google Scholar]
Isshiki, T. High-Performance Bit-Serial Datapath Implementation for Large-Scale Configurable Systems. Citeseer 1996, 187, 1–181. [Google Scholar]
Macsorley, O. High-Speed Arithmetic in Binary Computers. Proc. IRE 1961, 49, 67–91. [Google Scholar] [CrossRef]
Balsara, P.T.; Harper, D.T. Understanding VLSI Bit Serial Multipliers. IEEE Trans. Educ. 1996, 39, 19–28. [Google Scholar] [CrossRef]
Ma, Y.; Cao, Y.; Vrudhula, S.; Seo, J. Optimizing the Convolution Operation to Accelerate Deep Neural Networks on FPGA. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2018, 26, 1354–1367. [Google Scholar] [CrossRef]

Figure 1. Composition and timing of (a) single precision, and (b) double-precision data.

Figure 2. Radix-2 Booth bit-serial multiplier: (a) calculation diagram, and (b) calculation timing.

Figure 3. Radix-2 Booth bit-serial multiplier: (a) LSB multiplier cell (Cell [0]), (b) inner multiplier cell (Cell[i] (i = 1, 2, …, P − 2)), (c) MSB multiplier cell (Cell[P − 1]), and (d) overall structure. The numbers 0, 1, …, P − 1 represent the sequence of cells, and the letter D denotes a D flip-flop.

Figure 4. Radix-4 Booth bit-serial multiplier: (a) calculation diagram, and (b) timing.

Figure 5. Radix-4 Booth bit-serial multiplier. (a) LSB multiplier cell (Cell[0]), (b) inner multiplier cell (Cell[i] (i = 1, 2, …, P/2 − 2)), (c) MSB multiplier cell (Cell[P/2 − 1]), and (d) overall structure. The numbers 0, 1, …, P/2 − 1 represent the sequence of cells, and the letter D denotes a D flip-flop.

Figure 6. Simulation waveforms of (a) original bit-serial multiplier, (b) radix-2 Booth’s bit-serial multiplier, and (c) radix-4 Booth’s bit-serial multiplier. The magenta lines indicate the valid sections of the simulation waveform, while the cyan lines represent the invalid sections. Blue arrows point from the multipliers and multiplicands toward the resulting products.

Figure 7. Simulation waveforms of (a) single-precision, and (b) double-precision bit-serial adders. The magenta lines indicate the valid sections of the simulation waveform, while the cyan lines represent the invalid sections.

Figure 8. Formula representation and circuit structure of a general MAC.

Figure 9. Resource utilization of the 16-bit radix-4 Booth bit-serial MAC.

Figure 10. MAC implementation on a Xilinx Kintex-7 FPGA, where LED 1 represents the “right” signal, which lights up when the MAC computation is correct; LED 2 represents the “wrong” signal, which turns off when the MAC computation is correct; and representing the wrong_count signal, LEDs 3–8 will turn on according to the value of wrong_count when the MAC computation is incorrect.

Table 1. Rough resource consumption in 45 nm 0.9 V from Eyeriss [3].

Operation	Energy (pJ)		Area (μm²)
Operation	Multiplier	Adder	Multiplier	Adder
8-bit INT ¹	0.2	0.03	282	36
16-bit FP ²	1.1	0.4	1640	1360
32-bit FP ²	3.7	0.9	770	4184

¹ Integer operation. ² Floating-point operation.

Table 2. Radix-2 booth encoding.

x[i]	x[i − 1]	Booth Code
0	0	0
0	1	Y
1	0	−Y
1	1	0

Table 3. Radix-4 booth encoding.

x[i + 1]	x[i]	x[i − 1]	Booth Code
0	0	0	0
0	0	1	Y
0	1	0	Y
0	1	1	2Y
1	0	0	−2Y
1	0	1	−Y
1	1	0	−Y
1	1	1	0

Table 4. Synthesis results for the three types of bit-serial multipliers on different processes and clock frequencies.

Parameters	Original Bit-Serial Multiplier	Radix-2 Booth Bit-Serial Multiplier	Radix-4 Booth Bit-Serial Multiplier	Original Bit-Serial Multiplier	Radix-2 Booth Bit-Serial Multiplier	Radix-4 Booth Bit-Serial Multiplier
Clock (MHz)	500
Process	TSMC 90 nm
Precision (fixed)	16-bit	16-bit	16-bit	8-bit	8-bit	8-bit
Area (μm²)	2444.2	3231.6	2615.6	1162.8	1463.4	1263.7
Power (mW)	1.2305	1.2511	1.1631	0.5818	0.5837	0.5644
Delay (ns)	0.16	0.23	0.44	0.16	0.22	0.52
Energy Efficiency (GOP/s/W)	13.11	12.89	17.91	27.72	27.63	36.91
Area-power-product (APP)	3007.59	4043.05	3042.20	676.52	854.19	713.23
Process	SMIC 0.18 μm			SMIC 28 nm
Precision (fixed)	16-bit	16-bit	16-bit	16-bit	16-bit	16-bit
Area (μm²)	10,424.9	13,731.4	10,411.6	449.6	622.1	399.5
Power (mW)	6.0630	6.6189	5.7293	0.2249	0.2301	0.2057
Delay (ns)	0.31	0.54	0.99	0.19	0.24	0.43
Energy Efficiency (GOP/s/W)	2.66	2.44	3.64	71.72	70.10	101.28
Area-power-product (APP)	63,206.17	90,886.76	59,651.18	101.12	143.15	82.18
Clock (MHz)	1500
Process	TSMC 90 nm
Precision (fixed)	16-bit	16-bit	16-bit	8-bit	8-bit	8-bit
Area (μm²)	2444.2	3231.6	2643.2	1162.8	1562.2	1263.7
Power (mW)	3.7162	3.7732	3.4892	1.7192	1.7659	1.6662
Delay (ns)	0.16	0.23	0.52	0.16	0.23	0.52
Energy Efficiency (GOP/s/W)	13.02	12.82	17.91	28.15	27.40	37.51
Area-power-product (APP)	9083.14	12,193.47	9222.65	1999.09	2758.69	2105.58
Latency (cycles)	2P − 1	2P − 1	1.5P	2P − 1	2P − 1	1.5P

Table 5. Comparison of the radix-4 Booth bit-serial multiplier and state-of-the-art multipliers.

Parameters	Radix-4 Booth Bit-Serial Multiplier	[12]	[16]	Traditional Multiplier
Clock (MHz)	150
Precision (fixed)	16-bit
mode	bit-serial	bit-serial	parallel	parallel
Process	TSMC 90 nm	TSMC 90 nm	40 nm	TSMC 90 nm
Area(μm²)	2634.1	70,277.6	52,894.7	106,066.8
Power(mW)	0.285	0.2962	0.5388	0.7712
Energy Efficiency (GOP/s/W)	17.29	16.34	278.40	194.50
Area-power-product (APP)	750.72	20,816.23	28,499.66	81,798.72

Table 6. Synthesis results of two types of adders.

Parameters	Single-Precision Bit-Serial Adder	Double-Precision Bit-Serial Adder
Process	TSMC 90 nm
Clock (MHz)	1500
Area (μm²)	60.7	127.7
Power (mW)	0.0822	0.1650
Delay (ns)	0.16	0.18
Slack (ns)	0.47	0.44

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, X.; Wang, Y.; Liu, J.; Ding, W.; Lou, H.; Li, P. Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies. Electronics 2023, 12, 2177. https://doi.org/10.3390/electronics12102177

AMA Style

Cheng X, Wang Y, Liu J, Ding W, Lou H, Li P. Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies. Electronics. 2023; 12(10):2177. https://doi.org/10.3390/electronics12102177

Chicago/Turabian Style

Cheng, Xiaoshu, Yiwen Wang, Jiazhi Liu, Weiran Ding, Hongfei Lou, and Ping Li. 2023. "Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies" Electronics 12, no. 10: 2177. https://doi.org/10.3390/electronics12102177

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Booth Encoded Bit-Serial Multiply-Accumulate Units with Improved Area and Energy Efficiencies

Abstract

1. Introduction

2. Related Work

3. Basics

3.1. Bit-Serial Representation

3.2. Booth Encoding Basics

4. Design of the Booth Bit-Serial Multipliers

4.1. Radix-2 Booth Bit-Serial Multiplier

4.2. Radix-4 Booth Bit-Serial Multiplier

4.3. Bit-Serial Adder

5. Experiment and Results

5.1. Simulation of Bit-Serial Multipliers

5.2. Simulation of Bit-Serial Adders

5.3. Synthesis Results of Bit-Serial Multipliers

5.4. Synthesis Results of Bit-Serial Adders

5.5. MAC Implemention on an FPGA

5.5.1. Resource Utilization

5.5.2. Prototype Verification

6. Conclusions and Discussion

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI