Next Article in Journal
Step Coverage and Dry Etching Process Improvement of Amorphous Carbon Hard Mask
Next Article in Special Issue
Efficient and Accurate CORDIC Pipelined Architecture Chip Design Based on Binomial Approximation for Biped Robot
Previous Article in Journal
On Optimizing a Multi-Mode Last-Mile Parcel Delivery System with Vans, Truck and Drone
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

An Accuracy-Improved Fixed-Width Booth Multiplier Enabling Bit-Width Adaptive Truncation Error Compensation

Information and Computer Engineering Department, Chung Yuan Christian University, Taoyuan 32023, Taiwan
*
Author to whom correspondence should be addressed.
Electronics 2021, 10(20), 2511; https://doi.org/10.3390/electronics10202511
Submission received: 26 August 2021 / Revised: 11 October 2021 / Accepted: 13 October 2021 / Published: 15 October 2021
(This article belongs to the Special Issue Recent Advances in CMOS Logic Circuits)

Abstract

:
Fixed-width Booth multipliers (FWBMs) generate a product with the same bit width as the operand and have been extensively employed in many digital systems. Various truncation error compensation (TEC) schemes have been presented for FWBM designs, aiming to reduce hardware costs while preserving operation accuracy. In general, the existing TEC methods function adequately for an exact bit width of the operand but fail to consider the TEC effect for FWBM inputs with various bit-width levels. To address this issue, we propose a bit-width adaptive TEC (BWATEC) scheme for providing high-accuracy TEC functions that are adaptive to the multiple L′-bit numerical ranges of input data for an L-bit FWBM (L′ ≤ L). We also present adjustable architecture for a 16-bit FWBM to enable the proposed BWATEC scheme and evaluate the hardware performance, using the TSMC 40 nm standard cell library. Relative to the contrast 16-bit FWBM approaches that use state-of-the-art TEC methods, the proposed BWATEC-enabled FWBM design can achieve reductions in the area-delay-error product of 7.9–50.9%, 17.1–69.5%, 29.9–82.2%, and 100% for the 14-bit, 12-bit, 10-bit, and 8-bit inputs, respectively. Moreover, the resultant 16-bit FWBM with BWATEC was verified by using the field-programmable gate array for convolutional neural network acceleration.

1. Introduction

Multipliers are widely used in many digital operation systems. To limit bit-width increases in data paths, fixed-width multipliers are accordingly employed as arithmetic modules for digital signal processing, communication baseband operations, and neural network acceleration [1,2,3,4]. L-bit fixed-width multipliers generate the same L-bit output width as the L-bit operand, of which the Baugh–Wooley (array) multiplier and Booth multiplier are two of the most popular types. Two convenient approaches to fixed-width Baugh–Wooley or Booth multipliers are post-truncation (PT) and direct-truncation (DT). The PT method calculates all partial products and rounds the 2L-bit full-width product to the L most significant bits (MSBs) to achieve high accuracy, but the hardware costs are high. The DT method truncates the partial products related to the least significant bits (LSBs) of the 2L-bit full-width product to reduce the hardware costs, but the accuracy is very low.
Considering both operation accuracy and hardware complexity, several schemes based on truncation error compensation (TEC) have been presented for fixed-width Baugh–Wooley multipliers [5,6,7,8] or fixed-width Booth multipliers (FWBMs) [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. The Booth multiplier has benefits in achieving high hardware efficiency because the number of rows of partial products is significantly reduced [23,24]. Moreover, the lower level of truncated partial products for Booth multipliers profits the fixed-width operation accuracy [17,18]. Therefore, the FWBM that enables a kind of TEC scheme is discussed in this study. A number of TEC schemes for FWBMs have been presented [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. In general, TEC schemes for FWBMs obtain a TEC value (bias) based on computer simulation [9,10,11,12,13,14] or probability-based estimation [14,15,16,17,18,19,20,21,22] to compensate for the truncation error associated with the curtailed partial products. For simulation-based methods, the work in Reference [9] used linear regression analysis and simulation to generate bias values. In Reference [10], bias terms were generated based on simulation outcomes, and simplified through the Karnaugh map processing. Moreover, the authors of References [11,12] derived formulas in a closed form for TEC biasing based on simulation results. In References [12,13,14], the bias terms through a simulation were determined utilizing the Booth encoded results to further improve accuracy. The computer simulation methods presented in References [9,10,11,12,13,14] are applicable but generally consume exhaustive simulation time to obtain the TEC bias value. Instead of exhaustive simulation, the authors of References [14,15,16,17,18,19,20,21,22] also presented the probability-based scheme to derive the TEC function. In this study, we aim to design the TEC-enabled FWBM based on the probabilistic estimation method. More literature reviews for the state-of-the-art FWBM design that uses the probability-based TEC scheme [14,15,16,17,18,19,20,21,22] are discussed in Section 2.2.
In conventional TEC methods for FWBMs, TEC functions are generally operated based on a certain particular bit width of the FWBM operand. However, in practical applications, an L-bit FWBM might need to process input patterns with various L′-bit widths (L′ ≤ L; L and L′ are generally even). For example, a 16-bit FWBM might be employed to operate with 16-bit, 14-bit, 12-bit, 10-bit, or 8-bit numerical input patterns, as specified by different situations. Such an operation can be practically performed in several applications. Taking the convolutional neural network (CNN) as an example, an accelerator may employ an FWBM to process input data that have different settable bit widths from different CNN models or layers. Moreover, FWBMs used in a shared digital filter might operate with input data whose levels are various for multiple analog modules. To the best of our knowledge, no previous study has developed a TEC scheme for such an FWBM design to offer adaptive TEC biasing for various bit widths of input data. In this study, we propose a bit-width adaptive TEC (BWATEC) scheme for providing an adjustable TEC bias for the diverse bit widths of input patterns. For an L-bit FWBM, the proposed BWATEC method can enable a tailored and high-accuracy TEC function for each case of the L′-bit input pattern (where L′ ≤ L). In addition, an FWBM design for enabling the BWATEC is proposed based on a reconfigurable bias circuit with high hardware efficiency.
The remainder of this paper is organized as follows. Section 2 briefly introduces the background of FWBMs and the conventional probability-based TEC schemes for FWBMs. Section 3 outlines the proposed BWATEC scheme and its operations. In Section 4, the architecture of a 16-bit FWBM enabling the proposed BWATEC scheme is described. Section 5 evaluates the accuracy and hardware performances of our design and reports the experiment results, using a system-on-chip (SoC) field-programmable gate array (FPGA) platform. Finally, the conclusions are highlighted in Section 6.

2. Preliminaries and Design Issues

Some abbreviations and acronym words frequently used in this study are tabulated in Table 1 for convenient reference.

2.1. Fixed-Width Booth Multiplier (FWBM)

Let A and B be two L-bit 2′s complement operands, represented by “aL−1, aL, …, a1, a0” and “bL−1, bL, …, b1, b0” with the values shown below, respectively.
A = a L 1 2 L 1 + i = 0 L 2 a i 2 i B = b L 1 2 L 1 + i = 0 L 2 b i 2 i
The Booth encoding maps three consecutive terms, b2j+1, b2j, and b2j−1 into dj, as tabulated in Table 2. The dj value can be associated with (b2j+1, b2j, b2j−1) terms as expressed in Equation (2), where Q = (1/2) × L. As a result, a 2L-bit full-width product (FP) for A × B can be obtained as shown in Equation (3).
B = j = 0 Q 1 d j 2 2 j , d j = 2 b 2 j + 1 + b 2 j + b 2 j 1
F P = j = 0 Q 1 d j 2 2 j × a L 1 2 L 1 + i = 0 L 2 a i 2 i
Using binary arithmetic for A × B, the partial products (P.P.) for each dj can be derived in terms of ai (i is from 0 to L − 1), 0, or 1, as shown for the values of pi,j, and nj in Table 2. Based on the P.P. terms in Table 2, Figure 1 depicts the structure of the P.P. array for an example of a 16-bit (L = 16) A × B full-width Booth multiplier. As shown in Figure 1, all P.P. terms can be divided into two groups: the main part (MP) and truncation part (TP). The P.P. in the MP are calculated to generate the product, whereas the TP includes the P.P. for computing the rounded L LSBs of the full-width product. The TP can be further divided into the TPmajor and TPminor subgroups. As indicated in Figure 1, TPmajor contains the P.P. in the most significant column (MSC) of the TP, which dominates the accuracy of the carry from the TP toward the MP. In general, the accuracy can be improved by increasing the column range for TPmajor [17,18,19,20]. However, a TEC function based on the MSC TPmajor with one MSC usually offers adequate accuracy in many applications and the use of one-MSC TPmajor sufficiently serves as a baseline to evaluate the performances for different TEC schemes [12,14,16,22]. Thus, this study adopted the one-MSC TPmajor to develop and evaluate our BWATEC scheme and FWBM design. In an FWBM design with TEC, TPmajor is reserved for calculation, whereas TPminor is truncated, and an estimated bias is adopted to compensate for the truncation error [9,10,11,12,13,14,15,16,17,18,19,20,21,22]. Therefore, an L-bit FWBM with TEC produces an L-bit quantized FPq result, as expressed in Equation (4), where BTEC indicates the estimated bias value for TEC, TPmajor is mapped to the 2−1 digit and R{.} is the rounding operation.
F P q = M P + σ 2 L , σ = R T P m a j o r + B T E C
With regard to an L-bit FWBM whose operands can be assigned to the input data of multiple prespecified L′-bit width (L′ ≤ L), the L′-bit input patterns are necessarily left-shifted by (LL′) bits and are padded with zeros (i.e., Zero-Padding bits) to form the L-bit operand. The aforementioned processing for L = 16 is also described in Figure 1 for input patterns with multiple 14-bit, 12-bit, 10-bit, or 8-bit (i.e., L′-bit) widths.

2.2. Probability-Based TEC Schemes for FWBMs

Several FWBM designs with probability-based TEC have been presented [14,15,16,17,18,19,20,21,22]. The authors of Reference [14] presented the probability-based scheme, together with their simulation-based works. Similarly, the work in Reference [15] used the expected value for P.P. to derive bias values. Furthermore, the probabilistic analysis methods [16,17,18,19,20,21,22] derived closed formulas of the TEC function based on the expected value or the conditional probability for P.P. terms. In Reference [16], the expected values for two groups of TPminor (i.e., the nj terms in Table 2 equals to 0 or 1) were respectively derived to obtain the probabilistic estimation bias when one-MSC TPmajor is specified. In addition, a generalized probabilistic estimation bias (GPEB) method [17] further enhanced the work in Reference [16] for the cases of TPmajor containing more P.P. columns. Using the GPEB methods [16,17], a simple TEC function of a 1-bit or 2-bit constant value was derived. The work in Reference [18] presented a TEC scheme based on the conditional probability depending on non-zero Booth encoder outputs (i.e., dj! = 0 in Table 2) for each row of TPminor. In Reference [19], a more complex method based on [18] was presented by using a conditional probability model for multiple TPminor rows. Such a design [19] slightly improved accuracy but increased hardware overheads. The authors of Reference [20] considered both expected values and conditional probability to progress a bias function improving accuracy and area based on the probability and computer simulation (PACS). In Reference [21], the concept of data scaling was presented and applied to conventional TEC-adapted FWBM designs for improving accuracy. A Booth-encoded sign-digit-based conditional probability (BSCP) method was presented in Reference [22] for the case of one-MSC TPmajor. The work in Reference [22] further took advantage of the sign of non-zero Booth encoder results to generate a TEC function achieving relatively high accuracy.
Considering a 16-bit FWBM design with TEC, the aforementioned conventional TEC schemes can be directly applied to the design example, as shown in Figure 1. However, such approaches cannot achieve optimized accuracy for input patterns with 14/12/10/8-bit widths, as the applied TEC functions are for 16-bit operands; thus, imprecise biasing might be introduced to 14-bit to 8-bit FWBM operations. Accordingly, the development of an enhanced and tailored TEC scheme (e.g., the proposed BWATEC method) that is adaptive to input patterns with values in multiple bit-width levels is considered to be useful and practical for the TEC-enabled FWBM design.

3. Proposed Bit-Width Adaptive TEC (BWATEC) Scheme

In Section 3.1 and Section 3.2, we use the 16-bit FWBM as an example for explaining the probability-based bias estimation and TEC operations for the proposed BWATEC scheme.

3.1. Derivation of Probabilistic Estimation for BWATEC

Referring to Figure 1, there are eight rows of TPminor (incl. nj) in the P.P. array for the 16-bit A × B Booth multiplication. We can represent a row index of j from 0 to 7 (the top row is the 0th row). The contents of the TP vary with the number of Zero-Padding (ZP) bits for different L′-bit input patterns of the operands A and B. Based on the mapping results from Table 2, Figure 2a–c illustrates the contents of TPminor for L′ = 14, 12, and 10, respectively. As described in Figure 2a–c, TPminor is classified into three regions: the zero region (RZ), hybrid region (RH), and deterministic-only region (RD). The RZ region only has zero-valued P.P. related to the Booth-encoded result of the ZP bits for the B operand, and thus can be trivially truncated. In Figure 2, the RH region includes P.P. with hybrid deterministic and probabilistic values. For the jth row of TPminor in RH, the sj terms are the P.P. associated with the ZP bits of the operand (A). Both nj and sj can be exactly determined to be “0” or “1” (i.e., deterministic values) depending on dj, based on the contents in Table 2. The ej in RH (Figure 2) is the P.P. value of the pr,j (wherein r = LL′). From Table 2, it can be observed that the ej value can be equal to “0” or “1” (dj = ±2) or can be identified by the LSB of the original L′-bit input data (dj = ±1). In the RH region, the ej terms in the case of (dj = ± 1) and all other P.P. terms, excluding nj, sj, and ej, can be estimated by using an expected value of 1/2 (i.e., probabilistic values) [16,17,18,19,20]. Relative to the RH, all P.P. in the RD region are only sj and nj terms, which are deterministic values. In addition to the cases of L′ = 14, 12, and 10 (shown in Figure 2a–c), the TPminor contents for two contrast cases of L′ = 16 and L′ = 8 are also illustrated in Figure 3a,b. For L′ = 16, TPminor only has the RH region, while when L′ = 8, only the RZ and RD regions are included.
By mapping TPmajor to the 2−1 digit (i.e., the MSB of TPminor is 2−2), the expected value of all P.P. for the jth-row TPminor in the RH, E [ T P m i n o r , j ( H ) ] , can be calculated as Equation (5), where ns is the number of sj (refer to Figure 2). Based on Equation (5) and the mapping contents in Table 2, the values of E [ T P m i n o r , j ( H ) ] and (nj, sj, ej) according to dj are listed in Table 3.
2 2 j 16 n j + α = 0 n s 1 2 2 j + α 16 s j + 2 2 j + n s 16 e j + β = n s + 1 14 2 j 2 2 j + β 16 1 / 2
The E [ T P m i n o r , j ( H ) ] values in Table 3 can be summarized by using the following expression, where a variable δj is defined by δj = 1 for dj ! = 0; otherwise, δj = 0.
E [ T P m i n o r , j ( H ) ] = 2 2 δ j d j 2 16 + 2 j + n s 1
From observing the RH region for L′ = 14, 12, and 10 in Figure 2, it can be found that the RH includes the mth to the kth row of TPminor, in which m = ns/2 and k = 7 − (ns/2). By summing E [ T P m i n o r , j ( H ) ] for all rows in the RH, an overall E [ T P m i n o r ( H ) ] can be obtained as follows:
E [ T P m i n o r ( H ) ] = j = m k E [ T P m i n o r , j ( H ) ] = j = m k 2 2 δ j j = m k d j 2 16 + 2 j + n s 1 ( a )
For an FWBM with TEC, the result of Equation (7) can be viewed as an ideal bias for the truncated TPminor in the RH; however, the calculation of (a) in Equation (7) is complex. The bottom row in the RH (i.e., the kth row; j = k) dominates the final calculation result. Moreover, the result of (a) in Equation (7) can be rounded to the 2−2 digit to be arithmetically added to δj. Therefore, Equation (7) can be approximated by Equation (8) by simplifying the (a) part to a σ·2−2 term, where R−2{.} represents rounding a value to the 2−2 digit.
E [ T P m i n o r ( H ) ] j = m k 2 2 δ j R 2 d k 2 3 j = m k 2 2 δ j + σ 2 2 , σ = 1 , d k < 0 ; σ = 1 , d k > 0 ; σ = 0 , d k = 0
However, the subtraction arithmetic for the 2−2 (i.e., σ = −1 in Equation (8)) is also an issue in a P.P. array. This issue can be resolved by taking advantage of the following operational features. When dk is negative, both δk and σ are equal to 1; thus, a carry of “1” can be added to the 2−1 digit. If dk is positive, δk is 1, whereas σ is −1. Thus, δk can be eliminated at the 2−2 digit, owing to the offset by σ. As a result, Equation (8) can be further calculated by using Equation (9), where a variable, γ, is operated at the 2−1 digit only with an addition.
E [ T P m i n o r ( H ) ] j = m k 1 2 2 δ j + 2 1 γ ; γ = 1 , d k < 0 0 , d k 0
As shown in Figure 2a–c, the RD region only comprises P.P. in terms of sj and nj, as the number of P.P. within a row in the RD is less than the number of ZP bits (Figure 1). Similar to sj in the RH, the sj terms in the RD are also P.P. obtained from the ZP bits of the A operand and are equal to the dj-dependent deterministic “1” or “0” values. Setting sj and nj to “1” (for dj < 0) or “0” (for dj ≥ 0), the actual value of all P.P. for the jth row of TPminor in the RD, i.e., E [ T P m i n o r , j ( D ) ] , can be obtained by the following derivation. An accumulated result from the jth row in the RD is introduced to the 2−1 digit for negative dj values.
E T P m i n o r , j ( D ) = 2 16 + 2 j ( n j + s j ) + + 2 2 s j = 0 , d j 0 2 1   , d j < 0
For the design examples illustrated in Figure 2a–c, the RD region includes rows with indexes from k + 1 to Q − 1, where Q equals 8 for the case of L = 16. The variable Q is defined as Q = L/2, which is the number of rows in a P.P. array. Thus, a global E [ T P m i n o r ( D ) ] can be derived, as shown in Equation (11), in which the variable λj is defined by λj = 1 for dj < 0 and λj = 0 for dj ≥ 0, corresponding to the execution results of Equation (10) for each row.
E T P m i n o r ( D ) = j = k + 1 Q 1 T P m i n o r , j ( D ) = j = k + 1 Q 1 2 1 λ j

3.2. BWATEC Synthesis and Operations

For FWBMs with TEC, E [ T P m i n o r ( H ) ] and E [ T P m i n o r ( D ) ] values obtained by using Equations (9) and (11) can be employed as the TEC bias to compensate for the truncated P.P. of TPminor in the RH and RD, respectively. Moreover, the operations of Equations (9) and (11) are different from the input bit width (L′), as well as the contents and range of the RH and RD regions (Figure 2). In addition to the cases of L′ = 14, 12, and 10 as shown in Figure 2a–c, the proposed schemes based on Equations (9) and (11) can also be applied to the conditions of L′ = 16 and 8 as shown in Figure 3. For L′ = 16, TPminor only has the RH region, and we can use Equation (9) to generate the TEC bias. Alternatively, when L′ = 8, Equation (11) is used, as only the RD region is calculated. In practice, the TEC function for L′ = 16 can be further improved. From Figure 1, we can use deterministic p0,7 and n7 to operate with δj at the 2−2 digit; thus, a more precise carry can be added at the 2−1 digit, instead of adding γ in Equation (9). The efficient use of Equations (9) and (11), as associated with multiple combinations of deterministic and probabilistic data, achieves the aims of the proposed BWATEC scheme. Considering a 16-bit FWBM, Figure 4 illustrates the TEC operations by using the proposed BWATEC scheme for various L′-bit inputs.

3.3. Design Scalability

Taking the 16-bit FWBM example as a base, the deduced processing can also be applied to general L-bit FWBM designs (i.e., L is a scalable number other than 16). In general, an L-bit Booth multiplier is operated based on L of an even number. Considering the scalability of the proposed design, the aimed L-bit FWBMs can be categorized into two kinds of specifications. One is L = 2n, and n is an even integer; thus, the number of P.P. rows (i.e., the Q value) is even based on Q = L/2. The other is L = 2n, and n is an odd integer; thus, the number of P.P. rows is odd. Referring to the contents associated with Figure 2 and Figure 3, the TPminor P.P. corresponding to the RZ/RH/RD regions are illustrated for the design case of an L-bit Booth multiplier (L = 16) with various L′-bit inputs (L′ = 16, 14, 12, 10, and 8), which has even (i.e., 8) P.P. rows. Moreover, the contents related to Figure 4 illustrate the proposed BWATEC operation for a 16-bit FWBM. As extension based on the illustration for the case of L = 16, Figure 5 depicts the RZ/RH/RD distribution of TPminor rows of a general L-bit Booth multiplier (i.e., L is scalable) for different L′-bit inputs, and Figure 5a,b illustrates the specification of “L = 2n (n and Q are even; even rows)” and “L = 2n (n and Q are odd; odd rows)”, respectively. In Figure 5a,b, the value shown inside each RZ/RH/RD block represents the number of rows in that region and refers to the ceiling operator.
As indicated in the previous section, the proposed BWATEC operations for the case of a 16-bit FWBM (i.e., Figure 4) can be synthesized based on the contents in Figure 2 and Figure 3 in common with Equations (9) and (11). By analogy with the derivation for the contents in Figure 4, the proposed BWATEC operations for various L′-bit input patterns of a general L-bit FWBM can be similarly synthesized based on Figure 5, Equations (9) and (11), as described in Figure 6, where the two specifications of L = 2n (n is even or odd) are also respectively illustrated. The contents in Figure 5 and Figure 6 further address the RZ/RH/RD distribution and BWATEC operations of a general L-bit FWBM operated with L′-bit inputs in small L′ values. For the cases of “L′ ≤ L/2–2 (L = 2n; n is even)” or “L′ ≤ L/2–1 (L = 2n; n is odd)”, only the RZ and RD regions are included and the range of RD is reduced with smaller L′ values. Such conditions allow only the RD P.P. to be calculated to obtain the TEC bias, as expressed in Equation (12), which is an extended form based on Equation (11).
E T P m i n o r ( D ) = j = t Q 1 2 1 λ j ; t = Q / 2 + 1       , L L / 2 2 , Q i s e v e n Q / 2 + 1   , L L / 2 1 , Q i s o d d

4. Proposed BWATEC-Enabled FWBM Architecture

There is also a need for an FWBM design with an efficient architecture for enabling the proposed BWATEC scheme. Figure 7 describes the hardware architecture of a 16-bit FWBM example enabling the BWATEC functions. As shown in Figure 7, the P.P. values are first produced through the Booth Encoder, and the P.P. Generator operates on two operands of A and B, which are already padded with ZP bits according to the prespecified bit width, L′ of input patterns (Figure 1). Depending on L′, the BWATEC-associated δj, γ, and λj terms are also set and sent to the P.P. array, along with the P.P. terms. The carry-save adder/carry-propagation adder (CSA/CPA) unit performs the array operations for the MP, TPmajor, and BWATEC biasing. Right shifting of bits can be optionally executed at the CPA output depending on the practical system design. As detailed in Figure 7, we used four groups (i.e., M1, M2, M3, and M4) of multiplexers controlled by the setting of L′ to enable data selection of the carry of δj accumulation, γ, λj, and “0” for the BWATEC operations described in Figure 4. A switch is also employed for selecting γ or an extra carry contributed by the addition of p0,7 and n7 for L′ = 16. Based on the configuration shown in Figure 7, similar approaches can be used to deduce the FWBM design for other bit widths of the operand. As a result, the BWATEC-enabled FWBM can be realized by using the originally required P.P. elements with additional multiplexers (incl. a switch) and control logics for adjustable TEC operations.
Considering a general L-bit FWBM (L is scalable) enabling the proposed BWATEC scheme, we see that its hardware configuration for TEC biasing can also be developed based on the BWATEC operation shown in Figure 6, as the approach for the 16-bit FWBM example (refer to Figure 4 and Figure 7). The hardware structure for BWATEC biasing of a general L-bit FWBM is described in Figure 8a,b for the two specifications of “L = 2n (n is even)” and “L = 2n (n is odd)”, respectively. As indicated in Figure 8, the addition of biasing element is performed by using full adders (FAs) or half adders (HAs). The mandatory multiplexers (i.e., MUX1 in Figure 8) are employed to select the carry of δ accumulation or the γ and λ terms based on the BWATEC operations in Figure 6, and the optional multiplexers (i.e., MUX2) can allow unadded δ terms to be “0” for energy efficiency. If the devised FWBM is specified to process L′-bit inputs with small L′ values, corresponding levels of multiplexers (i.e., MUX3) might be employed to mask the uncalculated γ or λ terms (refer to Figure 6) as shown in Figure 8. Based on the configuration of a P.P. array and the BWATEC biasing (refer to Figure 8), the hardware (HW) resource usage in the number of FAs, HAs, and multipliers (i.e., MUX1, 2, and 3) of a general L-bit FWBM using the proposed BWATEC scheme for various L′-bit inputs are listed in Table 4.

5. Evaluations and Experiments

Considering 16-bit FWBM designs with TEC based on one-MSC TPmajor, this section evaluates the accuracy and hardware performances for the proposed design and several representative works in previous studies. Moreover, the 16-bit FWBM with BWATEC was verified through the SoC-FPGA implementation for CNN inference operations.

5.1. Evaluations of Accuracy and Hardware Performances

For the accuracy performance, the signal-to-noise ratio (SNR) is the most important parameter and is defined as in Equation (13), where FP (refer to Equation (3)) is the product of the full-width Booth multiplier, and FPq (refer to Equation (4)) is the product of the FWBM with TEC, DT, or PT. In Equation (12), the mean square error (MSE) is also defined.
S N R ( dB ) = 10 × log 10 ( E [ F P 2 ] / E [ ( F P F P q ) 2 ] ) ; M S E = E [ ( F P F P q ) 2 ] / 2 2 L
For comparison, we select state-of-the-art TEC schemes whose functions have a closed form, i.e., the generalized probabilistic estimation bias (GPEB) [16,17], probability estimation and computer simulation (PACS) [20], Booth-encoded sign-digit-based conditional probability (BSCP) [22], and SC-generator-based (SCG) [12] methods, as well as the DT and PT approaches. Table 5 presents the accuracy (i.e., the SNR) and hardware performances (area, critical-path delay, and power consumption) for a 16-bit FWBM using the aforementioned TEC and proposed BWATEC schemes, respectively.
In Table 5, the SNR results were obtained for operations of 16-bit data (i.e., L′ = 16), based on the calculation of 30 K sets of 16-bit A × B Booth multiplication. Both the A and B operands were uncorrelated random 16-bit numbers with uniform distribution in statistics. The hardware parameters were provided by using the Synopsys Design Compiler, through logic synthesis with the TSMC 40 nm typical standard cell library for FWBM designs with no optional bit-shift processing at the output. According to References [12,22], we used a general sorting circuit based on Reference [12] for the BSCP and SCG designs in order to avoid the addition of negative digit values. In Table 5, the BSCP method achieves a better SNR than all other TEC-enabled designs. However, this result of the BSCP approach was obtained by using a complex TEC formula (i.e., Equation (19) in Reference [22]), and this function is difficult to be directly applied to a practical biasing circuit. The GPEB scheme outperforms other TEC-based works due to its use of a simple 1- bit or 2-bit constant TEC bias; however, the GPEB accuracy result is comparatively more reduced. Referring to the hardware parameters listed in Table 5, the results from the “area” and “power” items basically exhibit the same trend. To benchmark both the area efficiency and the accuracy, a design metric of area-delay-error product (ADEP), defined as “ADEP = Area × Delay × MSE”, can be adopted to evaluate the overall design efficiency. As there is no TEC function involved in either DT or PT and the MSE magnitudes obtained by DT and PT are too extreme for the ADEP evaluation, these two schemes are excluded from the ADEP evaluation [22]. Table 6 lists the ADEP results in percentage values (normalized to that of the GPEB case) for TEC-enabled designs.
As indicated in Table 6, the proposed design outperforms all listed schemes (i.e., a relatively small ADEP value) except the PACS method. This is because additional data-selection multiplexers/switch and control logics are required in our design to enable the adjustment of the TEC function for multiple L′ levels (refer to Figure 7). Such processing increases hardware costs and especially increases critical-path delay in our FWBM relative to other TEC designs, as shown in Table 5; however, the accuracy for L′ = 16 in our case is comparatively improved by using deterministic p0,7 and n7 (refer to Section 3.2, Figure 4).
Nevertheless, the actual accuracy performance and hardware efficiency of the proposed design is manifested in the accuracy improvement for operations on L′-bit input patterns, giving L′ < 16. Table 7 reports the SNR results for the TEC-enabled 16-bit FWBMs (i.e., works in Table 6) for operations of L′ = 14, 12, 10, and 8. In Table 7, the SNR values were obtained based on the 16-bit product of FWBMs relative to the PT outcomes. As indicated in Table 7, our design achieves the highest SNR performances compared to other TEC-based designs for all listed L′ cases because the proposed BWATEC scheme provides more precise TEC biasing for various L′-bit inputs. In addition, higher SNR results can be achieved with smaller L′ values by using the proposed design, due to more counts of deterministic RZ/RD elements. In practical designs, a slight improvement in the SNR results possibly results in an efficient enhancement in the system operation accuracy [21].
Considering the overall design efficiency, Figure 9 illustrates the ADEP results from the TEC-based 16-bit FWBMs based on the MSE value relative to the PT products for operations of L′ = 14, 12, 10, and 8, with ZP bits added to the input operand. In Figure 9, the ADEP values are normalized to the GPEB results, and the annotated percentage values represent the reduction of the ADEP achieved by the proposed design relative to all other listed methods. Figure 9 demonstrates that our scheme outperforms its contenders in terms of the ADEP values, achieving reductions of 7.9–50.9%, 17.1–69.5%, 29.9–82.2%, and 100% for the operations of input patterns with 14-bit, 12-bit, 10-bit, and 8-bit widths, respectively. Figure 9 shows that our design can achieve a more significant TEC effect with smaller specified L′ values, as more ratios of deterministic values are used and associated with the RH and RD when using the proposed BWATEC scheme. For the case of L′ = 8, our approach equivalently counts all P.P. terms to obtain a full-width result that is the same as a PT outcome, and thus a 100% ADEP reduction can be achieved.
As discussed in Section 3.3 and Section 4, two specifications of “L = 2n (n is even)” and “L = 2n (n is odd)” are considered for the design scalability of an L-bit FWBM, using the proposed BWATEC scheme. Therefore, in addition to the case of L = 16 (for even n), another case of L = 14 (for odd n) was also evaluated for the ADEP performances in this section. Figure 10 illustrates the ADEP results from the TEC-enabled 14-bit FWBMs for operations of L′ = 12, 10, 8, and 6, based on the same processing with that for the 16-bit FWBM evaluation. Figure 10 indicates that our design outperforms all other listed methods, achieving the significant ADEP reductions for the operations of inputs with 12-bit, 10-bit, 8-bit, and 6-bit widths, respectively. Compared to the ADEP results for 16-bit FWBMs (Figure 9), the ADEP drops for all GPEB-excluded designs in relation to the GPEB base is reduced in Figure 10; however, the same trend of the ADEP reductions based on the relative value of L and L′ is exhibited for our design associated with other TEC-based works.

5.2. Design Verification and Experiments

5.2.1. CNN Acceleration Application

To verify an FWBM enabling the proposed BWATEC scheme, we implemented our design by using a SoC-FPGA platform and demonstrated the hardware acceleration for CNN inference operations. In a typical CNN accelerator, fixed-point operations are usually considered and a suitable bit width can be determined based on the CNN inference accuracy requirement [25,26]. Several studies have shown that the small bit width (e.g., 8-bit width or fewer) is sufficient for the model coefficients and operation precision, while preserve the inference accuracy [27,28,29]. However, a sufficiently high bit width (e.g., common 16 bits) is considered in several CNN accelerator approaches to ensure the precision required by various applications [30,31,32]. Moreover, different bit-widths cab be specified for different CNN layers (e.g., the intermediate layers) to adjust the CNN performance [33,34]. Accordingly, several works have proposed CNN processing units that support operations with variable bit widths (e.g., 4/8/16-bit or 1-bit to 16-bit) [28,35,36,37]. In this study, an L-bit FWBM (e.g., our design example of a 16-bit FWBM) capable of processing input patterns with multiple L′-bit (L′ ≤ L) widths lends support to the aforementioned practical approaches.

5.2.2. SoC-FPGA Implementation

The employed SoC-FPGA-based platform uses a Xilinx Zynq-7000 SoC-FPGA device which integrates an ARM central processing unit (CPU) with the user-developed hardware side. Such a SoC-FPGA approach lends support to the CNN inference operations by using a software (SW)–hardware (HW) co-design scheme [38,39,40] by appropriately evaluating the SW–HW work division. For example, computation-expensive two-dimensional (2D) convolution is often accelerated at the HW side, while other low-effort CNN operations, such as maximum pooling, fully-connected (FC) layer execution and system controls are processed at the SW end [39,40]. Figure 11 shows the setup of our implementation that uses a SoC-FPGA approach based on the SW–HW co-design for CNN acceleration.
Referring to Figure 11, the division of HW and SW responsibilities in our CNN setup was as follows. The HW side was responsible for 2D (5 × 5) convolution, addition of an offset, activation function (i.e., rectified linear unit; ReLU), and maximum pooling, while the SW side performed residual low-complex operations (e.g., FC execution), HW operation mode setting, and system control. As depicted in Figure 11, the ARM CPU executes SW commands and communicates with the HW side through the AXI bus, and the data transferring between the external memory and the FPGA HW-side memory is executed through direct memory access (DMA). When the HW acceleration of each CNN layer was actuated, the feature map data, kernel weights, and control parameters were fetched from the external memory (e.g., DRAM) to the HW side via DMA transmission and stored in the block RAMs, data registers, and control registers, respectively. The 2D convolution accelerator then accessed those stored values for convolution operations and then sent the calculated result to the next module for the offset-addition, ReLU, and pooling operations. The final produced data of HW acceleration for each CNN layer were stored in the block RAMs and sent to the external memory through DMA for the follow-up SW processing.
In our design, the 2D (5 × 5) convolution accelerator employs 25 16-bit FWBMs with BWATEC, which can operate with multiple 16-bit, 14-bit, 12-bit, 10-bit, or 8-bit numerical input data. Depending on the tilling for each CNN layer, the block RAMs can be configured to store the data of input and output feature maps (images) with sizes from 32 × 32 to 128 × 128. Table 8 lists the main HW resource usage on a Xilinx/Zynq-7000 SoC-FPGA device for our FPGA design, and the items include the lookup-table (LUT), flip-flop (FF), LUTRAM, and block RAM (BRAM)utilization. Table 8 also lists the HW performance of giga operations per second (GOPs), which is obtained by using a 50 MHz clock rate with values converted from the giga multiplication and addition operations [40].

5.2.3. Electrocardiogram Classification Experiment

Based on our SoC-FPGA implementation and SW–HW co-design setup, an experiment was performed to demonstrate the electrocardiogram (ECG) classification. In this work, we used the standard MIT–BIH arrhythmia dataset [41] for the CNN model training and inference. To operate with ECG data by using a 2D CNN model [42,43], we transformed the one-dimensional MIT–BIH ECG signals into the (128 × 128) 2D ECG image by using the signal preprocessing technique presented in Reference [42]. Rather than the clinical ECG classification for seven or five arrhythmia classes [42,44], our experiment merely classified ECG images into “normal” and “abnormal” heartbeats for wearable ECG monitor applications.
The experimental network for the aimed ECG classification was built up by using a simplified LeNet-5 CNN model [45]. As the contents summarized in Table 9, the built-up CNN model includes two convolution and maximum pooling layers, followed by the FC layers. Our CNN model was first determined by training process performed on a high-end computer in floating-point operations. For the CNN inference using an SW–HW co-design approach, the executions accelerated at the HW side were performed in fixed-point operations to achieve available overall accuracy (Table 9). To verify the proposed design, we implemented a 2D convolution unit consisting of 25 pcs 16-bit FWBMs with the proposed BWATEC function on the SoC-FPGA device. In our experiment, the same 2D convolution unit with one set of 16-bit FWBMs (25 pcs) was operated to perform the computation of two CNN layers. To demonstrate the BWATEC operations of our design, two phases of L′ setting for the same set of 25 pcs 16-bit FWBMs (i.e., L = 16; L′-bit inputs) were adopted to execute two layers of CNN convolution execution. In phase 1, all 16-bit FWBMs in the 2D convolution unit were set to operate with 12-bit input data (i.e., L′ = 12) for CNN layer-1 operations to consist with the numerical level of inputs. In phase 2, the same 16-bit FWBMs were set to process 16-bit input data (i.e., L′ = 16) for CNN layer-2 operations to preserve the computation precision.
For evaluation, we also implemented contrast 2D convolution units composed of 16-bit FWBMs, using the BSCP, PACS, GPEB, and SCG TEC schemes on the same SoC-FPGA device. Table 10 lists the FPGA LUT resource utilization for a 2D convolution unit with various TEC-based FWBM designs, using the aforementioned methods and our scheme. As shown in Table 10, our design achieves the medium level of area efficiency on FPGA, which is basically consist with the trend of area parameters listed in Table 7. However, the feature of our design for multiple setting of L′-bit operations lends support to the system development requiring flexible word lengths or improved accuracy. For a case study in addition to CNN acceleration, the devised 2D convolution unit can be restructured to realize a 25-tap finite impulse response (FIR) filter by inserting several multiplexers in the data paths of 25 pcs FWBMs. We also developed such an FIR with a slight HW overhead via FPGA implementation to use our design for digital signal processing applications.
The ECG classification was checked by using a modified CNN inference model with the 2D convolution performed in our two-phase fixed-point operations. Moreover, the experimental CNN operations (Table 9) were performed by using the SoC-FPGA based on our SW–HW co-design approach to obtain the inference results. For the bit-width setting (i.e., L′) of two phases, the SW side would prepare the FWBM operands appropriately padded with ZP bits and set the BWATEC control for each round (i.e., layer 1 or 2) of CNN HW acceleration. The inference outcomes generated via the SoC-FPGA were further compared with the results generated by using the aforementioned fixed-point-operated CNN inference model for verification. After inference checking, the confusion matrix and performances of our ECG classification experiment are listed in Table 11. The performance results reported in Table 11 were obtained based on the accuracy (Acc.), sensitivity (Sen.), and specificity (Spc.) statistical metrics extracted from the confusion matrix [42,44]. The terms of TP, TN, FP, and FN denote true positive as “abnormal” (arrhythmia), true negative as “normal”, false positive as “abnormal”, and false negative as “normal” in the binary classification, respectively. The associated formulas are defined as follows:
A c c . = TP + TN TP + TN + FP + FN   S e n . = TP TP + FN   S p c . = TN TN + FP

6. Conclusions

In this paper, we presented a BWATEC scheme capable of providing an adjusted TEC function adaptive to various L′-bit input patterns of an L-bit FWBM, in which L′ ≤ L. Using different combinations of hybrid deterministic/probabilistic values associated with the RH and RD regions, the proposed BWATEC scheme can generate a tailored high-accuracy TEC bias for an L-bit FWBM, depending on the setting of L′ (L and L′ are scalable). An FWBM enabling the proposed BWATEC scheme can be realized by using a reconfigurable bias circuit in a P.P. array with design scalability.
Taking a 16-bit FWBM as an example, we found that the approach using our BWATEC scheme exhibited design efficiency and different degrees of ADEP reduction for operations with 14-bit to 8-bit inputs, as compared to FWBM designs that used state-of-the-art TEC methods.
Moreover, the resultant 16-bit FWBM with BWATEC were verified by using the Xilinx Zynq-7000 SoC-FPGA based on the SW–HW co-design approach. The SoC-FPGA-based verification demonstrated the experimental CNN model for ECG classification.

Author Contributions

Conceptualization and methodology, S.-N.T.; review and editing, S.-N.T.; project administration, S.-N.T.; software and validation (experiment), J.-C.L., C.-K.C., P.-T.K. and Y.-S.C. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported in part by the Ministry of Science and Technology of Taiwan, under project MOST 110-2221-E-033-039.

Informed Consent Statement

C-language simulation and Verilog modeling supports. We also thank Hong-Yu Ke and Jia-Nan Zhong for their works in the CNN model development for ECG classification.

Acknowledgments

The authors thank Yu-Shin Han and Jih-Hsiang Yeh for their works.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Parhi, K.K. VLSI Digital Signal Processing Systems: Design and Implementation, 1st ed.; Wiley: New York, NY, USA, 1999. [Google Scholar]
  2. Lee, H.Y.; Park, I.C. Balanced Binary-Tree Decomposition for Area-Efficient Pipelined FFT Processing. IEEE Trans. Circuits Syst. I Reg. Pap. 2007, 54, 889–900. [Google Scholar] [CrossRef]
  3. Chen, H.Y.; Lin, J.N.; Hu, H.S.; Jou, S.J. STBC-OFDM Downlink Baseband Receiver for Mobile WMAN. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2013, 21, 43–54. [Google Scholar] [CrossRef]
  4. Tang, S.-N.; Han, Y.-S. A High-Accuracy Hardware-Efficient Multiply–Accumulate (MAC) Unit Based on Dual-Mode Truncation Error Compensation for CNNs. IEEE Access 2020, 8, 214716–214731. [Google Scholar] [CrossRef]
  5. Van, L.-D.; Yang, C.-C. Generalized Low-Error Area-Efficient Fixed-Width Multipliers. IEEE Trans. Circuits Syst. I Reg. Pap. 2005, 52, 1608–1619. [Google Scholar] [CrossRef]
  6. Tu, J.-H.; Van, L.-D. Power-efficient pipelined reconfigurable fixed-width Baugh-Wooley multipliers. IEEE Trans. Comput. 2009, 58, 1346–1355. [Google Scholar] [CrossRef]
  7. Chang, C.-H.; Satzoda, R.K. A Low Error and High Performance Multiplexer-Based Truncated Multiplier. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2010, 18, 1767–1771. [Google Scholar] [CrossRef]
  8. Petra, N.; Caro, D.D.; Garofalo, V.; Napoli, E.; Strollo, A.G.M. Design of Fixed-Width Multipliers with Linear Compensation Function. IEEE Trans. Circuits Syst. I Reg. Pap. 2011, 58, 947–960. [Google Scholar] [CrossRef]
  9. Jou, S.-J.; Tsai, M.-H.; Tsao, Y.-L. Low-error reduced-width Booth multipliers for DSP applications. IEEE Trans. Circuits Syst. I Fundam. Theory Appl. 2003, 50, 1470–1474. [Google Scholar] [CrossRef]
  10. Chen, Y.-H.; Chang, T.-Y.; Jou, R.-Y. A statistical error-compensated Booth multipliers and its DCT applications. In Proceedings of the TENCON IEEE Region 10 Conference, Fukuoka, Japan, 21–24 November 2010; pp. 1146–1149. [Google Scholar] [CrossRef]
  11. Song, M.A.; Van, L.D.; Kuo, S.Y. Adaptive low-error fixed-width Booth multipliers. IEICE Trans. Fundam. Electron. Commun. Comput. Sci. 2007, 90, 1180–1187. [Google Scholar] [CrossRef] [Green Version]
  12. Wang, J.-P.; Kuang, S.-R.; Liang, S.-C. High-Accuracy Fixed-Width Modified Booth Multipliers for Lossy Applications. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2011, 19, 52–60. [Google Scholar] [CrossRef]
  13. Kuang, S.-R.; Wang, J.-P.; Guo, C.-Y. Modified Booth Multipliers with a Regular Partial Product Array. IEEE Trans. Circuits Syst. II Exp. Briefs 2009, 56, 404–408. [Google Scholar] [CrossRef]
  14. Cho, K.-J.; Lee, K.-C.; Chung, J.-G.; Parhi, K.K. Design of low-error fixed-width modified Booth multiplier. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2004, 12, 522–531. [Google Scholar] [CrossRef]
  15. Juang, T.-B.; Hsiao, S.-F. Low-error carry-free fixed-width multi- pliers with low-cost compensation circuits. IEEE Trans. Circuits Syst. II Exp. Briefs 2005, 52, 299–303. [Google Scholar] [CrossRef]
  16. Li, C.-Y.; Chen, Y.-H.; Chang, T.-Y.; Chen, J.-N. A Probabilistic Estimation Bias Circuit for Fixed-Width Booth Multiplier and Its DCT Applications. IEEE Trans. Circuits Syst. II Exp. Briefs 2011, 58, 215–219. [Google Scholar] [CrossRef]
  17. Chen, Y.-H.; Li, C.-Y.; Chang, T.-Y. Area-Effective and Power-Efficient Fixed-Width Booth Multipliers Using Generalized Probabilistic Estimation Bias. IEEE J. Emerg. Sel. Top. Circuits Syst. 2011, 1, 277–288. [Google Scholar] [CrossRef]
  18. Chen, Y.-H.; Chang, T.-Y. A High-Accuracy Adaptive Conditional Probability Estimator for Fixed-Width Booth Multipliers. IEEE Trans. Circuits Syst. I Reg. Pap. 2012, 59, 594–603. [Google Scholar] [CrossRef]
  19. Chen, Y.-H. An Accuracy-Adjustment Fixed-Width Booth Multiplier Based on Multilevel Conditional Probability. IEEE Trans. Very Large Scale Integr. (VLSI) Syst. 2015, 23, 203–207. [Google Scholar] [CrossRef]
  20. He, W.-Q.; Chen, Y.-H.; Jou, S.-J. High-Accuracy Fixed-Width Booth Multipliers Based on Probability and Simulation. IEEE Trans. Circuits Syst. I Reg. Pap. 2015, 62, 2052–2061. [Google Scholar] [CrossRef]
  21. Chen, Y.-H. Improvement of Accuracy of Fixed-Width Booth Multipliers Using Data Scaling Technology. IEEE Trans. Circuits Syst. II Exp. Briefs 2021, 68, 1018–1022. [Google Scholar] [CrossRef]
  22. Zhang, Z.; He, Y. A Low-Error Energy-Efficient Fixed-Width Booth Multiplier with Sign-Digit-Based Conditional Probability Estimation. IEEE Trans. Circuits Syst. II Express Briefs 2018, 65, 236–240. [Google Scholar] [CrossRef]
  23. Oklobdzija, V.G.; Villeger, D.; Liu, S.S. A method for speed optimized partial product reduction and generation of fast parallel multipliers using an algorithmic approach. IEEE Trans. Comput. 1996, 45, 294–306. [Google Scholar] [CrossRef]
  24. He, Y.; Chang, C.-H. A New Redundant Binary Booth Encoding for Fast 2n-Bit Multiplier Design. IEEE Trans. Circuits Syst. I Reg. Pap. 2009, 56, 1192–1201. [Google Scholar] [CrossRef]
  25. Gong, L.; Wang, C.; Li, X.; Chen, H.; Zhou, X. MALOC: A Fully Pipelined FPGA Accelerator for Convolutional Neural Networks with All Layers Mapped on Chip. IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 2018, 37, 2601–2612. [Google Scholar] [CrossRef]
  26. Zhang, L.; Li, B.; Liu, Y.; Zhao, X.; Wang, Y.; Wu, J. FPGA Acceleration of CNNs-Based Malware Traffic Classification. Electronics 2020, 9, 1631. [Google Scholar] [CrossRef]
  27. Moons, B.; Verhelst, M. An Energy-Efficient Precision-Scalable ConvNet Processor in 40-nm CMOS. IEEE J. Solid-State Circuits 2017, 52, 903–914. [Google Scholar] [CrossRef]
  28. Camus, V.; Mei, L.; Enz, C.; Verhelst, M. Review and Benchmarking of Precision-Scalable Multiply-Accumulate Unit Architectures for Embedded Neural-Network Processing. IEEE J. Emerg. Sel. Top. Circuits Syst. 2019, 9, 697–711. [Google Scholar] [CrossRef]
  29. Chen, Q.; Fu, Y.; Song, W.; Cheng, K.; Lu, Z.; Zhang, C.; Li, L. An Efficient Streaming Accelerator for Low Bit-Width Convolutional Neural Networks. Electronics 2019, 8, 371. [Google Scholar] [CrossRef] [Green Version]
  30. Chen, Y.; Krishna, T.; Emer, J.S.; Sze, V. Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks. IEEE J. Solid-State Circuits 2017, 52, 127–138. [Google Scholar] [CrossRef] [Green Version]
  31. Du, L.; Du, Y.; Li, Y.; Su, J.; Kuan, Y.-C.; Liu, C.-C.; Chang, M.-C.F. A Reconfigurable Streaming Deep Convolutional Neural Network Accelerator for Internet of Things. IEEE Trans. Circuits Syst. I Regul. Pap. 2018, 65, 198–208. [Google Scholar] [CrossRef] [Green Version]
  32. Jo, J.; Cha, S.; Rho, D.; Park, I. DSIP: A Scalable Inference Accelerator for Convolutional Neural Networks. IEEE J. Solid-State Circuits 2018, 53, 605–618. [Google Scholar] [CrossRef]
  33. Zhou, S.; Wu, Y.; Ni, Z.; Zhou, X.; Wen, H.; Zou, Y. DoReFa-Net: Training Low Bitwidth Convolutional Neural Networks with Low Bitwidth Gradients. arXiv 2016, arXiv:1606.06160. Available online: https://arxiv.org/abs/1606.06160v2 (accessed on 11 October 2021).
  34. Park, E.; Ahn, J.; Yoo, S. Weighted-Entropy-Based Quantization for Deep Neural Networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Honolulu, HI, USA, 21–26 July 2017. [Google Scholar] [CrossRef]
  35. Shin, D.; Lee, J.; Lee, J.; Yoo, H.J. DNPU: An 8.1 TOPS/W reconfigurable CNN-RNN processor for general-purpose deep neural networks. In Proceedings of the IEEE Int. Solid-State Circuits Conference (ISSCC), San Francisco, CA, USA, 5–9 February 2017; pp. 240–241. [Google Scholar] [CrossRef]
  36. Garofalo, A.; Tagliavini, G.; Conti, F.; Rossi, D.; Benini, L. XpulpNN: Accelerating Quantized Neural Networks on RISC-V Processors Through ISA Extensions. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition (DATE), Grenoble, France, 9–13 March 2020; pp. 186–191. [Google Scholar] [CrossRef]
  37. Lee, J.; Kim, C.; Kang, S.; Shin, D.; Kim, S.; Yoo, H. UNPU: An Energy-Efficient Deep Neural Network Accelerator with Fully Variable Weight Bit Precision. IEEE J. Solid-State Circuits 2019, 54, 173–185. [Google Scholar] [CrossRef]
  38. Han, Y.; Virupakshappa, K.; Vitor Silva Pinto, E.; Oruklu, E. Hardware/Software Co-Design of a Traffic Sign Recognition System Using Zynq FPGAs. Electronics 2015, 4, 1062–1089. [Google Scholar] [CrossRef] [Green Version]
  39. Guo, K.; Han, S.; Yao, S.; Wang, Y.; Xie, Y.; Yang, H. Software-Hardware Codesign for Efficient Neural Network Acceleration. IEEE Micro 2017, 37, 18–25. [Google Scholar] [CrossRef]
  40. Moini, S.; Alizadeh, B.; Emad, M.; Ebrahimpour, R. A Resource-Limited Hardware Accelerator for Convolutional Neural Networks in Embedded Vision Applications. IEEE Trans. Circuits Syst. II Express Briefs 2017, 64, 1217–1221. [Google Scholar] [CrossRef]
  41. Moody, G.B.; Mark, R.G. The impact of the MIT-BIH arrhythmia database. IEEE Eng. Med. Biol. Mag. 2001, 20, 45–50. [Google Scholar] [CrossRef]
  42. Ju, T.J.; Nguyen, H.M.; Kang, D.; Kim, D.; Kim, D.; Kim, Y.-H. ECG arrhythmia classification using a 2-D convolutional neural network. arXiv 2018, arXiv:1804.06812. Available online: https://arxiv.org/abs/1804.06812 (accessed on 11 October 2021).
  43. Wu, Y.; Yang, F.; Liu, Y.; Zha, X.; Yuan, S. A Comparison of 1-D and 2-D Deep Convolutional Neural Networks in ECG Classification. arXiv 2018, arXiv:1810.07088. Available online: https://arxiv.org/abs/1810.07088 (accessed on 11 October 2021).
  44. Saadatnejad, S.; Oveisi, M.; Hashemi, M. LSTM-Based ECG Classification for Continuous Monitoring on Personal Wearable Devices. IEEE J. Biomed. Health Inform. 2020, 24, 515–523. [Google Scholar] [CrossRef] [Green Version]
  45. Lecun, Y.; Bottou, L.; Bengio, Y.; Haffner, P. Gradient-based learning applied to document recognition. Proc. IEEE 1998, 86, 2278–2324. [Google Scholar] [CrossRef] [Green Version]
Figure 1. Partial product (P.P.) array structure for a 16-bit full-width Booth multiplier with multiple 16-to-8-bit numerical ranges of input patterns.
Figure 1. Partial product (P.P.) array structure for a 16-bit full-width Booth multiplier with multiple 16-to-8-bit numerical ranges of input patterns.
Electronics 10 02511 g001
Figure 2. Schematic for the TPminor contents in a P.P. array of a 16-bit Booth multiplier for various L′-bit input patterns: (a) L′ = 14, (b) L′ = 12, and (c) L′ = 10.
Figure 2. Schematic for the TPminor contents in a P.P. array of a 16-bit Booth multiplier for various L′-bit input patterns: (a) L′ = 14, (b) L′ = 12, and (c) L′ = 10.
Electronics 10 02511 g002
Figure 3. Schematic for the TPminor contents of a 16-bit Booth multiplier for other L′-bit inputs: (a) L′ = 16 and (b) L′ = 8.
Figure 3. Schematic for the TPminor contents of a 16-bit Booth multiplier for other L′-bit inputs: (a) L′ = 16 and (b) L′ = 8.
Electronics 10 02511 g003
Figure 4. Schematic of truncation error compensation (TEC) operations that use the proposed bit-width adaptive TEC (BWATEC) scheme for various L′-bit input patterns of a 16-bit fixed-width Booth multiplier (FWBM).
Figure 4. Schematic of truncation error compensation (TEC) operations that use the proposed bit-width adaptive TEC (BWATEC) scheme for various L′-bit input patterns of a 16-bit fixed-width Booth multiplier (FWBM).
Electronics 10 02511 g004
Figure 5. Schematic for the RZ/RH/RD distribution of TPminor rows of a general L-bit Booth multiplier for various L′-bit inputs: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
Figure 5. Schematic for the RZ/RH/RD distribution of TPminor rows of a general L-bit Booth multiplier for various L′-bit inputs: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
Electronics 10 02511 g005
Figure 6. Schematic of BWATEC operations for various L′-bit input patterns of a general L-bit FWBM: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
Figure 6. Schematic of BWATEC operations for various L′-bit input patterns of a general L-bit FWBM: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
Electronics 10 02511 g006
Figure 7. Hardware architecture of 16-bit FWBM that uses the proposed BWATEC scheme.
Figure 7. Hardware architecture of 16-bit FWBM that uses the proposed BWATEC scheme.
Electronics 10 02511 g007
Figure 8. Hardware configuration for BWATEC biasing of a general L-bit FWBM: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
Figure 8. Hardware configuration for BWATEC biasing of a general L-bit FWBM: (a) L = 2n, and n is even; (b) L = 2n, and n is odd.
Electronics 10 02511 g008
Figure 9. Normalized ADEP values of a 16-bit FWBM for various TEC schemes and L′.
Figure 9. Normalized ADEP values of a 16-bit FWBM for various TEC schemes and L′.
Electronics 10 02511 g009
Figure 10. Normalized ADEP values of a 14-bit FWBM for various TEC schemes and L′.
Figure 10. Normalized ADEP values of a 14-bit FWBM for various TEC schemes and L′.
Electronics 10 02511 g010
Figure 11. Schematic of the setup of our design implementation that uses a SoC-FPGA approach.
Figure 11. Schematic of the setup of our design implementation that uses a SoC-FPGA approach.
Electronics 10 02511 g011
Table 1. Mapping of abbreviations and acronym words.
Table 1. Mapping of abbreviations and acronym words.
FWBMFixed-Width Booth Multiplier
TECTruncation error compensation
BWATECBit-width adaptive truncation error compensation
P.P.Partial products
MP/TPMain part/truncation part
MSCMost significant column
CNNConvolutional neural network
ECGElectrocardiogram
HW/SWHardware/software
SoC-FPGASystem-on-chip field-programmable gate array
ZPZero-Padding
RZ/RH/RDZero region/hybrid region/deterministic-only region
ADEPArea-delay-error product
Table 2. Mapping results for the Booth encoder and partial products.
Table 2. Mapping results for the Booth encoder and partial products.
( b 2 j + 1     b 2 j     b 2 j 1 ) d j p L , j     p L 1 , j     p L 2 , j     p 2 , j     p 1 , j     p 0 , j     n j  
(0 0 0)/(1 1 1)0 0                 0                 0               0                 0                 0   0
(0 0 1)/(0 1 0)1 a L 1       a L 1       a L 2     a 2         a 1         a 0   0
(1 0 1)/(1 1 0)−1 a L 1 ¯       a L 1 ¯       a L 2 ¯       a 2 ¯       a 1 ¯       a 0 ¯ 1
(0 1 1)2 a L 1       a L 2       a L 3     a 1         a 0         0   0
(1 0 0)−2 a L 1 ¯       a L 2 ¯       a L 3 ¯     a 1 ¯         a 0 ¯       1   1
Table 3. Values of (nj, sj, ej) and according to dj.
Table 3. Values of (nj, sj, ej) and according to dj.
P.P. Values E [ T P m i n o r , j ( H ) ] Values
dj = 0(all P.P. = 0)zero
dj = 1(nj = 0; sj = 0; ej = 1/2) 2 2 2 16 + 2 j + n s 1
dj = −1(nj = 1; sj = 1; ej = 1/2) 2 2 + 2 16 + 2 j + n s 1
dj = 2(nj = 0; sj = 0; ej = 0) 2 2 2 16 + 2 j + n s
dj = −2(nj = 1; sj = 1; ej = 1) 2 2 + 2 16 + 2 j + n s
Table 4. HW resources usage of a general L-bit FWBM for various L′-bit inputs using the BWATEC scheme (Q = L/2).
Table 4. HW resources usage of a general L-bit FWBM for various L′-bit inputs using the BWATEC scheme (Q = L/2).
MP/TPmajor HW ResourcesBWATEC Biasing HW Resources
#FA/HA #FA/HA#MUX1#MUX2#MUX3
Q 2 × 4 + L + Q 1 L = 2n (even n) Q / 2 Q / 2 Q / 2 1 1 2 × L 2 L
L = 2n (odd n) Q / 2 Q / 2 Q / 2 1 1 2 × L 2 L + 1
Table 5. Comparison of accuracy and hardware performances of a 16-bit fixed-width Booth multiplier (FWBM) for PT, DT, and various truncation error compensation (TEC) schemes.
Table 5. Comparison of accuracy and hardware performances of a 16-bit fixed-width Booth multiplier (FWBM) for PT, DT, and various truncation error compensation (TEC) schemes.
PTBSCPPACSGPEBSCGOursDT
Accuracy Performance (L′ = 16)–SNR (dB)
SNR85.5681.9181.8379.3481.8481.8764.84
Hardware Performances—Area (µm2)/Delay (ns)/Power (mW)
Area2294133013011249132513121098
Delay3.623.253.243.203.243.272.96
Power1075615.1585.2562.3612.9591.0486.4
Table 6. Comparison of ADEP results of a 16-bit FWBM for various TEC schemes.
Table 6. Comparison of ADEP results of a 16-bit FWBM for various TEC schemes.
BSCPPACSGPEBSCGOurs
ADEP (%)59.80%59.35%100%60.32%59.71%
Table 7. SNR results (dB) of a 16-bit FWBM for various TEC schemes and L′-bit inputs.
Table 7. SNR results (dB) of a 16-bit FWBM for various TEC schemes and L′-bit inputs.
BSCPPACSGPEBSCGOurs
L′ = 1480.6281.0678.1081.0481.69
L′ = 1280.7781.2576.6780.9682.12
L′ = 1081.0181.5775.4081.2883.18
L′ = 880.1481.3674.6580.38Inf.
Table 8. HW resources and performances for our FPGA implementation.
Table 8. HW resources and performances for our FPGA implementation.
LUT Util.LUTRAM
Util.
FF Util.BRAM Util.GOPs
678662257212.52.55
Table 9. Experimental CNN model architecture and SNR performances among layers.
Table 9. Experimental CNN model architecture and SNR performances among layers.
LayersInput Feature
Map Size
Input
Channel No.
Kernel
Size
Output
SNR (dB)
Input128 × 1281
1st Convolution128 × 12815 × 534.37
1st Max. Pooling128 × 12842 × 235.14
2nd Convolution64 × 6445 × 529.48
2nd Max. Pooling64 × 6482 × 229.86
FC32 × 328
Table 10. Comparison of lookup-table (LUT) resource usage for FPGA implementation of a 2D convolution unit for various TEC schemes.
Table 10. Comparison of lookup-table (LUT) resource usage for FPGA implementation of a 2D convolution unit for various TEC schemes.
BSCPPACSGPEBSCGOurs
LUT Util.52254907445952064957
Table 11. Confusion matrix and performances of the experimental CNN for ECG classification.
Table 11. Confusion matrix and performances of the experimental CNN for ECG classification.
PredictionAbnormalNormalMetrics
Label
Abnormal27,577 (TP)705 (FN)97.5% (Sen.)
Normal2266 (FP)35,316 (TN)94.0% (Spc.)
Metrics95.5%(Acc.)
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Tang, S.-N.; Liao, J.-C.; Chiu, C.-K.; Ku, P.-T.; Chen, Y.-S. An Accuracy-Improved Fixed-Width Booth Multiplier Enabling Bit-Width Adaptive Truncation Error Compensation. Electronics 2021, 10, 2511. https://doi.org/10.3390/electronics10202511

AMA Style

Tang S-N, Liao J-C, Chiu C-K, Ku P-T, Chen Y-S. An Accuracy-Improved Fixed-Width Booth Multiplier Enabling Bit-Width Adaptive Truncation Error Compensation. Electronics. 2021; 10(20):2511. https://doi.org/10.3390/electronics10202511

Chicago/Turabian Style

Tang, Song-Nien, Jen-Chien Liao, Chen-Kai Chiu, Pei-Tong Ku, and Yen-Shuo Chen. 2021. "An Accuracy-Improved Fixed-Width Booth Multiplier Enabling Bit-Width Adaptive Truncation Error Compensation" Electronics 10, no. 20: 2511. https://doi.org/10.3390/electronics10202511

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop