An Optimized Method for Nonlinear Function Approximation Based on Multiplierless Piecewise Linear Approximation

Yu, Hongjiang; Yuan, Guoshun; Kong, Dewei; Lei, Lei; He, Yuefeng

doi:10.3390/app122010616

Open AccessArticle

An Optimized Method for Nonlinear Function Approximation Based on Multiplierless Piecewise Linear Approximation

by

Hongjiang Yu

^1,2

,

Guoshun Yuan

^1,*,

Dewei Kong

^1,2,

Lei Lei

^1,2

and

Yuefeng He

^1,2

¹

Institute of Microelectronics of the Chinese Academy of Sciences, Beijing 100029, China

²

University of Chinese Academy of Sciences, Beijing 100049, China

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(20), 10616; https://doi.org/10.3390/app122010616

Submission received: 17 September 2022 / Revised: 14 October 2022 / Accepted: 17 October 2022 / Published: 20 October 2022

(This article belongs to the Section Electrical, Electronics and Communications Engineering)

Download

Browse Figures

Versions Notes

Abstract

:

Featured Application

Data analysis and processing system, Neural network system, Accelerator and Coprocessor.

Abstract

In this paper, we propose an optimized method for nonlinear function approximation based on multiplierless piecewise linear approximation computation (ML-PLAC), which we call OML-PLAC. OML-PLAC finds the minimum number of segments with the predefined fractional bit width of input/output, maximum number of shift-and-add operations, user-defined widths of intermediate data, and maximum absolute error (MAE). In addition, OML-PLAC minimizes the actual MAE as much as possible by iterating. As a result, under the condition of satisfying the maximum number of segments, the MAE can be minimized. Tree-cascaded 2-input and 3-input multiplexers are used to replace multi-input multiplexers in hardware architecture as well, reducing the depth of the critical path. The optimized method is applied to logarithmic, antilogarithmic, hyperbolic tangent, sigmoid and softsign functions. The results of the implementation prove that OML-PLAC has better performance than the current state-of-the-art method.

Keywords:

nonlinear function approximation; maximum absolute error (MAE); iterating; tree-cascaded multiplexer; logarithmic function; antilogarithmic function; hyperbolic tangent function; sigmoid function; softsign function

1. Introduction

Nonlinear functions are employed widely in data statistical analysis and pattern recognition [1]. In hardware circuits, the implementation of nonlinear functions often requires a lot of resources, which leads to high cost and low performance. In order to solve this problem, many efficient approximation methods have been proposed.

In general, there are two types of approximation methods. The first approximation method is based on an iterative approach, like the Newton iteration method [2], coordinated rotation digital computer (CORDIC) [3,4] and digit-recurrence algorithm [5]. In addition, there is a long delay in implementing a particular function using an iterative approach, which seriously affects the performance of a real-time system, such as recurrent neural networks (RNNs) [6]. The second approximation method is based on look-up table (LUT), such as direct table lookup method and indirect table lookup method. The direct table lookup method gets the output directly according to the input and the amount of storage explodes as required accuracy increases. The indirect table lookup method stores some coefficients, and when there is data input some coefficients are selected to participate in the subsequent operation to obtain the output data.

The polynomial approximation is a common method of indirect table lookup, and this method is applied to Taylor series approximation [7] and Chebyshev polynomial approximation [8]. Depending on cascaded multiplication and addition (MAC) operations to get the result, this approach incurs high hardware overhead and delay. Another type of indirect look-up table approximation uses the special properties of functions and the Quine–McCluskey method to approximate the sigmoid and hyperbolic tangent functions [9]. It has good performance, but it is not suitable for high-precision situations. Different ways are used to optimize the implementation of logarithm function based on multipliers [10,11,12], but there is still a lot of room for improvement. Then the piecewise linear (PWL) method is proposed to reduce the computation complexity because only one MAC is required. PWL method uses (k × x + b) to approximate each segment, and slope k, y-intercept b, and endpoints for each segment are stored in memory. As the approximate accuracy improves the number of segments will increase, which means more memory resources are required. At the same time, it will also bring about the deterioration of the timing. Therefore, many studies are devoted to reducing the number of segments and simplifying the operation to improve performance.

PWL method develops from uniform segmentation to non-uniform segmentation and finally to error flattening segmentation. Uniform segmentation method divides the input into segments of equal length, and this method is used in [13] to approximate logarithmic function by minimizing the maximum relative approximation error (MRE) while computing accuracy is improved by a piecewise linear interpolation and an LUT-based correction stage in [14]. In [15], the optimal coefficients of the approximation function are obtained by Linear-Lagrange interpolation to obtain the best accuracy. Sigmoid and hyperbolic tangent functions are approximated by using the properties of functions at special points in [16,17]. The biggest problem with uniform segmentation is that each segment has a different error, resulting in more segments. To overcome the shortcomings of the uniform segmentation method, the non-uniform segmentation method is proposed. 15 and 24 non-uniform segments are used in [18,19], respectively, to compute logarithmic function, both of which increase the number of segments around input value 0 and achieve relatively high accuracy using acceptable number of segments. In [20], a non-uniform-segment-based PWL method is used to approximate transcendental functions, and to enhance the accuracy; Ref. [21] is segmented according to specific coefficients. A relative error equal distribution algorithm that divides the segments under the premise of MRE is proposed in [22], and finally achieves over 70% reduction in average relative errors compared to other methods. However, maximum absolute error (MAE) is a better metric for fixed-point in implementation when considering the hardware efficiency. The error-flattened method is proposed for the first time in [23], minimizing MAE to approximate the logarithmic function by dividing the output range into equal regions. Additionally, the method in [23] optimizes the hardware implementation of logarithmic function and achieves the theoretically best segmentation performance. The trouble of the method in [23] is that it is not universal but only for logarithmic function. In [24], another logarithmic converter with a novel error-aware segmentation procedure is proposed, which approximates logarithmic function by unity slope straight lines and maximizes the length of each segment under the MAE requirement.

The PWL methods mentioned above are all for specific functions. A universal PWL method for all transcendental functions is proposed in [25] for the first time, which uses the software to select the minimum number of segments under the constraint of controllable and predefined MAE. However, the method in [25] cannot achieve the best segmentation performance reached in [23], because of segment endpoints overlapping and the inconsistencies between hardware and software. In the meantime, the hardware architecture in [25] still has redundant logic, resulting in unnecessary hardware overheads. A PWL approximation computation (PLAC) method is proposed in [26], which is an optimization of [25], improving the problems mentioned above. Moreover, a bisection method is proposed in PLAC to reduce the segmentation time as well. In [27], PLAC is used to perform the Nth root computations on floating-point numbers and is much better than the methods based on CORDIC. However, the optimal intercept value should change after quantization, but PLAC does not solve this problem. Based on PLAC, the method in [28] solves the remaining problem and inserts the quantification operations into the segmentor (SQ) for the logarithmic computation of floating-point numbers. Getting the errors after slope quantification, SQ method optimizes the value of y-intercept to acquire the minimum MAE. SQ is extended to other nonlinear functions and replaces multipliers with shift-and-add (SAA) parts in [29], which is called multiplierless SQ (ML-SQ). In ML-SQ, the substitution process is simulated by the software, in consideration of the fractional bit widths of the intermediate data and slope. The whole process is called multiplierless PLAC (ML-PLAC). When the slope fraction is small, the ML-PLAC method can significantly reduce the delay and area as for hardware implementation.

The comparison of the different PWL methods is shown in Table 1 and it can be concluded that among the piecewise approximate methods of nonlinear functions, the piecewise method of error flattening has better universality and better performance on hardware circuits.

In this paper, we make some improvements based on the ML-PLAC method, which we call optimized ML-PLAC (OML-PLAC). Like the ML-PLAC method, OML-PLAC also adopts SQ segmentation method and multiplierless hardware structure. The biggest difference is the definition of kw. In ML-PLAC, kw represents the fractional bit widths of the slope while represents the number of ones in the slope value in this paper. To put it another way, kw represents the maximum number of multiplexers in fractional bits and the maximum number of multiplexers overall in ML-PLAC and OML-PLAC, respectively, while the efficiency of the multiplexers is different. As the fractional bit widths of the slope, since the kw of different segments may be different, it is often the case that some inputs of the multiplexers are 0, which causes the waste of the multiplexer resources. However, as the number of ones in the slope value, the multiplexers are at full load (There are some special cases, which will be mentioned later).

For example, when a function is divided into four segments to approximate, each segment is a linear function of one variable, and the actual slope values are 0.10001, 0.00011, 0.11001, and 0.01001, respectively. We suppose kw = 2. As for ML-PLAC, the four slope values involved in the calculation are, respectively, 0.1, 0, 0.11, 0.01, while as for OML-PLAC the four values are 0.10001, 0.00011, 0.11, 0.01001. Where the fractional bit widths of the input and intermediate data are named iw and qw, assume that iw = qw = n and n > 5. The SAA parts of ML-PLAC and OML-PLAC are shown in Figure 1.

As can be seen from Figure 1, some inputs of multiplexers in ML-PLAC are 0, which does not happen in OML-PLAC. To some extent, 0 inputs cause the waste of the multiplexers. In fact, slope approximation in OML-PLAC is a better approximation of the actual slope value and has less error. Introducing this mechanism into quantization and segmentation can reduce the number of segments, which is a benefit for area and delay of hardware realization. Furthermore, decreasing the value of kw does not increase the number of segments in some cases compared to ML-PLAC, resulting in less shift-and-adder resources. Although the new segmentation method brings a slight increase in the number of multiplexer inputs in the SAA parts sometimes, the overall performance improvement is significant.

The second improvement is the MAE adaptation. In previous advanced segmentation techniques [26,29], the MAE of the last segment tends to be much smaller than the predefined error. This makes it possible to readjust segments and improve errors. We do this in the following segmentation steps:

Initialize the predefined maximum absolute error (MAE_def), the fractional bit widths of the input (iw) and intermediate data (qw), and the num of ones in slope value (kw);
Calculate the segments, record the original number of segments, the actual MAE and other information about each segment;
Take the actual MAE as the maximum error limit (EC_max), take 2^{−max(iw,qw)} as the minimum error limit (EC_min), and take the average of EC_max and EC_min as the new predefined MAE;
Recalculate the segments, record the number of segments;
If the number of segments is equal to the original number of segments, then take the new predefined MAE as EC_max and take the average of EC_max and EC_min as the new predefined MAE. Otherwise, take the new predefined MAE as EC_min and take the average of EC_max and EC_min as the new predefined MAE;
Predefine error iteration number cycle_num, and do cycle_num times of loop iteration, record the final segments, slope, y-intercept and MAE.

The third improvement concerns the hardware architecture. We adopt a tree-cascaded architecture to implement the multi-input multiplexer. Nodes in the tree network are 2-input and 3-input multiplexers. We try to balance the tree structure so that each input path delay is basically the same. This method contributes to the delay of the critical path.

In addition to the improvements mentioned above, we point out the limitations of the method for finding the MAE in advanced segmentation methods. The minimizations of MAE are found in [25,26] by parallel shifting, which is not suitable for all nonlinear unary functions.

Take the sine function as an example, as shown in Figure 2.

Assume the segment is sin(x), where x∈[0, 2π]. If the shifting method of [25] or [26] is used, then the final approximate line is the x-axis, and the MAE is 1. However, it is easy to prove mathematically that Line L is the optimal approximate line. L1 and L2 are tangent lines to sin(x), and the tangent points are P1 and P2, respectively. L1 is parallel to L2, and L is the midline of L1 and L2. Line L passes through A. The MAE in this case is the distance between point P1 or P2 and line L, and it is obvious that the value is less than 1.

In fact, the above problem may occur when the line on which the two endpoints lie intersects the function itself. This problem does not arise when the first derivative and the second derivative of the function are both greater than zero. This can be proved by the Cauchy mean value theorem and proof by contradiction. This issue is not the focus of this paper and will not be further explored. We just offer one solution here to this problem, traverse the line composed of any two actual points between two endpoints, take them as the target line for parallel shifting, and finally find the line with the minimum MAE. The cost of this approach is an increase in the number of iterations.

We use MATLAB to model the proposed improved segmentation and MAE adaptation method. Then Verilog HDL is used to model the hardware architecture. As for showing the superiority of OML-PLAC over other methods, we synthesize with TSMC 90-nm, TSMC 65-nm and TSMC 40-nm technologies. The synthesized results prove that OML-PLAC has increased performance over ML-PLAC.

The contributions of this paper are summarized as follows:

By controlling the slope property of the piecewise approximation function, the multiplexers in the SAA part can be fully used. As a result, fewer segments are required under the same limitation of MAE. Shift-and-add operations can be reduced in some cases as well, and these contribute to less area and delay;
The proposed MAE adaptation method can reduce the actual MAE as much as possible on the premise of ensuring the predefined MAE and the maximum number of segments, and the percentage of the final MAE to the limit MAE is controllable;
The tree-cascaded method is used to implement multi-input multiplexers with the nodes of 2-input and 3-input multiplexers in hardware architecture, trying to flatten the delay from input to output. This method is beneficial to reduce the length of critical path;
The limitations of the method for finding the MAE in advanced segmentation methods has been presented and a corresponding solution is given.

The rest of this paper is organized as follows: Section 2 introduces state-of-the-art research on the PWL so far. Section 3 introduces the proposed optimization method in detail, including the optimized segmentation method, the error adaptation and the hardware improvement. In Section 4, the proposed optimization method is applied to typical nonlinear functions, and the experimental results are compared and analyzed with those of the existing state-of-the-art methods and some other studies. Section 5 provides the conclusions of this paper.

2. State-of-the-Art PWL Method

In this section, we introduce the state-of-the-art PWL method proposed in [29] that mainly includes the basic principles of PWL, software-based segmentor and quantizer.

2.1. Basic Principles of PWL

The nonlinear function f(x) is divided into n segments with PWL method. The independent variables in the i^th section range from p_i to q_i, and the i^th segment of f(x) is approximated by:

h_{i} (x) = k_{i} \times x + b_{i}

(1)

Approximate absolute error (AE), influenced by expressed by:

AE = | f (x) - h (x) |

(2)

MAE and AAE are the maximum and the average AE, defined, respectively, as:

MAE = \max (AE)

(3)

and

AAE = \frac{\sum AE}{N}

(4)

where N is the number of inputs that are sampled to participate in the computation.

Apart from AE, relative error distance (RED) is another standard and metric for approximate designs. RED is defined as:

RED = \frac{AE}{f (x)}

(5)

MRED and NMED represent mean and normalized relative error distance, defined respectively, as:

MRED = \frac{\sum RED}{N}

(6)

and

NMED = \frac{AAE}{\max (f (x))}

(7)

2.2. ML-PLAC Segmentor and Quantizer

The segmentation method of ML-PLAC is similar to that of the method in [21,25], except that quantization is added to the process, which is called SQ method [28], and the whole process of segmentation is shown in Figure 3. Compared to PLAC, ML-PLAC has more segments, but the computational burden is greatly reduced. Therefore, the performance of the hardware circuit is greatly improved and the area is smaller meanwhile.

ML-PLAC searches for the minimum number of segments with the specified fractional bits of the input/output, intermediate data and slope, while ensuring that the error is less than predefined MAE (MAE_def). Using SQ method to replace the separated segmentation and quantization operations, the original function can be approximated better by piecewise linear function.

The segmentation steps of ML-PLAC based on software are briefly described below:

Predefine MAE, kw, iw, qw and initial i, j as 0 and 1, respectively, for the first segment;
Use the bisection method to find the segment width that meets the conditions. When the endpoint of a segment is found, the in-segment loop is complete. Then the value of i plus 1, sp and ep are initialized as start and end point indexes for undivided zones to start a new segment. Before each new segment, the left and right pointers of the bisection window coincide with sp and ep, respectively;
Suppose that the independent variable of i^th segment is [x(sp_i), x(ep_i)]; ki and bi are slope and y-intercept of the linear function, calculated as:

k_{i} = \frac{f (x ({ep}_{i})) - f (x ({sp}_{i}))}{x ({ep}_{i}) - x ({sp}_{i})}

(8)

and

b_{i} = f (x ({ep}_{i})) - k_{i} \times x ({ep}_{i})

(9)

k_i is quantified as kq_i which is related to kw, expressed by:

{kq}_{i} = \frac{round (k_{i} \times 2^{kw})}{2^{kw}}

(10)

Then take mq to represent the result of (x × kq_i). SAA operations are used to replace multiplication, in which the data are truncated according to the fractional bit width of the intermediate data qw. The linear function is expressed by (1), expressed as:

h (x) = mq + b_{i}

(11)

So, the error is:

Err = f (x (sp : ep)) - h (x (sp : ep))

(12)

The intercept is corrected by Err to minimize the MAE, optimized to be:

b_{i}^{'} = b_{i} + \frac{\max (Err) + \min (Err)}{2}

(13)

And b_i’ is quantified as bq_i, expressed by:

b q_{i} = \frac{r o u n d (b_{i}^{'} \times 2^{\max (i w, q w)})}{2^{\max (i w, q w)}}

(14)

Therefore, the optimized linear function and MAE are, respectively, calculated by:

{h (x)}^{'} = mq + {bq}_{i}

(15)

and

MAE = \max (| f (x ({sp}_{i} : {ep}_{i})) - {h (x)}^{'} |)

(16)

4.: This step executes according to the value of MAE calculated by Equation (16) and the predefined MAE_def.

Condition a: If MAE meets the requirement of the predefined MAE_def, then conform whether the width of current segment reaches the width of the bisection window pointer (ep == rp or ep == rp − 1). If is reached, the information of the segment including start pointer, end pointer, slope and y-intercept are determined and stored. After that, back to step 2 and start the new segment. Otherwise, the width of current segment can be enlarged. Then the left pointer of the bisection window shifts to the end pointer and end the pointer shifts right to the middle of the new bisection window.

Condition b: If MAE does not meet the predefined MAE_def, it means the width of the segment is too large and needs to be reduced. Then the right pointer of the bisection window shifts to the end pointer and the end pointer shifts left to the middle of the new bisection window.

5.: Repeat the in-segment and out-segment loops until all inputs are segmented.

3. Proposed Method

In this section, we introduce and analyze a proposed OML-PLAC method, mainly including optimized segmentation method, error adaptation and hardware improvement.

3.1. Optimized Segmentation Method

Based on ML-SQ, we proposed an optimized segmentation method. We control the number of ones in the slopes of segments instead of the fractional bit widths of the slope. This method can make full use of the multiplexers in the SAA part. As a result, less segments are required and the number of SAA operations is fewer in some cases. The difference in segmentation method between ML-PLAC and OML-PLAC is shown in green in Figure 3 and replaced by the pseudocode in Table 2.

In the segmentation process of OML-PLAC, the slope fractional bit width kw₁ is dynamic, no longer fixed, and depends on the slope value. Compared to ML-PLAC, the restriction of the same number of “1” in the slope value makes the number of SAA operations more even. We illustrate with two different slopes: kp₁ = 110.1010110₂, kp₂ = 0.1100101₂. If we take kw as 3, the two slopes truncated in ML-PLAC are kp_ml1 = 110.101₂, kp_ml2 = 0.110₂ while in OML-PLAC are kp_oml1 = 110.1₂, kp_oml2 = 0.11001₂. Then in ML-PLAC, the operations with kp_ml1 and kp_ml2 generate 4 and 2 shifted data, respectively. As a result, three adders are required, even considering reuse in hardware architecture. However, in OML-PLAC, different slope values produce the same shift data (3) and require the same adders (2). In the OML-PLAC method, the number of shift-add operations is not always strictly flat because of the rounding of the kp_i quantization (line 5_9 in Table 2), when kp_i = 0.011101₂ and kw = 2, for example, then the truncated datum is 0.1₂ but not 0.011₂. Although this scenario occurs from time to time, the overall flattening of OML-PLAC is still better than that of ML-PLAC. Moreover, the rounding operation brings the approximation closer to the actual value.

As we can see in Table 2, the new lines form 5_1 to 5_15 are used to replace the original lines 2 and 5.1 to 5.6 in OML-PLAC. Lines 5_1 to 5_8 of OML-PLAC are used to get the new fractional bit widths of the slope kw₁ for each segment according to the number of “1” in the slope kw. It should be noted that kw₁ can be negative and the magnitude represents the position of the least significant “1” in the integer part of the slope. Suppose k_i = 110.01₂, kw = 2, then kw₁ = −1, kq_i = 110₂. “−1” means that the least significant bit truncated is bit [1] in the integer part of k_i.

In order to reveal the improvement of segmentation method in OML-PLAC. we summarize the segmentation results of OML-PLAC and ML-PLAC in Table 3. As can be seen from the table, the segmentation method in OML-PLAC can significantly reduce the number of segments under the same conditions. Note here that kw can be zero in ML-PLAC but cannot in OML-PLAC, so we set kw to 1 when calculating hyperbolic tangent function and sigmoid function. In addition, when we set kw the same value in ML_PLAC and OML_PLAC, the latter needs less shift data and adders, which can be seen from the previous example.

3.2. Error Adaptation

When we set a MAE_def, we can get the segments and the actual MAE with the OML-PLAC method. However, MAE is usually not the smallest AE for the same number of segments, so we proposed an error adaptation method (EA).

EA can ensure that the error is controllable when the MAE_def and the maximum number of segments are met. It means that the difference between the final AE and ideal AE as a percentage of MAE_def is controllable.

The flow diagram of the EA method is shown in Figure 4 and the contents of the yellow boxes in the process represent the operations in Figure 3. EA steps have been briefly described in Section 1. EC_min and EC_max are, respectively, the error bisection window’s left and right pointers, and each iteration defines the middle of the error bisection window as the new MAE_def. The cycle_num is the user-defined number of iterations, and the value determines the width of the error bisection window. The final width of the error bisection window is less than (MAE_def × 2^−cycle_num), so the percentage is (2^−cycle_num × 100%). When we want to control the percentage, the cycle_num value can be reversed. For example, if we control the percentage at 1%, the value of cycle_num is 7. We add a column in Table 3 to calculate the final MAE of the four functions adjusted by EA (MAE_EA). EA process not only reduces the final MAE but has little impact on hardware performance because the number of segments is not changed.

3.3. Hardware Improvement

The hardware implementation of this paper is basically the same as that of [29], which mainly includes comparators, multiplexers and adders. The difference is that tree-cascaded method is used to implement multi-input multiplexers with the nodes of 2-input and 3-input multiplexers in this paper. In [29], the implementation method of multi-input multiplexers is not mentioned, only that it is based on 2-input multiplexer. The libraries used in this paper all contain 2-input and 3-input multiplexers, and in general a 3-input multiplexer has roughly the same area as two 2-input multiplexers, but the timing is better. To reduce the critical path, carry-lookahead adders are used to replace normal adders.

In order to make a better comparison, the same functions and parameters as in [29] are adopted for experiment and analysis. We consider 2^x as a typical function to analyze the advantages of the proposed method in this paper.

Firstly, we set the following parameters: iw = 27, qw = 14, kw = 2, MAE_def = 1.29 × 10⁻³, cycle_num = 7 and the range of independent variables is [0,1). The results of OML-PLAC are shown in Table 4. The number of segments and the final MAE are 18 and 1.25 × 10⁻³, while the two data in ML-PLAC are 24 and 1.29 × 10⁻³, respectively. The performance gains are due to optimized segmentation method and EA, and the hardware architecture is shown in Figure 5.

Seventeen parallel comparators are used to determine the location of the input data. According to the results of comparison, the y-intercept and shifted data are selected for addition. Three multiplexers for data selection are parallel too, which are implemented by tree-cascaded 2-input and 3-input multiplexers. The cascaded circuit in the green box of Figure 5 is the muti-input multiplexer implementation circuit, and the red dashed line represents the critical path, including a 27 bit comparator, a multiplexer (MUX2) and two adders. MUX2 contains 2 3:1 multiplexers and 1 2:1 multiplexer; when implemented by 2:1 multiplexer then we need 5. Using the former method, the delay of MUX2 is the sum delay of a 3-input multiplexer and a 2-input multiplexer while the delay is the sum delay of three 2-input multiplexers with the latter method. It is easy to know that the former method has a smaller delay, and as the number of multiplexer inputs increases the advantages of the tree-cascaded structure will become more obvious.

In Figure 5, sp_i and bq_i are the endpoint and the y-intercept of the i^th segment, s_i is the comparison result between input data and segment starting point. The SAA operations can be represented as the following piecewise function:

\{\begin{matrix} 2^{- 1} \times x_{13} + 2^{- 2} \times x_{12} & i \in [1, 7] \\ x_{14} & i \in [8, 11] \\ x_{14} + 2^{- 7} \times x_{7} & i = 12 \\ x_{14} + 2^{- 3} \times x_{11} & i = 13 \\ x_{14} + 2^{- 2} \times x_{12} & i \in [14, 17] \\ x_{14} + 2^{- 1} \times x_{13} & i = 18 \end{matrix}

(17)

where I and x_m represent the number of segments and the highest m significant bits of input x. We divide the SAA operations into six segments based on the value of slope according to Table 4. Multiplying 2^−t by x_m is the same thing as shifting x_m to the right by t bits, and the sum of m and t is the fractional bit widths of the intermediate data.

Next, we give an example to illustrate the case where the piecewise method in this paper reduces the SAA operations and their associated hardware resources. Considering the function softsign(x) as [29] and the same iw, qw, independent variable range and MAE_def (12, 10, (−8,8) and 3.91 × 10⁻³, respectively), but kw is set to 2 rather than 4. We get 20 segments with the segmentation method in this paper while the value in [29] is 42. The SAA operations are expressed by (18). The hardware architecture is shown in Figure 6. Due to the decrease in kw, the number of SAA operations is reduced. Compared to the architecture proposed in [29], one multiplexer and one adder are saved in Figure 6, and the critical path is also shorter. In addition, the comparator and y-intercept selection multiplexer are greatly reduced due to the reduced number of segments. So, the performance of the circuit has been comprehensively improved:

\{\begin{array}{l} 2^{- 6} \times x_{8} + 2^{- 9} \times x_{5} & i = 1 \\ 2^{- 5} \times x_{9} + 2^{- 8} \times x_{6} & i = 2 \\ 2^{- 4} \times x_{10} + 2^{- 8} \times x_{6} & i = 3 \\ 2^{- 4} \times x_{10} + 2^{- 5} \times x_{9} & i = 4 \\ 2^{- 3} \times x_{11} + 2^{- 5} \times x_{9} & i = 5 \\ 2^{- 2} \times x_{12} + 2^{- 9} \times x_{5} & i = 6 \\ 2^{- 2} \times x_{12} + 2^{- 3} \times x_{11} & i = 7 \\ 2^{- 1} \times x_{13} + 2^{- 5} \times x_{9} & i = 8 \\ 2^{- 1} \times x_{13} + 2^{- 2} \times x_{12} & i = 9 \\ x_{14} & i = 10 \\ 2^{- 1} \times x_{13} + 2^{- 2} \times x_{12} & i = [11, 12] \\ 2^{- 1} \times x_{13} + 2^{- 5} \times x_{9} & i = 13 \\ 2^{- 2} \times x_{12} + 2^{- 3} \times x_{11} & i = 14 \\ 2^{- 2} \times x_{12} & i = 15 \\ 2^{- 3} \times x_{11} + 2^{- 5} \times x_{9} & i = 16 \\ 2^{- 3} \times x_{11} + 2^{- 4} \times x_{10} & i = 17 \\ 2^{- 4} \times x_{10} & i = 18 \\ 2^{- 5} \times x_{9} & i = 19 \\ 2^{- 6} \times x_{8} + 2^{- 11} \times x_{3} & i = 20 \end{array}

(18)

What calls for special attention is that the inputs of multiplexers in the SAA part can be adjusted mutually. Taking the hardware structure in Figure 6 as an example, the number of MXU2 and MUX3 inputs can be adjusted by adjusting the grouping of partial sums. Considering the one of MUX2 and MUX3 with larger delay is in the critical path, we need to balance the input number of the two multiplexers as much as possible.

4. Implementation and Comparison

In order to reflect the advantages of the OML-PLAC method, we compare the design of this paper with the current state-of-the-art design [29]. In addition, we selected some data of other references and experimental as supplements.

We control for many of the same variables, including MAE_def, the range of the independent variable and the fractional bit widths of input, output and intermediate data. All designs are coded in Verilog HDL and synthesized by TSMC 90-, 65- and 40-nm CMOS technologies with the use of Synopsys Design Compiler (DC). The relevant comparison results are listed in the Table 5, Table 6, Table 7, Table 8 and Table 9. The labels in these tables are defined as shown in Figure 7. In all the cases in the tables, the number of EA iterations is 7, so the error between actual MAE and ideal MAE is within 1%. The experimental results of several typical functions are divided into several parts for detailed explanation below. Among them, the comparison between OML-PLAC and ML-PLAC will be described in detail, and the remaining supplementary experiments are only briefly explained. Note that the signs of “−” and “+” in the tables below indicate the percentage of reduction and increment, respectively.

4.1. Performance Comparison for the Logarithmic Function

To compare with the performance of a logarithmic function implemented by OML-PLAC and ML-PLAC, the function log₂(1 + x) is selected. In addition, we added two comparative experiments with [15,24], so four sets of parameters are selected for the experiment in total. The results are shown in Table 5.

Table 5. Performance comparison for

\log_{2} (1 + x), x \in [0, 1)

.

Table 5. Performance comparison for

\log_{2} (1 + x), x \in [0, 1)

.

Method	No.S	IO.F	NO.M	NO.A	qw	kw	Node (nm)	Fre (GHz)	Area (um²)	Delay (ns)	Power (mW)	Energy (pJ)	MAE_def	MAE
This paper	14	26/26	3	2	14	2	65	0.84	691.56	1.19	0.098	0.117	1.8 × 10⁻³	1.71 × 10⁻³
ML-PLAC	17	26/26	3	2	14	2	65	0.78	783.36	1.30	0.127	0.165	1.8 × 10⁻³	1.79 × 10⁻³
ML-PLAC	17	26/26	3	2	14	2	65	+7.69%	−11.72%	−8.46%	−22.83%	−29.09%	-	−4.47%
This paper	15	14/14	3	2	14	2	65	2.22	1480.32	0.42	0.579	0.243	1.63 × 10⁻³	1.53 × 10⁻³
ML-PLAC	18	14/14	3	2	14	2	65	2.13	1690.56	0.47	1.020	0.479	1.63 × 10⁻³	1.63 × 10⁻³
ML-PLAC	18	14/14	3	2	14	2	65	+4.23%	−12.44%	−10.64%	−43.24%	−49.27%	-	−6.13%
This paper	11	27/27	3	2	12	2	90	0.78	1028.06	1.29	0.081	0.104	2.5 × 10⁻³	2.5 × 10⁻³
[15]	8	27/27	-	5	12	-	90	0.56	8600.68	1.77	0.66	1.168	2.5 × 10⁻³	2.5 × 10⁻³
[15]	8	27/27	-	5	12	-	90	+39.29%	−88.05%	−27.12%	−87.73%	−91.10%	-	0%
This paper	12	26/26	3	2	13	2	65	1	616.68	1	0.098	0.098	2.1 × 10⁻³	2.1 × 10⁻³
[24]	45	26/26	1	1	13	-	65	0.73	786.24	1.37	0.155	0.212	2.1 × 10⁻³	2.1 × 10⁻³
[24]	45	26/26	1	1	13	-	65	+36.99%	−21.57%	−27.01%	−36.77%	−53.77%	-	0%

In the first set of experiments, iw = 26, qw = 14, kw = 2 and MAE_def = 1.8 × 10⁻³. We get better performance with the same MAE_def. Using the method in this paper, fewer segments and less area/delay/power consumption are required. In addition, because of the EA approach, we get a smaller MAE. Specifically, the design reduces the number of segments, the area, the delay, the power and MAE value by 3, 11.72%, 8.46%, 22.83% and 4.47%. Compared to the hardware architecture in ML-PLAC, we need fewer comparators and a smaller number of y-intercept selector inputs. As for the critical path, 1 26-bit comparator, 1 5:1 multiplexer and 2 14-bit adders are included.

In the second set of experiments, iw = 14, qw = 14, kw = 2 and MAE_def = 1.63 × 10⁻³. In this case, the results show that our design reduces 3 segments, 12.44% area, 10.64% delay, 43.23% power and 6.13% MAE, and the critical path includes 1 14-bit comparator, 1 5:1 multiplexer and 2 14-bit adders.

The optimizations in [15,24] are specific optimizations of logarithmic function. We control the same values of iw, qw and MAE in the third and fourth experiments. It can be seen from the data in the table that the performance of our proposed method is much higher.

4.2. Performance Comparison for the Antilogarithmic Function

In this part, we perform three sets of experiments with TSMC 90-nm CMOS technology which are mentioned in [29]. The results are shown in Table 6. The first set of the experiments is presented as an example in Section 3, where we set iw = 27, qw = 14, kw = 2 and MAE_def =1.29 × 10⁻³. Compared to [29], the number of segments, area, delay, power and MAE are reduced by 6, 6.26%, 3.70%, 9.49% and 3.10%; while when we set iw = 14, qw = 14, kw = 2 and MAE_def =1.24 × 10⁻³, the reduced results are 5, 12.13%, 1.64%, 15.2% and 0.81%. In the first set of experiments, the data in [21] also serve as a comparison to prove the advancement of our method. The delays of these two sets of experiments are both the sum delay of 1 comparator, 1 multiplexer and 2 adders. The differences are the width of the comparators and the number of y-intercept inputs.

Table 6. Performance comparison for

2^{x}, x \in [0, 1)

.

Table 6. Performance comparison for

2^{x}, x \in [0, 1)

.

Method	No.S	IO.F	NO.M	NO.A	qw	kw	Node (nm)	Fre (GHz)	Area (um²)	Delay (ns)	Power (mW)	Energy (pJ)	MAE_def	MAE
This paper	18	27/27	3	2	14	2	90	0.77	1679.33	1.30	0.124	0.161	1.29 × 10⁻³	1.25 × 10⁻³
ML-PLAC	24	27/27	3	2	14	2	90	0.74	1791.52	1.35	0.137	0.185	1.29 × 10⁻³	1.29 × 10⁻³
ML-PLAC	24	27/27	3	2	14	2	90	+4.05%	−6.26%	−3.70%	−9.49%	−12.97%	-	−3.10%
[21]	8	10/10	-	3	14	-	90	0.71	6098.74	1.41	0.153	0.216	1.29 × 10⁻³	1.29 × 10⁻³
[21]	8	10/10	-	3	14	-	90	+8.45%	−72.46%	−7.8%	−18.95%	−25.46%	-	−3.10%
This paper	19	14/14	3	2	14	2	90	1.67	2914.83	0.60	0.753	0.452	1.24 × 10⁻³	1.23 × 10⁻³
ML-PLAC	24	14/14	3	2	14	2	90	1.64	3317.03	0.61	0.888	0.542	1.24 × 10⁻³	1.24 × 10⁻³
ML-PLAC	24	14/14	3	2	14	2	90	+1.83%	−12.13%	−1.64%	−15.20%	−16.61%	-	−0.81%
This paper	21	10/12	4	3	12	3	90	1.29	1653.93	0.775	0.3188	0.247	4.79 × 10⁻⁴	4.79 × 10⁻⁴
ML-PLAC	36	10/12	4	3	12	3	90	1.28	1694.85	0.78	0.397	0.310	4.79 × 10⁻⁴	4.79 × 10⁻⁴
ML-PLAC	36	10/12	4	3	12	3	90	+0.78%	−2.41%	−0.64%	−19.70%	−20.32%	-	0%

In the third set of experiments, the fractional bits in the input and output are different, which are 10 and 12, respectively; furthermore, qw, kw and MAE_def are 12, 3, and 4.79 × 10⁻⁴. In this case, the design reduces the number of segments, area, delay, power and MAE, and the number and percentages are 15, 2.41%, 0.64%, 19.97% and 0%, separately. The MAE value does not get smaller on account of the reason that the MAE_def is close to the ideal MAE. If we want to get a more accurate MAE value, we can increase the number of iterations. Obviously, although the number of segments is much smaller, the reduction in area and delay is not much. We will analyze the reason below.

Figure 8 is the hardware architecture comparison of the third sets of experiments, in which (a) represents the architecture in ML-PLAC, and where (b) represents the architecture in OML-PLAC. Compared to ML-PLAC, there are fewer comparators and smaller y-intercept multiplexer in OML-PLAC hardware architecture. However, the number of the multiplexer inputs in the SAA part is higher. Therefore, the overall performance improvement of the optimized architecture is not very large. The critical path of OML-PLAC is even longer when the logic gate size is the same. However, when we use DC for optimization, we can gain some timing advantages at the cost of area. As a result, the final experimental results can achieve a small improvement in all indicators.

4.3. Performance Comparison for the Hyperbolic Function

We adopt four sets of experiments with TSMC 90-nm, 65-nm and 40-nm.

The first and second sets of experiments set the variables to the same values and the results are listed in Table 7. We set iw = 8, qw = 8, kw = 1 and MAE_def = 3.5 × 10⁻³. After the re-segmentation and error adaptation of the proposed method, the same number of segments and MAE are obtained, which means that software improvements do not contribute to the improvement in circuit performance in both sets of experiments. In addition, the effect of software on performance depends on the value of the variable and the slope of the function. Therefore, in this part, the hardware performance improvement is due to the hardware structure improvement of the tree-cascaded structure.

Table 7. Performance comparison for

\tanh (x)

.

Table 7. Performance comparison for

\tanh (x)

.

Method	x	No.S	IO.F	NO.M	NO.A	qw	kw	Node (nm)	Fre (GHz)	Area (um²)	Delay (ns)	Power (mW)	Energy (pJ)	MAE_def	MAE
This paper	[0,1)	24	8/8	2	1	8	1	90	1.28	498.86	0.78	0.053	0.041	3.5 × 10⁻³	3.48 × 10⁻³
ML-PLAC	[0,1)	24	8/8	2	1	8	1	90	1.11	524	0.90	0.0663	0.060	3.5 × 10⁻³	3.48 × 10⁻³
ML-PLAC	[0,1)	24	8/8	2	1	8	1	90	+6.31%	−4.80%	−13.33%	−20.06%	−31.67%	-	0%
This paper	[0,1)	24	8/8	2	1	8	1	40	2.04%	187.51	0.49	0.0727	0.036	3.5 × 10⁻³	3.48 × 10⁻³
ML-PLAC	[0,1)	24	8/8	2	1	8	1	40	2	215	0.50	0.0768	0.038	3.5 × 10⁻³	3.48 × 10⁻³
ML-PLAC	[0,1)	24	8/8	2	1	8	1	40	+2%	−12.79%	−2%	−5.34%	−5.26%	-	0%
This paper	(−8,8)	7	6/6	1	1	6	1	65	1.11	162.36	0.9	0.011	9.9 × 10⁻³	2 × 10⁻²	1.94 × 10⁻²
[9]	(−8,8)	-	6/6	-	-	-	-	65	0.59	220.41	1.69	-	-	2 × 10⁻²	2 × 10⁻²
[9]	(−8,8)	-	6/6	-	-	-	-	65	+88.14%	−26.34%	−46.75%	-	-	-	−3%
This paper	(−8,8)	8	8/8	2	1	16	1	90	1.43	1198.11	0.7	0.175	0.123	1.07 × 10⁻²	1.07 × 10⁻²
[16]	(−8,8)	12	8/8	-	-	16	-	90	1.01	2166.19	0.99	0.717	0.710	-	1.07 × 10⁻²
[16]	(−8,8)	12	8/8	-	-	16	-	90	+41.58%	−44.69%	−29.29%	−75.5%	−82.68%	-	0%

The hardware structure of the tanh(x) function includes 23 8-bit comparators, 1 24:1 8-bit multiplexer, 1 2:1 8-bit multiplexer and 1 8-bit adder in the first and second set of experiments. In addition, the critical path consists of 1 8-bit comparators, 1 8-bit multiplexer and 1 8-bit adder. The proposed design saves 4.80% of the area, 13.33% of the delay, and 20.06% of the power when implemented by TSMC 90-nm, while the three percentages are 12.97%, 2%, and 5.34%, respectively, implemented by TSMC 40-nm.

The third and fourth groups are supplementary comparative experiments, comparing the universal method proposed in this paper with the more specific methods in [9,16]. Compared with the first and second sets of experiments, our method has more obvious performance advantages in these two groups of experiments. In the circuit implementation process, we optimize the hardware according to the parity of the function, and only consider the case when the independent variable x is positive. When x is negative, the corresponding positive number is used as input to get the intermediate result, and the final result can be obtained by inverting the intermediate result value and adding 1 (Complement code).

4.4. Performance Comparison for the Hyperbolic Function

In the performance comparison of sigmoid function implementation, we still take four sets of experiments to analyze, with the use of TSMC 90-nm, TSMC 65-nm and TSMC 40-nm processes, as shown in Table 8.

Table 8. Performance comparison for

sigmoid (x)

.

Table 8. Performance comparison for

sigmoid (x)

.

Method	x	No.S	IO.F	NO.M	NO.A	qw	kw	Node (nm)	Fre (GHz)	Area (um²)	Delay (ns)	Power (mW)	Energy (pJ)	MAE_def	MAE
This paper	[0,1)	4	8/8	1	1	8	1	90	~1.92	~102	~0.52	~0.019	~0.010	~5.0 × 10⁻³	~4.93 × 10⁻³
ML-PLAC	[0,1)	4	8/8	1	1	8	2	90	~1.92	~102	~0.52	~0.019	~0.010	~5.0 × 10⁻³	~4.93 × 10⁻³
This paper	(−1,1)	22	13/13	4	3	13	3	40	1.54	563.77	0.65	0.179	0.116	3.79 × 10⁻⁴	3.79 × 10⁻⁴
ML-PLAC	(−1,1)	29	13/13	4	3	13	5	40	1.43	614.93	0.7	0.191	0.134	3.79 × 10⁻⁴	3.79 × 10⁻⁴
ML-PLAC	(−1,1)	29	13/13	4	3	13	5	40	+7.69%	−8.32%	−7.14%	−6.28%	−13.43%	-	0%
This paper	(−8,8)	4	6/6	1	1	6	1	65	1.2	119.88	0.83	0.009	0.007	2 × 10⁻²	1.89 × 10⁻²
[9]	(−8,8)	-	6/6	-	-	-	-	65	0.44	126.53	2.25	-	-	2 × 10⁻²	1.98 × 10⁻²
[9]	(−8,8)	-	6/6	-	-	-	-	65	+172.7%	−5.26%	−63.11%	-	-	-	−4.55%
This paper	(−8,8)	7	8/8	2	1	16	1	90	1.11	843.90	0.9	0.116	0.104	7.6 × 10⁻³	7.6 × 10⁻³
Scheme I in [17]	(−8,8)	12	8/8	-	-	16	-	90	1	1684.27	0.98	0.520	0.510	-	1.14 × 10⁻²
Scheme I in [17]	(−8,8)	12	8/8	-	-	16	-	90	+11%	−49.90%	−8.16%	−77.69%	−79.61%	-	−33.33%
Scheme II in [17]	(−8,8)	12	8/8	-	-	16	-	90	1	2024.37	0.98	0.667	0.654	-	7.6 × 10⁻³
Scheme II in [17]	(−8,8)	12	8/8	-	-	16	-	90	+11%	−58.31%	−8.16%	−82.61%	−84.10%	-	0%

In the first set of experiments, the data of ML-PLAC are obtained from [29]. After software processing of the proposed method, the same number of segments, MAE, segment boundary and quantized slope value of each segment are obtained. In addition, the performance of the circuit is basically similar after the hardware implementation. In this set of experiments, too few segments lead to less input of the multiplexers, so the advantages of the tree-cascaded structure are not reflected. There are only 3 8-bit comparators, 1 4:1 8-bit multiplexer, 1 2:1 8-bit multiplexer and 1 8-bit adder in the hardware circuit, and the critical path includes 1 8-bit comparators, 1 4:1 8-bit multiplexer and 1 8-bit adder.

In order to prove the advantages of the method in this paper, we conduct the second set of experiments, in which the data of ML-PLAC are not derived from [29] but are acquired by the implemented results according to the method in [29]. We set the range of independent variables, iw, qw and MAE_def as (−1,1), 13, 13 and 3.79 × 10⁻⁴, respectively. In ML-PLAC, kw is set to 5 and then we get 29 segments. The critical path includes 1 14-bit comparator, 1 3:1 15-bit multiplexer and 3 15-bit adders. Using the method in this paper, kw is set to 3, we get 22 segments and the critical path is the same as that in ML-PLAC. The hardware architecture implemented by the ML-PLAC method includes 28 14-bit comparator, 4 multiplexers (1 29:1, 2 3:1, and 1 6:1) and 3 15-bit adders, and the relevant components are 21 14-bit comparators, 4 multiplexers (1 21:1, 2 3:1, and 1 5:1) and 3 15-bit adders with the method of this paper. In this case, in addition to optimizing the hardware structure, we spend a little bit of area to optimize the timing, and finally the OML-PLAC method improves all performance indicators of the circuit. The final implementation results show that our method reduces area, delay and power by 8.32%, 7.14%, and 6.28% separately.

The third and fourth experiments compare our method with other specialized methods with good performance, and the results of the comparison are shown in the table. In these two sets of experiments, we take advantage of the symmetry of the function and only consider the case where the independent variable x is positive. Since the range of the dependent variable is (0,1), we only consider the fractional bits of the result. When x is negative, the corresponding positive number is used as input to get the intermediate result, and the final result can be obtained by inverting the fractional bits and adding 1.

4.5. Performance Comparison for the Softsign Function

In this part, we compare the performance of the softsign function in the range of independent variables (−8,8). The data of ML-PLAC are obtained from [29] and the data from [30] are used as a supplementary comparison. We use the same iw, qw, Node and MAE_def (12, 10, 40 and 3.91 × 10⁻³, respectively, except that MAE_def in [30] is 5.86 × 10⁻³) for the comparison experiments.

According to the particularity of the function, we use two methods to implement in this paper based on OML-PLAC method. Method a is implemented in the same way as the previous functions, and the independent variables are segmented from −8 to 8. Method b uses the parity of the function to consider only the case when the independent variable is positive. When the independent variable is negative, the absolute value of the independent variable is taken as the input, and the corresponding negative complement code of the result is taken as the final output. Therefore, the range of independent variables for method b is [0,8) and the results are list in Table 9.

Table 9. Performance comparison for

softsign (x), x \in (- 8, 8)

.

Table 9. Performance comparison for

softsign (x), x \in (- 8, 8)

.

Method		No.S	IO.F	NO.M	NO.A	qw	kw	Node (nm)	Fre (GHz)	Area (um²)	Delay (ns)	Power (mW)	Energy (pJ)	MAE_def	MAE
This paper	a	20	12/12	3	2	10	2	40	1	382.08	1	0.0653	0.065	3.91 × 10⁻³	3.91 × 10⁻³
This paper	b	10	12/12	3	2	10	2	40	1.11	244.67	0.9	0.0443	0.040	3.91 × 10⁻³	3.91 × 10⁻³
ML-PLAC		42	12/12	4	3	10	4	40	0.83	600	1.2	0.0741	0.089	3.91 × 10⁻³	3.91 × 10⁻³
Compared to a									+20.48%	−36.32%	−16.67%	−11.88%	−26.97%	-	0%
Compared to b									+33.73%	−59.22%	−25%	−40.22%	−55.06%	-	0%
[30]		32	12/12	2	1	10	-	40	0.50	1354	2	0.2152	0.430	-	5.86 × 10⁻³
Compared to a									+100%	−71.78%	−50%	−69.66%	−84.88%		−33.28%
Compared to b									+122%	−81.93%	−55%	−79.41%	−90.70%		−33.28%

The kw of the ML-PLAC method is 4 while the number is 2 in methods a and b. As a result, methods a and b require one less multiplier and one less adder each compared to ML-PLAC, which has a positive effect on the area and delay of the circuit. Method a gets 20 segments, and the critical path includes 1 16-bit comparator, 1 16:1 multiplexer, 1 14-bit adder and 1 16-bit adder. The design saves 36.32% area, 16.67% delay and 11.88% power. Method b gets 10 segments, and the critical path includes 1 16-bit comparator, 1 8:1 multiplexer, 1 14-bit adder and 1 16-bit adder. The performance of this design is much better and the design costs 59.92% less area, 25% less delay and 55.06% less power. The comparison results between the method in this paper and [30] are shown in the table, which will not be described in detail.

5. Conclusions

In this paper, we propose an optimized method for nonlinear function approximation based on multiplierless piecewise linear approximation computation which is called OML-PLAC. OML-PLAC is superior to the current state-of-the-art PWL method (ML-PLAC). To make the best use of shifters and multiplexers, we control the number of ones in the slope rather than the number of fractional bits of the slope, so that fewer segments and smaller hardware are needed to achieve the target function. In order to obtain the controllable error percentage under the constraint of the predefined MAE, an EA method is proposed. In addition, for optimizing the critical path, tree-cascaded 2-input and 3-input multiplexers are used to replace multi-input multiplexers. We also point out the limitation of the current optimal PWL segmentation method in Section 1 for readers to consider. We apply the OML-PLAC method to common activation functions and logarithmic number system typical functions, including sigmoid(x), tanh(x), softsign(x), log₂(1 + x) and 2^x, then we implement these functions in different CMOS technologies. Finally, we compare the results of the implementation with that of ML-PLAC. The data show that compared with the current state-of-the-art methods, our method has greater advantages in area, delay, power consumption and other indicators.

Author Contributions

Conceptualization, H.Y. and G.Y.; data collection, H.Y.; data analysis, H.Y.; methodology, H.Y.; software, H.Y.; coding, H.Y., D.K. and L.L.; synthesis, H.Y.; writing—original draft, H.Y.; writing—review and editing, D.K., L.L. and Y.H.; final approval, H.Y., G.Y., D.K., L.L. and Y.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy.

Conflicts of Interest

The authors declare no conflict of interest.

References

Liu, W.; Lombardi, F.; Shulte, M. A retrospective and prospective view of approximate computing. Proc. IEEE 2020, 108, 394–399. [Google Scholar] [CrossRef]
Seth, A.; Gan, W.-S. Fixed-point square roots using L-b truncation. IEEE Signal Process. Mag. 2011, 28, 149–153. [Google Scholar] [CrossRef]
Luo, Y.; Wang, Y.; Ha, Y.; Wang, Z.; Chen, S.; Pan, H. Generalized hyperbolic CORDIC and its logarithmic and exponential computation with arbitrary fixed base. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2019, 27, 2156–2169. [Google Scholar] [CrossRef]
Wang, Y.; Luo, Y.; Wang, Z.; Shen, Q.; Pan, H. GH CORDIC-Based Architecture for Computing Nth Root of Single-Precision Floating-Point Number. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2020, 28, 864–875. [Google Scholar] [CrossRef]
Montuschi, P.; Bruguera, J.D.; Ciminiera, L.; Piñeiro, J.-A. A digit-by-digit algorithm for mth root extraction. IEEE Trans. Comput. 2007, 56, 1696–1706. [Google Scholar] [CrossRef]
Wang, Z.; Lin, J.; Wang, Z. Accelerating recurrent neural networks: A memory-efficient approach. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2017, 25, 2763–2775. [Google Scholar] [CrossRef]
Nilsson, P.; Shaik, A.U.R.; Gangarajaiah, R.; Hertz, E. Hardware implementation of the exponential function using Taylor series. In Proceedings of the 2014 NORCHIP, Tampere, Finland, 27–28 October 2014; pp. 1–4. [Google Scholar]
Sybis, M. Log-MAP equivalent Chebyshev inequality based algorithm for turbo TCM decoding. Electron. Lett. 2011, 47, 1049–1050. [Google Scholar] [CrossRef]
Chong, Y.S.; Goh, W.L.; Ong, Y.S.; Nambiar, V.P.; Do, A.T. Efficient Implementation of Activation Functions for LSTM accelerators. In Proceedings of the 2021 IFIP/IEEE 29th International Conference on Very Large Scale Integration (VLSI-SoC), Singapore, 4–7 October 2021; pp. 1–5. [Google Scholar]
Chinta, C.; Deshmukh, R.B. High speed most significant bit first truncated multiplier. In Proceedings of the 2018 9th International Conference on Computing, Communication and Networking Technologies (ICCCNT), Bengaluru, India, 10–12 July 2018; pp. 1–4. [Google Scholar]
Chandrashekara, M.; Rohith, S. Design of 8 bit Vedic multiplier using Urdhva Tiryagbhyam sutra with modified carry save adder. In Proceedings of the 2019 4th International Conference on Recent Trends on Electronics, Information, Communication & Technology (RTEICT), Bangalore, India, 17–18 May 2019; pp. 116–120. [Google Scholar]
Pilipović, R.; Bulić, P. On the design of logarithmic multiplier using radix-4 booth encoding. IEEE Access 2020, 8, 64578–64590. [Google Scholar] [CrossRef]
De Caro, D.; Petra, N.; Strollo, A.G. Efficient logarithmic converters for digital signal processing applications. IEEE Trans. Circuits Syst. II Express Briefs 2011, 58, 667–671. [Google Scholar] [CrossRef]
Gutierrez, R.; Valls, J. Low cost hardware implementation of logarithm approximation. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2010, 19, 2326–2330. [Google Scholar] [CrossRef]
Ellaithy, D.M.; El-Moursy, M.A.; Ibrahim, G.H.; Zaki, A.; Zekry, A. Accurate Piecewise Uniform Approximation Logarithmic/Antilogarithmic Converters for GPU Applications. In Proceedings of the 2017 29th International Conference on Microelectronics (ICM), Beirut, Lebanon, 10–13 December 2017; pp. 1–4. [Google Scholar]
Qin, Z. Optimizations and Implementations for Key Components of Deep Neural Networks; Nanjing University: Nanjing, China, 2020. [Google Scholar]
Qin, Z.; Qiu, Y.; Sun, H.; Lu, Z.; Shen, Q.; Wang, Z.; Pan, H. A Novel Approximation Methodology and Its Efficient VLSI Implementation for the Sigmoid Function. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 3422–3426. [Google Scholar] [CrossRef]
Kim, H.; Nam, B.-G.; Sohn, J.-H.; Woo, J.-H.; Yoo, H.-J. A 231-MHz, 2.18-mW 32-bit logarithmic arithmetic unit for fixed-point 3-D graphics system. IEEE J. Solid-State Circuits 2006, 41, 2373–2381. [Google Scholar] [CrossRef] [Green Version]
Nam, B.-G.; Yoo, H.-J. An embedded stream processor core based on logarithmic arithmetic for a low-power 3-D graphics SoC. IEEE J. Solid-State Circuits 2009, 44, 1554–1570. [Google Scholar] [CrossRef]
Ellaithy, D.M.; El-Moursy, M.A.; Ibrahim, G.H.; Zaki, A.; Zekry, A. Double logarithmic arithmetic technique for low-power 3-D graphics applications. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2017, 25, 2144–2152. [Google Scholar] [CrossRef]
Ellaithy, D.M.; El-Moursy, M.A.; Ibrahim, G.H.; Zaki, A.; Zekry, A. Efficient Piecewise Non-Uniform Approximation Logarithmic and Antilogarithmic Converters. In Proceedings of the 2017 International Conference on Advanced Control Circuits Systems (ACCS) Systems & 2017 International Conference on New Paradigms in Electronics & Information Technology (PEIT), Alexandria, Egypt, 5–8 November 2017; pp. 149–152. [Google Scholar]
Zhu, M.; Ha, Y.; Gu, C.; Gao, L. An optimized logarithmic converter with equal distribution of relative errors. IEEE Trans. Circuits Syst. II Express Briefs 2016, 63, 848–852. [Google Scholar] [CrossRef]
Liu, C.-W.; Ou, S.-H.; Chang, K.-C.; Lin, T.-C.; Chen, S.-K. A low-error, cost-efficient design procedure for evaluating logarithms to be used in a logarithmic arithmetic processor. IEEE Trans. Comput. 2015, 65, 1158–1164. [Google Scholar] [CrossRef]
Loukrakpam, M.; Choudhury, M. Error-aware design procedure to implement hardware-efficient logarithmic circuits. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 851–855. [Google Scholar] [CrossRef]
Sun, H.; Luo, Y.; Ha, Y.; Shi, Y.; Gao, Y.; Shen, Q.; Pan, H. A universal method of linear approximation with controllable error for the efficient implementation of transcendental functions. IEEE Trans. Circuits Syst. I Regul. Pap. 2019, 67, 177–188. [Google Scholar] [CrossRef]
Dong, H.; Wang, M.; Luo, Y.; Zheng, M.; An, M.; Ha, Y.; Pan, H. PLAC: Piecewise linear approximation computation for all nonlinear unary functions. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2020, 28, 2014–2027. [Google Scholar] [CrossRef]
Lyu, F.; Xu, X.; Wang, Y.; Luo, Y.; Wang, Y.; Pan, H. Ultralow-latency VLSI architecture based on a linear approximation method for computing Nth roots of floating-point numbers. IEEE Trans. Circuits Syst. I Regul. Pap. 2020, 68, 715–727. [Google Scholar] [CrossRef]
Lyu, F.; Mao, Z.; Zhang, J.; Wang, Y.; Luo, Y. PWL-based architecture for the logarithmic computation of floating-point numbers. IEEE Trans. Very Large Scale Integr. VLSI Syst. 2021, 29, 1470–1474. [Google Scholar] [CrossRef]
Lyu, F.; Xia, Y.; Mao, Z.; Wang, Y.; Wang, Y.; Luo, Y. ML-PLAC: Multiplierless Piecewise Linear Approximation for Nonlinear Function Evaluation. IEEE Trans. Circuits Syst. I Regul. Pap. 2022, 69, 1546–1559. [Google Scholar] [CrossRef]
Chang, C.-H.; Zhang, E.-H.; Huang, S.-H. Softsign Function Hardware Implementation Using Piecewise Linear Approximation. In Proceedings of the 2019 International Symposium on Intelligent Signal Processing and Communication Systems (ISPACS), Taipei, Taiwan, 3–6 December 2019; pp. 1–2. [Google Scholar]

Figure 1. SAA parts of ML-PLAC (a) and OML-PLAC (b).

Figure 2. The graph of the sine function.

Figure 3. The process of segmentation in ML-PLAC.

Figure 4. The flow diagram of the EA method.

Figure 5. The hardware architecture for 2^x (iw = 27, qw = 14, kw = 2, MAE_def = 1.29 × 10⁻³).

Figure 6. The hardware architecture for softsign(x) (iw = 12, qw = 10, kw = 2, MAE_def = 3.91 × 10⁻³).

Figure 7. Definitions of labels in the Table 5, Table 6, Table 7, Table 8 and Table 9.

Figure 8. The hardware architectures in ML-PLAC (a) and OML-PLAC; (b) of the third sets of experiments in Table 6.

Table 1. The comparison of the different PWL methods.

PWL Methods	Segmentation Criteria	Optimizations	Features
uniform segmentation	Same segment widths.	Optimize according to the properties of specific functions; Error modification.	Good performance for specific functions; Different errors for different segments; More segments with the increase in accuracy.
non-uniform segmentation (not error-flattened)	MRE; Control the number of segments according to the rate of change in the function; No fixed standard.	Optimize according to the properties of specific functions and segment points.	Good performance for specific functions; Poor hardware efficiency.
error flattening segmentation	MAE.	Software segmentation method optimization; Hardware structure optimization.	Good performance for general nonlinear functions; Fewer segments; Hardware efficient.

Table 2. Pseudocode for MAE calculation in ML-PLAC and OML-PLAC.

1	$k_{i} = \frac{f (x (epi)) - f (x (spi))}{x (epi) - x (spi)}$
2	${kq}_{i} = \frac{round (k_{i} \times 2^{kw})}{2^{kw}}$	%Quantification operation of $k_{i}$ in ML-PLAC
3	$b_{i} = f (x (epi)) - k_{i} \times x (epi)$
4	$mq = 0, j = 0$
5.1	$while (floor ({kq}_{i} \times 2^{kw - j}) \neq 0)$	%Simulation of SAA operations in ML-PLAC to replace the multiplier and the truncation of multiplier output.
5.2	$Δ k_{i} = \frac{floor ({kq}_{i} \times 2^{kw - j})}{2^{kw - j}} - \frac{floor ({kq}_{i} \times 2^{kw - j - 1})}{2^{kw - j - 1}}$
5.3	$Δ mq = \frac{floor (x (spi : epi) \times 2^{qw + j - kw})}{2^{qw + j - kw}} \times Δ k_{i}$
5.4	$mq = mq + Δ mq$
5.5	$j = j + 1$
5.6	$endwhile$
5_1	$cnt = 0, k_{s} = k_{i}, k_{int} = floor (k_{s})$	%Getting the new fractional bit widths of the slope ${kw}_{1}$ according to the number of ones in the slope kw in OML-PLAC. When ${kw}_{1}$ is negative, the magnitude represents the position of the least significant “1” in the integer part of the slope.
5_2	${kw}_{1} = - cell (\log 2 (k_{int} + 1))$
5_3	$k_{s} = k_{s} \times 2^{{kw}_{1}}$
5_4	$while (cnt \neq kw)$
5_5	$if (floor (k_{s}) = = 1) cnt = cnt + 1$
5_6	${else k}_{s} = (k_{s} - floor (k_{s})) \times 2, {kw}_{1} = {kw}_{1} + 1$
5_7	$endwhile$
5_8	${kw}_{1} = {kw}_{1} - 1$
5_9	${kq}_{i} = \frac{round (k_{i} \times 2^{{kw}_{1}})}{2^{{kw}_{1}}}$	%Quantification operation of $k_{i}$ in OML-PLAC.
5_10	$while (floor ({kq}_{i} \times 2^{{kw}_{1} - j}) \neq 0)$	%Simulation the process of SAA in OML-PLAC to replace the multiplier and the truncation of multiplier output.
5_11	$Δ k_{i} = \frac{floor ({kq}_{i} \times 2^{{kw}_{1} - j})}{2^{{kw}_{1} - j}} - \frac{floor ({kq}_{i} \times 2^{{kw}_{1} - j - 1})}{2^{{kw}_{1} - j - 1}}$
5_12	$Δ mq = \frac{floor (x (spi : epi) \times 2^{qw + j - {kw}_{1}})}{2^{qw + j - {kw}_{1}}} \times Δ k_{i}$
5_13	$mq = mq + Δ mq$
5_14	$j = j + 1$
5_15	$endwhile$
6	$h_{i} = mq + b_{i}$
7	$Err = f (x (spi : epi)) - h_{i}$
8	$b_{i}^{'} = b_{i} + \frac{\max (Err) + \min (Err)}{2}$
9	$if (iw > qw)$
10	${bq}_{i} = \frac{round (b_{i}^{'} \times 2^{iw})}{2^{iw}}$
11	$else$
12	${bq}_{i} = \frac{round (b_{i}^{'} \times 2^{qw})}{2^{qw}}$
13	$h_{i}^{'} = mq + {bq}_{i}$
14	$MAE = \max (abs (f (x (spi : epi)) - h_{i})$

Table 3. Segmentation performance comparison between OML-PLAC and the method used in [29].

Function	Input Range	iw	MAE_def	Segmentation Method	NO. of Segments	kw	qw	MAE	MAE_EA Cycle_Num = 7
log₂(1 + x)	[0,1)	16	2.31 × 10⁻⁴	ML_PLAC	31	4	16	2.31 × 10⁻⁴	-
log₂(1 + x)	[0,1)	16	2.31 × 10⁻⁴	OML_PLAC	19	4	16	2.31 × 10⁻⁴	2.16 × 10⁻⁴
tanh(x)	[0,1)	8	5.72 × 10⁻³	ML_PLAC	22	0	8	5.71 × 10⁻³	-
tanh(x)	[0,1)	8	5.72 × 10⁻³	OML_PLAC	12	1	8	5.57 × 10⁻³	4.67 × 10⁻³
sigmoid(x)	[0,1)	8	6.54 × 10⁻³	ML_PLAC	20	0	8	6.39 × 10⁻³	-
sigmoid(x)	[0,1)	8	6.54 × 10⁻³	OML_PLAC	3	1	8	6.51 × 10⁻³	5.21 × 10⁻³
softsign(x)	(−8,8)	12	5.00 × 10⁻³	ML_PLAC	29	3	10	5.00 × 10⁻³	-
softsign(x)	(−8,8)	12	5.00 × 10⁻³	OML_PLAC	16	3	10	4.99 × 10⁻³	4.99 × 10⁻³

Table 4. The result of OML-PLAC for 2^x,

x \in [0, 1)

.

Table 4. The result of OML-PLAC for 2^x,

x \in [0, 1)

.

$i$	1	2	3	4	5	6
${kq}_{i}$	0.75	0.75	0.75	0.75	0.75	0.75
${kq}_{i}$ dcp ^(a)	−1,−2	−1,−2	−1,−2	−1,−2	−1,−2	−1,−2
${bq}_{i} \times 2^{27}$	134,064,676	133,945,844	134,266,135	134,585,698	134,907,266	135,227,824
${sp}_{i} \times 2^{27}$	0	7,405,568	27,982,355	33,160,625	37,123,176	40,433,978
${ep}_{i} \times 2^{27}$	7,405,567	27,982,354	33,160,624	37,123,175	40,433,977	43,348,127
$i$	7	8	9	10	11	12
${kq}_{i}$	0.75	1	1	1	1	1.0078125
${kq}_{i}$ dcp ^(a)	−1,−2	0	0	0	0	0,−7
${bq}_{i} \times 2^{27}$	135,550,313	124,052,312	123,723,825	123,395,314	123,066,929	122,272,854
${sp}_{i} \times 2^{27}$	43,348,128	45,969,761	48,832,512	52,101,120	56,033,280	61,382,656
	45,969,760	48,832,511	52,101,119	56,033,279	61,382,655	83,500,828
$i$	13	14	15	16	17	18
${kq}_{i}$	1.125	1.25	1.25	1.25	1.25	1.5
${kq}_{i}$ dcp ^(a)	0,−3	0,−2	0,−2	0,−2	0,−2	0,−1
${bq}_{i} \times 2^{27}$	112,508,522	99,488,945	99,809,534	100,128,867	100,448,749	67,144,445
${sp}_{i} \times 2^{27}$	83,500,829	104,193,313	124,058,498	128,089,921	131,169,714	133,757,944
	104,193,312	124,058,497	128,089,920	131,169,713	133,757,943	134,217,727

^(a) Decomposition. For example, kq_i = 0.75 = 0.11₂, then kq_{12 dcp} includes −1 and −2.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Yu, H.; Yuan, G.; Kong, D.; Lei, L.; He, Y. An Optimized Method for Nonlinear Function Approximation Based on Multiplierless Piecewise Linear Approximation. Appl. Sci. 2022, 12, 10616. https://doi.org/10.3390/app122010616

AMA Style

Yu H, Yuan G, Kong D, Lei L, He Y. An Optimized Method for Nonlinear Function Approximation Based on Multiplierless Piecewise Linear Approximation. Applied Sciences. 2022; 12(20):10616. https://doi.org/10.3390/app122010616

Chicago/Turabian Style

Yu, Hongjiang, Guoshun Yuan, Dewei Kong, Lei Lei, and Yuefeng He. 2022. "An Optimized Method for Nonlinear Function Approximation Based on Multiplierless Piecewise Linear Approximation" Applied Sciences 12, no. 20: 10616. https://doi.org/10.3390/app122010616

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

An Optimized Method for Nonlinear Function Approximation Based on Multiplierless Piecewise Linear Approximation

Abstract

Featured Application

Abstract

1. Introduction

2. State-of-the-Art PWL Method

2.1. Basic Principles of PWL

2.2. ML-PLAC Segmentor and Quantizer

3. Proposed Method

3.1. Optimized Segmentation Method

3.2. Error Adaptation

3.3. Hardware Improvement

4. Implementation and Comparison

4.1. Performance Comparison for the Logarithmic Function

4.2. Performance Comparison for the Antilogarithmic Function

4.3. Performance Comparison for the Hyperbolic Function

4.4. Performance Comparison for the Hyperbolic Function

4.5. Performance Comparison for the Softsign Function

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI