1. Introduction
The term artificial neural network (ANN) refers to a series of mathematical models inspired by biology and neuroscience. These models primarily simulate biological neural networks by abstracting the neural network of the human brain, constructing artificial neurons, and establishing connections among these artificial neurons according to a certain topological structure. In the field of artificial intelligence, ANNs are usually referred as neural networks or neural models. The basic constitutive unit of a neural network is the artificial neuron, which mainly simulates the structure and characteristics of biological neurons, receives a group of input signals, and produces output. The output of a neuron is usually realized by different activation functions, including sigmoid, Relu, Softplus, etc. [
1]. Among them, the sigmoid function is widely used in various ANN models due to its simple expression and limited output range. However, the sigmoid function is a nonlinear function with exponent and division operations that consumes a large amount of hardware resources. It is therefore necessary to simplify and accelerate the sigmoid function when deploying neural networks on hardware platforms [
2,
3]. These hardware platforms include embedded devices, Internet of Things (IoT) applications, and field-programmable gate array (FPGA) boards [
4,
5]. It is also necessary to strike a balance between performance and functional flexibility when realizing a feasible neural network model design [
6,
7].
In order to solve the problem of simplifying and deploying sigmoid functions, various fitting methods, such as look-up table, the coordinate rotation digital computer (CORDIC) algorithm, Taylor series expansion, polynomial, the piecewise method, and the hybrid method, have been proposed to implement the sigmoid function on hardware. The look-up table [
8] is a direct method of fitting a sigmoid function according to preset values. It requires all input and output values to be saved in memory and reads outputs based on inputs. While the accuracy of the outputs can be extremely high, it consumes too much storage space to save high-precision values. The CORDIC method [
9] converts the sigmoid function into simple operations, such as addition and shifting through multiple iterations. Although this method does not involve multiplication, it requires the use of multiple lookup tables and additions; as a result, its hardware resource consumption is also too high for many applications. The Taylor series expansion method [
10] and polynomial method [
11] fit sigmoid functions with high-order expressions, which also consume a large amount of hardware resources. Moreover, the referenced hybrid method [
12] applies different kinds of fitting methods together, which requires substantial storage space and many complex operations.
Compared with the methods above, the piecewise fitting method has clear function expressions, consumes few hardware resources, and achieves high fitting accuracy [
13]. It has thus become a mainstream sigmoid function fitting method. The basic principle of the piecewise fitting method is to divide the sigmoid function into several regions in a specific piecewise manner, then use a different expression for each region to replace the original function, thereby achieving the purpose of fitting the original function [
14]. Piecewise fitting methods can be divided into piecewise linear fitting and piecewise nonlinear fitting methods. The latter has higher fitting accuracy but consumes more hardware resources, while the former can achieve the same fitting accuracy through the use of more segment numbers without needing to employ high-order operations. Thus, the piecewise linear fitting method is more compatible to obtain high speed and few hardware resources on FPGA.
Savich divides the sigmoid function into five segments in the range of [−8, 8] and uses a linear fitting method with both adders and multipliers [
15]. Armato uses the area conservation method to divide the sigmoid function into 16 non-uniform segments for linear fitting [
16]. Ngah proposes a fitting method that combines piecewise binomial function and look-up table approaches. In more detail, this method uses a piecewise second-order nonlinear function method and look-up table method to perform the fitting for the first time, then adds and subtracts the output value to improve the fitting accuracy [
12]. Campo divides the sigmoid function into 12 segments and then uses the second-order Taylor expansion formula to fit every segment [
10]. Gomar uses the approximate calculation method for exponent to fit the sigmoid function [
17]. Pandit applies Chebyshev’s polynomial approximation for efficient hardware realization [
18]. Mitra proposes a 16-segment linear fitting algorithm based on adders, multipliers, and logic blocks [
14]. Zamanlooy uses a continuous valued number system to linearly fit the sigmoid and applies the continuous modular compression operation to reduce the width of the numbers [
19]. Nguyen divides the sigmoid function into 12 segments and chooses the parameters of the linear function based on the distribution probability of input values [
20]. Pan combines the piecewise linear (PWL) approximation, Taylor series approximation, and Newton–Raphson method-based approximation methods together to implement the sigmoid function efficiently [
21].
The piecewise fitting methods described above have advantages in terms of their utilization of hardware resources; however, their recognition accuracies and processing speed on hardware need some improvement [
22,
23]. This paper accordingly chooses several abscissas of potential piecewise points based on the curvatures of the sigmoid function. We solve the function expressions in every segment between piecewise points using the least squares method. We then compare the absolute errors of different fitting function with potential piecewise points and choose a single function expression as the hardware implementation scheme to achieve higher fitting accuracy and reduce hardware resource consumption. Finally, we realize the piecewise linear fitting function on the specified FPGA platform to simulate the inference stage of neural networks. The circuits for different segments work in parallel to calculate the outputs of these segments, and the multiplexer selects the result of one segment as the final output of the fitting function based on the range of input value. The clock frequency of this circuit design and recognition accuracies on the MINIST dataset show that our PieceWise linear fitting method based on curvatures (PWLC) has lower time latency and achieves higher accuracies in the specified neural networks.
The contributions of this paper lie in the following three aspects:
This paper proposes a new method to select potential piecewise points based on the curvature values of the sigmoid function. Piecewise points are dynamically selected in the specified range according to curvature values.
This paper develops an approximate comparison scheme for PWLC to determine the proper expression of the piecewise linear function. The comparison is based on the values of maximum absolute errors and average absolute errors.
This paper presents a high-speed hardware design for PWLC. The circuit implemented on the FPGA development platform can achieve the lowest end-to-end latency at higher clock frequency with the use of additional hardware resources.
The remainder of this paper is organized as follows.
Section 2 presents the principle of solving expressions of the piecewise linear fitting function based on curvature analysis.
Section 3 outlines the comparison scheme for the expressions of piecewise function solutions.
Section 4 describes the module design of the piecewise linear function.
Section 5 presents the experimental results and draws some comparisons with other papers. Finally,
Section 6 makes a conclusion for this paper at the end.
2. Piecewise Linear Fitting Method Based on Curvature Values
The sigmoid function has continuity and monotonicity in the domain of definition. Unlike the linear Relu function, the derivative of the sigmoid function in the domain of definition is constantly changing and exhibits nonlinear characteristics. The sigmoid function is close to the saturation values (0 or 1) at both ends and has almost no change of value at all. The function graphs among saturation areas have clearly nonlinear characteristics, which can be fitted by several linear functions with different derivatives. In the middle of this range, the sigmoid function changes more drastically, which needs to be fitted with more linear functions with different derivatives.
The derivative can describe how fast a function changes. As can be seen from
Figure 1, when
, the derivative of the sigmoid function reaches its maximum value. Although the value of the sigmoid function changes drastically near 0, the shape is relatively straight and can be approximated as a linear function. When
, the derivative value of sigmoid function is small, and it can be observed that the sigmoid function has a greater degree of curvature near
. In particular, the derivative values of the two functions
and
are the same at
, but the curvature values of the two functions at zero are obviously different. Thus, the derivatives cannot intuitively describe the deviation between a curve function and a straight line, especially the degree of curvature at a single point. Different from the derivative, the curvature value is defined as the rate of the tangent direction angle at one point on the function relative to the arc length of the curve. This indicator can describe the degree to which a curve deviates from a straight line. The value of curvature is positively related to the curve’s degree of curvature. The expression of curvature is defined as follows.
In the equation above,
is the tangent direction angle, while
ℓ is the arc length of the given tangent direction angle. The original expression of the sigmoid function is given below.
The expression of the first derivative for the sigmoid function is given below.
The expression of the second derivative for the sigmoid function is given below.
Thus, the expression of the curvature value for the sigmoid function is as follows.
2.1. Selection of Piecewise Points Based on Curvature Analysis
As shown in
Figure 1, the curvature graph is symmetric about the x-axis. When
, the curvature has the minimum value of 0. At this point, the curvature of the sigmoid function as its smallest and the shape is closest to a straight line. When
, the curvature value tends to be large. When
x is around -1 and 1, the curvature reaches its maximum value (close to 0.1). When
, the curvature value is close to 0. In this range, the sigmoid function can be approximately regarded as a linear function.
From
Figure 1, the derivative of the sigmoid function exhibits obvious changes in the range of [−8, 8] and peaks when
, at which point the derivative of the sigmoid function achieves a maximum value of 0.25. If the range of [−8, 8] is subdivided into numerous small segments, the derivative change in each of these small segments will tend to be smooth. In these small areas, the shape of the sigmoid function becomes similar to the shape of some linear functions. As a result, the nonlinear sigmoid function can be fitted by numerous linear functions, which will decrease the sigmoid function fitting error. It is practical that the sigmoid function can be fitted in the specified range according to the abscissa of the given piecewise point, and that each segment range has an independent piecewise function expression. In the saturation regions at both ends, the original function can be approximately equal to 0 or 1 with few fitting errors.
Due to the complexity of the sigmoid curvature function, it is complicated to obtain the maximum of this function. This paper applies systematic sampling for abscissas to reflect the curvature values corresponding to different coordinates of segment intervals. In the range of [−5, 5], the abscissa is equidistant and the scale value is 0.5. Systematic sampling results in a set of samples according to proportional abscissas. This sampling method, with its utilization of equal uniform spacing, can intuitively reflect the changes in curvature value. In the range of [0, 5], there is only one peak value of curvature, and there is also no periodic variability or monotonous change. The abscissas of the positive piecewise points and the corresponding curvature values are listed in
Table 1. A segment interval with a larger curvature requires more piecewise points for linear fitting if we are to reduce fitting errors and improve the numerical accuracy of the fitting function.
2.2. Solution for Function Fitting Based on Sample Points and Selected Piecewise Points
We select
n points in a sigmoid graph whose horizontal ordinates are
, according to a sequence from small values to large values. The corresponding longitudinal coordinates are
. Due to the monotonicity of the sigmoid function, the corresponding longitudinal coordinates are also in the same sequence. To apply all data pairs equally, we sample the data with equal uniform spacing. The data pairs are as presented below.
Here,
n is the number of sample points of the sigmoid function. The discretized sample points are evenly distributed on the coordinate axis, and the subscripts represents the size of the value. This detailed systematic sampling guarantees sample coverage and accurately describes the data distribution of the original function. In the internal ranges enclosed by adjacent piecewise points, the sigmoid function can be fitted with linear functions. All
m piecewise points can be selected from the values listed in
Table 1, and the expressions of the piecewise linear fitting functions for the sigmoid function are as follows.
where
is the abscissa of the
m-th piecewise point chosen in
Table 1,
is the ordinate of the
m-th piecewise point,
is the slope of the
m-th segment of the linear fitting function, and the total number of piecewise points is
m. This formula also assumes that the order of the piecewise points is
, which is similar to the order of abscissas among sample points.
The expression in Equation (
7) presents the general form of the piecewise function. Notably, due to the continuity of the sigmoid function, we require the piecewise linear function to be continuous at each piecewise point. Under this premise, the slope and intercept of each linear region depend on the value of the former segment. The function value of the former interval plus the increment can generate the function of the latter interval. The piecewise function can therefore be expressed in the form of the previous function expression
and the numerical increment of this interval
. In fact, the subsequent interval is based on the formal interval. Accordingly, the expression of the piecewise function is presented as below.
By taking advantage of the continuity of the linear fitting function, the number of unknown parameters can be decreased from
to
. This paper applies a custom step function
to express the piecewise functions with a matrix form. The expression of
can be expressed as the following piecewise function with step values including 0 and 1.
Here,
and
is the horizontal ordinate of the step point. By using this step function, the relationship between data pairs can be described in matrix form rather than the piecewise function expression form. We bring the values of
n samples and
m piecewise points into the expression in Equation (
8) and obtain an equation containing an unknown
. The equation of the unknown parameter based on the specified data pairs is presented in Equation (
10).
In common cases, the relationship between all sample points
n and piecewise points
m is
. It is worthy of note that the solution for
will reduce the fitting errors between the sample points and the corresponding sigmoid function values. To abbreviate the expression of the matrix form, this paper simplifies the matrix operation in the expression below.
where
is the regression matrix of size
,
is the vector consisting of
m parameters, and
Y is the vector comprising of
n function values. We use the least square method to reduce the sum of squares for residuals in order to solve the unknown vector
. The solution of
can be expressed using
as follows.
After solving
, the expressions of the piecewise function become clear. To accelerate the processing speed and retain recognition accuracy, the different fitting schemes designed to replace the original sigmoid function need to be compared before hardware implementation occurs. This paper applies the error vector
to describe the differences between fitted values in the piecewise linear function and actual values of the original function in the given data pairs. The error vector
can be expressed as follows.
is the vector reflecting absolute errors with n elements, and . The maximum of absolute error in is the largest value of all biases between the fitting function and the original function with the data pairs, while the average absolute error of all elements in is the average value of all biases in the fitting model. denotes the absolute value of each element in vector . The two kinds of indexes can measure the errors among fitting models.
3. Realization Scheme of PWLC for the Sigmoid Function
Having demonstrated the principle of solving expressions of the linear fitting function with sample data points, we next present the detailed process used to solve the function expression with the PWLC method. To determine the abscissas of the piecewise points with the specified sample points, we need to analyze the characteristics of the sigmoid function.
The sigmoid function is centrally symmetric about the point with the coordinate of (0, 0.5). There are obvious saturation areas for the sigmoid function at both ends of the x-coordinate axis. When , the sigmoid function’s value is equal to 0.9997, which is very close to its largest value of 1. When , the value of the fitting function can be set to 1. Correspondingly, when , the value of the fitting function can be set to 0. From the curvature graph of the sigmoid function, it can be seen that when , the curvature is also 0. Around , the sigmoid function is almost straight, and is in the range of an approximately straight line. In addition, the ideal linear fitting function is also centrally symmetric about the point with the coordinate of (0, 0.5). If is set as a piecewise point, the slope and intercept of the front and back two segments remain equal, and the expressions of the two piecewise functions are the same; thus, it is unnecessary to set as a piecewise point.
Due to the central symmetry of the linear fitting function, the coordinates of the piecewise points on the positive half of the x-axis and the coordinates of the piecewise points on the negative half of the x-axis generally appear in pairs. Therefore, the number of piecewise points of the linear fitting function is set to an even number. The number of piecewise points here is set to 4, 6, 8, and 10, and the total number of segment intervals including the two ranges of
and
is 5, 7, 9, and 11, respectively. The selection of the abscissa of the piecewise point needs to be symmetric about
to satisfy the central symmetry condition of the fitting function. Under the premise of the central symmetry, we can define the abscissa independently. It should be noted that the piecewise points need to be set relatively densely in the segment interval with larger curvature values so that more linear functions can be used to fit the sigmoid function with a greater degree of curvature. The numbers of abscissas are even, and the selection of abscissas is based on the analysis of curvatures. This paper selects several representative abscissas in the range of [−8, 8]. The abscissas of the piecewise points and the corresponding numbers of abscissas are shown in
Table 2, arranged from small numbers to large numbers.
According to the abscissas selected in
Table 2 and the method proposed to solve the piecewise linear fitting function, we can solve the expressions of the functions with different numbers of piecewise points based on the analysis of curvature values discussed above. The piecewise function expressions gives the slopes and intercepts in the format of decimals, valid to five decimal places. In fact, the slopes and intercepts in each piecewise linear function interval have redundant mantissas in the format of decimal numbers. Due to the limited storage space available for fixed-point numbers on the FPGA platform, it is sufficient to present the leading significant digits. These decimal numbers can be converted to binary numbers with limited bits of mantissas in the storage space of the FPGA platform.
The expression of a piecewise linear fitting function with four piecewise points
is as follows.
The expression of a piecewise linear fitting function with six piecewise points
is as follows.
The expression of a piecewise linear fitting function with eight piecewise points
is as follows.
The expression of a piecewise linear fitting function with 10 piecewise points
is as follows.
The expression of a piecewise linear fitting function with 12 piecewise points
is as follows.
The expression of a piecewise linear fitting function with 14 piecewise points
is as follows.
Based on the fitting function expression above (the interval length of sample points is set to 0.01), we obtain the maximum absolute errors
and the average absolute errors
in
Figure 2. As can be seen from the figure, the absolute error is inversely proportional to the number of piecewise points. In more detail, when the number of piecewise points increases to 10, the maximum error converges to about 0.007, while the average error converges to about 0.001. It can be determined that the point at which the number of piecewise points is 10 is the elbow point in
Figure 2. When the number of piecewise points is less than 10, the absolute error of the piecewise function is relatively large, and the fitting degree of the original sigmoid function is not ideal. When the number of piecewise points is greater than 10, the absolute error converges to a small value without significant change, while the absolute errors converge to the fixed values. Notably, a higher number of piecewise points increases the complexity of the piecewise function, elevating the power consumption and consuming more unnecessary hardware resources with little improvement on absolute fitting errors. Accordingly, considering the absolute error and the complexity of the function, this paper uses 10 as the number of piecewise points.
Apart from the selection of piecewise points, it is possible for the interval length of sample points to have effects on the maximum absolute errors and average absolute errors. This paper sets the interval lengths of sample points to 0.5, 0.1, 0.05, 0.01, and 0.005. The corresponding numbers of sample points
n in the range of [−8, 8] are 32, 160, 320, 1600, and 3200, respectively. The maximum absolute errors and average absolute errors between the fitting function and the original sigmoid function at different interval lengths are presented in
Table 3.
It can be seen that differences in sample point interval lengths have a significant impact on the absolute errors of the fitting function. As the sample point interval length is reduced, the selected sample points become more dense, the maximum error tends to be larger, and the average error becomes smaller. When the sample point interval length is 0.01, these two absolute errors converge to fixed values at the same time. It is worth noting that the maximum error does not accurately reflect all of the fitting function’s deviation from the original function, even when the sample point interval length is small. Due to the sparsity of the sample points, the maximum error cannot include other non-sample points. The average error can thus describe the relative deviation between the fitting function and the original function more generally.
When the sample point interval length is sufficiently small and the selection of sample points is sufficiently dense, both the maximum error and the absolute error can reflect the degree to which the linear fitting function deviates from the sigmoid function. When the sampling interval length is less than 0.01 in
Table 3, the values of the maximum error and the absolute error remain unchanged. More sampling points cannot improve the fitting effect; instead, it will cause the unknown parameter matrix
to be larger, increase the amount of calculation required to solve unknown parameters, raise the complexity of the fitting function expression, and cost a large amount of additional computational time. Therefore, it is both appropriate and reliable to set the sampling interval to 0.01.
4. Hardware Design for the Circuit of PWLC Method
We implement our PWLC method on the Xilinx FPGA (XC7V2000) with the Vivado design suite [
24]. In the specified range of [−8, 8], all the numbers of the sigmoid function, including input values, slopes, intercepts, and output values, are within the range of [−8, 8]. The circuit uses 16-bit fixed-point numbers to store all values. The fixed-point number includes a 1-bit signal part, a 3-bit integer part, and a 12-bit mantissa part. If the input value is outside the range of [−8, 8], it can be stored as the saturation values of a 16-bit fixed-point number. The output function value of −8 is 0, while the output function value of 8 is 1. All input values have corresponding output values expressed using this 16-bit fixed-point method. The original decimal numbers are converted to 16-bit fixed-point numbers with minimal accuracy loss.
It is time consuming to calculate different segments of the optimized function in series and then output the result for the specified segment. This paper accordingly designs a structure that allows 11 segments to be processed in parallel, after which one result is chosen as an output according to the range of input values. The hardware computes the arithmetic results of the input value in nine segments and selects one result based on comparisons among all piecewise points and the input value. The arithmetic (multiplying and adding) and range selection operations are performed in parallel to decrease the end-to-end latency. The input value is compared with 10 piecewise points and the range of input values are determined. As some slopes of the piecewise function are equal, we reuse some multipliers and connect two different adders after each multiplier. This design can reduce the number of multipliers required by four and therefore reduce the hardware resources. The hardware realization structure of PWLC is in shown
Figure 3.
The comparators compare the input value with all piecewise points without the trigger of the clock. When the input value is larger than a piecewise point, the comparator output is set to 1; otherwise, the comparator outputs 0. The expression for the comparator output is as follows.
The relationship among the 10 outputs of the different comparators and the data ports of the multiplexer is summarized in
Table 4. The selected function value of the multiplexer is based on the range of input value. The multiplexer then outputs the calculation results according to the comparison results.
According to the linear function expression with 10 piecewise points, the operations in different ranges can be realized by some combination of multipliers and adders. These multipliers truncate the sign bit and the high 15-bit data as output when multiplying two pieces of 16-bit input data. For the multipliers and adders in different ranges, the slopes and the intercepts are inputs of the multiplier and adder, respectively. Ranges that are symmetrical on the y-axis are brought close to each other to reuse the multiplier with the same slope value. Five multipliers and nine adders work in parallel and output nine results corresponding to different ranges. Based on the encoding table, the multiplexer chooses the specified value as the final output according to its segment selection.
The circuit for the PWLC method has input and output registers, so it outputs a value with two clocks. The multipliers, adders, and comparators are combinational circuits and operate with two clock cycles of latency. The parallel processing procedure can eliminate the read-after-write correlation between the selection signal and the outputs of the nine adders. Moreover, the parallel operation scheme across two data paths can also make full use of hardware resources and increase the computational efficiency. The computation modules thus work together at high speed and take little computation time.
5. Results and Comparisons
We list the timing characteristics and hardware resources from the referenced papers in
Table 5. The FPGA series names in
Table 5 are all Virtex. We present the detailed data of the minimum input arrival time before clock and the maximum output required time after clock [
25,
26]. The minimum input arrival time before clock of the proposed circuit is 8.559 ns and the maximum output required time after clock is 8.860 ns. Due to the high processing speed requirement, the timing characteristics include clock frequency and circuit latency. The clock frequency of our circuit design is 208.3 MHz, while the whole end-to-end latency is 9.6 ns. The comparisons of hardware resources include flip-flop (FF), look-up table (LUT), and digital signal processor (DSP). We find that our design can achieve high frequency; the primary reason for this relates to our circuit design with two parallel data paths. This design, with higher numbers of FFs and LUTs to realize parallel processing, can achieve the lowest latency when implementing our circuit without the use of DSPs. Given the advantages in terms of processing latency, the hardware resource of LUT overhead is acceptable in practical scenarios, while the design of Campo [
10] may exceed the DSP resources and that of Gomar [
17] has more FF usage.
This paper proposes maximum absolute error and average absolute error to describe the deviations between the sigmoid function and the piecewise linear fitting function. The maximum error presents the largest deviation, while the average error gives the overall deviation of all samples in the domain of definition. The comparisons of maximum absolute errors and average absolute errors among different methods [
20] are presented in
Table 6. As the table shows, the proposed method has the smallest maximum absolute error among all methods and the second smallest average absolute error. Moreover, compared with the method that achieves the smallest average absolute error (proposed by Armato [
16]), our method has fewer segments in the specified range. This lower number of segments can decrease the hardware complexity and reduce hardware resource consumption. In short, our hardware implementation design achieves high fitting accuracies with few FFs and no usage of DSPs.
This paper further applies the piecewise linear fitting function to recognize different handwritten numerals in the MINIST dataset with a specified deep neural network (DNN) and a convolutional neural network (CNN). The structure of DNN (comprising of five fully connected layers) and CNN (consisting of two convolutional layers, two pooling layers, and two fully connected layers) is in
Table 7. Based on the specified DNN and CNN structure, this paper compares the recognition accuracies of different fitting methods on the MNIST dataset. The recognition accuracy can intuitively reflect the actual effect of the proposed linear fitting method; a higher recognition accuracy indicates that the design is more trustable in practical use.
According to the contents of
Table 8, the hardware implementation of the linear fitting function proposed in this paper achieves a higher recognition rate than other methods. Compared with the second-highest recognition accuracy obtained by Nguyen [
20], our PWLC method increases the accuracy with DNN by 0.06% and the accuracy with CNN by 0.23%. Moreover, the recognition rate of the linear fitting function circuit applied in the deployment of DNN is even higher than that of the original sigmoid function. This may be because all the middle layers of the DNN network are fully connected layers, and the linear fitting function is expressed in a piecewise hierarchical form, which facilitates precise feature extraction with discrete values and result in high recognition rates. For its part, the original nonlinear sigmoid function may aggravate the error transmission in the inference process. Thus, the recognition rates of DNN with the linear fitting function are superior to those of the original nonlinear sigmoid function.