Next Article in Journal
The Pilot Study of the Hazard Perception Test for Evaluation of the Driver’s Skill Using Virtual Reality
Previous Article in Journal
A Gamified Simulator and Physical Platform for Self-Driving Algorithm Training and Validation
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

High Precision Multiplier for RNS {2n−1,2n,2n+1}

National Key Laboratory of Science and Technology on Communications, University of Electronic Science and Technology of China, 510, Main Building, No. 2006, Xiyuan Ave, West Hi-Tech District, Chengdu 611731, China
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2021, 10(9), 1113; https://doi.org/10.3390/electronics10091113
Submission received: 4 March 2021 / Revised: 1 May 2021 / Accepted: 3 May 2021 / Published: 8 May 2021
(This article belongs to the Section Circuit and Signal Processing)

Abstract

:
The Residue Number System (RNS) is a non-weighted number system. Benefiting from its inherent parallelism, RNS has been widely studied and used in Digital Signal Processing (DSP) systems and cryptography. However, since the dynamic range in RNS has been fixed by its moduli set, it is hard to solve the overflow problem, which can be easily solved in Two’s Complement System (TCS) by expanding the bit-width of it. For the multiplication in RNS, the traditional way to deal with overflow is to scale down the inputs so that the result can fall in its dynamic range. However, it leads to a loss of precision. In this paper, we propose a high-precision RNS multiplier for three-moduli set 2 n 1 , 2 n , 2 n + 1 , which is the most used moduli set. The proposed multiplier effectively improves the calculation precision by adding several compensatory items to the result. The compensatory items can be obtained directly from preceding scalers with little extra effort. To the best of our knowledge, we are the first one to propose a high-precision RNS multiplier for the moduli set 2 n 1 , 2 n , 2 n + 1 . Simulation results show that the proposed RNS multiplier can get almost the same calculation precision as the TCS multiplier with respect to Mean Square Error (MSE) and Signal-to-Noise Ratio(SNR), which outperforms the basic scaling RNS multiplier about 2.6–3 times with respect to SNR.

1. Introduction

The Residue Number System (RNS) is a non-weighted parallel numerical representation system, which divides the integers into multiple independent ones through modular operations. Thus, the bit-width of each channel is greatly reduced. As a result, RNS-based systems have the potential to achieve high calculation speed and low complexity. RNS is very suitable to process large integer numbers, which makes it extremely useful in cryptography [1], such as Elliptic Curve Cryptography (ECC) [2] and Lattice-based Cryptography (LBC) [3]. RNS has also been widely used in DSP units, such as digital Finite Impulse Response (FIR) filter in [4,5], adaptive filter in [6], 8-point [7,8], and 16-point [8] Discrete Cosine Transforms (DCT) and Discrete Fourier Transforms (DFT) [9]. However, since the calculation of RNS is defined on the modular operations, there are challenges in some basic operations, such as sign detection [10,11], magnitude comparison [12], residue-to-binary (R/B) conversion [13], and scaling [14,15], which limits the wide application of RNS.
In DSP application, overflow in fixed point representation is a common issue when the dynamic range is limited. It mostly happens in multiplication and addition operations, especially in applications with cascaded architecture, such as Fast Fourier Transform (FFT). For TCS, this issue can be easily addressed by expanding the bit-width of intermediate computation results and then converting it back to the original bit-width. This means that the precision of input operands will not lose, and the computation accuracy is only determined by the last conversion step. Meanwhile, the bit-width expansion step is very simple. In a word, the overflow can be solved simply and accurately by TCS fixed-point calculation.
However, since the dynamic range of the RNS is determined by its moduli set and the dynamic expanding in RNS is difficult, it is much more complicated to avoid overflow compared to TCS. Usually, there are three approaches to handle the overflow problem in RNS. Figure 1 gives the basic structure of these three approaches in a three-moduli RNS, for example.
(1) The first approach is based on scaling, as shown in Figure 1a. The input operands are firstly scaled down to ensure the product result falls in the dynamic range of the RNS [16]. We call this multiplier a basic scaling RNS multiplier. Unfortunately, the scaling operation definitely reduces the precision of the input, resulting in a loss of precision of the product result.
(2) The second approach is based on base conversion, as shown in Figure 1b. The input operands are firstly converted to a new RNS with larger dynamic range to avoid the overflow problem, the multiplication results are then scaled down to the original RNS. This approach will be helpful in some applications, such as the FIR filter. However, in some DSP algorithms with cascaded multiplication or feedback structure, such as FFT computation and IIR filter, the overall dynamic range can vary significantly. Thus, the overhead of base conversion will become unacceptable.
(3) The last approach is based on base extension, as shown in Figure 1c. When the dynamic range is not enough, one or more bases will be added to extend the dynamic range of original RNS. The base extension operation is still too complicated to be acceptable.
All of the above approaches cannot achieve similar performance to that in TCS. The first will lose accuracy, while the latter two will require complex algorithms and huge hardware resources.
In the RNS multiplier research, previous work mainly focuses on the efficient implementation of specific moduli. Chen [17] proposed an efficient modulo 2 n + 1 multiplier. Muralidharan [18] proposed a high dynamic range modulo 2 n 1 multiplier. Zimmermann [19] proposed a joint implementation of the modulo ( 2 n ± 1 ) multiplier. Hiasat [20] proposed a generic multiplier for any modulo. Although these implementations can efficiently and accurately calculate the modular multiplication in each RNS channel, the precision loss caused by scaling before modular multiplier is ignored. These designs didn’t consider the overflow problem caused by multipliers in specific DSP applications.
In this paper, we propose a high precision overflow-free RNS multiplier design method. The proposed RNS multiplier uses a similar idea of avoiding overflow in TCS to get high computation accuracy and low complexity. Throughout this paper, we choose the common used high precision RNS multiplier for three-moduli set 2 n 1 , 2 n , 2 n + 1 , which is widely used in RNS, to illustrate our idea for RNS multiplier design. The proposed multiplier improves the calculation precision by adding several compensation items to improve the precision of the result calculated by the scaled inputs. The compensation items can be obtained directly from preceding RNS scalers with little extra effort. Figure 2 illustrates the concise structure of the proposed RNS multiplier.
The rest of this paper is arranged as follows. In Section 2, we introduce the basic theory of RNS. In Section 3, we propose two joint scalers and two high precision RNS multipliers. In Section 4, we explore the structure of proposed multipliers and scalers. In Section 5, we analyze the calculation performance of the proposed RNS multiplier and implement the multiplier in Very Large Scale Integration circuits (VLSI) to explore its hardware performance. Finally, the paper is summarized in Section 6.

2. Introduction of RNS

RNS is defined by a moduli set { m 1 , m 2 , , m L } , where m i and m j ( i , j = 1 , 2 , . . . , L ) are coprime when i j . An integer X can be represented as
X { x 1 , x 2 , , x L } ,
where x i is the residue of X mod m i , we denote it as x i = X m i . Let M = i = 1 n m i , then M is called as the dynamic range of the RNS, that is, X [ 0 , M 1 ] . According to the rules of modular operation [21], for operands X and Y, we have
X ± Y m = X m ± Y m m X Y m = X m Y m m k X k m = k X m .
For two coprime moduli, m 1 and m 2 , the modular operation has properties as
X m 1 m 2 m 1 = X m 1 .
The Chinese Remainder Theorem (CRT) is one of the fundamental theorems of RNS, which is common used in scaling, Residue to Binary (R/B) conversion, and so on. If an integer X [ 0 , M 1 ] , then
X = i = 1 n x i M i M i 1 m i M ,
where M i = M / m i , and M i 1 m i is the multiplicative inverse of M i for m i , that is, M i M i 1 m i m i = 1 .
In this paper, our proposed scalers and multipliers are dedicated to the RNS { 2 n 1 , 2 n , 2 n + 1 } . For ease of notation, we let m 1 = 2 n 1 , m 2 = 2 n and m 3 = 2 n + 1 so that the residues x 1 = X m 1 , x 2 = X m 2 and x 3 = X m 3 .
Scaling is actually a constant division operation and the divisor is called the scaling factor. The format of the moduli set and the scaling factor are the two main factors in the complexity of the scaler. For an integer X, if scaling factor is K, the scaling result can be computed by
Y = X K = X X K K ,
where · represents floor operation. Different from TCS, the operand X in RNS is represented by multiple residues, and the final scaling result should also be represented by multiple residues.
For RNS with moduli set { m 1 , m 2 , m 3 } = { 2 n 1 , 2 n , 2 n + 1 } , we have
M = m 1 × m 2 × m 3 = 2 3 n 2 n M i = { 2 2 n + 2 n , 2 2 n 1 , 2 2 n 2 n } M i 1 = { 2 n 1 , 1 , 2 n 1 + 1 } .
in which, i = 1 , 2 , 3 . Substituting (6) into (4), we can get
X = 2 n 1 ( 2 2 n + 2 n ) x 1 ( 2 2 n 1 ) x 2 + ( 2 n 1 + 1 ) ( 2 2 n 2 n ) x 3 M = 2 n 1 ( 2 2 n + 2 n ) x 1 ( 2 2 n 1 ) x 2 + ( 2 n 1 + 1 ) ( 2 2 n 2 n ) x 3 I M ,
where I is an integer to ensure 0 X M 1 .

3. Design of High Precision RNS Multiplier

3.1. Joint RNS Scaler for Moduli Set 2 n 1 , 2 n , 2 n + 1

The proposed RNS multiplier needs the scaling result of scaling factors m 1 , m 3 , m 1 m 2 , and m 2 m 3 . Generally, we need four scalers to implement them. To further reduce the hardware complexity, we propose two joint scalers which can get scaling results for two scaling factors at the same time, one of which is for scaling factors m 1 and m 1 m 2 , and the other is for scaling factors m 3 and m 2 m 3 . As shown in the following derivation, the scaling results of scaling factors m 1 m 2 and m 2 m 3 can be obtained from intermediate products of m 1 scaler and m 3 scaler, respectively. As such, we can greatly save the hardware consumption by combining them with two joint scalers.
According to (7), we derive four calculation methods for these four scaling factors, m 1 , m 3 , m 2 m 3 , and m 1 m 2 , respectively.

3.1.1. Scaling Factor K = m 3

According to (7), the scaling operation can be represented as
X m 3 = 2 2 n 1 x 1 ( 2 n 1 ) x 2 + ( 2 2 n 1 1 ) x 3 I m 1 m 2 + x 3 m 3 .
In RNS, all residues are integers, and x 3 < m 3 , so x 3 / m 3 < 1 . Thus, we can round them down (8) and get
X m 3 = 2 2 n 1 x 1 ( 2 n 1 ) x 2 + ( 2 2 n 1 1 ) x 3 I m 1 m 2 = 2 2 n 1 x 1 ( 2 n 1 ) x 2 + ( 2 2 n 1 1 ) x 3 m 1 m 2 .
Mapping (9) to residue channels m 1 , m 2 , and simplify them, we can obtain
X 3 , 1 = X m 3 m 1 = 2 n 1 x 1 + ( 2 n 1 1 ) x 3 m 1 = 2 n 1 x 1 2 n 1 x 3 m 1 X 3 , 2 = X m 3 m 2 = x 2 x 3 m 2 ,
Because 0 X M 1 , then 0 X / m 3 m 1 m 2 1 . By using CRT, we can uniquely represent X / m 3 with the remainders of channels m 1 and m 2 . However, in most DSP-oriented applications, the remainder of channel m 3 is also required to match the original three moduli set. Then,
X m 3 = 2 n X 3 , 1 ( 2 n 1 ) X 3 , 2 m 1 m 2 = 2 n X 3 , 1 X 3 , 2 m 1 + X 3 , 2 m 1 m 2 .
Because 0 2 n X 3 , 1 X 3 , 2 m 1 + X 3 , 2 < m 1 m 2 , we can get
X 3 , 3 = 2 n X 3 , 1 X 3 , 2 m 1 + X 3 , 2 m 3 = X 3 , 1 X 3 , 2 m 1 + X 3 , 2 m 3 .

3.1.2. Scaling Factor K = m 1

Similarly with factor K = m 3 , scaling for K = m 1 can be represented as
X m 1 = ( 2 2 n 1 + 2 n + 1 ) x 1 ( 2 n + 1 ) x 2 + 2 n ( 2 n 1 + 1 ) x 3 m 2 m 3 + x 1 m 1 .
Then, the values in residue channels m 2 and m 3 are
X 1 , 2 = X m 1 m 2 = x 1 x 2 m 2 X 1 , 3 = X m 1 m 3 = 2 n 1 x 3 2 n 1 x 1 m 3 .
For the precise representation of X / m 1 , the residues in channel m 2 and m 3 are enough. However, in most applications, the residue in channel m 1 is still needed in the following processing. Thus, by using CRT, we can get
X m 1 = 2 n X 1 , 2 X 1 , 3 m 3 + X 1 , 2 m 2 m 3 .
Because 0 2 n X 1 , 2 X 1 , 3 m 3 + X 1 , 2 < m 2 m 3 , we can get
X 1 , 1 = X 1 , 2 X 1 , 3 m 3 + X 1 , 2 m 2 m 3 .

3.1.3. Scaling Factor K = m 2 m 3

Low proposed two scaling structures for scaling factor K = m 2 m 3 in [22,23], respectively. However, they have a unit error when x 1 < x 2 . In this paper, we propose a scaling structure which can get the scaling results accurately. According to (7), we can get
X m 2 m 3 = 2 n 1 x 1 x 2 2 n 1 x 3 + x 3 I m 1 + x 2 x 3 m 2 + x 3 m 2 m 3 .
x 2 and x 3 are non-negative integers which are defined on the radixes of m 2 and m 3 . Thus, if and only if x 2 x 3 < 0 , ( x 2 x 3 ) / m 2 + x 3 / ( m 2 m 3 ) < 0 . Then, from (17), we can get
X m 2 ( m 3 ) = 2 n 1 x 1 x 2 2 n 1 x 3 + x 3 m 1 x 2 x 3 0 2 n 1 x 1 x 2 2 n 1 x 3 + x 3 1 m 1 x 2 x 3 < 0 .
When x 2 x 3 < 0 , according to (18),
2 n 1 x 1 x 2 2 n 1 x 3 + x 3 1 m 1 = 2 n 1 x 1 2 n 1 x 3 ( x 2 x 3 + 2 n ) m 1 .
Thus, (19) can be converted to
X 2 n ( 2 n + 1 ) = 2 n 1 x 1 2 n 1 x 3 x 2 x 3 m 2 m 1 = X 3 , 1 X 3 , 2 m 1 .
When x 2 x 3 > 0 , it is obviously that x 2 x 3 has the same dynamic range with x 2 , so the modulo m 2 operations in (20) do not change the calculation result. These two situations can be combined into (20).
From (20), we can see that the scaling result of scaling factor m 2 m 3 can be calculated by the result from the scaler with scaling factor m 2 . Because X [ 0 , M 1 ] , 0 X / 2 n ( 2 n + 1 ) < m 1 , the results in all channels of moduli set { m 1 , m 2 , m 3 } are X / 2 n ( 2 n + 1 ) .

3.1.4. Scaling Factor K = m 1 m 2

Scaler with scaling factor m 1 m 2 has a similar structure with that of m 2 m 3 , From (7) we can get
X 2 n ( 2 n 1 ) = x 1 x 2 + 2 n 1 x 1 2 n 1 x 3 I m 3 + x 1 x 2 m 2 + x 1 m 1 m 2 .
If and only if x 1 x 2 < 0 , ( x 1 x 2 ) / m 2 + x 1 / ( m 1 m 2 ) < 0 , we have
X 2 n ( 2 n 1 ) = ( 2 n + x 1 x 2 ) + 2 n 1 x 1 2 n 1 x 3 m 3 .
Then, (22) can be converted to
X 2 n ( 2 n 1 ) = x 1 x 2 m 2 + 2 n 1 x 1 2 n 1 x 3 m 3 = X 1 , 2 X 1 , 3 m 3 .
The same as with scaling factor m 2 m 3 , the situation x 1 x 2 > 0 can be combined into (23).
From (23), we can see that the scaler with scaling factor m 1 m 2 can also be calculated by the result from the scaler with scaling factor m 1 . Because 0 X < M , 0 X / 2 n ( 2 n 1 ) < m 3 . We can use the value in channel m 3 to represent it. If we need to expose it to channels m 1 and m 2 , it can be simply implemented by modular operations.
In summary, the calculation methods of these four different scaling factors are shown in Table 1.

3.2. High Precision RNS Multipliers for Moduli Set 2 n 1 , 2 n , 2 n + 1

We propose two high precision RNS multipliers based on CRT. For multiplicands X and Y defined on the moduli set { m 1 , m 2 , m 3 } , the product result is scaled by a fixed scaling factor M = m 1 m 2 m 3 . Since 0 X · Y / M < M , we can guarantee that the result is still in the dynamic range of the RNS. As shown in Figure 2, we add several complementary items to the result of Figure 1a to get a high precision multiplication result. We can see from the following derivation how the complementary items work to obtain high precision. As mentioned before, the adding complementary items can be obtained directly from the proposed scalers, so the proposed RNS multiplier needs less extra hardware consumption in comparison with the basic scaling multiplier.
The derivation process is as follows.
According to (21), we have
Y m 1 m 2 = Y m 1 m 2 + y 1 y 2 m 2 + y 1 m 1 m 2 when y 1 y 2 Y m 1 m 2 + y 1 y 2 + m 2 m 2 + y 1 m 1 m 2 when y 1 < y 2 = Y m 1 m 2 + y 1 y 2 m 2 m 2 + y 1 m 1 m 2 = Y m 1 m 2 + m 1 Y 1 , 2 + y 1 m 1 m 2 .
By using (8), (21), and (24), we can get
X · Y M = X m 3 · Y m 1 m 2 = X m 3 + x 3 m 3 Y m 1 m 2 + m 1 Y 12 + y 1 m 1 m 2 = X m 3 Y m 1 m 2 + x 3 m 3 Y m 1 m 2 + X m 3 Y 1 , 2 m 2 + y 1 m 1 m 2 ,
According to (17), we have
X m 2 m 3 = X m 2 m 3 + X 3 , 2 m 2 + x 3 m 2 m 3 .
Thus, the last add item in (25) can be expanded as
X m 3 Y 1 , 2 m 2 + y 2 m 1 m 2 = X m 2 m 3 Y 1 , 2 + y 1 m 1 = X m 2 m 3 + X 3 , 2 m 2 + x 3 m 2 m 3 Y 1 , 2 + y 1 m 1 = X m 2 m 3 Y 1 , 2 + X 3 , 2 Y 1 , 2 m 2 + Y 1 , 2 x 3 m 2 m 3 + y 1 m 1 X m 2 m 3 + X 3 , 2 y 1 m 1 m 2 + x 3 y 1 m 1 m 2 m 3 .
Then, (25) can be rewritten as
X · Y M = X m 3 Y m 1 m 2 + x 3 m 3 Y m 1 m 2 + X m 2 m 3 Y 1 , 2 + X 3 , 2 Y 1 , 2 m 2 + Y 1 , 2 x 3 m 2 m 3 + y 1 m 1 X m 2 m 3 + X 3 , 2 y 1 m 1 m 2 + x 3 y 1 m 1 m 2 m 3 .
We can find that the division operations in (28) have a divisor that is not the power of 2, which can be difficult to implement. In order to reduce hardware complexity, we approximately represent the divisors in x 3 m 3 Y m 1 m 2 and y 1 m 1 X m 2 m 3 by m 2 . We maintain X 3 , 2 Y 1 , 2 m 2 and abandon Y 1 , 2 x 3 m 2 m 3 , X 3 , 2 y 1 m 1 m 2 and x 3 y 1 m 1 m 2 m 3 as approximate errors , for these three items are smaller than 1. After these simplifications, (28) can be approximately expressed as
X · Y M X m 3 Y m 1 m 2 + Y 1 , 2 X m 2 m 3 + Δ 1 ,
where
Δ 1 = Y m 1 m 2 x 3 m 2 + X m 2 m 3 y 1 m 2 + Y 1 , 2 X 3 , 2 m 2 .
The first two items to add in (29) only have integer multiplications and all the multipliers can be obtained directly from the scalers. Moreover, while (30) has decimals, all the divisors are powers of 2, which can be simply implemented by shifting. The totally approximate error in (29) is
δ 1 = x 3 m 2 m 3 Y 1 , 2 Y m 1 m 2 + y 1 m 1 m 2 X 3 , 2 + X m 2 m 3 + x 3 y 1 m 1 m 2 m 3 .
Moreover, this kind of multiplier can also be realized by another method, which is similar to the above derivation procedure. According to (13) and (17), we can get
X · Y M = X m 1 · Y m 2 m 3 = X m 1 Y m 2 m 3 + x 1 m 1 Y m 2 m 3 + X m 1 Y 3 , 2 m 2 + y 3 m 2 m 3 .
In the same way, (32) can be approximated as
X · Y M X m 1 Y m 2 m 3 + Y 3 , 2 X m 1 m 2 + Δ 2 ,
where
Δ 2 = X m 1 m 2 y 3 m 2 + Y m 2 m 3 x 1 m 2 + Y 3 , 2 X 1 , 2 m 2 .
In addition, the error of (33) is
δ 2 = x 1 m 1 m 2 Y 3 , 2 + Y m 2 m 3 + y 3 m 2 m 3 X 1 , 2 X m 1 m 2 + x 1 y 3 m 1 m 2 m 3 .
Although the multipliers implemented by (29) and (33) have the same structure, they use different scalers, which leads to different hardware complexity. It is worth noting that the hardware complexity can also be reduced by reducing the number of add items in the multipliers with the cost of precision loss. For example, since the complementary items in (30) and (34) are relatively small but consume a lot of hardware resources, we can abandon them and get a simplified RNS multiplier. This provides a trade-off between calculation precision and hardware consumption. From now on, we call the proposed RNS multiplier according to (29) and (33) as a full compensatory RNS multiplier, and the RNS multipliers implemented by abandoning (30) and (34) are represented by a partial compensatory RNS multiplier. The following numerical example is used to illustrate our proposed RNS multiplication algorithm. Letting X = 64 and Y = 460 , Table 2 shows the detailed calculation traces of the RNS multiplication operation step by step based on (29).

4. Hardware Structures

For the RNS with moduli set { 2 n 1 , 2 n , 2 n + 1 } , some numerical calculations and operations can be replaced by bit-wise logic operation to achieve an efficient hardware structure. Let the binary representation of an n-bit integer x be x n 1 x n 2 x 0 and 0 x < 2 n 1 , according to the properties of the modulo operations and the Boolean logic operation rules [14], we can easily get
x 2 n 1 = x ¯ 2 n 1 2 r x 2 n 1 = CLS ( x , r ) ,
where x ¯ represents bit-wise inversion of binary numbers, and CLS ( x , r ) indicates that integer x is shifted to the left by r bits.
The above two operations for modulo 2 n + 1 will become a little bit complicated. For an ( n + 1 ) -bit integer with binary representation x = x n x n 1 . . . x 0 and 0 x < 2 n + 1 , then
x 2 n + 1 = 2 n + 1 i = 0 n 2 i x i 2 n + 1 = 2 + 2 n 1 + x n i = 0 n 2 i x i 2 n + 1 = 2 + x n + x n 1 x n 2 . . . x 0 ¯ 2 n + 1 ,
and
2 r x 2 n + 1 = i = 0 n 2 i + r x i 2 n + 1 = i = 0 n r 1 2 i + r x i + i = n r n 2 i + r x i 2 n + 1 = i = 0 n r 1 2 i + r x i + 2 n i = n r n 2 i n + r x i 2 n + 1 = i = 0 n r 1 2 i + r x i + ( 2 n 1 ) i = n r n 2 i n + r x i + 2 2 n + 1 = x n r 1 . . . x 1 x 0 000 0 r + 111 1 n r 1 x n x n 1 x n r ¯ + 2 2 n + 1 .
The hardware structure of the scalers and multipliers in this paper are based on these operations.

4.1. Multiplier Hardware Structure

The implementation block diagrams of the proposed compensatory scaling RNS multipliers are shown in Figure 3.
In Figure 3, the multiplications in solid box represent the basic items of the proposed RNS multiplier, while the multiplications in dotted box represent the compensation items. The scaler blocks in Figure 3 are all proposed joint scalers, and the implementation details are shown in Figure 4. As shown in Figure 3, the final result is calculated by the product of two scalers without any other effort. Thus, this just costs a little extra hardware consumption.

4.2. Scaler Hardware Structure

We designed the scaling structures of the two joint scalers proposed in this paper. In addition, the implementation block diagram is shown in Figure 4.
In Figure 4, the proposed scaling structures of these four scaling factors are implemented using logic operations in (36) and modular adders proposed in [19]. The module in the two dotted line box can be used separately as scaler implementations of scaling factor m 2 m 3 and m 1 m 2 , respectively. The overall implementation in Figure 4a can be used as a joint implementation for scaling factors m 2 m 3 and m 3 , while Figure 4b can be used for scaling factors m 1 m 2 and m 1 .
Figure 4a can be further optimized, since the cascade modulo m 1 can be replaced by a Carry Save Adder (CSA) and a modulo m 1 adder. This optimized structure is shown in Figure 5. It is worth noting that we use this optimized structure in the implementation of Figure 4a in addition to scaling factor m 2 m 3 , in order to get a better hardware performance.

5. Performance Analysis

5.1. Calculation Performance of the Proposed Multiplier

In order to verify the calculation performance of the proposed compensatory scaling multiplier, we firstly analyze the calculation precision of the proposed multiplier compared with the TCS multiplier and the basic scaling RNS multiplier (see Figure 1a. Two metrics, MSE (mean-square-error), and SNR (Signal-to-Noise) are used to evaluate the calculation precision of three multipliers included. The MSE is calculated by:
MSE = 1 N i = 0 N 1 ( X r e a l , i , n o r m X r e s u l t , i , n o r m ) 2 ,
while SNR is defined as:
SNR ( dB ) = 10 log 10 i = 0 N 1 X r e a l , i 2 i = 0 N 1 ( X r e a l , i X r e s u l t , i ) 2 .
In (39) and (40), X r e a l , i represents real results of ith calculation, X r e s u l t , i represents ith result calculated by the three multipliers, while X r e a l , i , n o r m and X r e s u l t , i , n o r m represent the unit normalization of X r e a l , i and X r e s u l t , i , respectively. For each i, the inputs of the multiplier are randomly selected from RNS’s dynamic range. N is the total number of samples calculated for each n, in this paper, N = 10,000. The dynamic range of RNS is changed by n from 5 to 12. For comparison, the bit-width k of TCS is selected to guarantee that the dynamic range of TCS is slightly bigger than that of RNS. For example, for n = 5 , the moduli set of the RNS is 31 , 32 , 33 ; then, M = 32,736, and we choose k = f l o o r ( l o g 2 M + 1 ) = 15 bit in order to compare with TCS. The simulation results are shown in Figure 6 and Figure 7.
From Figure 6, for all n, the proposed RNS multiplier achieves almost the same MSE with a TCS multiplier, while, for the basic scaling RNS multiplier, it suffers a relatively big MSE when n < 8 . This means that the proposed RNS multiplier has a better calculation precision when the bit-width is not large enough.
As shown in Figure 7, for each n, the proposed RNS multiplier outperforms the traditional way by about 50–140 dB when n increases from 5 to 12, which reveals that the proposed RNS multiplier can greatly improve the calculation precision of the multiplication in RNS and avoid the overflow at the same time. In addition, the proposed multiplier achieves an SNR curve similar to that of the TCS multiplier, with an SNR loss of about 5 dB. This is because the dynamic range of TCS is chosen to be slightly larger than that of the RNS. Moreover, as mentioned before, the hardware consumption of proposed RNS multiplier can be saved by abandoning the third add item in (29) and (33). Although this can lead to a loss of calculation precision, the result shown in Figure 7 suggests that our partial compensatory RNS multiplier still outperforms the basic scaling one.

5.2. Synthesis Results of RNS Multiplier Based on Design Compiler

In order to evaluate the hardware performance of the proposed RNS multipliers, we designed them by VHDL and compiled them with Synopsys Design Compiler (DC) under the SMIC 65 nm process, respectively. The partial compensatory RNS multiplier with only one complementary item was also in consideration. We synthesized three mentioned multipliers in a clock timing constraint approach which uses the smallest clock period. The results are shown in Table 3.
In Table 3, Structure 1 means the structure in (29) and its simplified version, and Structure 2 means the structure in (33) and its simplified version. Although structures of two multipliers are symmetrical, they use different numbers of modulo m 1 and m 3 adders. Moreover, two cascaded modulo m 1 adders can benefit from the optimal hardware structure in Figure 5, so the hardware consumption of S 1 is slightly larger than that of S 2 for each n. BS means basic scaling RNS multiplier, PC means partial compensatory RNS multiplier and FC means full compensatory RNS multiplier. We use the A T of basic scaling RNS multiplier with structure 1 as a basis to calculate the A T ratio for each n. From Table 3, for each n, although the proposed multiplier needs about 2–2.3 times hardware consumption compared with the basic scaling multiplier, it can get about 2.6–3 times SNR than that of the basic scaling multiplier. To further evaluate the hardware efficiency of the proposed RNS multiplier, we calculate the S N R / A T of three multipliers. The results are shown in Figure 8. We can see that the proposed partial compensatory RNS multiplier has the biggest S N R / A T for each n, which means it has the highest hardware efficiency. In addition, the proposed RNS multiplier has bigger S N R / A T than that of the basic scaling one. This indicates that our proposed RNS multiplier still outperforms basic scaling RNS multiplier in hardware efficiency.
Hisat proposes a few efficient RNS scalers for moduli set { 2 n 1 , 2 n + p , 2 n + 1 } with scaling factor 2 n and 2 n + p [15]. This work is a recent development in RNS scaler design. The hardware efficiency of this work is obviously better than that of ours. As far as I know, our work in this paper is the first to discuss the problem of overall accuracy of the RNS multiplier, instead of the modular multiplier. In our design, we use the scaling results scaled by 2 n 1 , 2 n ( 2 n 1 ) , 2 n + 1 , and 2 n ( 2 n + 1 ) , instead of 2 n . Our focus in this work is not on optimizing the scaler. Any optimized scaler can be used in the proposed multiplier as long as it meets the requirements of scaling factor.

6. Conclusions

This work presented a high precision overflow-free multiplier design for three-moduli set { 2 n 1 , 2 n , 2 n + 1 } . The proposed RNS multiplier avoids overflow based on the scaling approach and achieves high precision by adding several compensation items to compensate the precision loss caused by scaling. Our RNS multiplier can get almost the same calculation precision as the TCS multiplier, which outperforms the basic scaling RNS multiplier about 2.6–3 times in SNR. In addition, the compensation items can be flexibly selected to make a trade-off between hardware resource and calculation precision. Synthesis results suggest that both our full compensatory RNS multiplier and partial compensatory RNS multiplier outperform traditional basic scaling RNS multiplier in hardware efficiency.

Author Contributions

The contributions made by the authors are as follows: Conceptualization, S.M. and S.H.; Methodology, S.M. and S.H.; Software, Z.Y., X.W., M.L., and J.H. Validation, X.W. and M.L. Formal analysis, Z.Y., X.W., M.L., and J.H. Investigation, S.M. and S.H. Data curation, S.M. and S.H. Writing—original draft preparation, S.M. and S.H. Writing—review and editing, S.M. and S.H. Visualization, S.M. and S.H. Supervision, S.M. and S.H. Project administration, S.M. and S.H. Funding acquisition, S.M. and S.H. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by the National Natural Science Foundation of China under Grant No. 61571083.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Schoinianakis, D. Residue arithmetic systems in cryptography: A survey on modern security applications. J. Cryptogr. Eng. 2020, 10, 249–267. [Google Scholar] [CrossRef]
  2. Schinianakis, D.M.; Fournaris, A.P.; Kakarountas, A.P.; Stouraitis, T. An RNS Architecture of an Fp Elliptic Curve Point Multiplier. In Proceedings of the IEEE International Symposium on Circuits and Systems, Kos, Greece, 21–24 May 2006; Volume 56, pp. 1202–1213. [Google Scholar]
  3. Bajard, J.C.; Eynard, J.; Merkiche, N.; Plantard, T. RNS Arithmetic Approach in Lattice-based Cryptography Accelerating the “Rounding-off” Core Procedure. In Proceedings of the 2015 IEEE 22nd Symposium on Computer Arithmetic, Lyon, France, 22–24 June 2015. [Google Scholar]
  4. Jenkins, W.; Leon, B. The use of residue number systems in the design of finite impulse response digital filters. IEEE Trans. Circuits Syst. 1977, 24, 191–201. [Google Scholar] [CrossRef]
  5. Re, A.D.; Nannarelli, A.; Re, M. Implementation of digital filters in carry-save residue number system. In Proceedings of the Asilomar Conference on Signals, Systems and Computers, Pacific Grove, CA, USA, 4–7 November 2001; Volume 2, pp. 1309–1313. [Google Scholar]
  6. Bernocchi, G.L.; Cardarilli, G.C.; Del Re, A.; Nannarelli, A.; Re, M. Low-power adaptive filter based on RNS components. In Proceedings of the IEEE International Symposium on Circuits and Systems, New Orleans, LA, USA, 27–30 May 2007; pp. 3211–3214. [Google Scholar]
  7. Fernández, P.; García, A.; Ramírez, J.; Parrilla, L.; Lloris, A. A new implementation of the discrete cosine transform in the residue number system. In Proceedings of the Asilomar Conference on Signals, Systems, and Computers, Pacific Grove, CA, USA, 24–27 October 1999; Volume 2, pp. 1302–1306. [Google Scholar]
  8. Arai, Y.; Agui, T.; Nakajima, M. A fast DCT-SQ scheme for images. IEICE Trans. 1988, 71, 1095–1097. [Google Scholar]
  9. Tseng, B.D.; Jullien, G.A.; Miller, W.C. Implementation of FFT structures using the residue number system. IEEE Trans. Comput. 1979, 100, 831–845. [Google Scholar] [CrossRef]
  10. Hiasat, A. Sign detector for the extended four-moduli set {2n-1,2n+1,22n+1,2n+k}. IET Comput. Digit. Tech. 2017, 12, 39–43. [Google Scholar] [CrossRef]
  11. Xu, M.; Yao, R.; Luo, F. Low-Complexity Sign Detection Algorithm for RNS {2n-1,2n,2n+1}. IEICE Trans. Electron. 2012, 95, 1552–1556. [Google Scholar] [CrossRef]
  12. Sousa, L. Efficient method for magnitude comparison in RNS based on two pairs of conjugate moduli. In Proceedings of the 18th IEEE Symposium on Computer Arithmetic (ARITH’07), Montpellier, France, 25–27 June 2007; pp. 240–250. [Google Scholar]
  13. Hiasat, A. A residue-to-binary converter with an adjustable structure for an extended RNS three-moduli set. J. Circuits Syst. Comput. 2019, 28, 1950126. [Google Scholar] [CrossRef]
  14. Chang, C.H.; Low, J.Y.S. Simple, Fast, and Exact RNS Scaler for the Three-Moduli Set {2n-1,2n,2n+1}. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 58, 2686–2697. [Google Scholar] [CrossRef]
  15. Hiasat, A. Efficient RNS scalers for the extended three-moduli set {2n-1,2n+p,2n+1}. IEEE Trans. Comput. 2017, 66, 1253–1260. [Google Scholar] [CrossRef]
  16. Taylor, F.J. An Overflow-Free Residue Multiplier. IEEE Trans. Comput. 1983, 32, 501–504. [Google Scholar] [CrossRef]
  17. Chen, J.W.; Yao, R.H.; Wu, W.J. Efficient Modulo 2n+1 Multipliers. IEEE Trans. Very Large Scale Integr. Syst. 2011, 19, 2149–2157. [Google Scholar] [CrossRef]
  18. Muralidharan, R.; Chang, C.H. Radix-8 Booth Encoded Modulo 2n-1 Multipliers with Adaptive Delay for High Dynamic Range Residue Number System. IEEE Trans. Circuits Syst. I Regul. Pap. 2011, 58, 982–993. [Google Scholar] [CrossRef]
  19. Zimmermann, R. Efficient VLSI implementation of modulo (2n±1) addition and multiplication. In Proceedings of the IEEE Symposium on Computer Arithmetic, Adelaide, SA, Australia, 14–16 April 1999; p. 158. [Google Scholar]
  20. Hiasat, A.A. New efficient structure for a modular multiplier for RNS. IEEE Trans. Comput. 2000, 49, 170–174. [Google Scholar] [CrossRef]
  21. Omondi, A.R.; Premkumar, A.B. Residue Number Systems: Theory and Implementation; World Scientific: Singapore, 2007; Volume 2. [Google Scholar]
  22. Low, J.Y.S.; Chang, C.H. A new RNS scaler for {2n-1,2n,2n+1}. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Rio de Janeiro, Brazil, 15–18 May 2011; pp. 1431–1434. [Google Scholar]
  23. Low, J.Y.S.; Tay, T.F.; Chang, C.H. A unified {2n-1,2n,2n+1} RNS scaler with dual scaling constants. In Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS), Kaohsiung, Taiwan, 2–5 December 2012. [Google Scholar]
Figure 1. Three traditional overflow-free RNS multipliers. (a) Scaling-based scheme, (b) Base Conversion-based scheme, (c) Base Extension-based scheme
Figure 1. Three traditional overflow-free RNS multipliers. (a) Scaling-based scheme, (b) Base Conversion-based scheme, (c) Base Extension-based scheme
Electronics 10 01113 g001
Figure 2. The proposed high precision overflow-free RNS multiplier structure.
Figure 2. The proposed high precision overflow-free RNS multiplier structure.
Electronics 10 01113 g002
Figure 3. Compensatory scaling RNS multiplier structure. (a) multiplier structure corresponding to (29), (b) multiplier structure corresponding to (33).
Figure 3. Compensatory scaling RNS multiplier structure. (a) multiplier structure corresponding to (29), (b) multiplier structure corresponding to (33).
Electronics 10 01113 g003
Figure 4. Scaling structures of four scaling factors. (a) scaling factor 2 n + 1 and 2 n ( 2 n + 1 ) joint scaling structure; (b) scaling factor 2 n 1 and 2 n ( 2 n 1 ) joint scaling structure.
Figure 4. Scaling structures of four scaling factors. (a) scaling factor 2 n + 1 and 2 n ( 2 n + 1 ) joint scaling structure; (b) scaling factor 2 n 1 and 2 n ( 2 n 1 ) joint scaling structure.
Electronics 10 01113 g004
Figure 5. Optimized scaling structure for scaling factor m 2 m 3 .
Figure 5. Optimized scaling structure for scaling factor m 2 m 3 .
Electronics 10 01113 g005
Figure 6. MSE of three kinds of multipliers.
Figure 6. MSE of three kinds of multipliers.
Electronics 10 01113 g006
Figure 7. SNR of three kinds of multipliers.
Figure 7. SNR of three kinds of multipliers.
Electronics 10 01113 g007
Figure 8. Hardware efficiency of two kinds of multipliers (higher is better).
Figure 8. Hardware efficiency of two kinds of multipliers (higher is better).
Electronics 10 01113 g008
Table 1. Calculation methods for four different scaling factors.
Table 1. Calculation methods for four different scaling factors.
K X / K m 1 X / K m 2 X / K m 3
m 3 2 n 1 x 1 2 n 1 x 3 m 1 x 2 x 3 m 2 X 3 , 1 X 3 , 2 m 1 + X 3 , 2 m 3
m 1 X 1 , 2 X 1 , 3 m 3 + X 1 , 2 m 1 x 1 x 2 m 2 2 n 1 x 3 2 n 1 x 1 m 3
m 2 m 3 X 3 , 1 X 3 , 2 m 1 X 3 , 1 X 3 , 2 m 1 X 3 , 1 X 3 , 2 m 1
m 1 m 2 X 1 , 2 X 1 , 3 m 3 m 1 X 1 , 2 X 1 , 3 m 3 m 2 X 1 , 2 X 1 , 3 m 3
Table 2. Computation traces of X × Y = 64 460 of the proposed high precision RNS multiplier.
Table 2. Computation traces of X × Y = 64 460 of the proposed high precision RNS multiplier.
Moduli set { 7 , 8 , 9 }
Residues, { x 1 , x 2 , x 3 } , { y 1 , y 2 , y 3 } { 1 , 0 , 1 } , { 5 , 4 , 1 }
X m 3 = { X 3 , 1 , X 3 , 2 , X 3 , 3 } { 0 , 7 , 7 }
Y m 1 = { Y 1 , 1 , Y 1 , 2 , Y 1 , 3 } { 2 , 1 , 2 }
X m 2 m 3 0
Y m 1 m 2 8
Y m 1 m 2 X m 3 { 0 , 0 , 2 }
Y 1 , 2 X m 2 m 3 { 0 , 0 , 0 }
Y m 1 m 2 x 3 m 2 1
X m 2 m 3 y 1 m 2 0
Y 1 , 2 X 3 , 2 m 2 0
(29) { 0 , 0 , 2 } + 1 = { 1 , 1 , 3 } = 57
calculation error 64 × 460 57 504 = 712
Table 3. Synthesis results of multipliers.
Table 3. Synthesis results of multipliers.
A (μm 2 )T (ns) A T
(μm 2 ns)
AT Ratio
n = 5 Structure 1BS4500.362.199889.311
PC5197.682.3912,458.731.26
FC6280.922.6916,942.901.71
Structure 2BS3651.842.298399.150.85
PC4745.522.4911,848.471.20
FC7913.882.5920,570.072.08
n = 8 Structure 1BS7743.962.6920,906.371
PC10,293.122.8929,823.281.43
FC13,224.603.0940,949.841.96
Structure 2BS7428.242.6920,021.110.96
PC10,254.962.8923,453.741.12
FC16,380.363.0349,757.472.38
n = 11 Structure 1BS12,677.762.8936,704.901
PC17,025.483.0952,766.211.43
FC23,253.843.2976,688.602.09
Structure 2BS12,537.722.9937,606.261.02
PC15,290.283.1948,888.831.33
FC26,150.763.2986,222.452.35
n = 14 Structure 1BS16,674.123.1853,148.761
PC26,294.763.3990,232.631.70
FC31,066.203.59111,744.812.10
Structure 2BS15,747.123.2951,894.630.98
PC26,294.763.3989,279.121.68
FC32,858.643.59118,271.722.23
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Ma, S.; Hu, S.; Yang, Z.; Wang, X.; Liu, M.; Hu, J. High Precision Multiplier for RNS {2n−1,2n,2n+1}. Electronics 2021, 10, 1113. https://doi.org/10.3390/electronics10091113

AMA Style

Ma S, Hu S, Yang Z, Wang X, Liu M, Hu J. High Precision Multiplier for RNS {2n−1,2n,2n+1}. Electronics. 2021; 10(9):1113. https://doi.org/10.3390/electronics10091113

Chicago/Turabian Style

Ma, Shang, Shuai Hu, Zeguo Yang, Xuesi Wang, Meiqing Liu, and Jianhao Hu. 2021. "High Precision Multiplier for RNS {2n−1,2n,2n+1}" Electronics 10, no. 9: 1113. https://doi.org/10.3390/electronics10091113

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop