A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves

Kumar, Harish; Rashid, Muhammad; Alhomoud, Ahmed; Khan, Sikandar Zulqarnain; Bahkali, Ismail; Alotaibi, Saud S.

doi:10.3390/app12094312

Open AccessArticle

A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves

by

Harish Kumar

^1,*

,

Muhammad Rashid

²

,

Ahmed Alhomoud

³

,

Sikandar Zulqarnain Khan

⁴

,

Ismail Bahkali

⁵

and

Saud S. Alotaibi

⁶

¹

Department of Computer Science, College of Computer Science, King Khalid University, Abha 61413, Saudi Arabia

²

Department of Computer Engineering, Umm Al-Qura University, Makkah 21955, Saudi Arabia

³

Department of Computer Sciences, Faculty of Computing and Information Technology, Northern Border University, Rafha 91911, Saudi Arabia

⁴

Department of Aeronautical Engineering, Estonian Aviation Academy, 61707 Tartu, Estonia

⁵

Department of Information Science, King Abdulaziz University, Jeddah 21589, Saudi Arabia

⁶

Department of Information Systems, Umm Al-Qura University, Makkah 21955, Saudi Arabia

^*

Author to whom correspondence should be addressed.

Appl. Sci. 2022, 12(9), 4312; https://doi.org/10.3390/app12094312

Submission received: 4 March 2022 / Revised: 11 April 2022 / Accepted: 21 April 2022 / Published: 24 April 2022

Download

Browse Figures

Versions Notes

Abstract

:

This work presents a scalable digit-parallel finite field polynomial multiplier architecture with a digit size of 32 bits for NIST-standardized binary elliptic fields. First, a dedicated digit-parallel architecture is proposed for each binary field recommended by NIST, i.e.,

163, 233, 283, 409

and 571. Then, a scalable architecture having support for all variants of binary fields of elliptic curves is proposed. For performance investigation, we have compared dedicated multiplier architectures with scalable design. After this, the dedicated and scalable architectures are compared with the most relevant state-of-the-art multipliers. All multiplier architectures are implemented in Verilog HDL using the Vivado IDE tool. The implementation results are reported on a 28 nm Virtex-7 FPGA technology. The dedicated multipliers utilize slices of 1182 (for

m = 163

), 1451 (for

m = 233

), 1589 (for

m = 283

), 2093 (for

m = 409

) and 3451 (for

m = 571

). Moreover, our dedicated designs can operate at a maximum frequency of 500, 476, 465, 451 and 443 MHz. Similarly, for all supported binary fields, our scalable architecture (i) utilizes 3753 slices, (ii) achieves 305 MHz clock frequency, (iii) takes 0.013 μs for one finite field multiplication and (iv) consumes 3.905 W power. The proposed scalable digit-parallel architecture is more area-efficient than most recent state-of-the-art multipliers. Consequently, the reported results and comparison to the state of the art reveal that the proposed architectures are well suited for cryptographic applications.

Keywords:

finite field polynomial multipliers; dedicated designs; scalable architectures; flexible implementations; binary elliptic fields; FPGA

1. Introduction

In modern cryptographic applications, the most widely used public-key cryptosystems, such as Rivest–Shamir–Adleman (RSA) [1] and elliptic curve cryptography (ECC) [2,3], are built on the hardness of integer factoring and the elliptic curve discrete logarithm problem(s). The latter offers equivalent security with shorter key lengths, which result in low hardware resource utilization, low power consumption and low bandwidth requirement. Due to these benefits, ECC is now an attractive choice for area- and power-constrained cryptographic applications such as radio frequency identification networks (RFID) [4,5], wireless sensor nodes (WSNs) [6], etc. Consequently, the ECC is standardized by many organizations, including the National Institute of Standards and Technology (NIST) [7]. The NIST is an American organization that defines standards to manage and reduce IT infrastructure security risk. In other words, it defines new standards, and provides guidelines and practices that can be used to prevent, detect and respond to cyberattacks.

The hierarchical model of ECC contains four layers of operations [8]. The highest layer determines the protocol for the desired cryptographic operation, e.g., data encryption or decryption, signature generation or verification, etc. The third layer provides computation of the critical point multiplication operation [9,10]. It depends on the execution of layer two operations, i.e., point addition and doubling. Execution of each point addition and doubling depends on the computation of finite field operations, i.e., addition, multiplication, squaring and inversion. Despite this, various key lengths for prime

G F (P)

and binary

G F (2^{m})

fields specified by NIST are commonly known to implement the four-layer model of ECC either as an application running on software or as a hardware accelerator [8,11,12]. It is important to note that, for each prime and binary field, NIST specifies five different key lengths. The supported key sizes for the prime field are 192, 224, 256, 384 and 521. Similarly, the offered key lengths for the binary field are 163, 233, 283, 409 and 571. Comparatively, the

G F (2^{m})

field is most significant for performance accelerations on hardware platforms such as field-programmable gate array (FPGA) and application-specific integrated circuits (ASIC). Moreover, two types of basis representations, i.e., polynomial and normal, are available in the literature to present the underlying points on the specified ECC curve. The polynomial basis representation is appropriate where the computation of frequent multiplications is concerned, while, for frequent squaring computations, a normal basis representation is more beneficial [13]. Therefore, we target binary fields and the polynomial basis representation for the discussions in this work.

1.1. Existing Polynomial Multiplier Architectures and Their Limitations

Polynomial multiplication is the fundamental unit in modern ECC-based cryptographic devices. Several algorithms and hardware architectures have been proposed in the literature to accelerate area and timing constraints. Some examples of the most frequently used bit-serial designs are given in [14,15,16]. Bit-parallel architectures are adopted in [17,18,19,20,21,22,23,24]. Some digit-serial implementation architectures are proposed in [25,26,27]. Similarly, digit-parallel multipliers used for PM computation in ECC are considered in [28,29]. It is important to note that the computational complexity of bit-serial multipliers is x clock cycles for two x-bit operands. The bit- and digit-parallel multipliers result in one multiplication in one clock cycle with an expense of area and power. The cost of digit-serial multipliers is

\frac{x}{y}

cycles, where x is the operand size and y is the digit size. An interesting evaluation study over various multiplication approaches is shown in [30]. Moreover, an open-source library of several polynomial multipliers is described in [31], where polynomial multiplications for binary elliptic curves are also presented. Additionally, various architectures, such as sequential, parallel, systolic, semi-systolic and pipelined, are proposed based on the requirement of the targeted application. Consequently, bit-serial multiplication approaches are area- and power-efficient and are appropriate for constrained cryptographic applications.

Bit-serial multiplier designs. An efficient interleaved modular reduction multiplication algorithm along with its bit-serial sequential architecture is described in [14] for different cryptographic applications. This polynomial multiplication architecture is an appropriate choice for security provision in applications associated with the Internet of Things (IoT). Different multiplication architectures for both bit- and digit-serial Montgomery multiplication using linear feedback shift registers (LFSRs) are investigated in [15]. Their study demonstrates that the use of LFSR results in a decrease in area resources while keeping high performance. A low-complexity scalable serial architecture for polynomial multiplication over

G F (2^{m})

is presented in [16]. Their design is more convenient for deployments of cryptographic primitives in resource-constrained applications, such as smart cards, embedded devices related to medical applications, etc.

Bit-parallel multiplier architectures. A modular polynomial multiplication algorithm is depicted in [17] to provide an m-bit parallel systolic multiplier architecture for polynomial basis representations of ECC, where m defines the length of the polynomials operand. In [18], time-dependent and time-independent multiplication algorithms over the

G F (2^{m})

field are presented. They employ interleaved conventional multiplication and folded techniques. Their algorithm allows the efficient realization of the bit-parallel systolic multipliers. The implementation results are reported on a Virtex-7 FPGA. Their results reveal that the time-independent multiplier can save 54% hardware resources as compared to other related multipliers. Versatile polynomial multiplication architectures over

G F (2^{m})

field using the Montgomery multiplication algorithm are reported in [19]. With the aim of achieving low-power and low-hardware resource utilizations, a bit-parallel polynomial basis systolic multiplier architecture over

G F (2^{m})

is presented in [20]. In [21], a bit-parallel multiplication method is proposed by using an important feature, i.e., concurrency, which makes its execution faster. The multiplication result using this scheme is generated in every clock cycle, after taking ‘m’ clock cycles in the beginning. The implementation results are reported on FPGAs. A high-throughput, low-complexity systolic Montgomery multiplication architecture over

G F (2^{m})

is presented in [22]. A bit-parallel systolic multiplication design is described in [23] over

G F (2^{m})

. The performance results are given for FPGA and ASIC platforms. An efficient and high-speed overlap-free Karatsuba polynomial multiplier is described in [24] over

G F (2^{m})

on an Artix-7 FGPA.

Digit-serial multiplier designs. A digit-based serial-in and serial-out semisystolic multiplication architecture is presented in [25] over

G F (2^{m})

. Their architecture can execute both polynomial multiplication and square operations together with the shared hardware resources. It results in a decrease in execution time. Additionally, due to shared hardware resources, their semisystolic architecture reduces the hardware area and as well the power/energy consumption. An efficient implementation of a full-word Montgomery multiplication is presented in [26]. This incorporates the Karatsuba multiplication algorithm to reduce the computational complexity. The Karatsuba algorithm allows one to split the input operands into smaller segments. Two approaches of operand splitting were used: (i) four-part splitting and (ii) deep four-part (DFP) splitting. The implementation results are reported for Virtex-5, Virtex-6 and Virtex-7 FPGA devices over 192 to 521 prime fields. For various

G F (2^{m})

fields, an optimized m-term polynomial multiplier similar to the Karatsuba multiplication method is proposed in [27]. The implementation results are reported for a Spartan-7 device.

Digit-parallel multiplier architectures. To accelerate the point multiplication operation of ECC, digitparallel multipliers with digit sizes of 32 and 41 bits are utilized in [28,29], respectively. To perform digit multiplications, a simple least-significant approach is used. Their multiplication designs are more suitable for the applications that demand a reasonable throughput with minimum hardware area, where throughput is the ratio of one over computation time.

Although there are different polynomial multiplication architectures in the literature [14,15,16,17,25,28,29], these designs are specific to cryptographic applications having specific operand lengths. Concerning [13], scalability/flexibility is one of the critical requirements for 5G architecture development to incorporate enhanced mobile broadband, machine type communication (MTC) and ultra-reliable MTC. Rather than this, various high-speed cryptographic applications (e.g., internet servers), implementing cryptographic algorithms for different security levels, require scalable polynomial multiplication architectures. Therefore, there is a real demand for scalable multiplication designs.

1.2. Novelty

The splitting of input polynomials for multiplication is not a new idea and is commonly used in hierarchical/bit-parallel polynomial multipliers [32,33]. For example, the Karatsuba multiplier splits each input polynomial into two smaller polynomials. Then, the smaller polynomials are further split into two smaller polynomials. This process could be repeated to obtain enough smaller polynomials to apply the schoolbook multiplication method. To generate the resultant polynomial for the Karatsuba method, the multiplication over smaller polynomials is needed to perform in reverse/recursive order (i.e., from smaller polynomials to larger). Variants of Toom–Cook multipliers are another example where the splitting of input polynomials is performed in three (three-way Toom–Cook multiplier) and four (four-way Toom–Cook multiplier) smaller polynomials. The ASIC implementations of Karatsuba and Toom–Cook multipliers are described in [31]. These multiplication approaches (i.e., Karatsuba and Toom–Cook) decrease the computational time for the polynomial multiplication but with hardware resources overhead.

Therefore, the novelty of this work is splitting only the one input polynomial into a smaller portion of 32-bit lengths. Moreover, to reduce the hardware resources, we exploit the hardware components for a higher 571-bit binary field to compute the multiplications for smaller 163-, 233-, 283- and 409-bit fields using a schoolbook method. Although this is not very new, it is still relevant (from the perspective of NIST-standardized binary elliptic curve fields). Moreover, the computation of a constant-time polynomial multiplication in four clock cycles for each binary field (i.e., 163, 233, 283, 409 and 571) further highlights the novelty of this work.

1.3. Our Contributions

The main objective of this work is to perform the polynomial multiplications using the same features in the dedicated and scalable architectures. This allows us to make a realistic comparison between the dedicated and scalable designs having the same digit lengths and reduction algorithms. Therefore, to achieve this, our contributions are as follows:

(i): Five dedicated architectures. We have presented a dedicated digit-parallel architecture for each variant of the NIST-standardized binary elliptic curves, i.e., 163, 233, 283, 409 and 571. The experimental results are evaluated to estimate the area and timing characteristics of the implemented multiplier circuits.
(ii): A scalable polynomial multiplication architecture. Similar to our dedicated designs, a scalable digit-parallel architecture supporting all variants of NIST-specified binary elliptic curves, i.e., 163, 233, 283, 409 and 571, is described. Our scalable architecture results in a decrease in the hardware resources when compared to the sum of resources of our dedicated architectures.
(iii): Dedicated and flexible architecture(s) for NIST reduction algorithms. For each dedicated multiplication architecture, a dedicated architecture for NIST-defined reduction routines is provided. Moreover, a unified architecture for NIST-specified reduction algorithms is presented for our scalable design.
(iv): A dedicated controller. In our scalable architecture, a finite state machine (FSM) controller is used to perform control functionalities over multiplication and reduction operations.

All our dedicated and scalable finite field polynomial multiplication architectures are implemented in Verilog HDL using a Vivado IDE tool on the Xilinx Virtex-7 FPGA device. The outcomes of our polynomial multiplication mean that the dedicated multipliers utilize FPGA slices of 1182, 1451, 1589, 2093 and 3451 for

m = 163, 233, 283, 409

and 571. Moreover, the dedicated architectures can operate at a maximum frequency of 500, 476, 465, 451 and 443 MHz. Furthermore, due to the higher operational frequency, the power consumption of our dedicated architectures is rather high (3.201 W for 163 and 3.726 W for 571). For all NIST-supported binary fields, our scalable architecture (i) uses 3753 slices, (ii) acquires 305 MHz frequency, (iii) bears 0.013 μs for one finite field multiplication and (iv) consumes 3.905 W power. Our scalable digit-parallel architecture is comparatively 8, 6 and 14 times more area-efficient than most recent state-of-the-art multipliers of [17,21,23]. Moreover, our scalable design is 403.89 times faster in terms of computational time than [23]. The implementation results and comparison to the state of the art reveal that our multiplication architectures are good alternatives in competitive cryptographic applications.

The structure of this paper is formulated as follows. Section 2 describes the schoolbook multiplication method for polynomial multiplications. Dedicated and scalable multiplier architectures are presented in Section 3. We provide the implementation results and a comparison to the state of the art in Section 4. Finally, we present the critical findings of this work in Section 5.

2. Schoolbook Polynomial Multiplication

Schoolbook multiplication is a traditional way to multiply two input polynomials, i.e.,

A (x)

and

B (x)

, of degree

m - 1

, as shown in Equation (1).

\begin{matrix} A (x) = \sum_{i = 0}^{m - 1} a_{i} x^{i}, B (x) = \sum_{i = 0}^{m - 1} b_{i} x^{i} \end{matrix}

(1)

Then, the product of

A (x)

and

B (x)

can be calculated using Equation (2).

\begin{matrix} D (x) = \sum_{i = 0}^{m - 1} x^{i} (\sum_{i = j + k}^{m - 1} a_{j} b_{k}) \end{matrix}

(2)

In Equations (1) and (2), A and B are two m-bit input polynomials, D is a

2 \times m - 1

-bit output polynomial and m defines the length of input/output polynomials. Moreover, it requires

m^{2}

multiplications and

{(m - 1)}^{2}

additions. The complexity of schoolbook multiplication is

O (n^{2})

. For further mathematical descriptions, we direct readers to follow [34].

Amongst several other polynomial multiplication approaches, i.e., bit-parallel, digit-serial, systolic arrays, etc., the schoolbook method is an appropriate alternative when low hardware resources and consumed power are expected [30,31]. To implement Equation (2) as a hardware accelerator, two different logic styles, i.e., sequential and combinational, could be used. The sequential logic implementation requires

c l k

and

r e s e t

signals for the timely driven execution of the required operation. Moreover, the computational cost of a sequential logic for the schoolbook polynomial approach is m clock cycles for m-bit input polynomials. On the other hand, a combinational logic without the

c l k

and

r e s e t

signals results in longer routing delays and ultimately reduces the clock frequency of the multiplier circuit. The computational cost of the schoolbook multiplier can be reduced to one when a fully microcoded circuit is implemented, as achieved in solutions [35].

The polynomial reduction is essential to operate after each polynomial multiplication. Hence, the most widely used polynomial reduction algorithms over NIST-defined binary fields are initially described in [8] and, in this work, these algorithms are represented in Appendix A.

3. Our Dedicated and Scalabale Multiplier Architectures

As introduced in Section 1.3, we have used the digit-parallel multiplication approach for multiplying two m-bit polynomials. A digit-parallel method can be employed by dividing one input polynomial into smaller digits with different digit lengths, depending on the designer to choose. Then, multiplication over each created digit with an m-bit polynomial is performed either by selecting the least significant or most significant polynomial digits. In our dedicated and flexible architectures (presented in Section 3.1 and Section 3.2), we have created digits of an m-bit polynomial

B (x)

with a size of 32 bits each. Moreover, for simplicity, we have used the least-significant digit multiplication approach to acquire the resultant polynomials. The flowchart of our polynomial multiplication architectures is illustrated in Figure 1. Similarly, the pseudocode is given in Algorithm 1.

Algorithm 1: Pseudocode of our dedicated and scalable multiplier architectures

Input: Polynomial,

A (x)

and

B (x)

with m-bit length

Output: Polynomial,

C (x)

with m-bit length

1.

f o r (i f r o m 1 d o w n t o d = \frac{m}{n}) d o

1.1: $M_{i} ⟵ D_{i} (x)$ computation of polynomial multiplications using Equation (2)
1.2: $T_{i} ⟵ M_{i} \oplus (T_{i} ≪ n)$ concatenation to produce polynomial $T (x)$ with $2 \times m - 1$ bit length

2.

C ⟵ R E D (T_{i})

3.

R e t u r n (C (x))

Figure 1 describes the proposed methodology for multiplying polynomial coefficients. It starts with the register transfer level (RTL) implementations of our dedicated and flexible architectures. Finally, it ends after reporting and comparing the implementation results with the state of the art. The pseudocode of the proposed architectures (shown in Algorithm 1) takes two m-bit polynomials

A (x)

and

B (x)

as input. It results in a resultant polynomial

C (x)

with m-bit length as output. The line one in Algorithm 1 performs the multiplication (using Equation (2)) and concatenation operations in an iterative way (or under loop). The

f o r l o o p

in statement one of Algorithm 1 starts from 1 and continues until

d = \frac{m}{n}

, where d is the total number of digits, m is the operand length and n shows the digit size (32 in our architectures). Statement two, i.e., the

R E D ()

function, in Algorithm 1 performs the polynomial reduction using Algorithms A1–A5 for targeted input operands of 163, 233, 283, 409 and 571.

3.1. Dedicated Polynomial Multiplication Architectures

For NIST-defined binary fields over

G F (2^{163})

to

G F (2^{571})

, our digit-parallel polynomial multiplication architectures are illustrated in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6. All our proposed dedicated multipliers take two m-bit polynomials as input and result in a

2 \times m - 1

-bit polynomial as output. The value of m determines the underlying elliptic curve field (i.e., 163, 233, 283, 409 and 571). Further descriptions of these multipliers are given below.

In order to perform a polynomial multiplication over

G F (2^{m})

, our dedicated and scalable architectures take two m-bit polynomials, i.e.,

A (x)

and

B (x)

, as input and result in an m-bit polynomial (i.e.,

C (x)

) as output. It is important to note that for digit-level polynomial multiplications, we have created digits of a polynomial

B (x)

, while

A (x)

is considered with m bit. Therefore, a total of five 32-bit digits and one 3-bit digit are required for the implementation of a polynomial multiplication over

G F (2^{163})

, as given in Figure 2. Similarly, as shown in Figure 3, Figure 4, Figure 5 and Figure 6, a total of eight (seven with 32-bit and one digit with 9-bit), nine (eight with 32-bit and one digit with 27-bit), thirteen (twelve with 32-bit and one digit with 25-bit) and eighteen (seventeen with 32-bit and one digit with 27-bit) digits are required for polynomial multiplication over

G F (2^{m})

with

m = 233, 283, 409

and 571. After creating the digits of a polynomial

B (x)

, each digit of a polynomial

B (x)

is multiplied with an input polynomial

A (x)

. The used M1 to M18 multipliers in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6 perform the polynomial multiplication and result in an output polynomial depending on the lengths of the used inputs. For example, M1 to M5 multipliers (in Figure 2) result in output polynomials of length

32 + 163 - 1

bits, while M6 results in

3 + 163 - 1

bits. This is also applicable to the remaining dedicated and scalable multiplication architectures. After the multiplication of each 32-bit digit with an m-bit input polynomial

A (x)

, a concatenation is applied to generate a resultant polynomial of length

[(2 \times m) - 1]

bits. A resultant polynomial of m-bit size is generated by employing a NIST-recommended reduction algorithm (shown in Algorithms A1–A5).

3.2. Scalable Multiplication Architecture

For NIST-defined binary fields over

G F (2^{163})

to

G F (2^{571})

, our scalable polynomial multiplication architecture is shown in Figure 7. It consists of a finite state machine (FSM), a multiplier block and a reduction block. Details of these blocks are as follows.

FSM. The FSM of our scalable architecture incorporates four states, i.e., (i)

R E S E T

, (ii)

S T A R T

, (iii)

M U L T

and (iv)

R E D

. Before starting polynomial multiplication, the

R E S E T

state is responsible for setting the corresponding pipeline registers of input polynomials

A (x)

and

B (x)

. The purpose of the

S T A R T

state is to load m-bit right input polynomials into the corresponding pipeline registers (

A_{R E G}

and

B_{R E G}

). After providing the correct input polynomials, the next state, i.e.,

M U L T

, specifies the execution of two m-bit polynomial multiplication to generate a

2 \times m - 1

-bit polynomial. Finally, the

R E D

state confirms the computation of a polynomial reduction based on a three-bit

s e l_{R E D}

signal. Each state of the FSM bears one clock cycle for computation. Therefore, our scalable architecture performs one NIST-recommended polynomial multiplication in four clock cycles.

Multiplier block. Similar to our dedicated digit-parallel multiplication architecture over

G F (2^{571})

, a total of seventeen 32-bit and one 25-bit digit(s) have been used in our scalable architecture for the implementation of a polynomial multiplication over

G F (2^{163})

to

G F (2^{571})

, as presented in Figure 7. The use of seventeen 32-bit and one 25-bit digit(s) allows us to perform multiplications over the remaining NIST-defined binary fields, i.e., 163, 233, 283 and 409. As mentioned earlier, before starting multiplication, it is essential to reset the pipeline registers (

A_{R E G}

and

B_{R E G}

) used in our scalable architecture. After resetting these pipeline registers, the correct input operands will be loaded. The length of input operands is not relevant to our multiplier architecture. However, for each respective binary field length (i.e., 163, 233, 283, 409 and 571), our scalable architecture will operate on 571 bits. For example, for a 163-bit operand length, our scalable architecture takes two 571-bit operands as input and results in

2 \times 571 - 1

bit as output. Out of these 571-bit inputs, the initial 163 bits (from the least significant side) are correct, while the remaining bits from 164 to 571 will be 0. Similarly, in the output polynomial of length

2 \times 571 - 1

, the first 326 bits are correct while other bits from 327 to the end will be 0. Based on this principle, our scalable architecture performs polynomial multiplication.

Inside M1 to M18 multipliers, a schoolbook logic of Equation (2) is inferred for multiplications. More precisely, we have generated dedicated partial products using logical AND operation. In the end, an exclusive (OR) operation has been performed on the corresponding bits of each generated partial product to produce a resultant polynomial of length

32 + 571 - 1

(for M1 to M17 multipliers) and

25 + 571 - 1

(for M18 multiplier), respectively. After the multiplication of each digit of polynomial

B (x)

with the 571-bit input polynomial

A (x)

, we have performed a concatenation operation to generate the resultant polynomial of length

[(2 \times 571) - 1]

bits.

Reduction block. The reduction block of our scalable architecture incorporates five instances each for Algorithms A1–A5. Each instance takes the same polynomial of length as input and results one polynomial as an output. Thus, we have five instances, and then we have five outputs. Based on the three-bit

s e l_{R E D}

signal, a

5 \times 1

multiplexer is used to select an appropriate output of Algorithms A1–A5.

4. Implementation Results and Comparisons

The implementation results of our dedicated and scalable multiplier architectures are presented in Section 4.1. The comparison to our dedicated and scalable architecture is given in Section 4.2. The comparison to the state of the art is provided in Section 4.3.

4.1. Results

Using the Vivado IDE tool, our dedicated and scalable architecture is modeled in a Verilog HDL. The implementation results on the Virtex-7 FPGA device are presented in Table 1. The underlying field m is given in column one. The FPGA slices and look-up tables (LUTs) are listed in order to investigate the utilized hardware resources in columns two and three, respectively. The required clock cycles (CCs) and provided time period (PP in

n s

) for logic synthesis are provided in columns four and five, respectively. The operational frequency (MHz) of the multiplier circuits for one polynomial multiplication computation is presented in column six. Column seven shows the latency (in μs) and is the time to perform one finite field multiplication. The total power consumption of our multiplier circuits is given in column eight. Finally, to provide a reasonable comparison, we have defined a performance metric concerning slices times latency, shown in the last column. The lower the value of slices × latency, the higher will be the performance metric. It is important to note that the slices, LUTs, operational frequency and power parameters in Table 1 are obtained from the Vivado IDE tool. The values for latency are calculated using Equation (3). The performance metric is calculated using Equation (4).

\begin{matrix} \begin{matrix} L a t e n c y (i n μ s) = \frac{C l o c k c y c l e s}{C l o c k F r e q u e n c y (i n M H z)} \end{matrix} \end{matrix}

(3)

\begin{matrix} \begin{matrix} P e r f o r m a n c e m e t r i c = F P G A s l i c e s \times L a t e n c y (i n μ s) \end{matrix} \end{matrix}

(4)

For dedicated designs, Table 1 reveals that the increase in the length of the underlying field increases in slices and LUTs. All the dedicated multipliers require one clock cycle for one finite field polynomial multiplication. The dedicated multiplier can operate at a clock frequency of 500, 476, 465, 451 and 443 MHz for

m = 163, 233, 283, 409

and 571. The increase in the underlying field or operand length increases the computational time, i.e., latency (see column seven). Similar to latency, power values show an increase with the increase in the length of binary fields. Moreover, due to the parallel multiplication architecture style, our multiplication circuits consume more power. Although our dedicated multiplication architectures are power-consuming, they can operate starting from 440 MHz for a larger 571-bit binary field length. There is always a tradeoff. In short, with a clock cycle overhead, the power consumption of our architectures can be reduced by employing a digit-serial multiplication style instead of the use of a fully parallel approach. The product of slices × latency, given in the last column of Table 1, increases with the increase in the length of the utilized binary field.

For all the NIST-recommended binary fields (163, 233, 283 409 and 571), our scalable architecture utilizes 3753 slices and 7461 LUTs. It takes four clock cycles and requires 0.013 μs for one polynomial multiplication computation. It can operate at a maximum clock frequency of 305 MHz. The power consumption of our scalable architecture is 3.905 W.

4.2. Comparison of Our Dedicated Multipliers with Scalable Architecture

Slices and LUTs comparison. As shown in columns two and three of Table 1, the utilized slices and LUTs for only the dedicated multiplier implementations are lower than the slices and LUTs achieved for our flexible design. The reason is the support for all NIST-defined binary fields in our flexible architecture. For better comparison, if we calculate the sum of slices and LUTs of all the binary fields together, it becomes 9766 and 25,987, which are comparatively 2.60 (ratio of 9766 with 3753) and 3.48 (ratio of 25,987 with 7461) times higher as compared with the slices and LUTs of our flexible architecture.

Comparison over computational time in terms of clock cycles, frequency and latency. Due to its flexibility, our scalable design requires more clock cycles (i.e., four) as it offers all the NIST-defined binary fields for implementation. The dedicated architectures require only one clock cycle for computation, including the reduction over polynomials. As expected, the flexible architecture is operating at a lower operational frequency of 305 MHz as compared to our dedicated multiplier designs, as shown in Figure 2, Figure 3, Figure 4, Figure 5 and Figure 6. Similar to the clock frequency, column seven in Table 1 provides the latency values, which indicate that the dedicated design over

G F (2^{571})

is 5.82 (ratio of 0.01311 over 0.00225) times faster (in terms of computational time) as compared to the flexible design of Figure 7. It is important to clarify that the flexible design has a critical path similar to the dedicated design of Figure 6. Even if the critical path is almost similar, the latency value for the flexible design is 5.82 times lower. The reasons are the requirement for four-times higher clock cycles as compared to the dedicated design of Figure 6 and the use of an additional

5 \times 1

multiplexer in the data path of Figure 7. Another cause is the lower operational frequency in our scalable design in Figure 7. Consequently, the latency is the ratio of clock cycles over operational frequency, as given in Equation (3). Thus, this ratio (i.e., 4 over 305) results in more computational time in terms of latency as compared to the latency (i.e., ratio of 1 over 443) calculated for our design in Figure 6.

Comparison over consumed power. For the dedicated architecture of a higher 571-bit binary field length, the consumed power is 3.726 W, which is comparatively 1.04 (ratio of 3.905 with 3.726) times lower than the flexible design. The reason is the longer routing delays in our flexible multiplier architecture as compared to our dedicated multiplication designs. As we finalize the comparisons, it is evident that the flexible architecture results in lower hardware resource utilization (slices and LUTs). Moreover, the dedicated multiplication architectures result in a lower computational time (latency) and consumed power. Therefore, there is always a tradeoff between several design parameters.

Comparison over

s l i c e s \times l a t e n c y

. The calculated values for the product of slices × latency of the dedicated architectures are 2.364, 3.047, 3.416, 4.625 and 7.764. Similarly, the value for the flexible design is 49.201. As we mentioned earlier in this article, the lower the value for slices × latency, the higher will be the performance. Thus, the performance (in terms of slices × latency) of our scalable architecture can be improved by either reducing the hardware resources or shortening the critical path by improving the latency. The hardware resources of our flexible architecture can be reduced by employing a digit-serial multiplication approach instead of the digit-parallel, but with a performance overhead. Another technique to reduce the hardware resources is the use of a unified reduction block for Algorithms A1–A5. On the other hand, the latency of our flexible design can be improved by reducing the clock cycles and shortening the critical path. In this context, pipelining could be beneficial.

4.3. Comparison to State of the Art

There are certain limitations when providing a realistic comparison to the state of the art. For example, the comparison to multipliers, published in [14,15,16], is not possible because these architectures are implemented in a bit-serial fashion, while we employed a digit-parallel implementation style. Additionally, the implementations of these designs are provided on different ASIC technologies, while we reported results on a 28 nm FPGA technology (Virtex-7). Similarly, the comparison to [25] could be more challenging as the complexity of multiplier design is offered in mathematical formulations rather than providing the actual FPGA implementation. The comparison to some digit-parallel multipliers, reported in [28,29], is also not possible as these are used for point multiplication computation in ECC without providing the area and timing details of the utilized multipliers. It is important to note that we have presented our comparison with only the most relevant bit-parallel state-of-the-art multiplication architectures in Table 2. Moreover, we have also synthesized our multiplication circuits for the FPGAs that are utilized in state-of-the-art implementations. Therefore, column one in Table 2 provides the reference design. The name of the corresponding multiplier is given in column two. The utilized binary field is shown in column three. Column four provides the implementation device. The area utilizations (slices) are provided in column five. Columns six to eight present the timing information in terms of clock cycles (CCs), clock frequency and latency. Column nine presents the power consumption of the multiplier architectures. Finally, we have used a symbol (–) in Table 2, where the relevant information is not provided.

4.3.1. Comparison to Bit-Parallel Systolic Multiplier Architectures

Comparison of our designs of Figure 2 and Figure 7 over Virtex-7 with [17,18,19,20]. A scalable polynomial multiplication architecture of [17] is 27.65 (ratio of 32,685 with 1182) times more area-consuming in terms of FPGA slices as compared to our dedicated architecture of Figure 2. The scalability in [17] is achieved with the division of processing nodes into k cells to perform bit-wise AND and exclusive(OR) operations. The splitting of processing nodes into several cells results in higher hardware resources. When we consider the slices of our Figure 7 for comparison, the proposed flexible design utilizes 8.70 (ratio of 32,685 with 3753) times lower hardware area. As far as the power consumption is concerned, the proposed multipliers of Figure 2 and Figure 7 are 1.64 (ratio of 5.277 with 3.201) and 1.35 (ratio of 5.277 with 3.905) times more power-efficient as compared to the bit-parallel systolic multiplier of [17]. Therefore, the division of processing nodes into k cells results in higher hardware resources and consumed power as compared to our dedicated and flexible designs, where we employ a digit-parallel-based schoolbook multiplication approach with a digit length of 32 bits. Another bit-parallel systolic multiplier over

G F (2^{m})

is illustrated in [18], where the used FPGA slices are 130.82 (ratio of 154,635 with 1182) times higher than our dedicated architecture of Figure 2. This is due to the use of interleaved conventional multiplication and folded techniques. When we consider the slices of our Figure 7 for comparison, the proposed flexible design utilizes 41.20 (ratio of 154,635 with 3753) times lower hardware area. The employed interleaved conventional multiplication and folded techniques in [18] consume 1.12 (ratio of 3.600 with 3.201) times more power as compared to the proposed multiplier of Figure 2. On the other hand, our flexible multiplier design is 1.08 (ratio of 3.905 with 3.600) times more power-consuming. This is due to the longer critical path as compared to the critical path of the parallel-systolic multiplier of [18].

The Montgomery-based versatile bit-parallel systolic multiplication architecture of [19] is 89.49 (ratio of 105,787 with 1182) times more area-consuming as compared to our digit-parallel schoolbook-based dedicated architecture of Figure 2. Likewise, the slices of [19] are 28.18 (ratio of 105,787 with 3753) times higher than our flexible digit-parallel-based schoolbook multiplier design of Figure 7. Due to the versatile nature in [19], the power consumption is 1.93 (ratio of 6.187 with 3.201) and 1.58 (ratio of 6.187 with 3.905) times higher as compared to our dedicated and flexible designs. With the aim of obtaining low-power and low-hardware resource utilizations, the bit-parallel polynomial basis systolic multiplier architecture of [20] utilizes 56.20 (ratio of 66,434 with 1182) and 17.70 (ratio of 66,434 with 3753) times more FPGA slices as compared with our dedicated and flexible architectures of Figure 2 and Figure 7, respectively. The cause of the higher hardware resource utilizations in [20] is the use of a modular interleaved multiplication method. As shown in the last column of Table 2, the architecture of [20] is more power-efficient as compared to our multipliers of Figure 2 and Figure 7. There is always a tradeoff between several design parameters such as area and power (in this case). It is essential to note that the additional design parameters, i.e., clock cycles, clock frequency and latency, cannot be compared as the relevant information is not provided (see the corresponding columns in Table 2).

Comparison of our architectures of Figure 3 and Figure 7 over Virtex-7 with [21,22,23]. Concerning the FPGA slices of [21,22] and [23] for comparison, our dedicated architecture of Figure 3 utilizes 15.75 (ratio of 22,864 with 1451), 65.12 (ratio of 94,498 with 1451) and 38.74 (ratio of 56,223 with 1451) times lower hardware resources. Likewise, our scalable design of Figure 7 utilizes 6.09 (ratio of 22,864 with 3753), 25.17 (ratio of 94,498 with 3753) and 14.98 (ratio of 56,223 with 3753) times lower hardware resources. The reason for the use of higher hardware resources in [21,22,23] is the systolic multiplication designs. In these architectures, several small arrays are employed in parallel to multiply and accumulate for obtaining the resultant polynomial after multiplication. In our dedicated and flexible architectures, we have performed several smaller polynomial multiplications in parallel to each other using a schoolbook multiplication method. The use of a schoolbook method results in lower hardware resources (in terms of FPGA slices). Apart from the hardware resources, the proposed dedicated and flexible architectures consume more power as compared to multiplier designs published in [21,22,23]. One factor behind the consumption of more power is the higher clock frequency in our dedicated and flexible architectures, while the state-of-the-art designs are operating on a lower clock frequency (given in column seven of Table 2). For example, the multiplier design of [23] can operate at a maximum frequency of 44 MHz, while our dedicated and flexible architectures can operate at 476 and 305 MHz. Due to the higher operational frequencies, our dedicated and flexible designs are comparatively 2521 (ratio of 5.295

μ

s with 0.00210

μ

s) and 403.89 (ratio of 5.295

μ

s with 0.01311

μ

s) times faster in terms of latency as compared to [23].

4.3.2. Comparison to Karatsuba and Montgomery Multiplier Architectures

Comparison of our architectures of Figure 5 and Figure 7 over Artix-7 with [24]. When considering the FPGA LUTs for comparison, our dedicated and flexible architectures of Figure 5 and Figure 7 utilize 8.03 (ratio of 49,211 with 6128) and 6.13 (ratio of 49,211 with 8019) times lower hardware resources, respectively. As we mentioned earlier in Section 1.2, the Karatsuba multiplication is a bit-parallel multiplication approach where the inputs to the Karatsuba multiplier are split into two smaller polynomials. Then, the smaller polynomials are further split into two additional polynomials. The repetition of this process is essential to obtain enough smaller polynomials to apply the schoolbook multiplication method for the polynomial multiplications. Therefore, to generate a resultant polynomial for the Karatsuba method, the multiplication over smaller polynomials must be performed in reverse order (i.e., from smaller polynomials to larger). Thus, splitting for the Karatsuba multiplier results in higher hardware resources as compared to our architectures, where we use the digit-parallel method instead of bit-parallel. As shown in Table 2, the comparison in terms of clock cycles, clock frequency, latency and power is not possible as the relevant information is not given.

Comparison of our architectures of Figure 5 and Figure 7 over Spartan-7 with [27]. Concerning the FPGA LUTs for comparison, our dedicated and flexible architectures of Figure 5 and Figure 7 utilize 5.90 (ratio of 40,056 with 6784) and 4.62 (ratio of 40,056 with 8653) times lower hardware resources, respectively. This is due to a novel architecture (termed a composite model) based on the m-term Karatsuba-like and schoolbook multiplication method. In their composite solution, the m-term Karatsuba-like method uses all polynomial splits to multiply polynomial coefficients rather than the least split polynomials. A simple schoolbook method is applied for one time to perform polynomial multiplications over the smallest splits. Consequently, the combined use of a schoolbook and Karatsuba multiplier results in higher hardware resources as compared to our designs, where we only use a schoolbook multiplication method. Similar to [24], the comparison in terms of clock cycles, clock frequency, latency and power parameters is not possible as the corresponding information is not available.

Comparison of our architectures of Figure 6 and Figure 7 over Virtex-7 with [26]. The full-word Montgomery modular multiplication architecture consumes 2.98 (ratio of 20,695 with 6943) and 2.77 (ratio of 20,695 with 7461) times higher hardware resources as compared to our dedicated and flexible architectures of Figure 6 and Figure 7, respectively. The use of a Karatsuba algorithm for splitting input polynomials into four parts and deep four parts is the reason for the use of higher hardware resources in the Montgomery multiplier design of [26]. In our multiplication architectures, we have not used any specific algorithm (or protocol) to split the input polynomials. Moreover, the use of a simple schoolbook multiplication method is another reason for the lower hardware resources in our designs. Despite the hardware resources, our dedicated and flexible designs are faster in terms of clock cycles, operational clock frequency and computational time (see columns six to eight of Table 2). The power comparison is not possible as the required information is not available in [26].

4.3.3. Overall Summary

Due to the use of a schoolbook multiplication method in our proposed digit-parallel multiplier architectures (i.e., dedicated and flexible/scalable), our method is successful in terms of FPGA slices as compared to architectures reported in [17,18,19,20,21,22,23,24,26,27]. Moreover, our proposed dedicated and scalable multiplication designs consume less power as compared to multiplication architectures published in [17,18,19]. On the other hand, the architectures of [20,21,22,23] are more power-efficient as compared to our multiplication architectures. In most state-of-the-art designs, the comparison in terms of timing parameters (i.e., clock cycles, frequency and latency) is not possible as the relevant information is not described. Consequently, our dedicated and scalable multiplication architectures report competitive results in comparison to state-of-the-art multiplication designs.

5. Conclusions

This article has proposed a scalable/flexible finite field polynomial multiplier architecture for NIST-standardized binary elliptic fields. Initially, a dedicated digit-parallel architecture is proposed for each binary field recommended by NIST, i.e.,

163, 233, 283, 409

and 571. Then, a scalable architecture having support for all variants of binary fields of elliptic curves is proposed. Towards performance investigation, we have compared dedicated multiplier architectures with scalable design. After this, the dedicated and scalable architectures are compared with the most relevant state-of-the-art multipliers. All multiplier architectures are implemented in Verilog HDL using the Vivado IDE tool. The implementation results are reported on a 28 nm Virtex-7 FPGA technology. The obtained results demonstrate that the dedicated multipliers utilize FPGA slices of 1182, 1451, 1589, 2093 and 3451 for

m = 163, 233, 283, 409

and 571. These can operate at a maximum frequency of 500, 476, 465, 451 and 443 MHz. Due to the higher clock frequency, the power consumption is relatively higher (3.201 W for 163 and 3.726 W for 571). Similarly, for all supported binary fields, our scalable architecture (i) utilizes 3753 slices, (ii) achieves 305 MHz clock frequency, (iii) takes 0.013

μ

s for one finite field multiplication and (iv) consumes 3.905 W power. Our scalable digit-parallel architecture is comparatively 8, 6 and 14 times more area-efficient than most recent state-of-the-art multipliers of [17,21,23]. Moreover, our scalable design is 403.89 times faster in terms of computational time than [23]. The implementation results and comparison to the state of the art show the significance of this work.

Author Contributions

Conceptualization, M.R. and A.A.; Data extraction, S.Z.K. and S.S.A.; results compilation, M.R. and S.Z.K.; validation, M.R. and S.S.A.; writing—original draft preparation, S.Z.K. and H.K.; critical review, M.R. and I.B.; draft optimization, S.Z.K. and H.K.; supervision, H.K. and I.B.; funding acquisition, H.K. All authors have read and agreed to the published version of the manuscript.

Funding

We are thankful for the support of the Deanship of Scientific Research at King Khalid University, Abha, Saudi Arabia for funding this work under grant number R.G.P.2/132/42.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

Appendix A. Polynomial Reduction Algorithms over GF(2^m)

For each NIST-recommended binary field length, the reduction algorithms used in our work are given in Algorithms A1–A5. For additional mathematical descriptions and complete formulation constructions, we refer readers to [8].

Algorithm A1: Polynomial Reduction Algorithm over

G F (2^{163})

(algorithm 2.41 of [8])

Input: Polynomial,

D (x)

with

2 \times m - 1

-bit length

Output: Polynomial,

D (x)

with m-bit length

1.

f o r (i f r o m 10 d o w n t o 6) d o

1.1: $T ⟵ D [i]$
1.2: $D [i - 6] ⟵ D [i - 6] \oplus (T ≪ 29)$
1.3: $D [i - 5] ⟵ D [i - 5] \oplus (T ≪ 4) \oplus (T ≪ 3) \oplus T \oplus (T ≫ 3)$
1.4: $D [i - 4] ⟵ D [i - 4] \oplus (T ≫ 28) \oplus (T ≫ 29)$

2.

T ⟵ D [5] ≫ 3

3.

D [0] ⟵ D [0] \oplus (T ≪ 7) \oplus (T ≪ 6) \oplus (T ≪ 3) \oplus T

4.

D [1] ⟵ D [1] \oplus (T ≫ 25) \oplus (T ≫ 26)

5.

D [5] ⟵ D [5] & 0 x 7

6.

R e t u r n (D [5], D [4], D [3], D [2], D [1], D [0])

Algorithm A2: Polynomial Reduction Algorithm over

G F (2^{233})

(algorithm 2.42 of [8])

Input: Polynomial,

D (x)

with

2 \times m - 1

-bit length

Output: Polynomial,

D (x)

with m-bit length

1.

f o r (i f r o m 15 d o w n t o 8) d o

1.1: $T ⟵ D [i]$
1.2: $D [i - 8] ⟵ D [i - 8] \oplus (T ≪ 23)$
1.3: $D [i - 7] ⟵ D [i - 7] \oplus (T ≫ 9)$
1.4: $D [i - 5] ⟵ D [i - 5] \oplus (T ≪ 1)$
1.5: $D [i - 4] ⟵ D [i - 4] \oplus (T ≫ 31)$

2.

T ⟵ D [7] ≫ 9

3.

D [0] ⟵ D [0] \oplus T

4.

D [2] ⟵ D [2] \oplus (T ≪ 10)

5.

D [3] ⟵ D [3] \oplus (T ≫ 22)

6

D [7] ⟵ D [7] & 0 x 1 FF

7.

R e t u r n (D [7], D [6], D [5], D [4], D [3], D [2], D [1], D [0])

Algorithm A3: Polynomial Reduction Algorithm over

G F (2^{283})

(algorithm 2.43 of [8])

Input: Polynomial,

D (x)

with

2 \times m - 1

-bit length

Output: Polynomial,

D (x)

with m-bit length

1.

f o r (i f r o m 17 d o w n t o 9) d o

1.1: $T ⟵ D [i]$
1.2: $D [i - 9] ⟵ D [i - 9] \oplus (T ≪ 5) \oplus (T ≪ 10) \oplus (T ≪ 12) \oplus (T ≪ 17)$
1.3: $D [i - 8] ⟵ D [i - 8] \oplus (T ≫ 27) \oplus (T ≫ 22) \oplus (T ≫ 20) \oplus (T ≫ 15)$

2.

T ⟵ D [8] ≫ 27

3.

D [0] ⟵ D [0] \oplus T \oplus (T ≪ 5) \oplus (T ≪ 7) \oplus (T ≪ 12)

4.

D [8] ⟵ D [8] & 0 x 7 FFFFFF

5.

R e t u r n (D [8], D [7], D [6], D [5], D [4], D [3], D [2], D [1], D [0])

Algorithm A4: Polynomial Reduction Algorithm over

G F (2^{409})

(algorithm 2.44 of [8])

Input: Polynomial,

D (x)

with

2 \times m - 1

-bit length

Output: Polynomial,

D (x)

with m-bit length

1.

f o r (i f r o m 25 d o w n t o 13) d o

1.1: $T ⟵ D [i]$
1.2: $D [i - 13] ⟵ D [i - 13] \oplus (T ≪ 7)$
1.3: $D [i - 12] ⟵ D [i - 12] \oplus (T ≫ 25)$
1.4: $D [i - 11] ⟵ D [i - 11] \oplus (T ≪ 30)$
1.5: $D [i - 10] ⟵ D [i - 10] \oplus (T ≫ 2)$

2.

T ⟵ D [12] ≫ 25

3.

D [0] ⟵ D [0] \oplus T

4.

D [2] ⟵ D [2] \oplus (T!! 23)

5.

D [12] ⟵ D [12] & 0 x 1 FFFFFF

6.

R e t u r n (D [12], \dots, D [3], D [2], D [1], D [0])

Algorithm A5: Polynomial Reduction Algorithm over

G F (2^{571})

(algorithm 2.45 of [8])

Input: Polynomial,

D (x)

with

2 \times m - 1

-bit length

Output: Polynomial,

D (x)

with m-bit length

1.

f o r (i f r o m 35 d o w n t o 18) d o

1.1: $T ⟵ D [i]$
1.2: $D [i - 18] ⟵ D [i - 18] \oplus (T ≪ 5) \oplus (T ≪ 7) \oplus (T ≪ 10) \oplus (T ≪ 15)$
1.3: $D [i - 17] ⟵ D [i - 17] \oplus (T ≫ 27) \oplus (T ≫ 25) \oplus (T ≫ 22) \oplus (T ≫ 17)$

2.

T ⟵ D [17] ≫ 27

3.

D [0] ⟵ D [0] \oplus T \oplus (T ≪ 2) \oplus (T ≪ 5) \oplus (T ≪ 10)

4.

D [17] ⟵ D [17] & 0 x 7 FFFFFFF

5.

R e t u r n (D [17], \dots, D [3], D [2], D [1], D [0])

References

Rivest, R.L.; Shamir, A.; Adleman, L. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
Koblitz, N. Elliptic curve cryptosystems. Math. Comput. 1987, 48, 203–209. [Google Scholar] [CrossRef]
Miller, V.S. Use of elliptic curves in cryptography. In Conference on the Theory and Application of Cryptographic Techniques; Springer: Berlin/Heidelberg, Germany, 1985; pp. 417–426. [Google Scholar]
Rashid, M.; Jamal, S.S.; Khan, S.Z.; Alharbi, A.R.; Aljaedi, A.; Imran, M. Elliptic-Curve Crypto Processor for RFID Applications. Appl. Sci. 2021, 11, 7079. [Google Scholar] [CrossRef]
Calderoni, L.; Maio, D. Lightweight Security Settings in RFID Technology for Smart Agri-Food Certification. In Proceedings of the 2020 IEEE International Conference on Smart Computing (SMARTCOMP), Bologna, Italy, 14–17 September 2020; pp. 226–231. [Google Scholar] [CrossRef]
Dyka, Z.; Langendörfer, P. Improving the Security of Wireless Sensor Networks by Protecting the Sensor Nodes against Side Channel Attacks. In Wireless Networks and Security: Issues, Challenges and Research Trends; Khan, S., Khan Pathan, A.S., Eds.; Springer: Berlin/Heidelberg, Germany, 2013; pp. 303–328. [Google Scholar] [CrossRef]
NIST. Recommended Elliptic Curves for Federal Government Use (1999). Available online: https://csrc.nist.gov/csrc/media/publications/fips/186/2/archive/2000-01-27/documents/fips186-2.pdf (accessed on 19 February 2022).
Hankerson, D.; Menezes, A.J.; Vanstone, S. Guide to Elliptic Curve Cryptography. 2004, pp. 1–311. Available online: https://link.springer.com/book/10.1007/b97644 (accessed on 14 February 2022).
Yeh, L.Y.; Chen, P.J.; Pai, C.C.; Liu, T.T. An Energy-Efficient Dual-Field Elliptic Curve Cryptography Processor for Internet of Things Applications. IEEE Trans. Circuits Syst. II Express Briefs 2020, 67, 1614–1618. [Google Scholar] [CrossRef]
Jafri, A.R.; Islam, M.N.; Imran, M.; Rashid, M. Towards an Optimized Architecture for Unified Binary Huff Curves. J. Circuits Syst. Comput. 2017, 26, 1750178. [Google Scholar] [CrossRef]
Rashid, M.; Imran, M.; Jafri, A.R.; Mehmood, Z. A 4-Stage Pipelined Architecture for Point Multiplication of Binary Huff Curves. J. Circuits Syst. Comput. 2020, 29, 2050179. [Google Scholar] [CrossRef]
Fournaris, A.P.; Koufopavlou, O. Affine Coordinate Binary Edwards Curve Scalar Multiplier with Side Channel Attack Resistance. In Proceedings of the 2015 Euromicro Conference on Digital System Design, Madeira, Portugal, 26–28 August 2015; pp. 431–437. [Google Scholar] [CrossRef]
Imran, M.; Rashid, M.; Raza Jafri, A.; Najam-ul-Islam, M. ACryp-Proc: Flexible Asymmetric Crypto Processor for Point Multiplication. IEEE Access 2018, 6, 22778–22793. [Google Scholar] [CrossRef]
Pillutla, S.R.; Boppana, L. An area-efficient bit-serial sequential polynomial basis finite field GF(2m) multiplier. AEU Int. J. Electron. Commun. 2020, 114, 153017. [Google Scholar] [CrossRef]
Morales-Sandoval, M.; Feregrino-Uribe, C.; Kitsos, P. Bit-serial and digit-serial GF(2^m) Montgomery multipliers using linear feedback shift registers. IET Comput. Digit. Tech. 2011, 5, 86–94. [Google Scholar] [CrossRef]
Gebali, F.; Ibrahim, A. Efficient Scalable Serial Multiplier Over GF(2^m) Based on Trinomial. IEEE Trans. Very Large Scale Integr. Syst. 2015, 23, 2322–2326. [Google Scholar] [CrossRef]
Devi, S.; Mahajan, R.; Bagai, D. Low complexity design of bit parallel polynomial basis systolic multiplier using irreducible polynomials. Egypt. Inform. J. 2022, 23, 105–112. [Google Scholar] [CrossRef]
Lee, C.Y. Low-complexity bit-parallel systolic multipliers over GF(2^m). Integration 2008, 41, 106–112. [Google Scholar] [CrossRef]
Fournaris, A.P.; Koufopavlou, O. Versatile multiplier architectures in GF(2k) fields using the Montgomery multiplication algorithm. Integration 2008, 41, 371–384. [Google Scholar] [CrossRef]
Mathe, S.E.; Boppana, L. Low-power and low-hardware bit-parallel polynomial basis systolic multiplier over gf(2^m) for irreducible polynomials. ETRI J. 2017, 39, 570–581. [Google Scholar] [CrossRef]
Devi, S.; Mahajan, R.; Bagai, D. A low complexity bit parallel polynomial basis systolic multiplier for general irreducible polynomials and trinomials. Microelectron. J. 2021, 115, 105163. [Google Scholar] [CrossRef]
Bayat-Sarmadi, S.; Farmani, M. High-throughput low-complexity systolic Montgomery multiplication over GF(2^m) based on trinomials. IEEE Trans. Circuits Syst. II Express Briefs 2015, 62, 377–381. [Google Scholar] [CrossRef]
Mathe, S.E.; Boppana, L. Bit-parallel systolic multiplier over GF(2^m) for irreducible trinomials with ASIC and FPGA implementations. IET Circuits Devices Syst. 2018, 12, 315–325. [Google Scholar] [CrossRef]
Heidarpur, M.; Mirhassani, M. An Efficient and High-Speed Overlap-Free Karatsuba-Based Finite-Field Multiplier for FGPA Implementation. IEEE Trans. Very Large Scale Integr. Syst. 2021, 29, 667–676. [Google Scholar] [CrossRef]
Ibrahim, A.; Gebali, F. Energy-Efficient Word-Serial Processor for Field Multiplication and Squaring Suitable for Lightweight Authentication Schemes in RFID-Based IoT Applications. Appl. Sci. 2021, 11, 6938. [Google Scholar] [CrossRef]
Khan, S.; Javeed, K.; Shah, Y.A. High-speed FPGA implementation of full-word Montgomery multiplier for ECC applications. Microprocess. Microsyst. 2018, 62, 91–101. [Google Scholar] [CrossRef]
Thirumoorthi, M.; Heidarpur, M.; Mirhassani, M.; Khalid, M. An Optimized M-term Karatsuba-Like Binary Polynomial Multiplier for Finite Field Arithmetic. IEEE Trans. Very Large Scale Integr. Syst. 2022, 1–12. [Google Scholar] [CrossRef]
Rashid, M.; Imran, M.; Kashif, M.; Sajid, A. An Optimized Architecture for Binary Huff Curves With Improved Security. IEEE Access 2021, 9, 88498–88511. [Google Scholar] [CrossRef]
Rashid, M.; Imran, M.; Sajid, A. An Efficient Elliptic-Curve Point Multiplication Architecture for High-Speed Cryptographic Applications. Electronics 2020, 9, 2126. [Google Scholar] [CrossRef]
Imran, M.; Rashid, M. Architectural review of polynomial bases finite field multipliers over GF(2^m). In Proceedings of the 2017 International Conference on Communication, Computing and Digital Systems (C-CODE), Islamabad, Pakistan, 8–9 March 2017; pp. 331–336. [Google Scholar] [CrossRef]
Imran, M.; Abideen, Z.U.; Pagliarini, S. An Open-source Library of Large Integer Polynomial Multipliers. In Proceedings of the 2021 24th International Symposium on Design and Diagnostics of Electronic Circuits Systems (DDECS), Vienna, Austria, 7–9 April 2021; pp. 145–150. [Google Scholar] [CrossRef]
Kashif, M.; Cicek, I.; Imran, M. A Hardware Efficient Elliptic Curve Accelerator for FPGA Based Cryptographic Applications. In Proceedings of the 2019 11th International Conference on Electrical and Electronics Engineering (ELECO), Bursa, Turkey, 28–30 November 2019; pp. 362–366. [Google Scholar] [CrossRef]
Imran, M.; Abideen, Z.U.; Pagliarini, S. An Experimental Study of Building Blocks of Lattice-Based NIST Post-Quantum Cryptographic Algorithms. Electronics 2020, 9, 1953. [Google Scholar] [CrossRef]
Ilter, M.B.; Cenk, M. Efficient Big Integer Multiplication in Cryptography. Int. J. Inf. Secur. Sci. 2017, 6, 70–78. Available online: https://dergipark.org.tr/en/download/article-file/2160206 (accessed on 1 March 2022).
Liu, W.; Fan, S.; Khalid, A.; Rafferty, C.; O’Neill, M. Optimized Schoolbook Polynomial Multiplication for Compact Lattice-Based Cryptography on FPGA. IEEE Trans. Very Large Scale Integr. Syst. 2019, 27, 2459–2463. [Google Scholar] [CrossRef] [Green Version]

Figure 1. Flowchart of our dedicated and scalable multiplier architectures.

Figure 2. Digit-parallel multiplier over

G F (2^{163})

.

Figure 2. Digit-parallel multiplier over

G F (2^{163})

.

Figure 3. Digit-parallel multiplier over

G F (2^{233})

.

Figure 3. Digit-parallel multiplier over

G F (2^{233})

.

Figure 4. Digit-parallel multiplier over

G F (2^{283})

.

Figure 4. Digit-parallel multiplier over

G F (2^{283})

.

Figure 5. Digit-parallel multiplier over

G F (2^{409})

.

Figure 5. Digit-parallel multiplier over

G F (2^{409})

.

Figure 6. Least significant digit-parallel multiplier over

G F (2^{571})

.

Figure 6. Least significant digit-parallel multiplier over

G F (2^{571})

.

Figure 7. Block diagram of our proposed scalable multiplier architecture for NIST-recommended binary fields. Our scalable multiplier design incorporates the least-significant digit-parallel multiplication approach.

Table 1. Implementation results of our proposed dedicated and scalable least-significant digit-parallel multiplier(s) over

G F (2^{m})

with

m = 163, 233, 283, 409

and 571.

Table 1. Implementation results of our proposed dedicated and scalable least-significant digit-parallel multiplier(s) over

G F (2^{m})

with

m = 163, 233, 283, 409

and 571.

m	Slices	LUTs	CCs	PP (in ns)	Freq. (MHz)	Latency (in $μ$ s)	Pwr. (in W)	$Slices \times Latency$
Dedicated architectures
163 (Figure 2)	1182	3925	1	2.000	500	0.00200	3.201	2.364
233 (Figure 3)	1451	4464	1	2.100	476	0.00210	3.326	3.047
283 (Figure 4)	1589	4927	1	2.150	465	0.00215	3.409	3.416
409 (Figure 5)	2093	5728	1	2.215	451	0.00221	3.561	4.625
571 (Figure 6)	3451	6943	1	2.255	443	0.00225	3.726	7.764
Scalable/flexible architecture
163 (Figure 7)	3753	7461	4	3.275	305	0.01311	3.905	49.201
233 (Figure 7)
283 (Figure 7)
409 (Figure 7)
571 (Figure 7)

Pwr is the total (sum of static and dynamic) power. CCs is the clock cycles. PP is the provided clock period.

Table 2. Comparison of our proposed dedicated and scalable least-significant digit-parallel multiplier(s) over

G F (2^{m})

with state-of-the-art multipliers.

Table 2. Comparison of our proposed dedicated and scalable least-significant digit-parallel multiplier(s) over

G F (2^{m})

with state-of-the-art multipliers.

Ref #/Year	Multiplier	m	Device	Slices	CCs	Freq. (MHz)	Latency	Pwr. (in W)
[17]/2022	bit-parallel systolic	163	Virtex-7	32,685	–	–	–	5.277
[18]/2008	bit-parallel systolic	163	Virtex-7	154,635	–	–	–	3.600
[19]/2008	bit-parallel systolic	163	Virtex-7	105,787	–	–	–	6.187
[20]/2017	bit-parallel systolic	163	Virtex-7	66,434	–	–	–	2.848
[21]/2021	bit-parallel systolic	233	Virtex-7	22,864	–	–	–	0.717
[22]/2015	bit-parallel systolic	233	Virtex-7	94,498	–	–	–	2.148
[23]/2018	bit-parallel systolic	233	Virtex-7	56,223	233	44	5.295 $μ$ s	1.192
[24]/2021	overlap-free Karatsuba	409	Artix-7	49,211 *	–	–	–	–
[26]/2018	Montgomery	521	Virtex-7	20,695 *	7	99.68	0.070 $μ$ s	–
[27]/2022	similar to Karatsuba	409	Spartan-7	40,056 *	–	–	–	–
Figure 2	digit-parallel	163	Virtex-7	1182	1	500	0.002 $μ$ s	3.201
Figure 3	digit-parallel	233	Virtex-7	1451	1	476	0.00210 $μ$ s	3.326
Figure 5	digit-parallel	409	Artix-7	6128 *	1	468	0.00213 $μ$ s	3.632
Figure 5	digit-parallel	409	Spartan-7	6784 *	1	457	0.00218 $μ$ s	3.541
Figure 6	digit-parallel	571	Virtex-7	6943 *	1	443	0.00225 $μ$ s	3.726
Figure 7	scalable digit-parallel	BFL	Virtex-7	3753	4	305	0.01311 $μ$ s	3.905
Figure 7	scalable digit-parallel	BFL	Artix-7	8019 *	4	331	0.01208 $μ$ s	4.251
Figure 7	scalable digit-parallel	BFL	Spartan-7	8653 *	4	316	0.01265 $μ$ s	4.016

BFL indicates that these results are for all binary field lengths (m = 163, 233, 283, 409 and 571). * indicates that the area reported values are for FPGA LUTs. The FPGA LUTs for Figure 7 on Virtex-7 are 7461.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Kumar, H.; Rashid, M.; Alhomoud, A.; Khan, S.Z.; Bahkali, I.; Alotaibi, S.S. A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves. Appl. Sci. 2022, 12, 4312. https://doi.org/10.3390/app12094312

AMA Style

Kumar H, Rashid M, Alhomoud A, Khan SZ, Bahkali I, Alotaibi SS. A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves. Applied Sciences. 2022; 12(9):4312. https://doi.org/10.3390/app12094312

Chicago/Turabian Style

Kumar, Harish, Muhammad Rashid, Ahmed Alhomoud, Sikandar Zulqarnain Khan, Ismail Bahkali, and Saud S. Alotaibi. 2022. "A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves" Applied Sciences 12, no. 9: 4312. https://doi.org/10.3390/app12094312

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves

Abstract

1. Introduction

1.1. Existing Polynomial Multiplier Architectures and Their Limitations

1.2. Novelty

1.3. Our Contributions

2. Schoolbook Polynomial Multiplication

3. Our Dedicated and Scalabale Multiplier Architectures

3.1. Dedicated Polynomial Multiplication Architectures

3.2. Scalable Multiplication Architecture

4. Implementation Results and Comparisons

4.1. Results

4.2. Comparison of Our Dedicated Multipliers with Scalable Architecture

4.3. Comparison to State of the Art

4.3.1. Comparison to Bit-Parallel Systolic Multiplier Architectures

4.3.2. Comparison to Karatsuba and Montgomery Multiplier Architectures

4.3.3. Overall Summary

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

Appendix A. Polynomial Reduction Algorithms over GF(2^m)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Article Menu

A Scalable Digit-Parallel Polynomial Multiplier Architecture for NIST-Standardized Binary Elliptic Curves

Abstract

1. Introduction

1.1. Existing Polynomial Multiplier Architectures and Their Limitations

1.2. Novelty

1.3. Our Contributions

2. Schoolbook Polynomial Multiplication

3. Our Dedicated and Scalabale Multiplier Architectures

3.1. Dedicated Polynomial Multiplication Architectures

3.2. Scalable Multiplication Architecture

4. Implementation Results and Comparisons

4.1. Results

4.2. Comparison of Our Dedicated Multipliers with Scalable Architecture

4.3. Comparison to State of the Art

4.3.1. Comparison to Bit-Parallel Systolic Multiplier Architectures

4.3.2. Comparison to Karatsuba and Montgomery Multiplier Architectures

4.3.3. Overall Summary

5. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Conflicts of Interest

Appendix A. Polynomial Reduction Algorithms over GF(2m)

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Appendix A. Polynomial Reduction Algorithms over GF(2^m)