Fully Parallel Proposal of Naive Bayes on FPGA

Barros, Wysterlânya K. P.; Barbosa, Matheus T.; Dias, Leonardo A.; Fernandes, Marcelo A. C.

doi:10.3390/electronics11162565

Open AccessArticle

Fully Parallel Proposal of Naive Bayes on FPGA

by

Wysterlânya K. P. Barros

^1,†

,

Matheus T. Barbosa

¹,

Leonardo A. Dias

²

and

Marcelo A. C. Fernandes

^1,3,*,†

¹

Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

²

Centre for Cyber Security and Privacy, School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK

³

Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil

^*

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

Electronics 2022, 11(16), 2565; https://doi.org/10.3390/electronics11162565

Submission received: 14 July 2022 / Revised: 4 August 2022 / Accepted: 8 August 2022 / Published: 17 August 2022

(This article belongs to the Special Issue Digital Hardware Architectures: Systems and Applications)

Download

Browse Figures

Versions Notes

Abstract

:

This work proposes a fully parallel hardware architecture of the Naive Bayes classifier to obtain high-speed processing and low energy consumption. The details of the proposed architecture are described throughout this work. Besides, a fixed-point implementation on a Stratix V Field Programmable Gate Array (FPGA) is presented and evaluated regarding the hardware area occupation, processing time (throughput), and dynamic power consumption. In addition, a comparative design analysis was carried out with state-of-the-art works, showing that the proposed implementation achieved a speedup of up to

10^{4} \times

and power savings of up to

10^{7} \times

-times while also reducing the hardware occupancy by up to

10^{2} \times

-times fewer logic cells.

Keywords:

FPGA; hardware; machine learning; Naive Bayes; parallel implementation

1. Introduction

Machine learning (ML) is a set of techniques widely used to analyze large volumes of data, requiring intensive processing power [1]. Usually, the use of ML techniques results in high energy consumption. Therefore, energy consumption can become a critical problem for large technology companies such as Microsoft, IBM, Amazon, and Google. Given that, the servers and databases are improving at an architectural level to adopt low-power alternatives also capable of high-speed processing [2].

Applications in emerging fields such as 5G Communication, Advanced Driver Assistance Systems, and bioinformatics often require high-speed processing and low energy consumption. Given the complexity of ML techniques, parallelization of the algorithms has often been adopted to overcome the need for high computational performance when processing large volumes of data. Besides, the use of hardware solutions implemented in parallel has proven to be very efficient in reaching the mentioned requirements. For instance, Field Gate Programmable Arrays (FPGAs) allow high performance and reduced energy consumption compared to conventional software implementations due to the high parallelization degree and the optimization at the level of logic gates, thus increasing the data processing speed of ML techniques up to a thousand times [3,4,5,6,7,8,9,10].

Naive Bayes (NB) is a widely used ML technique for solving classification problems, such as facial recognition and network packet classification [11,12,13,14,15,16,17,18,19]. Several of these applications demand real-time processing, but reaching this performance in software becomes difficult due to the large volume of data usually handled. Thus, many works in the literature have proposed implementing the Naive Bayes method in hardware seeking to achieve high throughputs and, in addition, low power consumption [20,21,22,23,24,25].

Some works in the literature propose a hardware design for both stages of the NB method, training and inference [20,21]. However, they present a serial implementation with floating-point representation, which reduces the architecture’s performance. Moreover, these works present an implementation designed for a fixed amount of attributes and classes. In contrast, this paper proposes a fully parallel architecture with fixed-point representation for the training and classification steps of the NB method, easily adjustable to handle different numbers of attributes and classes. This architecture aims to increase the data processing speed while maintaining low energy consumption. The design was developed on a Stratix V 5SGXMBBR3H43C3 FPGA and evaluated for a different quantity of attributes in the inference and training steps.

The remainder of this paper is organized as follows: Section 2 presents the related works in the literature; Section 3 addresses the theoretical foundation of the Naive Bayes method; Section 4 gives a detailed description of the architectures proposed in this paper; Section 5 provides validation data for the hardware implementation by comparing its results to the software implementation method, while Section 6 presents and analyzes the synthesis results obtained from the described implementation, including a comparison to other works; Section 7 presents the final considerations.

2. Related Works

Hardware implementations of the NB method have been developed to increase the processing speed of different applications that rely on massive datasets.

An NB classifier was implemented by [20] on a Virtex-4 XC4VLX160 FPGA to perform multiple class object classification by analyzing vectors with binary attributes. The training and inference steps were serially implemented. The total area occupied by their architecture was 784 slices, 730 Flip-Flops (FFs), 819 Look-Up Tables (LUTs), and 3 Block RAMs (BRAMs), being less than the area occupation of a bSOM also implemented on an FPGA.

In [26], the NB was developed on the Raspberry Pi 2 development board. The board has a Quad-Core ARMv7 900 MHz Central Processing Unit (CPU) and 1 GB of RAM. The NB method was validated with the Iris, Diabetes, Hepatitis, and other datasets, achieving accuracy levels between

73 %

and

99 %

, according to the dataset used.

Meanwhile, in [21], a Gaussian Naive Bayes was proposed to optimize the algorithm’s learning process. The implementation was developed using High-Level Synthesis (HLS) on a ZedBoard Zynq-7020 SoC Development Board with an Artix-7 FPGA to obtain a heterogeneous system. Regarding the area occupation, 197 DSPs, 56 BRAMs, 37,929 LUTs, and 29,271 FFs were used for the training step, while the inference step required 93 DSPs, 112 BRAMs, 34,977 LUTs, and 31,779 FFs. Meanwhile, the processing time for the training and inference steps were

1.1

s and

2.7

s, respectively, for a total of 784 attributes and 10 labels, using 2000 samples from the MNIST dataset.

The work presented by [22] implements three classifiers: Decision Tree, NB, and K-Nearest Neighbors, to classify packets sent within a network and identify possible malicious packets that can infect the target machine. They were developed on a Cyclone IV GX FPGA. Regarding NB, six implementations are proposed (three combinational and three sequential) using floating-point notation for 32, 16, and 10 bits in each type of implementation. The work presented data on area occupancy, throughput, and power consumption, for all designs.

In [23], a hardware architecture of an NB classifier is implemented on Altera’s Stratix EP1S10F780C5ES for email spam classification, using a logarithmic numerical system to reduce the computational complexity. The proposal was designed using floating-point notation, but can receive fixed-point inputs with 8, 12, and 16 bits. The floating-point implementation was able to classify 8889.375 emails per second using 15 attributes. The hardware was validated with SpamAssassin’s public corpus dataset and presented each architecture’s data processing time and area occupancy.

In [24], another hardware architecture of an NB classifier is presented for real-time classification of seven facial expressions: happy, surprised, sad, disgusted, afraid, angry, and neutral. The proposed design was developed in the fixed-point notation to reduce the area occupation and achieved an operating frequency of 241.55 MHz, with a classification accuracy of

81.94 %

. The proposed architecture was designed using the Xilinx System Generator, and its target FPGA was Xilinx Virtex-II.

Meanwhile, in [25], a hardware system is proposed to increase the processing speed and reduce the power consumption for a malware detection application. The system was developed using HLS on a ZedBoard Zynq-7020 SoC Development Board with an Artix-7 FPGA, and it is ready to be integrated into mobile devices’ CPUs. Three classification algorithms are commonly used to detect malware: Logistic Regression, Naive Bayes, and Support Vector Machines. These classifiers were optimized regarding power consumption while maintaining an accuracy similar to software implementations.

Therefore, it is clear from the works mentioned that FPGA implementations of NB could optimize applications’ processing speed and power consumption with these requirements. Unlike the works mentioned above, this work proposes implementing both steps of Naive Bayes, training and inference, with a fully parallel architecture using fixed-point representation. In addition, the architecture was designed to be easily adapted to handle different numbers of attributes and classes. Through this architecture, the aim is to obtain better throughput and power savings results.

Table 1 summarizes the characteristics of the proposal’s architecture and works in the literature. The first column shows the architecture analyzed. Meanwhile, the second and third columns indicate whether the work has implemented the training and inference phases of the NB algorithm in hardware. The fourth column displays the FPGA for which the implementation was synthesized. The fifth column indicates the arithmetic representation of the values in the hardware. Finally, the sixth column shows how the architecture implementation was accomplished.

3. Naive Bayes Method

The Naive Bayes method is based on Bayes’ Theorem, in which a “naive” premise of total independence between the attributes is adopted for a given sample. This premise is often unrealistic for most classification problems in which the technique is used; however, it is still possible to achieve high accuracy values compared to other ML techniques [1].

According to [27,28], the classification process calculates the probability of an input vector

x

belonging to a k-th class

c_{k}

of a set of K classes. Therefore, the probability value

P (c_{k} ∣ x)

is obtained, which can be expressed as

P (c_{k} ∣ x) = \frac{P (c_{k}) P (x ∣ c_{k})}{P (x)}

(1)

where

P (c_{k})

and

P (x)

are the probabilities of the k-th class

c_{k}

and an

x

occurring, respectively, while

P (x ∣ c_{k})

is the probability of

x

happening given that

c_{k}

occurred. The input vector

x

is expressed as

x = [x_{1}, \dots, x_{i}, \dots, x_{N}]

(2)

where each i-th

x_{i}

represents an independent attribute.

Given the premise of independence between attributes, that

P (x)

does not depend on

c_{k}

, and that each i-th input,

x_{i}

, is already presented, then

P (x)

can be ignored in (1). Hence, it can be rewritten as

P (c_{k} ∣ x) \propto P (c_{k}) \prod_{i = 1}^{N} P (x_{i} ∣ c_{k}) .

(3)

Thus, after obtaining the NB probabilistic model, the classifier is performed. For this purpose, the classification process requires a decision mechanism, which can be defined by the highest class probability for a given input sample, i.e.,

\hat{k} = a r g max_{k \in 1, . ., K} P (c_{k}) \prod_{i = 1}^{N} P (x_{i} ∣ c_{k})

(4)

where

\hat{k}

is the estimate of a k-th class

c_{k}

.

Figure 1 shows a flowchart illustrating the operation of the Naive Bayes method. Initially, the dataset is split into training and testing sets. Each sample in the training data contains the attribute vector,

x

, and its classification,

c_{k}

, which are provided for the training stage. In this step, the prior probability of each k-th

c_{k}

,

P (c_{k})

, and the likelihood of each i-th

x_{i}

and k-th

c_{k}

,

P (x_{i} ∣ c_{k})

, are calculated. The values of

P (c_{k})

and

P (x_{i} ∣ c_{k})

are then provided to the inference stage, which uses these values to compute the posterior probabilities,

P (c_{k} ∣ x)

, of the samples from the test set. Finally, the highest posterior probability value determines the analyzed sample’s classification.

4. Hardware Description

The architecture was designed in fixed-point notation using 15 bits for the fractional part and from 3 to 17 bits in the integer part, depending on the pipeline stage, to reduce the hardware area occupation and increase the throughput compared to floating-point implementations. Two modules were implemented: training and inference, based on the NB method described in Section 3.

4.1. Training Module

Figure 2 presents the Training Module (TM) used to obtain the probabilities

P (x | c_{k})

and

P (c_{k})

, according to Expression (1). Therefore, at each n-th time sample,

t_{s}

, the TM receives as input the vector of attributes

x (n)

, described in (2), and the k-th class,

c_{k} (n)

, for the respective vector of attributes.

Figure 3 details the TM modules. As can be seen, it is constituted by two: the Class Probability Module (CPM) and the Attribute Probability Module (APM). Note that the APM is replicated N times for parallel processing. Hence, at each n-th time sample, the CPM receives one

c_{k} (n)

value, which is used for calculating the probabilities,

P (c_{k})

, and the occurrence number of each k-th class,

n c_{k}

. Meanwhile, each i-th

{APM}_{i}

receives as input the i-th attribute,

x_{i}^{j} (n)

, the k-th class,

c_{k} (n)

, and its occurrence number,

n c_{k}

, to obtain the probabilities

P (x_{i}^{j} | c_{k})

. Besides, it is important to mention that each i-th attribute is represented by a j-th possible value, in which

J_{i}

is the maximum number of distinct values assumed by its attribute.

Figure 4 shows the internal structure of the CPM. Firstly, at each n-th time sample, the k-th class,

c_{k} (n)

, is compared with the remaining classes of the dataset (

c_{1}, \dots, c_{K}

), through comparator circuits. Besides, note that the classes are represented in the architecture as constant values. Secondly, the comparators output the bit 1 if the classes are equal; otherwise, they output the bit 0. Thirdly, the counter increments its value by 1 every time sample the comparator outputs the bit 1. Thereby, the counter defines the occurrence of each class,

n c_{k}

, and propagates these values to every

A P M_{i}

. Lastly, each k-th

n c_{k}

is multiplied by

\frac{1}{K}

for obtaining the probability of occurrence of each k-th class,

P (c_{K})

, according to Expression (3). The calculated probabilities are stored in Registers (R) and provided as output from the CPM.

Meanwhile, each i-th

A P M_{i}

is performed in parallel, and its internal structure is shown in Figure 5. As can be observed, each i-th

A P M_{i}

receives as input the i-th attribute,

x_{i} (n)

, the k-th class,

c_{k} (n)

, and its occurrence,

n c_{k}

. Initially,

x_{i} (n)

is concatenated to

c_{k} (n)

by the

C X U

block, generating a

z_{k}^{j_{i}}

value. This value represents the occurrence of a k-th class for an i-th attribute with a j-th value.

Secondly, for each possible

z_{k}^{j_{i}}

value, there is a comparator block. Similar to the submodules in the CPM, the comparator outputs the bit 1 according to

Comparator block = \{\begin{matrix} 1, if z_{k}^{j_{i}} = CXU output value \\ 0, otherwise . \end{matrix}

Subsequently, the counter submodule (

C o u n t e r

) increases its value by 1 every n-th time sample the comparator output is equal to 1. Hence, the counter defines the total occurrence of each

z_{k}^{j_{i}}

, which is stored in a Register (R).

Thirdly, the output of each counter has to be divided by the k-th

n c_{k}

related to

z_{k}^{j_{i}}

. Due to the hardware complexity of performing division circuits that provide the decimal values of the result, a multiplier was added before and after the division block. Therefore, the counter output is multiplied by

α_{1} =

1000, divided by

n c_{k}

, and finally, multiplied by

α_{2} = 0.001

.

Lastly, to avoid the Naive Bayes zero-frequency problem, when

P (x_{i}^{j} | c_{k}) = 0

, logic OR gates were added after the last multiplier blocks. The OR gates receive as input the multiplier value and a constant

β

, which in turn has the smallest possible value that can be represented for the data precision defined, e.g.,

β \approx 0.00031

. Thus, the probabilities

P (x_{i}^{j} | c_{k})

of having a value equal to zero are replaced by the value of

β

. The outputs of the logic gates, i.e., the probabilities

P (x_{i}^{j} | c_{k})

, are stored in Registers (R) and provided as the outputs of the module. At the end of this procedure, the NB training step is complete, and all the values of

P (x_{i}^{j} | c_{k})

and

P (C_{k})

are available.

4.2. Inference Module

The NB inference process consists of classifying a sample

x (n)

according to (4). For this purpose, it is necessary to calculate the probability

P (c_{k} | x)

, defined on Expression (3). Figure 6 shows the Inference Module (IM) developed to perform this process. As can be seen, there is one submodule for each k-th

P (c_{k} | x)

, and it receives as inputs the probabilities

P (c_{k})

from the TM and the N samples,

x (n)

.

Figure 7 shows the internal structure of each k-th

P (c_{k} | x)

submodule. As illustrated, there are N 15 bit-wide LUTs, of depth

J_{i}

, where each stores all

P (c_{k} | x)

values. The sample input value,

x_{i}^{j} (n)

, is used as the LUT address to select its corresponding probability,

P (x_{i}^{j} | c_{k})

. Subsequently, the probabilities

P (x_{i}^{j} | c_{k})

addressed in each i-th LUT are multiplied through a tree of multipliers, and in the last multiplier of the tree, the result is multiplied by the k-th probability

P (c_{k})

, thus obtaining the value of

P (c_{k} | X)

, according to (3).

According to (4), the decision process of the class corresponding to the sample under evaluation consists of selecting the highest probability

P (c_{k} | x)

. This process is performed by a relational circuit that compares the

P (c_{k} | x)

values from all K submodules of the IM and outputs the class

c_{k}

related to the highest probability value calculated, therefore completing the inference process.

5. Hardware Validation

The software implementation was used to validate the proposed NB hardware architecture. The software was developed using floating-point representation, according to the IEEE 754 standard, and both hardware and software implementations were applied to the divorce predictors dataset [29]. This dataset contains 170 samples, representing couples’ responses to a survey to predict divorce basis on Gottman couples therapy. Each sample has 54 attributes corresponding to a question in the survey, which can have five different answers represented numerically by:

0 = Never

;

1 = Rarely

;

2 = Averagely

;

3 = Frequently

; and

4 = Always

. The target of this dataset is to predict, based on the answers given, whether the couple will divorce, Class 1, or not, Class 2.

The dataset was divided into two groups: one with

80 %

of the data for the training step and another for the classification test (inference step) with the remaining

20 %

. For each data point, four discrete attributes were selected, each of which can assume five distinct values. Thus, the hardware parameters to perform the tests were set as

N = 4

,

J_{1} = J_{2} = J_{3} = J_{4} = 5

, and

K = 2

.

5.1. Training Step Validation

The absolute error between the hardware and software implementations was obtained for each k-th probability

P (x_{i}^{j} | c_{K})

, as shown in Figure 8 and Figure 9, and it can be defined as

e_{i, k}^{j} = | P_{H} (x_{i}^{j} | c_{k}) - P_{S} (x_{i}^{j} | c_{k}) |

(5)

where

P_{H} (x_{i}^{j} | c_{k})

and

P_{S} (x_{i}^{j} | c_{k})

are the probability values obtained in hardware and software, respectively.

In addition, the absolute errors were also calculated for each k-th probability

P (c_{K})

as follows:

e c_{k} = | P_{H} (c_{k}) - P_{S} (c_{k}) |

(6)

where

P_{H} (c_{k})

and

P_{S} (c_{k})

correspond, respectively, to the probability values obtained in hardware and software. The absolute error calculated for each class was

e c_{1} = 0.00303

and

e c_{2} = 0.00379

.

5.2. Inference Step Validation

As previously mentioned,

20 %

of the dataset was used for the classification test, that is on the inference step. The classification result of each sample was compared to the expected class, which is present in the dataset. The accuracy achieved by the proposed implementation was

97.06 %

. Figure 10 presents the confusion matrix obtained after performing the inference on the IM.

6. Results

6.1. Hardware Proposal Results

6.1.1. Area Occupation

Synthesis results were analyzed for training and inference modules and presented in Table 2 and Table 3, respectively. In both tables, the first column indicates the number of attribute,

N_{A}

, used for the implementation. Meanwhile, from the second to fourth columns are shown the area occupation in the target FPGA, that is the Number of Logical Cells,

N_{LC}

, the Number of Bits of memory blocks,

N_{Bits}

, and the Number of Multipliers implemented using DSP blocks,

N_{MULT}

.

6.1.2. Processing Speed (Throughput)

The processing speeds for the TM and IM are presented in Table 4 and Table 5, respectively. In both tables, the first column indicates the number of attributes in the analysis,

N_{A}

. Meanwhile, the second column presents the Throughput (THPT) in Mega samples per second (Msps), while the third column shows the maximum Clock (CLK) frequency.

As is evident from Table 4, the throughput values remain constant with the increase of

N_{A}

due to the high parallelism adopted for the NB training step by replicating the modules. In contrast, the throughput reduces as

N_{A}

increases due to the rise in the number of multipliers in the IM for the inference step, as shown in Table 5.

6.2. Comparison with State-of-the-Art Works

Regarding the area occupation, the use of logical cells and the number of bits in the memory were analyzed. For a fair comparison of the proposed architecture with state-of-the-art works, it was necessary to conduct some conversions to have similar metrics for the hardware area occupancy. Usually, this is required due to the use of different FPGAs in each work, which present the data based on the optimization process. The conversions were realized considering the relationship between ALMs, LUTs, and Slices with their equivalents in the number of logical cells, as stated in the documentation provided by the manufacturer of the FPGA devices.

Based on [30], the dynamic power,

E_{d}

, can be expressed as

E_{d} \propto N_{g} \times CLK \times V_{D D}^{2}

(7)

where

N_{g}

is the number of elements occupied in hardware, CLK is the clock frequency, and

V_{D D}

is the supply voltage. The frequency at which a CMOS circuit can operate is approximately proportional to the voltage [31], so Equation (7) can be expressed as

E_{d} \propto N_{g} \times {CLK}^{3} .

(8)

For all comparisons,

N_{g}

was calculated as

N_{g} = N_{Bits} + N_{Mult}

(9)

Besides, the inference module was only compared due to the scarcity of data about the training step in the state-of-the-art works. Table 3 shows the synthesis results for the inference module, which were used to derive equations for comparing the area occupation. For this purpose, linear regression was used, which resulted in the following:

N_{LC} = 1.6333 \times N_{A} + 13.0833,

(10)

N_{Bits} = 256 \times N_{A}

(11)

and

N_{Mult} = 2 \times N_{A} .

(12)

Likewise, the data presented in Table 5 and a logarithmic adjustment were used to derive the following throughput metric:

THPT = \frac{1}{3.5869 \times {log}_{2} N_{A} + 6.1680},

(13)

the behavior of which is observed in Figure 11.

The derived equations allow estimating the hardware area occupation, throughput, and dynamic power consumption for any number of attributes and classes, to compare the performance with state-of-the-art works.

6.2.1. Area Occupation Comparison

Table 6 presents the area occupation for the hardware proposals of [20,22,23] and compares them to the proposed implementation. The second column gives the number of attributes,

N_{A}

, used on the referenced proposal and this work. Meanwhile, from the third to fifth columns are shown the number of logical cells for the referenced work (

N_{LC}^{Ref}

), for this work (

N_{LC}^{Author}

), and the gain (

N_{LC}^{Gain}

), respectively. Likewise, the last three columns present the number of memory block bits.

As can be seen, except for the architecture presented by [20], the implementation proposed achieved gains both in the number of logical cells (

N_{LC}^{Gain}

) and the number of bits in the memory block (

N_{Bits}^{Gain}

). This gain can be attributed to the use of a fixed-point representation, while [22] used a floating-point and [23] deployed a selection scheme between fixed and floating-point. Concerning the number of bits in the memory block, it was not possible to compare the gains against Designs I, II, and III of [22] as this did not apply to the work developed, and therefore, these values were not included for the compared reference.

6.2.2. Throughput Comparison

Table 7 presents the processing time synthesis results. The second column shows the number of attributes (

N_{A}

), while the third and fourth columns show the throughput for the reference (THPT^Ref) and this work (THPT^Author), respectively. Meanwhile, the fifth column presents the speedup obtained by this work regarding the reference.

As is evident from Table 7, the proposed architecture obtained satisfactory performance, reaching a speedup from

5 \times

to

10^{3} \times

regarding the other works. This speedup results from the high degree of parallelism adopted in the proposed architecture, which provides a greater processing speed for the technique. The throughput achieved was inferior only to the architecture proposed in [25].

6.2.3. Dynamic Power Consumption Comparison

Table 8 presents the synthesis results regarding dynamic power consumption. The second column shows the number of attributes of the reference,

N_{A}

, while from the third to the fifth columns are shown the clock frequency, CLK^Ref, the

N_{g}^{Ref}

, and dynamic power,

E_{d}^{Ref}

, respectively, for the reference. Likewise, from the sixth to eighth columns are shown the clock frequency, CLK^Author, the

N_{g}^{Author}

, and dynamic power,

E_{d}^{Author}

, respectively, for this work. The last column shows the dynamic power saved by this work compared to the reference (PWRSave). To achieve the reference works throughput, the CLK^Author was defined as THPT^Ref. Besides, it is important to mention that the dynamic power consumption values were estimated using Expression (8).

The work [23] did not use multipliers implemented with DSP blocks, and [22] presented multipliers of size 18 × 18 bits. Size equals the multipliers presented in this work, allowing direct comparison of these quantities with both works. As can be observed in Table 8, this proposal presents a lower dynamic power compared to all designs analyzed, reducing the dynamic power between

10^{3} \times

and

10^{7} \times

. This result was achieved due to the high throughput, low clock frequency, and reduced hardware area occupation.

7. Conclusions

This work presented a hardware implementation of the Naive Bayes technique, developed with a fully parallel architecture and fixed-point representation. Different from the works presented in the literature, the training and inference steps were implemented in parallel and fixed points. The validation was carried out for the training and inference steps using a floating-point software implementation, and the synthesis results were obtained for a Stratix V 5SGXMBBR3H43C3 FPGA. The occupancy, throughput, and power consumption results were compared with other results found in the literature. Compared with other works in the literature, the proposed architecture achieved a speedup of up to

10^{3} \times

, a lower hardware occupancy of up to

35 \times

, and reduced dynamic power consumption of up to

10^{7} \times

.

Author Contributions

All the authors contributed to various degrees to ensure the quality of this work (e.g., W.K.P.B., M.T.B., L.A.D. and M.A.C.F. conceived of the idea and experiments; W.K.P.B., M.T.B., L.A.D. and M.A.C.F. designed and performed the experiments; W.K.P.B., M.T.B., L.A.D. and M.A.C.F. analyzed the data; W.K.P.B., M.T.B., L.A.D. and M.A.C.F. wrote the paper; M.A.C.F. coordinated the project). All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001.

Acknowledgments

The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support.

Conflicts of Interest

The authors declare no conflict of interest.

References

Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 3rd ed.; Prentice Hall Press: Hoboken, NJ, USA, 2009. [Google Scholar]
Caulfield, A.; Chung, E.S.; Putnam, A.; Angepat, H.; Fowers, J.; Haselman, M.; Heil, S.; Humphrey, M.; Kaur, P.; Kim, J.Y.; et al. A cloud-scale acceleration architecture. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; pp. 1–13. [Google Scholar] [CrossRef]
Dias, L.A.; Damasceno, A.M.; Gaura, E.; Fernandes, M.A. A full-parallel implementation of Self-Organizing Maps on hardware. Neural Netw. 2021, 143, 818–827. [Google Scholar] [CrossRef] [PubMed]
Dias, L.A.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation of K-Means Algorithm on FPGA. IEEE Access 2020, 8, 41071–41084. [Google Scholar] [CrossRef]
Torquato, M.F.; Fernandes, M.A. High-performance parallel implementation of genetic algorithm on fpga. Circuits Syst. Signal Process. 2019, 38, 4014–4039. [Google Scholar] [CrossRef]
Coutinho, M.G.F.; Torquato, M.F.; Fernandes, M.A.C. Deep Neural Network Hardware Implementation Based on Stacked Sparse Autoencoder. IEEE Access 2019, 7, 40674–40694. [Google Scholar] [CrossRef]
Blaiech, A.G.; Ben Khalifa, K.; Valderrama, C.; Fernandes, M.A.; Bedoui, M.H. A Survey and Taxonomy of FPGA-based Deep Learning Accelerators. J. Syst. Archit. 2019, 98, 331–345. [Google Scholar] [CrossRef]
Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631. [Google Scholar] [CrossRef]
Noronha, D.H.; Torquato, M.F.; Fernandes, M.A. A parallel implementation of sequential minimal optimization on FPGA. Microprocess. Microsyst. 2019, 69, 138–151. [Google Scholar] [CrossRef]
Lopes, F.F.; Silva, S.N.; Fernandes, M.A.C. FPGA Implementation of the Adaptive Digital Beamforming for Massive Array. In Proceedings of the 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), Helsinki, Finland, 25–28 April 2021; pp. 1–5. [Google Scholar] [CrossRef]
Chou, K.; Chen, Y. Real-Time and Low-Memory Multi-Faces Detection System Design With Naive Bayes Classifier Implemented on FPGA. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4380–4389. [Google Scholar] [CrossRef]
Wickramasinghe, I.; Kalutarage, H. Naive Bayes: Applications, variations and vulnerabilities: A review of literature with code snippets for implementation. Soft Comput. 2021, 25, 2277–2293. [Google Scholar] [CrossRef]
Blanquero, R.; Carrizosa, E.; Ramírez-Cobo, P.; Sillero-Denamiel, M.R. Variable selection for Naïve Bayes classification. Comput. Oper. Res. 2021, 135, 105456. [Google Scholar] [CrossRef]
Chen, H.; Hu, S.; Hua, R.; Zhao, X. Improved naive Bayes classification algorithm for traffic risk management. EURASIP J. Adv. Signal Process. 2021, 2021, 30. [Google Scholar] [CrossRef]
Khajenezhad, A.; Bashiri, M.A.; Beigy, H. A distributed density estimation algorithm and its application to naive Bayes classification. Appl. Soft Comput. 2021, 98, 106837. [Google Scholar] [CrossRef]
Sethi, J.K.; Mittal, M. Efficient weighted naive bayes classifiers to predict air quality index. Earth Sci. Inform. 2022, 15, 541–552. [Google Scholar] [CrossRef]
Deng, Z.; Han, T.; Cheng, Z.; Jiang, J.; Duan, F. Fault detection of petrochemical process based on space-time compressed matrix and Naive Bayes. Process Saf. Environ. Prot. 2022, 160, 327–340. [Google Scholar] [CrossRef]
Kute, S.S.; Shreyas Madhav, A.; Kumari, S.; Aswathy, S. Machine Learning–Based Disease Diagnosis and Prediction for E-Healthcare System. Adv. Anal. Deep. Learn. Model. 2022, 127–147. [Google Scholar] [CrossRef]
Triwiyanto, T.; Caesarendra, W.; Purnomo, M.H.; Sułowicz, M.; Wisana, I.D.G.H.; Titisari, D.; Lamidi, L.; Rismayani, R. Embedded machine learning using a multi-thread algorithm on a Raspberry Pi platform to improve prosthetic hand performance. Micromachines 2022, 13, 191. [Google Scholar] [CrossRef] [PubMed]
Meng, H.; Appiah, K.; Hunter, A.; Dickinson, P. FPGA implementation of Naive Bayes classifier for visual object recognition. In Proceedings of the CVPR 2011 Workshops, Colorado Springs, CO, USA, 20–25 June 2011; pp. 123–128. [Google Scholar] [CrossRef]
Tzanos, G.; Kachris, C.; Soudris, D. Hardware Acceleration on Gaussian Naive Bayes Machine Learning Algorithm. In Proceedings of the 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 13–15 May 2019; pp. 1–5. [Google Scholar] [CrossRef]
França, A.; Jasinski, R.; Cemin, P.; Pedroni, V.A.; Santin, A.O. The energy cost of network security: A hardware vs. software comparison. In Proceedings of the 2015 IEEE International Symposium on Circuits and Systems (ISCAS), Lisbon, Portugal, 24–27 May 2015; pp. 81–84. [Google Scholar] [CrossRef]
Marsono, M.; Watheq El-Kharashi, M.; Gebali, F. Binary LNS-based naive Bayes hardware classifier for spam control. In Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, Kos, Greece, 21–24 May 2006; p. 4. [Google Scholar] [CrossRef]
Chaudhary, P.; Sharma, M. VLSI Hardware Architecture of Real Time Pattern Classification using Naive Bayes Classifier. In Proceedings of the ICMSSP: International Conference on Multimedia Systems and Signal Processing, Taichung, Taiwan, 13–16 August 2017; pp. 61–65. [Google Scholar] [CrossRef]
Wahab, M.; Milosevic, J. Power & perfomance optimized hardware classifiers for efficient on-device malware detection. In Proceedings of the ICMSSP: International Conference on Multimedia Systems and Signal Processing, Guangzhou, China, 10–12 May 2019; pp. 23–26. [Google Scholar] [CrossRef]
Seth, H.; Banka, H. Hardware implementation of Naïve Bayes classifier: A cost effective technique. In Proceedings of the 2016 3rd International Conference on Recent Advances in Information Technology (RAIT), Dhanbad, India, 3–5 March 2016; pp. 264–267. [Google Scholar] [CrossRef]
Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Volume 3, pp. 41–46. [Google Scholar]
Leung, K.M. Naive bayesian classifier. Polytech. Univ. Dep. Comput. Sci. Risk Eng. 2007, 2007, 123–156. [Google Scholar]
Yöntem, M.K.; Kemal, A.; Ilhan, T.; Kiliçarslan, S. Divorce prediction using correlation based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Derg. 2019, 9, 259–273. [Google Scholar]
Sarwar, A. CMOS Power Consumption and C_pd Calculation; Texas Instruments: Dallas, TX, USA, 1997. [Google Scholar]
McCool, M.; Robison, A.D.; Reinders, J. Chapter 2—Background. In Structured Parallel Programming; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 39–75. [Google Scholar] [CrossRef]

Figure 1. Flowchart for the Naive Bayes procedure.

Figure 2. General architecture of the TM.

Figure 3. Modules that constitute the TM.

Figure 4. CPM module internal architecture.

Figure 5. Internal architecture of an i-th APM.

Figure 6. General architecture of the IM.

Figure 7. Internal architecture of a k-th

P (c_{k} | x)

submodule.

Figure 7. Internal architecture of a k-th

P (c_{k} | x)

submodule.

Figure 8. Absolute error values,

e_{i, 1}^{j}

, for the probabilities

P (x_{i}^{j} | c_{1})

.

Figure 8. Absolute error values,

e_{i, 1}^{j}

, for the probabilities

P (x_{i}^{j} | c_{1})

.

Figure 9. Absolute error values,

e_{i, 2}^{j}

, for the probabilities

P (x_{i}^{j} | c_{2})

.

Figure 9. Absolute error values,

e_{i, 2}^{j}

, for the probabilities

P (x_{i}^{j} | c_{2})

.

Figure 10. Confusion matrix for Classes 1 and 2.

Figure 11. Processing time function adjusted by a logarithmic function.

Table 1. Summary of the proposal’s architectural characteristics and other works in the literature.

References	Training	Inference	FPGA	Arithmetic	Design
[20]	Yes	Yes	Virtex-4	Floating-point	Serial
[21]	Yes	Yes	Artix-7	Floating-point	Serial
[23]	No	Yes	Stratix	Floating-point	Serial
[25]	No	Yes	Artix-7	Floating-point	Serial
[22]	No	Yes	Cyclone IV	Floating-point	Serial/Parallel
[24]	No	Yes	Virtex-II	Fixed-point	Parallel
This work	Yes	Yes	Stratix V	Fixed-point	Parallel

Table 2. Area occupation for the TM.

$N_{A}$	$N_{LC}$	$N_{Bits}$
4	$13,950$	2550
8	$27,825$	4590
16	$55,459$	8670

Table 3. Area occupation for the IM.

$N_{A}$	$N_{LC}$	$N_{Bits}$	$N_{MULT}$
4	24	1024	8
8	27	2048	16
16	37	4096	32
32	59	8192	64
64	12	16,384	128

Table 4. Processing speed of the TM.

$N_{A}$	THPT	CLK
4	40 Msps	40 MHz
8	40 Msps	40 MHz
16	40 Msps	40 MHz

Table 5. Processing speed of the IM.

$N_{A}$	THPT	CLK
4	72.52 Msps	75.50 MHz
8	59.00 Msps	59.00 MHz
16	50.00 Msps	50.00 MHz
32	43.01 Msps	43.00 MHz
64	35.00 Msps	35.00 MHz

Table 6. Hardware area occupation comparison.

Reference	$N_{A}$	$N_{LC}^{Ref}$	$N_{LC}^{Author}$	$N_{LC}^{Gain}$	$N_{Bits}^{Ref}$	$N_{Bits}^{Author}$	$N_{Bits}^{Gain}$
[20]	762	81	1258	≈ $0.65 \times$	48	$19,507$	≈ $0.00 \times$
[22] I	10	$11,488$	$29.40$	≈ $390.75 \times$	0	2560	–
[22] II	10	5480	$29.40$	≈ $186.40 \times$	0	2560	–
[22] III	10	2867	$29.40$	≈ $97.52 \times$	0	2560	–
[22] IV	10	1733	$29.40$	≈ $58.95 \times$	$10,112$	2560	≈ $3.95 \times$
[22] V	10	1000	$29.40$	≈ $34.01 \times$	6656	2560	≈ $2.60 \times$
[22] VI	10	705	$29.40$	≈ $23.98 \times$	5120	2560	≈ $2.00 \times$
[23] I	15	146	$37.60$	≈ $3.88 \times$	$38,656$	3840	≈ $10.07 \times$
[23] II	15	278	$37.60$	≈ $7.39 \times$	$81,920$	3840	≈ $21.33 \times$
[23] III	15	313	$37.60$	≈ $8.32 \times$	$135,168$	3840	≈ $35.20 \times$
[23] IV	15	144	$37.60$	≈ $3.83 \times$	$135,168$	3840	≈ $35.20 \times$

Table 7. Throughput comparison.

Reference	$N_{A}$	THPT^Ref.	THPT^Author	Speedup
[21]	784	$7.41 \times 10^{- 4}$ Msps	$24.59$ Msps	$\approx 33,200 \times$
[22] I	10	$2.80$ Msps	$54.67$ Msps	≈ $19.53 \times$
[22] II	10	$4.16$ Msps	$54.67$ Msps	≈ $13.14 \times$
[22] III	10	$5.51$ Msps	$54.67$ Msps	≈ $9.92 \times$
[22] IV	10	$0.60$ Msps	$54.67$ Msps	≈ $118.85 \times$
[22] V	10	$0.77$ Msps	$54.67$ Msps	≈ $71.00 \times$
[22] VI	10	$0.98$ Msps	$54.67$ Msps	≈ $55.79 \times$
[23] I	15	$9.76$ Msps	$49.55$ Msps	≈ $5.08 \times$
[23] II	15	$8.05$ Msps	$49.55$ Msps	≈ $6.16 \times$
[23] III	15	$7.33$ Msps	$49.55$ Msps	≈ $6.67 \times$
[23] IV	15	$8.89$ Msps	$49.55$ Msps	≈ $5.57 \times$
[25] I	7	$4.18 \times 10^{2}$ Msps	$61.58$ Msps	≈ $0.15 \times$
[25] II	7	$2.11 \times 10^{3}$ Msps	$61.58$ Msps	≈ $0.03 \times$

Table 8. Dynamic power consumption comparison.

Reference	$N_{A}$	CLK^Ref.	$N_{g}^{Ref}$	$E_{d}^{Ref}$	CLK^Author	$N_{g}^{Author}$	$E_{d}^{Author}$	PWRSave
[22] IV	10	$33.46$ MHz	1747	$6.54 \times 10^{7}$	$0.46$ MHz	$49.40$	$0.48 \times 10^{1}$	≈ $1.36 \times 10^{7} \times$
[22] V	10	$56.12$ MHz	1004	$1.77 \times 10^{8}$	$0.77$ MHz	$49.40$	$2.25 \times 10^{1}$	≈ $7.87 \times 10^{6} \times$
[22] VI	10	$71.70$ MHz	707	$2.61 \times 10^{8}$	$0.98$ MHz	$49.40$	$4.65 \times 10^{1}$	≈ $5.60 \times 10^{6} \times$
[23] I	15	$156.18$ MHz	146	$5.56 \times 10^{8}$	$9.76$ MHz	$67.60$	$6.28 \times 10^{4}$	≈ $8.86 \times 10^{3} \times$
[23] II	15	$128.78$ MHz	278	$5.94 \times 10^{8}$	$8.05$ MHz	$67.60$	$3.53 \times 10^{3}$	≈ $1.68 \times 10^{5} \times$
[23] III	15	$117.26$ MHz	313	$5.05 \times 10^{8}$	$7.33$ MHz	$67.60$	$2.66 \times 10^{4}$	≈ $1.90 \times 10^{4} \times$
[23] IV	15	$142.23$ MHz	144	$4.14 \times 10^{8}$	$8.89$ MHz	$67.60$	$4.75 \times 10^{4}$	≈ $8.72 \times 10^{3} \times$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Barros, W.K.P.; Barbosa, M.T.; Dias, L.A.; Fernandes, M.A.C. Fully Parallel Proposal of Naive Bayes on FPGA. Electronics 2022, 11, 2565. https://doi.org/10.3390/electronics11162565

AMA Style

Barros WKP, Barbosa MT, Dias LA, Fernandes MAC. Fully Parallel Proposal of Naive Bayes on FPGA. Electronics. 2022; 11(16):2565. https://doi.org/10.3390/electronics11162565

Chicago/Turabian Style

Barros, Wysterlânya K. P., Matheus T. Barbosa, Leonardo A. Dias, and Marcelo A. C. Fernandes. 2022. "Fully Parallel Proposal of Naive Bayes on FPGA" Electronics 11, no. 16: 2565. https://doi.org/10.3390/electronics11162565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Fully Parallel Proposal of Naive Bayes on FPGA

Abstract

1. Introduction

2. Related Works

3. Naive Bayes Method

4. Hardware Description

4.1. Training Module

4.2. Inference Module

5. Hardware Validation

5.1. Training Step Validation

5.2. Inference Step Validation

6. Results

6.1. Hardware Proposal Results

6.1.1. Area Occupation

6.1.2. Processing Speed (Throughput)

6.2. Comparison with State-of-the-Art Works

6.2.1. Area Occupation Comparison

6.2.2. Throughput Comparison

6.2.3. Dynamic Power Consumption Comparison

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI