Next Article in Journal
Large-Scale Music Genre Analysis and Classification Using Machine Learning with Apache Spark
Previous Article in Journal
Suitability Classification of Retinal Fundus Images for Diabetic Retinopathy Using Deep Learning
Previous Article in Special Issue
A Configurable IP Core for Calculating the Integer Square Root for Serial and Parallel Implementations in FPGA
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Fully Parallel Proposal of Naive Bayes on FPGA

by
Wysterlânya K. P. Barros
1,†,
Matheus T. Barbosa
1,
Leonardo A. Dias
2 and
Marcelo A. C. Fernandes
1,3,*,†
1
Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil
2
Centre for Cyber Security and Privacy, School of Computer Science, University of Birmingham, Birmingham B15 2TT, UK
3
Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil
*
Author to whom correspondence should be addressed.
These authors contributed equally to this work.
Electronics 2022, 11(16), 2565; https://doi.org/10.3390/electronics11162565
Submission received: 14 July 2022 / Revised: 4 August 2022 / Accepted: 8 August 2022 / Published: 17 August 2022
(This article belongs to the Special Issue Digital Hardware Architectures: Systems and Applications)

Abstract

:
This work proposes a fully parallel hardware architecture of the Naive Bayes classifier to obtain high-speed processing and low energy consumption. The details of the proposed architecture are described throughout this work. Besides, a fixed-point implementation on a Stratix V Field Programmable Gate Array (FPGA) is presented and evaluated regarding the hardware area occupation, processing time (throughput), and dynamic power consumption. In addition, a comparative design analysis was carried out with state-of-the-art works, showing that the proposed implementation achieved a speedup of up to 10 4 × and power savings of up to 10 7 × -times while also reducing the hardware occupancy by up to 10 2 × -times fewer logic cells.

1. Introduction

Machine learning (ML) is a set of techniques widely used to analyze large volumes of data, requiring intensive processing power [1]. Usually, the use of ML techniques results in high energy consumption. Therefore, energy consumption can become a critical problem for large technology companies such as Microsoft, IBM, Amazon, and Google. Given that, the servers and databases are improving at an architectural level to adopt low-power alternatives also capable of high-speed processing [2].
Applications in emerging fields such as 5G Communication, Advanced Driver Assistance Systems, and bioinformatics often require high-speed processing and low energy consumption. Given the complexity of ML techniques, parallelization of the algorithms has often been adopted to overcome the need for high computational performance when processing large volumes of data. Besides, the use of hardware solutions implemented in parallel has proven to be very efficient in reaching the mentioned requirements. For instance, Field Gate Programmable Arrays (FPGAs) allow high performance and reduced energy consumption compared to conventional software implementations due to the high parallelization degree and the optimization at the level of logic gates, thus increasing the data processing speed of ML techniques up to a thousand times [3,4,5,6,7,8,9,10].
Naive Bayes (NB) is a widely used ML technique for solving classification problems, such as facial recognition and network packet classification [11,12,13,14,15,16,17,18,19]. Several of these applications demand real-time processing, but reaching this performance in software becomes difficult due to the large volume of data usually handled. Thus, many works in the literature have proposed implementing the Naive Bayes method in hardware seeking to achieve high throughputs and, in addition, low power consumption [20,21,22,23,24,25].
Some works in the literature propose a hardware design for both stages of the NB method, training and inference [20,21]. However, they present a serial implementation with floating-point representation, which reduces the architecture’s performance. Moreover, these works present an implementation designed for a fixed amount of attributes and classes. In contrast, this paper proposes a fully parallel architecture with fixed-point representation for the training and classification steps of the NB method, easily adjustable to handle different numbers of attributes and classes. This architecture aims to increase the data processing speed while maintaining low energy consumption. The design was developed on a Stratix V 5SGXMBBR3H43C3 FPGA and evaluated for a different quantity of attributes in the inference and training steps.
The remainder of this paper is organized as follows: Section 2 presents the related works in the literature; Section 3 addresses the theoretical foundation of the Naive Bayes method; Section 4 gives a detailed description of the architectures proposed in this paper; Section 5 provides validation data for the hardware implementation by comparing its results to the software implementation method, while Section 6 presents and analyzes the synthesis results obtained from the described implementation, including a comparison to other works; Section 7 presents the final considerations.

2. Related Works

Hardware implementations of the NB method have been developed to increase the processing speed of different applications that rely on massive datasets.
An NB classifier was implemented by [20] on a Virtex-4 XC4VLX160 FPGA to perform multiple class object classification by analyzing vectors with binary attributes. The training and inference steps were serially implemented. The total area occupied by their architecture was 784 slices, 730 Flip-Flops (FFs), 819 Look-Up Tables (LUTs), and 3 Block RAMs (BRAMs), being less than the area occupation of a bSOM also implemented on an FPGA.
In [26], the NB was developed on the Raspberry Pi 2 development board. The board has a Quad-Core ARMv7 900 MHz Central Processing Unit (CPU) and 1 GB of RAM. The NB method was validated with the Iris, Diabetes, Hepatitis, and other datasets, achieving accuracy levels between 73 % and 99 % , according to the dataset used.
Meanwhile, in [21], a Gaussian Naive Bayes was proposed to optimize the algorithm’s learning process. The implementation was developed using High-Level Synthesis (HLS) on a ZedBoard Zynq-7020 SoC Development Board with an Artix-7 FPGA to obtain a heterogeneous system. Regarding the area occupation, 197 DSPs, 56 BRAMs, 37,929 LUTs, and 29,271 FFs were used for the training step, while the inference step required 93 DSPs, 112 BRAMs, 34,977 LUTs, and 31,779 FFs. Meanwhile, the processing time for the training and inference steps were 1.1 s and 2.7 s, respectively, for a total of 784 attributes and 10 labels, using 2000 samples from the MNIST dataset.
The work presented by [22] implements three classifiers: Decision Tree, NB, and K-Nearest Neighbors, to classify packets sent within a network and identify possible malicious packets that can infect the target machine. They were developed on a Cyclone IV GX FPGA. Regarding NB, six implementations are proposed (three combinational and three sequential) using floating-point notation for 32, 16, and 10 bits in each type of implementation. The work presented data on area occupancy, throughput, and power consumption, for all designs.
In [23], a hardware architecture of an NB classifier is implemented on Altera’s Stratix EP1S10F780C5ES for email spam classification, using a logarithmic numerical system to reduce the computational complexity. The proposal was designed using floating-point notation, but can receive fixed-point inputs with 8, 12, and 16 bits. The floating-point implementation was able to classify 8889.375 emails per second using 15 attributes. The hardware was validated with SpamAssassin’s public corpus dataset and presented each architecture’s data processing time and area occupancy.
In [24], another hardware architecture of an NB classifier is presented for real-time classification of seven facial expressions: happy, surprised, sad, disgusted, afraid, angry, and neutral. The proposed design was developed in the fixed-point notation to reduce the area occupation and achieved an operating frequency of 241.55 MHz, with a classification accuracy of 81.94 % . The proposed architecture was designed using the Xilinx System Generator, and its target FPGA was Xilinx Virtex-II.
Meanwhile, in [25], a hardware system is proposed to increase the processing speed and reduce the power consumption for a malware detection application. The system was developed using HLS on a ZedBoard Zynq-7020 SoC Development Board with an Artix-7 FPGA, and it is ready to be integrated into mobile devices’ CPUs. Three classification algorithms are commonly used to detect malware: Logistic Regression, Naive Bayes, and Support Vector Machines. These classifiers were optimized regarding power consumption while maintaining an accuracy similar to software implementations.
Therefore, it is clear from the works mentioned that FPGA implementations of NB could optimize applications’ processing speed and power consumption with these requirements. Unlike the works mentioned above, this work proposes implementing both steps of Naive Bayes, training and inference, with a fully parallel architecture using fixed-point representation. In addition, the architecture was designed to be easily adapted to handle different numbers of attributes and classes. Through this architecture, the aim is to obtain better throughput and power savings results.
Table 1 summarizes the characteristics of the proposal’s architecture and works in the literature. The first column shows the architecture analyzed. Meanwhile, the second and third columns indicate whether the work has implemented the training and inference phases of the NB algorithm in hardware. The fourth column displays the FPGA for which the implementation was synthesized. The fifth column indicates the arithmetic representation of the values in the hardware. Finally, the sixth column shows how the architecture implementation was accomplished.

3. Naive Bayes Method

The Naive Bayes method is based on Bayes’ Theorem, in which a “naive” premise of total independence between the attributes is adopted for a given sample. This premise is often unrealistic for most classification problems in which the technique is used; however, it is still possible to achieve high accuracy values compared to other ML techniques [1].
According to [27,28], the classification process calculates the probability of an input vector x belonging to a k-th class c k of a set of K classes. Therefore, the probability value P ( c k x ) is obtained, which can be expressed as
P ( c k x ) = P ( c k ) P ( x c k ) P ( x )
where P ( c k ) and P ( x ) are the probabilities of the k-th class c k and an x occurring, respectively, while P ( x c k ) is the probability of x happening given that c k occurred. The input vector x is expressed as
x = x 1 , , x i , , x N
where each i-th x i represents an independent attribute.
Given the premise of independence between attributes, that P ( x ) does not depend on c k , and that each i-th input, x i , is already presented, then P ( x ) can be ignored in (1). Hence, it can be rewritten as
P ( c k x ) P ( c k ) i = 1 N P ( x i c k ) .
Thus, after obtaining the NB probabilistic model, the classifier is performed. For this purpose, the classification process requires a decision mechanism, which can be defined by the highest class probability for a given input sample, i.e.,
k ^ = a r g max k 1 , . . , K P ( c k ) i = 1 N P ( x i c k )
where k ^ is the estimate of a k-th class c k .
Figure 1 shows a flowchart illustrating the operation of the Naive Bayes method. Initially, the dataset is split into training and testing sets. Each sample in the training data contains the attribute vector, x , and its classification, c k , which are provided for the training stage. In this step, the prior probability of each k-th c k , P ( c k ) , and the likelihood of each i-th x i and k-th c k , P ( x i c k ) , are calculated. The values of P ( c k ) and P ( x i c k ) are then provided to the inference stage, which uses these values to compute the posterior probabilities, P ( c k x ) , of the samples from the test set. Finally, the highest posterior probability value determines the analyzed sample’s classification.

4. Hardware Description

The architecture was designed in fixed-point notation using 15 bits for the fractional part and from 3 to 17 bits in the integer part, depending on the pipeline stage, to reduce the hardware area occupation and increase the throughput compared to floating-point implementations. Two modules were implemented: training and inference, based on the NB method described in Section 3.

4.1. Training Module

Figure 2 presents the Training Module (TM) used to obtain the probabilities P ( x | c k ) and P ( c k ) , according to Expression (1). Therefore, at each n-th time sample, t s , the TM receives as input the vector of attributes x ( n ) , described in (2), and the k-th class, c k ( n ) , for the respective vector of attributes.
Figure 3 details the TM modules. As can be seen, it is constituted by two: the Class Probability Module (CPM) and the Attribute Probability Module (APM). Note that the APM is replicated N times for parallel processing. Hence, at each n-th time sample, the CPM receives one c k ( n ) value, which is used for calculating the probabilities, P ( c k ) , and the occurrence number of each k-th class, n c k . Meanwhile, each i-th APM i receives as input the i-th attribute, x i j ( n ) , the k-th class, c k ( n ) , and its occurrence number, n c k , to obtain the probabilities P ( x i j | c k ) . Besides, it is important to mention that each i-th attribute is represented by a j-th possible value, in which J i is the maximum number of distinct values assumed by its attribute.
Figure 4 shows the internal structure of the CPM. Firstly, at each n-th time sample, the k-th class, c k ( n ) , is compared with the remaining classes of the dataset ( c 1 , , c K ), through comparator circuits. Besides, note that the classes are represented in the architecture as constant values. Secondly, the comparators output the bit 1 if the classes are equal; otherwise, they output the bit 0. Thirdly, the counter increments its value by 1 every time sample the comparator outputs the bit 1. Thereby, the counter defines the occurrence of each class, n c k , and propagates these values to every A P M i . Lastly, each k-th n c k is multiplied by 1 K for obtaining the probability of occurrence of each k-th class, P ( c K ) , according to Expression (3). The calculated probabilities are stored in Registers (R) and provided as output from the CPM.
Meanwhile, each i-th A P M i is performed in parallel, and its internal structure is shown in Figure 5. As can be observed, each i-th A P M i receives as input the i-th attribute, x i ( n ) , the k-th class, c k ( n ) , and its occurrence, n c k . Initially, x i ( n ) is concatenated to c k ( n ) by the C X U block, generating a z k j i value. This value represents the occurrence of a k-th class for an i-th attribute with a j-th value.
Secondly, for each possible z k j i value, there is a comparator block. Similar to the submodules in the CPM, the comparator outputs the bit 1 according to
Comparator   block = 1 , if z k j i = CXU   output value 0 , otherwise .
Subsequently, the counter submodule ( C o u n t e r ) increases its value by 1 every n-th time sample the comparator output is equal to 1. Hence, the counter defines the total occurrence of each z k j i , which is stored in a Register (R).
Thirdly, the output of each counter has to be divided by the k-th n c k related to z k j i . Due to the hardware complexity of performing division circuits that provide the decimal values of the result, a multiplier was added before and after the division block. Therefore, the counter output is multiplied by α 1 = 1000, divided by n c k , and finally, multiplied by α 2 = 0.001 .
Lastly, to avoid the Naive Bayes zero-frequency problem, when P ( x i j | c k ) = 0 , logic OR gates were added after the last multiplier blocks. The OR gates receive as input the multiplier value and a constant β , which in turn has the smallest possible value that can be represented for the data precision defined, e.g., β 0.00031 . Thus, the probabilities P ( x i j | c k ) of having a value equal to zero are replaced by the value of β . The outputs of the logic gates, i.e., the probabilities P ( x i j | c k ) , are stored in Registers (R) and provided as the outputs of the module. At the end of this procedure, the NB training step is complete, and all the values of P ( x i j | c k ) and P ( C k ) are available.

4.2. Inference Module

The NB inference process consists of classifying a sample x ( n ) according to (4). For this purpose, it is necessary to calculate the probability P ( c k | x ) , defined on Expression (3). Figure 6 shows the Inference Module (IM) developed to perform this process. As can be seen, there is one submodule for each k-th P ( c k | x ) , and it receives as inputs the probabilities P ( c k ) from the TM and the N samples, x ( n ) .
Figure 7 shows the internal structure of each k-th P ( c k | x ) submodule. As illustrated, there are N 15 bit-wide LUTs, of depth J i , where each stores all P ( c k | x ) values. The sample input value, x i j ( n ) , is used as the LUT address to select its corresponding probability, P ( x i j | c k ) . Subsequently, the probabilities P ( x i j | c k ) addressed in each i-th LUT are multiplied through a tree of multipliers, and in the last multiplier of the tree, the result is multiplied by the k-th probability P ( c k ) , thus obtaining the value of P ( c k | X ) , according to (3).
According to (4), the decision process of the class corresponding to the sample under evaluation consists of selecting the highest probability P ( c k | x ) . This process is performed by a relational circuit that compares the P ( c k | x ) values from all K submodules of the IM and outputs the class c k related to the highest probability value calculated, therefore completing the inference process.

5. Hardware Validation

The software implementation was used to validate the proposed NB hardware architecture. The software was developed using floating-point representation, according to the IEEE 754 standard, and both hardware and software implementations were applied to the divorce predictors dataset [29]. This dataset contains 170 samples, representing couples’ responses to a survey to predict divorce basis on Gottman couples therapy. Each sample has 54 attributes corresponding to a question in the survey, which can have five different answers represented numerically by: 0 = Never ; 1 = Rarely ; 2 = Averagely ; 3 = Frequently ; and 4 = Always . The target of this dataset is to predict, based on the answers given, whether the couple will divorce, Class 1, or not, Class 2.
The dataset was divided into two groups: one with 80 % of the data for the training step and another for the classification test (inference step) with the remaining 20 % . For each data point, four discrete attributes were selected, each of which can assume five distinct values. Thus, the hardware parameters to perform the tests were set as N = 4 , J 1 = J 2 = J 3 = J 4 = 5 , and K = 2 .

5.1. Training Step Validation

The absolute error between the hardware and software implementations was obtained for each k-th probability P ( x i j | c K ) , as shown in Figure 8 and Figure 9, and it can be defined as
e i , k j = | P H ( x i j | c k ) P S ( x i j | c k ) |
where P H ( x i j | c k ) and P S ( x i j | c k ) are the probability values obtained in hardware and software, respectively.
In addition, the absolute errors were also calculated for each k-th probability P ( c K ) as follows:
e c k = | P H ( c k ) P S ( c k ) |
where P H ( c k ) and P S ( c k ) correspond, respectively, to the probability values obtained in hardware and software. The absolute error calculated for each class was e c 1 = 0.00303 and e c 2 = 0.00379 .

5.2. Inference Step Validation

As previously mentioned, 20 % of the dataset was used for the classification test, that is on the inference step. The classification result of each sample was compared to the expected class, which is present in the dataset. The accuracy achieved by the proposed implementation was 97.06 % . Figure 10 presents the confusion matrix obtained after performing the inference on the IM.

6. Results

6.1. Hardware Proposal Results

6.1.1. Area Occupation

Synthesis results were analyzed for training and inference modules and presented in Table 2 and Table 3, respectively. In both tables, the first column indicates the number of attribute, N A , used for the implementation. Meanwhile, from the second to fourth columns are shown the area occupation in the target FPGA, that is the Number of Logical Cells, N LC , the Number of Bits of memory blocks, N Bits , and the Number of Multipliers implemented using DSP blocks, N MULT .

6.1.2. Processing Speed (Throughput)

The processing speeds for the TM and IM are presented in Table 4 and Table 5, respectively. In both tables, the first column indicates the number of attributes in the analysis, N A . Meanwhile, the second column presents the Throughput (THPT) in Mega samples per second (Msps), while the third column shows the maximum Clock (CLK) frequency.
As is evident from Table 4, the throughput values remain constant with the increase of N A due to the high parallelism adopted for the NB training step by replicating the modules. In contrast, the throughput reduces as N A increases due to the rise in the number of multipliers in the IM for the inference step, as shown in Table 5.

6.2. Comparison with State-of-the-Art Works

Regarding the area occupation, the use of logical cells and the number of bits in the memory were analyzed. For a fair comparison of the proposed architecture with state-of-the-art works, it was necessary to conduct some conversions to have similar metrics for the hardware area occupancy. Usually, this is required due to the use of different FPGAs in each work, which present the data based on the optimization process. The conversions were realized considering the relationship between ALMs, LUTs, and Slices with their equivalents in the number of logical cells, as stated in the documentation provided by the manufacturer of the FPGA devices.
Based on [30], the dynamic power, E d , can be expressed as
E d N g × CLK × V D D 2
where N g is the number of elements occupied in hardware, CLK is the clock frequency, and V D D is the supply voltage. The frequency at which a CMOS circuit can operate is approximately proportional to the voltage [31], so Equation (7) can be expressed as
E d N g × CLK 3 .
For all comparisons, N g was calculated as
N g = N Bits + N Mult
Besides, the inference module was only compared due to the scarcity of data about the training step in the state-of-the-art works. Table 3 shows the synthesis results for the inference module, which were used to derive equations for comparing the area occupation. For this purpose, linear regression was used, which resulted in the following:
N LC = 1.6333 × N A + 13.0833 ,
N Bits = 256 × N A
and
N Mult = 2 × N A .
Likewise, the data presented in Table 5 and a logarithmic adjustment were used to derive the following throughput metric:
THPT = 1 3.5869 × log 2 N A + 6.1680 ,
the behavior of which is observed in Figure 11.
The derived equations allow estimating the hardware area occupation, throughput, and dynamic power consumption for any number of attributes and classes, to compare the performance with state-of-the-art works.

6.2.1. Area Occupation Comparison

Table 6 presents the area occupation for the hardware proposals of [20,22,23] and compares them to the proposed implementation. The second column gives the number of attributes, N A , used on the referenced proposal and this work. Meanwhile, from the third to fifth columns are shown the number of logical cells for the referenced work ( N LC Ref ), for this work ( N LC Author ), and the gain ( N LC Gain ), respectively. Likewise, the last three columns present the number of memory block bits.
As can be seen, except for the architecture presented by [20], the implementation proposed achieved gains both in the number of logical cells ( N LC Gain ) and the number of bits in the memory block ( N Bits Gain ). This gain can be attributed to the use of a fixed-point representation, while [22] used a floating-point and [23] deployed a selection scheme between fixed and floating-point. Concerning the number of bits in the memory block, it was not possible to compare the gains against Designs I, II, and III of [22] as this did not apply to the work developed, and therefore, these values were not included for the compared reference.

6.2.2. Throughput Comparison

Table 7 presents the processing time synthesis results. The second column shows the number of attributes ( N A ), while the third and fourth columns show the throughput for the reference (THPTRef) and this work (THPTAuthor), respectively. Meanwhile, the fifth column presents the speedup obtained by this work regarding the reference.
As is evident from Table 7, the proposed architecture obtained satisfactory performance, reaching a speedup from 5 × to 10 3 × regarding the other works. This speedup results from the high degree of parallelism adopted in the proposed architecture, which provides a greater processing speed for the technique. The throughput achieved was inferior only to the architecture proposed in [25].

6.2.3. Dynamic Power Consumption Comparison

Table 8 presents the synthesis results regarding dynamic power consumption. The second column shows the number of attributes of the reference, N A , while from the third to the fifth columns are shown the clock frequency, CLKRef, the N g Ref , and dynamic power, E d Ref , respectively, for the reference. Likewise, from the sixth to eighth columns are shown the clock frequency, CLKAuthor, the N g Author , and dynamic power, E d Author , respectively, for this work. The last column shows the dynamic power saved by this work compared to the reference (PWRSave). To achieve the reference works throughput, the CLKAuthor was defined as THPTRef. Besides, it is important to mention that the dynamic power consumption values were estimated using Expression (8).
The work [23] did not use multipliers implemented with DSP blocks, and [22] presented multipliers of size 18 × 18 bits. Size equals the multipliers presented in this work, allowing direct comparison of these quantities with both works. As can be observed in Table 8, this proposal presents a lower dynamic power compared to all designs analyzed, reducing the dynamic power between 10 3 × and 10 7 × . This result was achieved due to the high throughput, low clock frequency, and reduced hardware area occupation.

7. Conclusions

This work presented a hardware implementation of the Naive Bayes technique, developed with a fully parallel architecture and fixed-point representation. Different from the works presented in the literature, the training and inference steps were implemented in parallel and fixed points. The validation was carried out for the training and inference steps using a floating-point software implementation, and the synthesis results were obtained for a Stratix V 5SGXMBBR3H43C3 FPGA. The occupancy, throughput, and power consumption results were compared with other results found in the literature. Compared with other works in the literature, the proposed architecture achieved a speedup of up to 10 3 × , a lower hardware occupancy of up to 35 × , and reduced dynamic power consumption of up to 10 7 × .

Author Contributions

All the authors contributed to various degrees to ensure the quality of this work (e.g., W.K.P.B., M.T.B., L.A.D. and M.A.C.F. conceived of the idea and experiments; W.K.P.B., M.T.B., L.A.D. and M.A.C.F. designed and performed the experiments; W.K.P.B., M.T.B., L.A.D. and M.A.C.F. analyzed the data; W.K.P.B., M.T.B., L.A.D. and M.A.C.F. wrote the paper; M.A.C.F. coordinated the project). All authors have read and agreed to the published version of the manuscript.

Funding

This study was financed in part by the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES)—Finance Code 001.

Acknowledgments

The authors wish to acknowledge the financial support of the Coordenação de Aperfeiçoamento de Pessoal de Nível Superior (CAPES) for their financial support.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Russell, S.; Norvig, P. Artificial Intelligence: A Modern Approach, 3rd ed.; Prentice Hall Press: Hoboken, NJ, USA, 2009. [Google Scholar]
  2. Caulfield, A.; Chung, E.S.; Putnam, A.; Angepat, H.; Fowers, J.; Haselman, M.; Heil, S.; Humphrey, M.; Kaur, P.; Kim, J.Y.; et al. A cloud-scale acceleration architecture. In Proceedings of the 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO), Taipei, Taiwan, 15–19 October 2016; pp. 1–13. [Google Scholar] [CrossRef]
  3. Dias, L.A.; Damasceno, A.M.; Gaura, E.; Fernandes, M.A. A full-parallel implementation of Self-Organizing Maps on hardware. Neural Netw. 2021, 143, 818–827. [Google Scholar] [CrossRef] [PubMed]
  4. Dias, L.A.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation of K-Means Algorithm on FPGA. IEEE Access 2020, 8, 41071–41084. [Google Scholar] [CrossRef]
  5. Torquato, M.F.; Fernandes, M.A. High-performance parallel implementation of genetic algorithm on fpga. Circuits Syst. Signal Process. 2019, 38, 4014–4039. [Google Scholar] [CrossRef]
  6. Coutinho, M.G.F.; Torquato, M.F.; Fernandes, M.A.C. Deep Neural Network Hardware Implementation Based on Stacked Sparse Autoencoder. IEEE Access 2019, 7, 40674–40694. [Google Scholar] [CrossRef]
  7. Blaiech, A.G.; Ben Khalifa, K.; Valderrama, C.; Fernandes, M.A.; Bedoui, M.H. A Survey and Taxonomy of FPGA-based Deep Learning Accelerators. J. Syst. Archit. 2019, 98, 331–345. [Google Scholar] [CrossRef]
  8. Lopes, F.F.; Ferreira, J.C.; Fernandes, M.A.C. Parallel Implementation on FPGA of Support Vector Machines Using Stochastic Gradient Descent. Electronics 2019, 8, 631. [Google Scholar] [CrossRef]
  9. Noronha, D.H.; Torquato, M.F.; Fernandes, M.A. A parallel implementation of sequential minimal optimization on FPGA. Microprocess. Microsyst. 2019, 69, 138–151. [Google Scholar] [CrossRef]
  10. Lopes, F.F.; Silva, S.N.; Fernandes, M.A.C. FPGA Implementation of the Adaptive Digital Beamforming for Massive Array. In Proceedings of the 2021 IEEE 93rd Vehicular Technology Conference (VTC2021-Spring), Helsinki, Finland, 25–28 April 2021; pp. 1–5. [Google Scholar] [CrossRef]
  11. Chou, K.; Chen, Y. Real-Time and Low-Memory Multi-Faces Detection System Design With Naive Bayes Classifier Implemented on FPGA. IEEE Trans. Circuits Syst. Video Technol. 2020, 30, 4380–4389. [Google Scholar] [CrossRef]
  12. Wickramasinghe, I.; Kalutarage, H. Naive Bayes: Applications, variations and vulnerabilities: A review of literature with code snippets for implementation. Soft Comput. 2021, 25, 2277–2293. [Google Scholar] [CrossRef]
  13. Blanquero, R.; Carrizosa, E.; Ramírez-Cobo, P.; Sillero-Denamiel, M.R. Variable selection for Naïve Bayes classification. Comput. Oper. Res. 2021, 135, 105456. [Google Scholar] [CrossRef]
  14. Chen, H.; Hu, S.; Hua, R.; Zhao, X. Improved naive Bayes classification algorithm for traffic risk management. EURASIP J. Adv. Signal Process. 2021, 2021, 30. [Google Scholar] [CrossRef]
  15. Khajenezhad, A.; Bashiri, M.A.; Beigy, H. A distributed density estimation algorithm and its application to naive Bayes classification. Appl. Soft Comput. 2021, 98, 106837. [Google Scholar] [CrossRef]
  16. Sethi, J.K.; Mittal, M. Efficient weighted naive bayes classifiers to predict air quality index. Earth Sci. Inform. 2022, 15, 541–552. [Google Scholar] [CrossRef]
  17. Deng, Z.; Han, T.; Cheng, Z.; Jiang, J.; Duan, F. Fault detection of petrochemical process based on space-time compressed matrix and Naive Bayes. Process Saf. Environ. Prot. 2022, 160, 327–340. [Google Scholar] [CrossRef]
  18. Kute, S.S.; Shreyas Madhav, A.; Kumari, S.; Aswathy, S. Machine Learning–Based Disease Diagnosis and Prediction for E-Healthcare System. Adv. Anal. Deep. Learn. Model. 2022, 127–147. [Google Scholar] [CrossRef]
  19. Triwiyanto, T.; Caesarendra, W.; Purnomo, M.H.; Sułowicz, M.; Wisana, I.D.G.H.; Titisari, D.; Lamidi, L.; Rismayani, R. Embedded machine learning using a multi-thread algorithm on a Raspberry Pi platform to improve prosthetic hand performance. Micromachines 2022, 13, 191. [Google Scholar] [CrossRef] [PubMed]
  20. Meng, H.; Appiah, K.; Hunter, A.; Dickinson, P. FPGA implementation of Naive Bayes classifier for visual object recognition. In Proceedings of the CVPR 2011 Workshops, Colorado Springs, CO, USA, 20–25 June 2011; pp. 123–128. [Google Scholar] [CrossRef]
  21. Tzanos, G.; Kachris, C.; Soudris, D. Hardware Acceleration on Gaussian Naive Bayes Machine Learning Algorithm. In Proceedings of the 2019 8th International Conference on Modern Circuits and Systems Technologies (MOCAST), Thessaloniki, Greece, 13–15 May 2019; pp. 1–5. [Google Scholar] [CrossRef]
  22. França, A.; Jasinski, R.; Cemin, P.; Pedroni, V.A.; Santin, A.O. The energy cost of network security: A hardware vs. software comparison. In Proceedings of the 2015 IEEE International Symposium on Circuits and Systems (ISCAS), Lisbon, Portugal, 24–27 May 2015; pp. 81–84. [Google Scholar] [CrossRef]
  23. Marsono, M.; Watheq El-Kharashi, M.; Gebali, F. Binary LNS-based naive Bayes hardware classifier for spam control. In Proceedings of the 2006 IEEE International Symposium on Circuits and Systems, Kos, Greece, 21–24 May 2006; p. 4. [Google Scholar] [CrossRef]
  24. Chaudhary, P.; Sharma, M. VLSI Hardware Architecture of Real Time Pattern Classification using Naive Bayes Classifier. In Proceedings of the ICMSSP: International Conference on Multimedia Systems and Signal Processing, Taichung, Taiwan, 13–16 August 2017; pp. 61–65. [Google Scholar] [CrossRef]
  25. Wahab, M.; Milosevic, J. Power & perfomance optimized hardware classifiers for efficient on-device malware detection. In Proceedings of the ICMSSP: International Conference on Multimedia Systems and Signal Processing, Guangzhou, China, 10–12 May 2019; pp. 23–26. [Google Scholar] [CrossRef]
  26. Seth, H.; Banka, H. Hardware implementation of Naïve Bayes classifier: A cost effective technique. In Proceedings of the 2016 3rd International Conference on Recent Advances in Information Technology (RAIT), Dhanbad, India, 3–5 March 2016; pp. 264–267. [Google Scholar] [CrossRef]
  27. Rish, I. An empirical study of the naive Bayes classifier. In Proceedings of the IJCAI 2001 Workshop on Empirical Methods in Artificial Intelligence, Seattle, WA, USA, 4–10 August 2001; Volume 3, pp. 41–46. [Google Scholar]
  28. Leung, K.M. Naive bayesian classifier. Polytech. Univ. Dep. Comput. Sci. Risk Eng. 2007, 2007, 123–156. [Google Scholar]
  29. Yöntem, M.K.; Kemal, A.; Ilhan, T.; Kiliçarslan, S. Divorce prediction using correlation based feature selection and artificial neural networks. Nevşehir Hacı Bektaş Veli Üniversitesi SBE Derg. 2019, 9, 259–273. [Google Scholar]
  30. Sarwar, A. CMOS Power Consumption and Cpd Calculation; Texas Instruments: Dallas, TX, USA, 1997. [Google Scholar]
  31. McCool, M.; Robison, A.D.; Reinders, J. Chapter 2—Background. In Structured Parallel Programming; Morgan Kaufmann: Boston, MA, USA, 2012; pp. 39–75. [Google Scholar] [CrossRef]
Figure 1. Flowchart for the Naive Bayes procedure.
Figure 1. Flowchart for the Naive Bayes procedure.
Electronics 11 02565 g001
Figure 2. General architecture of the TM.
Figure 2. General architecture of the TM.
Electronics 11 02565 g002
Figure 3. Modules that constitute the TM.
Figure 3. Modules that constitute the TM.
Electronics 11 02565 g003
Figure 4. CPM module internal architecture.
Figure 4. CPM module internal architecture.
Electronics 11 02565 g004
Figure 5. Internal architecture of an i-th APM.
Figure 5. Internal architecture of an i-th APM.
Electronics 11 02565 g005
Figure 6. General architecture of the IM.
Figure 6. General architecture of the IM.
Electronics 11 02565 g006
Figure 7. Internal architecture of a k-th P ( c k | x ) submodule.
Figure 7. Internal architecture of a k-th P ( c k | x ) submodule.
Electronics 11 02565 g007
Figure 8. Absolute error values, e i , 1 j , for the probabilities P ( x i j | c 1 ) .
Figure 8. Absolute error values, e i , 1 j , for the probabilities P ( x i j | c 1 ) .
Electronics 11 02565 g008
Figure 9. Absolute error values, e i , 2 j , for the probabilities P ( x i j | c 2 ) .
Figure 9. Absolute error values, e i , 2 j , for the probabilities P ( x i j | c 2 ) .
Electronics 11 02565 g009
Figure 10. Confusion matrix for Classes 1 and 2.
Figure 10. Confusion matrix for Classes 1 and 2.
Electronics 11 02565 g010
Figure 11. Processing time function adjusted by a logarithmic function.
Figure 11. Processing time function adjusted by a logarithmic function.
Electronics 11 02565 g011
Table 1. Summary of the proposal’s architectural characteristics and other works in the literature.
Table 1. Summary of the proposal’s architectural characteristics and other works in the literature.
ReferencesTrainingInferenceFPGAArithmeticDesign
[20]YesYesVirtex-4Floating-pointSerial
[21]YesYesArtix-7Floating-pointSerial
[23]NoYesStratixFloating-pointSerial
[25]NoYesArtix-7Floating-pointSerial
[22]NoYesCyclone IVFloating-pointSerial/Parallel
[24]NoYesVirtex-IIFixed-pointParallel
This workYesYesStratix VFixed-pointParallel
Table 2. Area occupation for the TM.
Table 2. Area occupation for the TM.
N A N LC N Bits N MULT
4 13,950 25500
8 27,825 45900
16 55,459 86700
Table 3. Area occupation for the IM.
Table 3. Area occupation for the IM.
N A N LC N Bits N MULT
42410248
827204816
1637409632
3259819264
641216,384128
Table 4. Processing speed of the TM.
Table 4. Processing speed of the TM.
N A THPTCLK
440 Msps40 MHz
840 Msps40 MHz
1640 Msps40 MHz
Table 5. Processing speed of the IM.
Table 5. Processing speed of the IM.
N A THPTCLK
472.52 Msps75.50 MHz
859.00 Msps59.00 MHz
1650.00 Msps50.00 MHz
3243.01 Msps43.00 MHz
6435.00 Msps35.00 MHz
Table 6. Hardware area occupation comparison.
Table 6. Hardware area occupation comparison.
Reference N A N LC Ref N LC Author N LC Gain N Bits Ref N Bits Author N Bits Gain
[20]762811258 0.65 × 48 19,507 0.00 ×
[22] I10 11,488 29.40 390.75 × 02560
[22] II105480 29.40 186.40 × 02560
[22] III102867 29.40 97.52 × 02560
[22] IV101733 29.40 58.95 × 10,112 2560 3.95 ×
[22] V101000 29.40 34.01 × 66562560 2.60 ×
[22] VI10705 29.40 23.98 × 51202560 2.00 ×
[23] I15146 37.60 3.88 × 38,656 3840 10.07 ×
[23] II15278 37.60 7.39 × 81,920 3840 21.33 ×
[23] III15313 37.60 8.32 × 135,168 3840 35.20 ×
[23] IV15144 37.60 3.83 × 135,168 3840 35.20 ×
Table 7. Throughput comparison.
Table 7. Throughput comparison.
Reference N A THPTRef.THPTAuthorSpeedup
[21]784 7.41 × 10 4 Msps 24.59 Msps 33,200 ×
[22] I10 2.80 Msps 54.67 Msps 19.53 ×
[22] II10 4.16 Msps 54.67 Msps 13.14 ×
[22] III10 5.51 Msps 54.67 Msps 9.92 ×
[22] IV10 0.60 Msps 54.67 Msps 118.85 ×
[22] V10 0.77 Msps 54.67 Msps 71.00 ×
[22] VI10 0.98 Msps 54.67 Msps 55.79 ×
[23] I15 9.76 Msps 49.55 Msps 5.08 ×
[23] II15 8.05 Msps 49.55 Msps 6.16 ×
[23] III15 7.33 Msps 49.55 Msps 6.67 ×
[23] IV15 8.89 Msps 49.55 Msps 5.57 ×
[25] I7 4.18 × 10 2 Msps 61.58 Msps 0.15 ×
[25] II7 2.11 × 10 3 Msps 61.58 Msps 0.03 ×
Table 8. Dynamic power consumption comparison.
Table 8. Dynamic power consumption comparison.
Reference N A CLKRef. N g Ref E d Ref CLKAuthor N g Author E d Author PWRSave
[22] IV10 33.46 MHz1747 6.54 × 10 7 0.46 MHz 49.40 0.48 × 10 1 1.36 × 10 7 ×
[22] V10 56.12 MHz1004 1.77 × 10 8 0.77 MHz 49.40 2.25 × 10 1 7.87 × 10 6 ×
[22] VI10 71.70 MHz707 2.61 × 10 8 0.98 MHz 49.40 4.65 × 10 1 5.60 × 10 6 ×
[23] I15 156.18 MHz146 5.56 × 10 8 9.76 MHz 67.60 6.28 × 10 4 8.86 × 10 3 ×
[23] II15 128.78 MHz278 5.94 × 10 8 8.05 MHz 67.60 3.53 × 10 3 1.68 × 10 5 ×
[23] III15 117.26 MHz313 5.05 × 10 8 7.33 MHz 67.60 2.66 × 10 4 1.90 × 10 4 ×
[23] IV15 142.23 MHz144 4.14 × 10 8 8.89 MHz 67.60 4.75 × 10 4 8.72 × 10 3 ×
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Barros, W.K.P.; Barbosa, M.T.; Dias, L.A.; Fernandes, M.A.C. Fully Parallel Proposal of Naive Bayes on FPGA. Electronics 2022, 11, 2565. https://doi.org/10.3390/electronics11162565

AMA Style

Barros WKP, Barbosa MT, Dias LA, Fernandes MAC. Fully Parallel Proposal of Naive Bayes on FPGA. Electronics. 2022; 11(16):2565. https://doi.org/10.3390/electronics11162565

Chicago/Turabian Style

Barros, Wysterlânya K. P., Matheus T. Barbosa, Leonardo A. Dias, and Marcelo A. C. Fernandes. 2022. "Fully Parallel Proposal of Naive Bayes on FPGA" Electronics 11, no. 16: 2565. https://doi.org/10.3390/electronics11162565

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop