Homomorphic Model Selection for Data Analysis in an Encrypted Domain

Hong, Mi Yeon; Yoo, Joon Soo; Yoon, Ji Won

doi:10.3390/app10186174

Open AccessArticle

Homomorphic Model Selection for Data Analysis in an Encrypted Domain^†

by

Mi Yeon Hong

^1,2,

Joon Soo Yoo

^1,2 and

Ji Won Yoon

^1,2,*

¹

School of Cyber Security, Korea University, Seoul 02841, Korea

²

Institute of Cyber Security and Privacy (ICSP), Korea University, Seoul 02841, Korea

^*

Author to whom correspondence should be addressed.

^†

This paper is an extended version of our paper published in Hong, M.Y.; Yoon, J.W. Model Selection for Data Analysis in Encrypted Domain: Application to Simple Linear Regression. In Proceedings of the 20th International Conference, WISA 2019, Jeju Island, Korea, 21–24 August 2019.

Appl. Sci. 2020, 10(18), 6174; https://doi.org/10.3390/app10186174

Submission received: 20 July 2020 / Revised: 21 August 2020 / Accepted: 28 August 2020 / Published: 4 September 2020

(This article belongs to the Special Issue Design and Security Analysis of Cryptosystems)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Secure computation, a methodology of computing on encrypted data, has become a key factor in machine learning. Homomorphic encryption (HE) enables computation on encrypted data without leaking any information to untrusted servers. In machine learning, the model selection method is a crucial algorithm that determines the performance and reduces the fitting problem. Despite the importance of finding the optimal model, none of the previous studies have considered model selection when performing data analysis through the HE scheme. The HE-based model selection we proposed finds the optimal complexity that best describes given data that is encrypted and whose distribution is unknown. Since this process requires a matrix calculation, we constructed the matrix multiplication and inverse of the matrix based on the bitwise operation. Based on these, we designed the model selection of the HE cross-validation approach and the HE Bayesian approach for homomorphic machine learning. Our focus was on evidence approximation for linear models to find goodness-of-fit that maximizes the evidence. We conducted an experiment on a dataset of age and Body Mass Index (BMI) from Kaggle to compare the capabilities and our model showed that encrypted data can regress homomorphically without decrypting it.

Keywords:

fully homomorphic encryption; bitwise operation; model selection; cross validation; evidence approximation; gauss-jordan elimination

1. Introduction

Due to the continuous increase in computational power and the rapid development of processing and storage technologies, people use cloud computing to manage and analyze massive amounts of biomedical data generated from many sensors. Using cloud computing requires one to trust and share their data with the cloud server. Since all the data resides with the cloud server, a critical data issue arises if the cloud provider misuses the information. One of the best ways to guarantee data privacy is to encrypt data before sending it to the cloud server. However, there is still the problem of the cloud servers calculating the encrypted data to respond to the request from the client. The client needs to offer the secret key to the server to decrypt the data before executing the calculations, which could lead to breach of confidentiality or invasion of privacy. Homomorphic encryption (HE) is the appropriate solution for these issues as it allows the implementation of operations on encrypted data without decrypting it.

Machine learning provides systems with the ability to learn and improve automatically from experience without being programmed to do so. HE can be used to preserve privacy and can also be applied to machine learning to find useful information from big data. For example, HE based machine learning prevents leakage of personal information, while predicting a patient’s disease in the healthcare industry or creating a model for assessing an individual’s creditworthiness in the banking industry. Various machine learning methods have been studied using various HE libraries [1,2,3]. Several studies have been conducted about linear regression, logistic regression, K-Nearest Neighbor, and K-means in References [4,5] based on the proposed HE scheme.

In our previous work, we treated HE based model selection from a validation technique perspective. In this paper, we extend the prior work by adding HE based evidence approach in a Bayesian framework by proposing various ways to construct the final algorithm, including the matrix inverse approach in different ways. To determine the best model for the HE, we conducted experiments withe the proposed model using the same dataset and compared the results with the results from previously proposed model. The key contributions of this study were as follows:

The proposed model implements basic dynamic operations with bit-by-bit operations that take accuracy and speed into account. We introduced isomorphic logarithms and HE logarithms used for Bayesian model selection.
We also employed a matrix multiplication method based on our operations and expanded our computation to form matrix inverse operation using the Gauss-Jordan elimination method.
Finally, we proposed two ways of the first HE based model selection: Homomorphic cross-validation (HCV) and Homomorphic Bayesian model selection (HBMS), to determine the complexity of models that can explain the data well when given the encrypted data.

2. Background

2.1. Homomorphic Encryption

Homomoprhic encryption (HE) is a scheme that allows arbitrary computations on encrypted data. It is based on the property that the result of operations between ciphertext is equal to the result of operations between plaintexts, hence the server can operate on the client’s data without decrypting it. If the following equation holds, then the encryption scheme is called homomorphic over the ∘ operation ★.

Enc (m_{1} \circ m_{2}) = Enc (m_{1} ★ m_{2}), \forall m_{1}, m_{2} \in M,

where

Enc

is an encryption algorithm, and M is a set of plaintexts.

In 1978, the concept of privacy homomorphism was originally proposed as a modification to the RSA cryptosystem, which describes the concept of preserving the computations between ciphertexts as presented by Rivest et al. [6]. This concept has led to numerous attempts by researchers around the world to search for a homomorphic scheme with various sets of operations. However, the main hindrance that researchers faced for about 30 years was the limitation of the number of operations that could be evaluated. During arithmetic, the noise level is mounting, and at some point, it becomes too large that is impossible to proceed with the operation without losing its correctness.

In 2009, Craig Gentry presented the idea of a Fully homomorphic encryption (FHE) scheme that is based on NTRU (N-th degree Truncated polynomial Ring Units), a lattice-based cryptosystem that is considered Somewhat homomorphic encryption (SHE). It can hold up to a limited number of operations—that is, a few multiplications and many additions. A public FHE shceme is composed of 4-tuple of probabilistic polynomial time protocols and algorithms namely (KeyGen, Enc, Eval and Dec):

KeyGen( $1^{λ}$ ) → ( $pk, evk, sk$ ): KeyGen is key generation algorithm that takes a security parameter $λ$ and outputs public encryption key $pk$ , public evaluation key $evk$ and secret decryption key $sk$ .
Enc( $pk$ ,m) → $ct$ : Encryption algorithm $Enc$ takes $pk$ and a vector of plaintext message m∈ ${0, 1}$ , and outputs a ciphertext $ct$ .
Eval( $evk$ , $f$ ,< ${ct}_{1}$ , ⋯, ${ct}_{k}$ >) → ${ct}^{*}$ : Evaluation algorithm $Eval$ takes a result of k-input function $f$ : ${0, 1}^{k}$ → ${0, 1}$ and a set of $k$ ciphertexts ${ct}_{1}$ , ⋯, ${ct}_{k}$ . Then it outputs a new ciphertext ${ct}^{*}$ that is an encryption of $f$ ( $m_{1}$ , ⋯, $m_{k}$ ) where ${ct}_{i}$ ← Enc( $pk$ ,m) for i = 1, ⋯, k.
Dec( $sk$ , $ct$ ) → $m^{*}$ or ⊥: Decryption algorithm Dec takes $sk$ and a ciphertext $ct$ , and outputs a message $m^{*}$ ∈ ${0, 1}$ if $ct$ is result of Enc( $pk$ , m) and $pk$ is matched to $sk$ , otherwise, it outputs ⊥.

The key idea proposed in Gentry’s work is bootstrapping, which is the process of refreshing a ciphertext and maintaining a lower level of noise to produce a new ciphertext so that more homomorphic operations can be evaluated. However, the bootstrapping procedure required adding many ciphertexts to the public key, and encrypting the individual bits (or coefficients) of the secret key. This resulted in very large public keys, and plaintext has to be bit-by-bit encrypted, increasing the capacity of the ciphertext increases at each step, making it too expensive in terms of computations.

2.2. Our Novel Approach Based on THFE Library

Various HE libraries (e.g., References [7,8,9,10,11]) have been suggested based on Gentry’s scheme [12]. Chillotti et al. released TFHE (The Fast Fully Homomorphic Encryption over the Torus) [13] library, which was designed from FHEW(Fastest Homomorphic Encryption in the West). It is based on both learning with errors (LWE) [14] assumption and its ring variant [15]. It significantly improves the performance of the bootstrapping operation—less than 0.1 s and facilitating an unlimited number of operations based on a gate-by-gate bootstrapping procedure. Morever, the bootstrapping key size was scaled down from 1 GB to 24 MB to maintain the security level while reducing the overhead caused by noise.

The library supports the homomorphic evaluation of the binary gates (AND, OR, XOR, NOR, NAND, etc.), as well as the negation and the MUX gate, which can be used for various operations. Additionally, there is no restriction on the number of gates or their compositions, making it possible to perform any computations over encrypted data.

2.2.1. Bitwise Representation of Number

We performed homomorphic encryption of plaintext bits, that is, each bit within the plaintext is encrypted with the same key. Therefore, using our notation, given the plaintext

a

, the result of the encryption stage yields a ciphertext

ct . a

, which is an array containing encrypted bits. This is different from the mainstream approach since it is lattice-based and initiated from the input’s integer values. Specifically, integer-based FHE requires the encoding process through the rounding operation of plaintexts for the conversion of real number input. Through the rounding process, the outcome contains an error.

Our approach is more accurately designed to solve the problem of conversion from real number to integer. In general, floating-point number system guarantees a broader range of input; however, it can also lead to more complex algorithms when using HE scheme. HE gate operations take a significant amount of time compared to plaintext gate operations, thus using floating-point is less efficient. On the other hand, the major advantage of using a fixed-point representation is that it is a simple integer arithmetic operation with much less logic than floating-point, which improves performance by reducing the bootstrapping procedure on every encrypted bit. Therefore, it is crucial in FHE to reduce, so we only adopt a fixed-point number system in this paper.

We assigned

\frac{r}{2}

,

\frac{r}{2} - 1

and 1 bit for integer, decimals and signed bit, respectively. The encryption of each bit is designated in the same position as in the plaintext. In consequence, the values between plaintext and its corresponding ciphertext are precisely equal. By assigning a lengthy input, we guarantee higher accuracy with a tradeoff of the execution time, increasing dramatically. Typically, HE gate operations take a significant amount of time compared to plaintext gate operations. This is due to heavy loads of noise accumulated through HE gates. Therefore, realistically, we assigned the length of the input as 8, 16, and 32 for experimentation. Figure 1 shows an example of an 8-bit fixed-point representation of number 6.75 using our approach.

2.2.2. Bitwise Operation

We illustrate some of the critical concepts, such as our basic scheme and functions, to help us understand. However, in order to grasp fully-understanding of our approach, interested readers should refer to the previous works of our approach [4,5,16,17]. For the sake of the flow, we introduce the very basics of our approach.

The goal of HE operations is to construct a HE function that yields the encrypted result that matches the plaintext operation. For instance, result of plaintext addition,

a + b

should match the result of the HE addition,

ct . a +_{HE} ct . b

where

+_{HE}

represents HE addition. These basic HE operations are built from the combination of HE gate operations. We provide a simple illustration of our method in constructing HE absolute value operation to show the difference between plaintext and ciphertext.

Suppose we have a plaintext value a which is then converted to fixed point number constituting 0 and 1. If the goal is to derive an absolute value of a in plaintext condition, it is not hard to see that by using 2’s complement method, one can easily obtain the absolute value depending on the sign of a. However, in the encrypted state, since the sign bit of ciphertext,

ct . a [r - 1]

where r is the length of the input, is encrypted or not known, we have to consider both cases where a is positive or negative. Thus, we have to perform both negative and positive cases of

ct . a

. Consider the case when

ct . a

is positive, then the sign bit is

Enc [0]

. Using NOT gate on the sign bit, our encrypted value becomes

Enc [1]

and call it

ct . s

. With the value

ct . s

, if we perform AND gate operations to every bits of

ct . a

, we will obtain

ct . a

when

ct . a

is positive. If

ct . a

is negative, then

ct . s

is

Enc [0]

which, in the same manner, yields

Enc [0]

after executing AND gate operations to every bits of

ct . a

.

Also in the negative case, we have to perform 2’s complement operation on

ct . a

, and denote the outcome of operation to be

ct . n

. Likewise, AND gate operations on every bits of

ct . a

and the sign bit, which is

Enc [0]

provides the result of negative case. Therefore, by adding two results of positive and negative cases, one can obtain an encrypted result of absolute value operation starting from the encrypted number of a. In this way, we constructed various HE operations from HE bootstrapping gates. Now, we provide basic notations of gates and operations that are used in our work. Table 1 demonstrates our homomorphic operations and homomorphic functions used in this study.

2.2.3. HE Bitwise Logarithm

We introduced absolute value function in the previous section to demonstrate that there should be other measures in the plaintext algorithm in order to obtain the same result in the encrypted domain. Now, we suggested another key HE function logarithm used in derivation of evidence function. The details of the HE function are well explained in Reference [16]. In this paper, we briefly explain its method and show an example to illustrate our approach.

To design logarithm function in plaintext, we derive

y = l o g_{2} x

in the first stage. Next, we obtain the general form of logarithm with different base. We assume x and y to be real numbers s.t.

x \in (1, 2)

and

y \in [0, 1)

. Then, y can be written as

\begin{matrix} y & = b_{0} \times 2^{- 1} + b_{1} \times 2^{- 1} + \dots + b_{\frac{r - 2}{2}} \times 2^{\frac{r}{2} - 2} \\ = 2^{- 1} \times (b_{0} + 2^{- 1} \times (b_{1} + (b_{2} + \dots))), \end{matrix}

(1)

where binary representation of

y \in [0, 1)

is

[0, 0, \dots, 0, b_{\frac{r}{2} - 2}, \dots, b_{1}, b_{0}]

. Since

y = l o g_{2} x

can be rewritten as

x = 2^{y}

, we obtain nested-form of the equation as the following.

\begin{matrix} x & = 2^{y} \\ = 2^{2^{- 1} \times (b_{0} + 2^{- 1} \times (b_{1} + (b_{2} + \dots)))} . \end{matrix}

(2)

By squaring x, the formula becomes

x^{2} = 2^{b_{0} + 2^{- 1} \times (b_{1} + (b_{2} + \dots)))}

. Since

b_{i}

is either 0 or 1 and if

x^{2} \geq 2

,

b_{i}

is equal to 1 otherwise 0. For the case of

b_{i} = 1

, we divide

x^{2}

by 2. Following the procedure recursively provides bits

b_{i}

or fractional part of y.

So far, the above procedure is to obtain decimals of y that are less than 1. If

y \geq 1

, we perform the above procedure for the fractional part of y whereas we only count the index of the position of most significant bit of x for deriving integer of y. In conclusion, summation of the two outcomes is the result of

y = l o g_{2} x

.

This plaintext mechanism is useful to our scheme on the encrypted domain since the process of obtaining decimals mainly involves two fast-operations: comparison and shift. However, the approach to solve the problem should be different. Given the problem of

ct . y = l o g_{2} (ct . x

)

ct . x

, our initial work is to gain

ct . x \in [Enc (1), Enc (2))

. This is not an easy task compared to plaintext situation where we can easily shift bits of x to normalized x s.t. x lies in between 1 and 2. Instead, we make a detour using our HE functions to process normalization of a ciphertext. This is the first step that we perform that is different from plaintext domain. Algorithm 1 is the process of normalization of

ct . x

.

Algorithm 1 HomNorm(

ct . x

)

1:: for $i = 0$ to $r - 2$ do
2:: $ct . d_{i} \leftarrow$ Ciphertext arrays where $ct . d_{i} [i] = Enc (1)$ , else $Enc (0)$
3:: $ct . s_{i} \leftarrow$ Ciphertext arrays where $ct . s_{i} = \frac{r}{2} - i$
4:: $ct . o [i] \leftarrow HomCompareLarge (ct . x, ct . d_{i})$
5:: $ct . a \leftarrow$ $HomShift (ct . x)$ by $\frac{r}{2} - i$
6:: end for
7:: $ct . p$ ← Add all the elements of $ct . o [i]$
8:: $ct . p$ ← Subtract $ct . p$ by $Enc (\frac{r}{2})$
9:: $ct . e_{i}$ ← HomEqualCompare $(ct . s_{i}, ct . p)$
10:: $ct . r_{i}$ ← Bitwise $bootsAND (ct . e, ct . a)$
11:: $ct . r$ ← Add all the elements of $ct . r_{i}$
12:: return $ct . r$

Next, we proceed to obtain fractional bits of

ct . x

from the Algorithm 1. The problem in the encrypted domain is to decide whether the square of

ct . x

is greater than or equal to

Enc (2)

. We use

HomLargeCompare

function to compare the values of ciphertexts,

Enc (2)

and

ct . x^{2}

that returns as

Enc (0)

if the former is larger than the latter value. In addition, with this ciphertext, we can decide whether to shift

ct . x^{2}

by 1 or not. Our explained method of obtaining encrypted decimals of

ct . y

is listed in the Algorithm 2.

Algorithm 2 HomSquareShift(

ct . x

)

1:: for $i = 0$ to $\frac{r}{2} - 1$ do
2:: $ct . x^{2} \leftarrow HomMulti (ct . x, ct . x)$
3:: $ct . r [i] \leftarrow ct . x^{2} [\frac{r}{2} - 1]$
4:: $ct . n [i] \leftarrow bootsNOT ct . r [i]$
5:: $\frac{ct . x^{2}}{2} \leftarrow$ Left shift 1 bit of $ct . x^{2}$
6:: $ct . a [i] \leftarrow$ $bootsAND (ct . r [i], \frac{ct . x^{2}}{2})$
7:: $ct . b [i] \leftarrow$ $bootsAND (ct . n [i], ct . x^{2})$
8:: $ct . v [i] \leftarrow HomAdd (ct . a [i], ct . b [i]$ )
9:: end for
10:: return $ct . v$

Through Algorithms 1 and 2, we obtain the decimals of

ct . x

. Since, the integer value of

ct . x

is the value of the position of the most significant value, it is

ct . p

at line 8 of the Algorithm 1. Therefore,

HomAdd (ct . v, ct . p)

is the result of HE logarithm of

y = l o g_{2} x

. For the general result of

ct . (\log_{a} x)

, we first derive

ct . (\log_{2} a)

to perform

\frac{ct . (\log_{2} x)}{ct . (\log_{2} a)}

which is equal to

ct . (\log_{a} x)

.

2.2.4. Time Complexity of Designed Homomorphic Operation

The experimental environment configuration settings are as follows. All computations ran on a computer with 32 GB RAM, Intel Core i7-8700 CPU 3.2 GHz(Intel, Santa Clara, CA, USA), Ubuntu 18.04 and we used TFHE library version 1.0.1. was used. We measured computational speeds of 1-bit basic gate operations for 1000 times in TFHE. All bootstrapping HE gates except NOT and MUX gates took about 10.8 ms to evaluate. The MUX gate took 20.9 ms and NOT gate takes 0.000154 ms, which is significantly lower than other gates. Therefore, we ignored the speed of NOT gate in calculating execution. We denoted time complexity of all binary gates as

T_{B}

except Mux gate which was denoted as

T_{X}

. Table 2 shows performance time for each operation in detail.

As more bits are assigned to the input value, execution time increases dramatically because it involves many operations, including addition, subtraction and comparison that increase linearly with the length of data and their interactions.

2.3. Model Selection

Most statistical inference approaches for analyzing the data aim to make “good” predictions. However, “good” predictions we cannot be obtained if the proper models are not chosen in the first place. Worse, there exists no such perfect model that is generally suitable for any data. Therefore, a crucial step in data analysis is to consider a set of candidate models and then select the most appropriate one. This is called model selection, one of the most important and essential steps to obtain stably accurate results in data analysis. Also, the model selection can be divided into two main branches because the meaning of the ’model’ is interpreted differently in various fields.

A Model regarded as an algorithm: It selects the most appropriate process among different machine learning approaches—for example, support vector machine (SVM), KNN, logistic regression, and so forth.
A Model regarded as a complexity: It selects among different hyperparameters in a set of features for the same machine learning approach—for example, determining an order between polynomial models in linear regression.

In this study, we focus on model selection with various complexities, by adapting numerical solutions for the regression analysis.

Polynomial regression is a type of linear regression that refers to the relationship between the independent variable x and dependent variable y modeled as Mth degree polynomial:

y (x, θ) = θ_{0} + θ_{1} x + θ_{2} x^{2} + \dots + θ_{M} x^{M} = \sum_{i = 0}^{M} θ_{i} x^{i} .

(3)

In Equation (3),

θ

is a set of polynomial coefficients, and it is determined by fitting the polynomial to the training data by minimizing the errors. M is the order of the polynomial, so we need to estimate the optimal order

M^{*}

to obtain the best prediction results.

Generally, model selection criteria are based on an estimator of generalization performance evaluated over the data. The models following M can be evaluated as estimated errors that are decomposed by bias and variance. If the model is too simple to describe the data, there is a high probability of a high bias and low variance called under-fitting. On the contrary, over-fitting occurs when a complex model has low bias and high variance. In machine learning, an over-fitted model may fit perfectly into training data but may not be suitable for new data.

To avoid under-fitting and over-fitting, one must select an appropriate model with optimal complexity. Thus, we need to find a way to determine the right value between models with different complexity. A considerable number of selection procedures were proposed in the literature, for example, the AIC(Akaike Information Criterion) method [18], the Cp method [19], the BIC(Bayesian Information Criterion) method [20], the Cross-Validation (CV) method [21], Bayesian evidence methods [22], and so forth. In this study, we developed two model selection algorithms that can work in encrypted domain for the Cross-validation (CV) and Bayesian model selection.

2.3.1. Cross Validation

CV is one of the most widely used methods to evaluate predictive performances of a candidate model in model selection. CV estimates the expected error, and does not require the models to be parametric. It makes full use of data without leaking information into the training phase. Regarding data splitting, some of the data is used for fitting each model to be compared, and the rest of the data is used to measure the predictive performances of the models by the validation errors. Through these processes, the model with the best overall performance is selected.

2.3.2. Bayesian Model Selection

Bayesian model selection is also a well-known approach for choosing an appropriate model. The Bayesian paradigm offers a principle approach that addresses the model choice by considering the posterior probability given a model. The Bayesian view of model comparison includes the consistent application of the rules of sum and product of probabilities, as well as the use of probabilities that represent uncertainty in model comparison [23]. More precisely, suppose that the comparing models can be enumerated and indexed by the set {

M_{i}

:

i = 1, \dots, N

}. It represents the probability distribution for the observed data

D

generated by model

M_{i}

. The posterior distribution for a set of model parameters is:

p (θ | D, M_{i}) = \frac{p (D | θ, M_{i}) p (θ | M_{i})}{p (D | M_{i})},

(4)

where

p (D | θ, M_{i})

is the likelihood and

p (θ | M_{i})

represents the prior distribution of the parameters of

M_{i}

. The model evidence for model

M_{i}

, based on the product and sum rule as,

p (D | M_{i}) = \int p (D | θ, M_{i}) p (θ | M_{i}) d θ

(5)

represents the preference shown by the data for different models, and it is also called marginal likelihood because it can be viewed as a likelihood function over the space of models in which the parameters have been marginalized out.

3. Problems

Our approach is different from the mainstream of the homomorphic encryption scheme. We addressed several critical underlying problems that the current research holds.

3.1. Model Selection in Homomorphic Encryption

First, majority of the current literatures that links HE to machine learning are confined to solving a “modeled” problem. Based on their HE operations, they look for a suitable data distribution that can show high performance of their functions. Kim et al. [1] and Aono et al. [3] achieved high performance of HE logistic regression on encrypted data in terms of accuracy and speed, using iDASH dataset and Pima diabetes dataset. But, they achieved the high performance mainly because they choose the appropriate data distribution that their performance of HE logistic regression can significantly benefit from.

However, if it were to perform logistic regression on unknown(or encrypted) data distribution, it is unlikely that the same results would be achieved. Therefore, we proposed model selection as a first step to decide the appropriate model for fitting unknown data distribution. We claim that our approach should be more fundamental because it provides an adequate model for encrypted data distribution preceded in the first round.

Furthermore, implementing an inverse of the matrix multiplication(and division) in the model selection process is a particularly challenging task in HE. Reference [24] discusses inverse matrix operation through the Schulz algorithm under the HE domain, however, their scheme does not have a division operation. Therefore, they inevitably must adopt an approximation method, in order to perform the inverse matrix operation during the model selection process. In this process, the error increases linearly through iterations. However, our approach, Gauss-Jordan elimination, based on bitwise operations mentioned in Section 2, can accurately derive the inverse matrix results. The proposed approach is the Gauss-Jordan elimination. The most useful property of the method is that reduced row echelon form is unique, which means row-reduction on a matrix will produce the same answer regardless of how the row operation are performed, which is essential to get a definite answer in the encrypted state. In the following, we propose homomorphic matrix operations to discuss a method of obtaining an inverse matrix.

Through our previous works [5,16], we designed nonlinear functions such as logarithm and exponential functions that can be applied more fundamentally and used them in the Bayesian model selection process. However, the mainstream approach does not support nonlinear functions such as exponential function and logarithm function [4,25]. Instead, they approximate these types of functions in terms of Taylor series or least square approximation in the form of summation of polynomials. This approach is only limited to solving logistic regression problem and cannot be used to design other types of algorithms.

3.2. Problems of Cross-Validation, an Existing Model Selection Method

In the previous study [17], we used CV as an evaluation method for model selection. When performing CV using two sets of factors—the training and the test sets, the number of folds(k) affects the performance of our approach in terms of speed. It is known that the larger the value of k, the more accurate the results. However, in our research, the cost of HE operations’ execution time is significantly high, balancing accuracy and speed. Thus, we need other methods that offset the shortcoming of computation time for our HE functions.

To overcome these shortcomings, we proposed model selection from a Bayesian perspective. This method avoids the over-fitting problem by performing marginalization based on the parameter instead of the point estimation of the model’s parameter values. In this case, the models can be directly compared based on the training set, hence a verification set is not required, which avoids multiple training for each model needed to implement the CV method, reducing computational time in the long run.

4. Methods

4.1. Homomorphic Matrix Operations

We begin with matrix multiplication and the inverse of a square matrix before implementing the model selection algorithm.

4.1.1. Homomorphic Matrix Multiplication

As more complex algorithms require many matrix operations have grown, performing matrix operations on HE has become equally important. Let

A

=

[a_{i k}]

and

B

=

[b_{k j}]

be two square matrices of size

n \times n

. Using the component-wise definition of matrix multiplication,

C

=

[c_{i j}]

can be expressed as

C_{i j} = \sum_{k = 1}^{n} A_{i k} \otimes B_{k j}, \otimes : HomMulti \forall i, j = 1, 2, \dots n .

(6)

The matrix multiplication process requires multiplication and addition of its entries. First, homomorphic multiplication method is introduced in Algorithm 3.

When entries of encrypted matrices

ct . A

and

ct . B

enter the input value, multiplication runs mainly on addition. Specifically, assuming the multiplication is between r bits, the multiplication operation converted to r number of additions using only AND gate and Right-Shift operations. Therefore, when we execute multiplication of entries of r bit, the result is obtained by j times of right shift and

i \in

{0, 1, . . ., r - 2}

times of summation. To obtain results regardless of the sign of the given data, we calculate the positive product using absolute value operation and then perform 2’s complement operation on the result according to its sign. Finally, the integration of part-by-part calculations is carried out, and the calculated value shown as a result. The outcome is double the length of the input, therefore, we limit the boundary of result to r to obtain an adjusted result in which the lengths of the partial and integer parts are equal to our setting.

The Algorithm 4 shows the addition operation using encrypted input values, which is the last part of the dot product of the matrix multiplication operation. As with the method in the plaintext, is based on a full adder scheme. Addition can be designed simply due to basic gate operations with bootstrapping, so only 2 XOR, 2 AND, and 1 OR gates are used in this process.

Thus, Algorithm 5 outputs the multiplication of matrices

ct . A

and

ct . B

through homomorphic multiplication and homomorphic addition.

Algorithm 3 HomMulti(ct.a, ct.b, r): Homomorphic multiplication for matrix operation

1:: $ct . \dot{a} \leftarrow HomAbs (ct . a)$ , $ct . \dot{b} \leftarrow HomAbs (ct . b)$
2:: for $i = 1$ to r do
3:: $ct . d [i] \leftarrow bootsAND (ct . \dot{a} [i], ct . \dot{b} [i])$
4:: end for
5:: for $i = 1$ to $r - 2$ do
6:: if $i < \frac{r}{2} - 1$ then
7:: $ct . e \leftarrow HomShift (ct . \dot{a}, \frac{r}{2} - i - 1)$
8:: for $j = 1$ to r do
9:: $ct . f [j] \leftarrow bootsAND (ct . d [j], ct . \dot{b} [i + 1])$
10:: end for
11:: $ct . d \leftarrow HomAdd (ct . d [j], ct . f [i + 1])$
12:: else if $i = \frac{r}{2} - 1$ then
13:: for $j = 1$ to r do
14:: $ct . f [j] \leftarrow bootsAND (ct . \dot{a} [j], ct . \dot{b} [i + 1])$
15:: end for
16:: $ct . d \leftarrow HomAdd (ct . d, ct . f)$
17:: else
18:: $ct . e \leftarrow HomShift (ct . \dot{a}, i - (\frac{r}{2} + 1))$
19:: for $j = 1$ to r do
20:: $ct . f [j] \leftarrow bootsAND (ct . e [j], ct . \dot{b} [i + 1])$
21:: end for
22:: $ct . d \leftarrow HomAdd (ct . d, ct . f)$
23:: end if
24:: end for
25:: $ct . f \leftarrow HomTwosComp (ct . d)$
26:: $ct . g \leftarrow bootsXOR (ct . a, ct . b)$
27:: $ct . h \leftarrow bootsNOT (ct . g)$
28:: for $i = 1$ to r do
29:: $ct . o \leftarrow bootsAND (ct . d, ct . h)$
30:: $ct . p \leftarrow bootsAND (ct . f, ct . g)$
31:: end for
32:: $ct . q \leftarrow HomAdd (ct . o, ct . p)$
33:: Return $ct . q$

Algorithm 4

HomAdd (ct . a, ct . b, r

): Homomorphic addition for matrix operation

1:: $ct . a [0] \leftarrow bootsCONSTANT [0]$
2:: for $i = 0$ to $r - 1$ do
3:: $ct . m [0] \leftarrow bootsXOR (ct . a [i], ct . b [i])$
4:: $ct . m [1] \leftarrow bootsAND (ct . a [i], ct . b [i])$
5:: $ct . c [i] \leftarrow bootsXOR (ct . m [0], ct . c [i])$
6:: $ct . m [0] \leftarrow bootsAND (ct . m [0], ct . c [i])$
7:: $ct . c [i + 1] \leftarrow bootsOR (ct . m [0], ct . m [1])$
8:: end for
9:: $ct . m [0] \leftarrow bootsXOR (ct . a [r - 1], ct . b [r - 1])$
10:: $ct . c [r - 1] \leftarrow bootsXOR (ct . m_{0} [0], ct . c [r - 1])$
11:: Return $ct . c$

Algorithm 5

HomMatMulti (ct . A, ct . B)

1:: $ct . C \leftarrow Enc [0]$
2:: for $i = 1$ to n do
3:: for $j = 1$ to n do
4:: for $k = 1$ to n do
5:: $ct . c_{ij}$ ← $HomMulti (ct . A_{ik}, ct . B_{kj})$
6:: $ct . C$ ← $HomAdd (ct . C, ct . c_{ij})$
7:: end for
8:: end for
9:: end for
10:: Return $ct . C$

The performance evaluation of this approach is as follows. The order of square matrix and the length of data are set as a factor of matrix inversion. The time complexity

T_{m a t m u l t i}

for homomorphic matrix multiplication can be represented as follows:

\begin{matrix} \begin{matrix} T_{m a t m u l t i} = & (8 r^{2} + r + 1) n^{3} T_{M} + (r^{2} + 2 r) n^{3} T_{X} . \end{matrix} \end{matrix}

(7)

4.1.2. Accurate Homomorphic Matrix Inversion

If matrix

W

is invertible, then every equation

W x = p

has a unique solution. Gauss-Jordan algorithm computes

W^{- 1}

given

W

. There are many ways to find the inverse matrix. Examples include such as Cramer’s rule, Gaussian elimination, and Gauss-Jordan elimination. Cramer’s rule is computationally intensive compared to other methods, and roundoff error may become significant on large problems with non-integer coefficients. Gauss-Jordan elimination is a modification of the Gaussian elimination that reduces the computations involved in back substitution by performing additional row operations to transform the matrix from echelon form to reduced row echelon form (

r r e f

). In general, the Gauss-Jordan method may require a more significant number of arithmetic operations as the size of matrix increases, however, the

r r e f

of a matrix has the advantage of guaranteeing uniqueness. The elimination process consists of three possible steps called elementary row operations: swap two rows, scale a row, and subtract a multiple of a row from another. Before we construct our algorithm, we first figure out how it works. If

I

is an identity matrix, the equation can be express as follows.

Wx = I P .

(8)

The equation is still same if we multiply both terms by a non-singlar matrix

V_{0}

:

V_{0} Wx = V_{0} I P .

(9)

The trick of the Gauss-Jordan elimination consists of finding a series of matrices

V_{0}, V_{1}, . . ., V_{n - 1}

so that

V_{n - 1} . . . V_{1} V_{0} Wx = V_{n - 1} . . . V_{1} V_{0} I P = x .

(10)

The process expression must be true for

P

so,

x

is the solution of

Wx = P

, by definition,

V_{n - 1} . . . V_{1} V_{0} I \equiv W^{- 1}

. Thus, given

W

, the Gauss-Jordan algorithm works exactly this way, it computes

W^{- 1}

.

Algorithm 6 demonstrates the process of Gauss-Jordan elimination in an encrypted state. First, we essentially chose the largest absolute value element on an unvisited column and row using

HomAbs

operation. Next, we used

HomMux

in order to switch the kth and

i^{★}

th rows.

HomMux (ct . S, ct . a, ct . b)

means Mux gate that homomorphically outputs either the message of

ct . a

or

ct . b

relying on the boolean value of

ct . S

, without decrypting any of the ciphertexts. If

ct . S = Enc [1]

, it indicates

ct . a

otherwise

ct . b

.

Algorithm 6

ct . W^{- 1}

=

HomGJ (ct . W)

: Homomorphic Matrix Inversion using Gauss- Jordan Elimination

Require:: Encrypted Matrix $ct . W_{n \times n}$
Ensure:: Encrypted Inverse Matrix $ct . W_{n \times n}^{- 1}$
1:: for each row k do
2:: $ct . i^{★}$ ← $a r g m a x_{k \leq i \leq n} HomAbs (ct . W_{ik})$
3:: if $ct . W_{i^{★} k}$ =0 then
4:: Matrix is not invertile
5:: end if
6:: Swap rows k and $i^{★}$ ← $HomMux (ct . k, ct . i^{★}, ct . temp)$
7:: for each row j below k (i.e., $j = k + 1, \dots, n)$ do
8:: $ct . τ \leftarrow HomDiv (ct . W_{jk}, ct . W_{kk})$
9:: $ct . W_{K} \leftarrow HomMulti (ct . τ, ct . W_{k})$
10:: $ct . W_{j} \leftarrow HomSubt (ct . W_{j}, ct . W_{K})$
11:: end for
12:: end for
13:: for each row $k = n, \dots, 1$ (i.e., in reverse) do
14:: $ct . W_{k} \leftarrow HomDiv (ct . W_{k}, ct . W_{kk})$
15:: for each row j above k (i.e., $j = k - 1, \dots, 1)$ do
16:: $ct . τ \leftarrow HomDiv (ct . W_{jk}, ct . W_{kk})$
17:: $ct . W_{K} \leftarrow HomMulti (ct . τ, ct . W_{k})$
18:: $ct . W_{j} \leftarrow HomSubt (ct . W_{j}, ct . W_{K})$
19:: end for
20:: end for
21:: Return $ct . W^{- 1}$

Then Gauss elimination to eliminate all components below the diagonal, followed by two phases to eliminate all components above the diagonal. Homomorphic operations are responsible for these processes.

phase1: Convert $ct . W$ into the row echelon form by eliminating all $ct . W_{jk}$ in equation below the diagonal. By this phase of elimination, all elements below the diagonal are eliminated, and $ct . W$ becomes an upper triangular matrix.
phase2: Convert the $r e f$ obtained into a diagonal matrix, called $r r e f$ , by eliminating all $ct . W_{jk}$ in equation $ct . W_{j}$ above the diagonal. By this phase of elimination, all elements above the diagonal are eliminated, and $ct . W$ becomes a diagonal matrix, which can be turned into $ct . I$ .

The performance evaluation of this approach is as follows. The order of square matrix and the length of data are set as a factor of matrix inversion. The time complexity

T_{i n v e r s e}

for homomorphic matrix inverse can be represented as follows:

\begin{matrix} \begin{matrix} T_{i n v e r s e} = & [(2 r^{2} + \frac{7}{3} r + \frac{10}{3}) n^{3} + (3 r^{2} + 4 r + 7) n^{2} \\ + (3 r^{2} - \frac{19}{3} r - \frac{10}{3}) n] T_{B} + [\frac{1}{3} r n (2 n^{2} + 3 n + 7)] T_{X} . \end{matrix} \end{matrix}

(11)

4.2. Design of Homomorphic Model Selection

4.2.1. Homomorphic Cross Validation Approach

The method proposed using the Equation (3) to construct the model selection algorithm is polynomial regression. Given a set of N points (

x_{t}, y_{t}

)

_{t = 0}^{N - 1}

for all n, the goal is to fit the data with a polynomial of a degree M and the least-square method figure out the coefficients

θ_{t}

to minimize the squared error function

R (θ_{0}, \dots, θ_{M}) = \sum_{t = 0}^{N} {(\sum_{i = 0}^{M} θ_{i} x_{t}^{i} - y_{t})}^{2} .

(12)

The coefficient gets its global minimum when the gradient of R is zero, or the partial derivatives of R must be zero for 0 ≤j≤M:

\frac{\partial R}{\partial θ_{j}} = 2 \sum_{t = 0}^{N - 1} (\sum_{i = 0}^{M} θ_{i} x_{t}^{i} - y_{t}) x_{t}^{j} = 0 .

(13)

It simplifies to

\sum_{i = 0}^{M} (\sum_{t = 0}^{N - 1} x_{t}^{i + j}) θ_{i} = \sum_{t = 0}^{N - 1} x_{t}^{j} y_{t} .

(14)

This equation is a linear system of (

M + 1

) equations in (

M + 1

) unknown coefficients and is of the form

X^{T} X θ = X^{T} y .

(15)

where

X

=[

a_{t i}

] is an N× (M+1) matrix whose general entry is

a_{t i} = x_{t}^{i}

, and

θ

= [

θ_{j}

] is a (M + 1) × 1 column vector and

y

=[

y_{t}

] is an N× 1 column vector. The coefficients of the polynomial are

X^{T} X = [\begin{matrix} \sum_{t = 0}^{N - 1} 1 & \sum_{t = 0}^{N - 1} x_{t} & \dots & \sum_{t = 0}^{N - 1} x_{t}^{M} \\ \sum_{t = 0}^{N - 1} x_{t} & \sum_{t = 0}^{N - 1} x_{t}^{2} & \dots & \sum_{n = 0}^{N - 1} x_{t}^{M + 1} \\ ⋮ & ⋮ & ⋱ & ⋮ \\ \sum_{t = 0}^{N - 1} x_{t}^{M} & \sum_{t = 0}^{N - 1} x_{t}^{M + 1} & \dots & \sum_{t = 0}^{N - 1} x_{t}^{2 M} \end{matrix}], X^{T} y = [\begin{matrix} \sum_{t = 0}^{N - 1} y_{t} \\ \sum_{t = 0}^{N - 1} x_{t} y_{t} \\ ⋮ \\ \sum_{t = 0}^{N - 1} x_{t}^{M} y_{t} . \end{matrix}]

(16)

The assumption that

x_{k}

are increasing guarantees that

X^{T} X

is invertible, so the least square estimator of

θ

is given by

θ = {(X^{T} X)}^{- 1} X^{T} y

(17)

and Algorithm 7 describes the details of the equation. K-fold CV method is used to evaluate the model selection performance. The values of the coefficients

θ

is determined by fitting polynomial to the data. This can be obtained by minimizing the error function called

m e a n - s q u a r e - e r r o r

(MSE). It is widely used for the sum of the squares of the errors between the prediction

y (x_{t}, θ)

for each data point

x_{t}

and the corresponding target values

z_{t}

,

R (θ) = \frac{1}{N} \sum_{t = 1}^{N} {(y (x_{t}, θ) - z_{t})}^{2}

(18)

For K subsets, we compute the SSE of each subset,

R {(θ)}_{k}

, on the training set and record the total error on the validation set then compute the average error over all folds. Finally, we choose the value of the parameter that minimizes the average error.

\hat{R} (θ) = \frac{1}{K} \sum_{i = 1}^{K} R {(θ)}_{k}

(19)

Based on TFHE, we implement the

K - f o l d

homomorphic CV for the model selection. We first initialize part of training and test error matrix elements to zero and updated our elements using the model selection algorithm as a polynomial regression function. During this process, the value of the elements are computed through homomorphic operations and the solution of matrix inverse (

{ct . X}^{T} ct . X

) is multiplied by the vector

ct . y

. After K-fold CV, final return value stands for average MSE at each polynomial degree.

Algorithm 7 Homomorphic Regression Coefficients for each model order

Require:: Encrypted input features $ct . w, ct . y$
1:: $ct . X \leftarrow Enc [0]$
2:: for $i = 1$ to $l e n g t h (x)$ do
3:: for $j = 1$ to $o r d e r + 1$ do
4:: $ct . X \leftarrow HomPow (ct . x (i), (j - 1))$
5:: end for
6:: end for
7:: $ct . A \leftarrow HomGJ (HomMatMulti (ct . X^{T}, ct . X))$
8:: $ct . B \leftarrow HomMatMulti (ct . A, ct . X^{T})$
9:: $ct . θ \leftarrow HomMatMulti (ct . B, ct . y)$
10:: Return $ct . θ$

4.2.2. Homomorphic Bayesian Model Selection Approach

Another way to overcome over-fitting associated with maximum likelihood is marginalizing (summing or integrating) over the model parameters instead of making point estimates of corresponding values. Therefore, the model can be directly compared to the training data without the need for a validation set, avoiding multiple training runs for each model associated with CV. To do so, we focus on fully Bayesian predictive distribution for our regression models.

Evidence function can directly link the hyperparameters

α

,

β

, and observations

D

by marginalizing the parameters

θ

in the Bayesian framework, that is, we adopt prior distributions over hyperparameters and make predictions by marginalizing with respect to

α

,

β

and parameters

θ

,

p (z | z) = \int \int \int p (z | θ, β) p (θ | z, α, β) p (α, β | z) d θ d α d β,

(20)

where z is target value,

p (z | θ, β)

is defined by

N (z | y, β^{- 1}

)z and

p (θ | z, α, β)

is defined by

N (θ | μ_{N}, V_{N})

with

μ_{N}

=

β V_{N} H^{T} z

and

V_{N}^{- 1} = α I

+

β H^{T} H

. According to Bayes’ theorem in the previous section, the posterior distribution for hyperparameter is given by

p (α, β | z) \propto p (z | α, β) p (α, β) .

(21)

The evidence function

p (z | α, β)

is obtained by integrating over the parameters, so

p (z | α, β) = \int p (z | θ, β) p (θ | α) d θ = {(\frac{β}{2 π})}^{N / 2} {(\frac{α}{2 π})}^{M / 2} \int \exp {- R (θ)} d θ,

(22)

where

p (z | θ, β)

is obtained by taking the logarithm of the likelihood function and making use of the standard form for the univariate Gaussian and

p (θ | α)

is defined by

N (θ, 0, α^{- 1} I

). M is the number of parameters in the model and N is the number of data.

R (θ)

is defined as

R (θ) = β R_{D} (θ) + α R_{W} (θ) = R (μ_{N}) + \frac{1}{2} {(θ - μ_{N})}^{T} L (θ - μ_{N})

(23)

and

R (μ_{N})

is together with

R (μ_{N}) = \frac{β}{2} {‖ z - H μ_{N} ‖}^{2} + \frac{α}{2} μ_{N}^{T} μ_{N},

(24)

where

H

is the design matrix. The right hand side of the equation shows the application of the square over parameter through recognition as being equal to a constant of proportionality to the regularized SSE function. Hessian matrix

L

is same as

V_{N}^{- 1}

and it implies that a square matrix of second-order partial derivatives of a scalar-valued function,

L = ▽ ▽ R (θ)

.

μ_{N}

is also given by

μ_{N} = β L^{- 1} H^{T} z

. Based on the equation above, the integral over parameter can evaluate by appealing to the standard result for the normalization coefficient of a multivariate Gaussian. Moreover, the log of the marginal likelihood created using Equation (22), which is the required expression for the evidence function.

The optimal model complexity determined by the maximum evidence is given through the equation above. Based on TFHE, the evidence approximation for the model selection is presented in Algorithm 8.

Algorithm 8 Homomorphic Bayesian Model Selection for Linear Regression

Require:: Encrypted input features ct.x, ct.z and observations $α$ and $β$
Ensure:: $ct . m^{*} = a r g m a x_{M}$ $[ct . Evi (m)]$
1:: $ct . H$ ← $Enc [0]$
2:: for $M = 1$ to $M_{m a x}$ do
3:: for $i = 1$ to N do
4:: for $j = 1$ to M do
5:: $ct . H \leftarrow HomPow (ct . x (i), (j - 1))$
6:: end for
7:: end for
8:: Compute Hessian Matrix $ct . L_{N}$
9:: Compute mean of the posterior distribution $ct . μ_{N}$ using $HomGJ$
10:: Compute error function $ct . R (μ_{N})$
11:: Compute Evidence function $ct . Evi (m) \leftarrow HomLog [(p (z | α, β)]$
12:: end for
13:: Return $ct . m^{*} \leftarrow ct . Evi (m)$

5. Implementation

In this section, we present the results of the model selection for encrypted data. Two types of datasets were used to implement the model selection algorithm. The input value is encrypted with 16-bits in length and used for calculation. In the first experiment, we confirmed that the HE Gauss-Jordan Elimination algorithm works correctly with various HE functions. To expand the scope, we tested the algorithm on a real-world dataset from Kaggle and used it in our experiment to see that the algorithm works well for many features.

5.1. Toy Dataset

We use artificially created dataset comprising 20 observations of x, together with corresponding observations of the values of y to check performance and evaluation of our algorithm. Figure 2 shows a plot of a training set comprising

N = 10

data points and test set comprising

N = 10

data points respectively.

Then we evaluated through two different ways to achieve good generalization by making accurate predictions for new data. In this process, we use

r o o t - m e a n - s q u a r e - e r r o r

(RMSE) as opposed to MSE because the square root ensures that RMSE is measured on the same scale as the variable y, so it is easy to interpret. Figure 3 shows the results of implementing two approaches of our HE

m o d e l s e l e c t i o n

.

Toy dataset is too small to perform CV procedure, therefore, we focused on training and test error of each order of M. Values of M in the range 3 ⩽M⩽ 7, give small values for the test set error in the Figure 3a. For 9th polynomial, the regression model fits exactly to all data points, therefore, RMSE error converges to 0 in training dataset. In this case, the values of 6th provide a smaller error, making it is the best predictor of new data when M is 6.

In the evidence approach, we fixed

α

= 5 ×

10^{- 3}

,

β

= 1.5 and performed our algorithm and when M = 3, giving the best evidence of all polynomials. Additional increases in the value of M make only small improvements in the fit to the data but increases the complexity penalty, which decreases the overall evidence value. Table 3 shows the evidence values for each order in the plaintext and ciphertext through the same experimental method. The difference in the evidence values obtained from the two results indicates that there are not many errors for the experiment in the encrypted state. The error occurs in the process of converting a real number plaintext into a ciphertext, which can be reduced by increasing the number of input

r - b i t s

.

5.2. Body Mass Index (BMI) Dataset

BMI, consisting of various sensors including ultra-sonic, is a measurement of a person’s weight based on his/her height. BMI is calculated by dividing the weight of the body in kilograms by the square of height in meters, and the result of BMI indicates if that person is obese, overweight, regular, or underweight. BMI classification does not depend on age, therefore, we determined the tendency between them through regression. For experimental evaluation, we use “Medical Cost Personal Datasets” from Kaggle [26]. The dataset consists of age, sex, BMI, number of children, region, and so on, however, in this experiment, we used only two types of age and BMI to find out the correlation between them. The total number of data for each type is 1339, and we sampled 1000 data respectively to proceed model selection.

Before starting the homomorphic CV approach, we standardized out dataset because variables measured at different scales do not contribute equally to the analysis. Our dataset contains features highly varying in magnitudes, units, and ranges. Standardization (or Z-score normalization) is the process of rescaling the features to have the properties of a gaussian distribution with

μ = 0

and

σ = 1

, where

μ

is the mean and

σ

is the standard deviation from the mean. Figure 4a shows 9th order of polynomial regression for standardized data and Figure 4b shows standardized data distribution of each dataset.

For training and test sets of randomly selected age and BMI values, we assumed that each set represents an entire dataset. We performed

5 - f o l d

CV technique to estimate RMSE. A polynomial problem associated with implementing nth order regression on the computer is that the normal equations tend to be ill-conditioned, especially for higher-order versions. Thus, as with the toy dataset, we varied complexity by using our algorithm that range in model order from 0 (least complex) to 9 (most complex).

Figure 5a shows training, test, validation, and

5 - f o l d

CV results. The validation dataset is a part of the training set that is sacrificed to evaluate the performance of the model when the CV method is not used. In order to use all the data through the

5 - f o l d

method, we put the training, test, and validation set together and divide them into five equally partitions, and use sequentially four of them as a training set and the remaining one as a test set. The performance measure reported by

5 - f o l d

CV is then the average of the values computed in the loop. We repeated the procedure 50 times to obtain the total RMSE values of each dataset. The boxplot in Figure 5b reveals the tendency of distribution for the dataset. In the CV approach, the best estimator corresponds to a polynomial model of order 4.

To maximize model evidence over model order M, we set up hyperparameter

α

= 1×

10^{- 6}

,

β

= 18.3. As with the CV approach, we figure out the evidence approximation up to the 9th order and when the order of polynomial model was 2, it corresponded to the best estimator corresponds as with Figure 6. We measured the time performance of each HE model selection approach and the execution times were 1764 and 480 min, respectively.

6. Limitation

Although our proposed approach improves computational resources, it still has a few limitations. In an encrypted domain, each operation speed is very slow as the memory space is large. However, this problem can be solved by accelerating calculations with many state-of-art technologies. Using FPGA(Field Programmable Gate Array), ASIC(Application Specific Integrated Circuit), GPU(Graphic Processing Unit) [27,28,29], the computations can be offloaded to the HE co-processor. It improves the performance of certain HE tasks by performing directly in hardware. The CPU vector computation extension using SIMD(Single Instruction Multiple Data) [30] also improves efficiency. Therefore, we can combine these approaches into a hybrid system to create a fully optimized solution.

7. Conclusions

To date, privacy-preserving machine learning is one of the key research areas in computer security and machine learning. Homomorphic encryption is one of the most theoretically sound approaches for privacy-preserving machine learning and data mining. However, most algorithms made using homomorphic encryption do not consider the model’s complexity and stability. It is not straightforward to evaluate the stability of the model given data in the homomorphic encryption since all data are encrypted. Therefore, homomorphic model selection methods are an essential tool for data analysis in the encrypted domain, especially for big data obtained from biomedical sensors. If the model is not properly chosen, the results cannot be trusted, and it can be a disaster for the biomedical domain. Therefore, homomorphic model selection algorithms that provide both privacy and stability.

Therefore, we introduced two different methods for constructing the homomorphic model selection—homomorphic Cross-Validation (HCV) and homomorphic Bayesian model selection (HBMS). In addition, both model selection approaches require homomorphic matrix inversion. We designed and developed a homomorphic matrix inversion using Gauss-Jodan elimination in the homomorphic encryption scheme, specially, using a homomorphic mux operation. Finally, based on the results of the two different approaches, we confirmed that the HBMS approach is more efficient in encrypted domains.

Author Contributions

Conceptualization, M.Y.H. and J.W.Y.; methodology, M.Y.H.; software, M.Y.H. and J.S.Y.; validation, M.Y.H.; formal analysis, M.Y.H. and J.S.Y.; investigation, M.Y.H.; data curation, M.Y.H.; writing–original draft preparation, M.Y.H.; writing–review and editing, J.S.Y. and J.W.Y.; visualization, M.Y.H.; supervision, J.W.Y.; project administration, J.W.Y.; funding acquisition, J.W.Y. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Acknowledgments

This study was supported by the Institute for Information and Communications Technology Promotion (IITP) grant funded by the Korea government (MSIT) (no. 2017-0-00545).

Conflicts of Interest

The authors declare no conflict of interest.

References

Kim, A.; Song, Y.; Kim, M.; Lee, L.; Cheon, J.H. Logistic regression model training based on the approximate homomorphic encryption. BMC Med. Genom. 2018, 11, 83. [Google Scholar] [CrossRef] [PubMed]
Giacomelli, I.; Jha, S.; Joye, M.; Page, C.D.; Yoon, K. Privacy-preserving ridge regression with only linearly-homomorphic encryption. In Proceedings of the International Conference on Applied Cryptography and Network Security, Leuven, Belgium, 2–4 July 2018; pp. 243–261. [Google Scholar]
Aono, Y.; Hayashi, T.; Trieu Phong, L.; Wang, L. Scalable and secure logistic regression via homomorphic encryption. In Proceedings of the Sixth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 9–11 March 2016; pp. 142–144. [Google Scholar]
Song, B.K.; Yoo, J.S.; Hong, M.Y.; Yoon, J.W. A Bitwise Design and Implementation for Privacy-Preserving Data Mining. Secur. Commun. Netw. 2019, 2019, 3648671. [Google Scholar] [CrossRef] [Green Version]
Yoo, J.S.; Hwang, J.H.; Song, B.K.; Yoon, J.W. A Bitwise Logistic Regression Using Binary Approximation and Real Number Division in Homomorphic Encryption Scheme. In Proceedings of the International Conference on Information Security Practice and Experience, Tokyo, Japan, 25–27 September 2018; pp. 20–40. [Google Scholar]
Rivest, R.L.; Shamir, A.; Leonard, A. A method for obtaining digital signatures and public-key cryptosystems. Commun. ACM 1978, 21, 120–126. [Google Scholar] [CrossRef]
Halevi, S.; Shoup, V. Algorithms in HElib. In Proceedings of the Advances in Cryptology, Santa Barbara, CA, USA, 17–21 August 2014; Volume 8616, pp. 554–571. [Google Scholar]
Cheon, J.H.; Kim, A.; Kim, M.; Song, Y. Homomorphic encryption for arithmetic of approximate numbers. In Proceedings of the Advances in Cryptology—ASIACRYPT, Hong Kong, China, 3–7 December 2017; Volume 10624, pp. 409–437. [Google Scholar]
Leo, D.; Daniele, M. FHEW: Bootstrapping homomorphic encryption in less than a second. In Proceedings of the Advances in Cryptology—EUROCRYPT, Sofia, Bulgaria, 26–30 April 2015; Volume 9056, pp. 617–640. [Google Scholar]
Chillotti, I.; Gamma, N.; Georgieva, M. Faster fully homomorphic encryption: Bootstrapping in less than 0.1 s. In Proceedings of the Advances in Cryptology—ASIACRYPT, Hanoi, Vietnam, 4–8 December 2016; Volume 10031, pp. 3–33. [Google Scholar]
Chen, H.; Laine, K.; Player, R. Simple encrypted arithmetic library-SEAL v2. 1. Financ. Cryptogr. Data Secur. 2017, 10323, 3–18. [Google Scholar]
Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009; pp. 169–178. [Google Scholar]
Chillotti, I.; Gamma, N.; Georgieva, M. TFHE: Fast fully homomorphic encryption over the torus. J. Cryptol. 2020, 33, 34–91. [Google Scholar] [CrossRef]
Regev, O. On lattices, learning with errors, random linear codes, and cryptography. In Proceedings of the Thirty-Seventh Annual ACM Symposium on Theory of Computing, Baltimore, MD, USA, 22–24 May 2005; pp. 84–93. [Google Scholar]
Lyubashevsky, V.; Peikert, C.; Regev, O. On ideal lattices and learning with errors over rings. In Proceedings of the Advances in Cryptology, Monaco and Nice, France, 30 May–3 June 2010; pp. 1–23. [Google Scholar]
Yoo, J.S.; Song, B.K.; Yoon, J.I. Logarithm design on encrypted data with bitwise operation. In Proceedings of the International Workshop on Information Security Applications, Jeju Island, Korea, 23–25 August 2018; pp. 105–116. [Google Scholar]
Hong, M.Y.; Yoon, J.W. Model Selection for Data Analysis in Encrypted Domain: Application to Simple Linear Regression. In Proceedings of the International Workshop on Information Security Applications, Jeju Island, Korea, 21–24 August 2019; pp. 155–166. [Google Scholar]
Akaike, H. A new look at the statistical model identification. IEEE Trans. Autom. Control 1974, 19, 716–723. [Google Scholar] [CrossRef]
Mallows, C.L. Some comments on C_p. Technometrics 1973, 15, 661–675. [Google Scholar]
Schwarz, G. Estimating the dimension of a model. Ann. Stat. 1978, 6, 461–464. [Google Scholar] [CrossRef]
Breiman, L.; Friedman, J.H.; Olshen, R.; Stone, C.J. Classification and Regression Trees; CRC Press: Boca Raton, FL, USA, 1984. [Google Scholar]
MacKay, D.J.C. Bayesian Methods for Adaptive Models. Ph.D. Thesis, California Institute of Technology, Pasadena, CA, USA, 1992. [Google Scholar]
Bishop, C.M. Pattern Recognition and Machine Learning; Springer: Berlin/Heidelberg, Germany, 2006; pp. 161–181. [Google Scholar]
Cheon, J.H.; Kim, A. Homomorphic Encryption for Approximate Matrix Arithmetic; Report 2018/565; 2018; Cryptology ePrint Archive. [Google Scholar]
Chen, H.; Gilad-Bachrach, R.; Han, K.; Huang, Z.; Jalali, A.; Lauter, K.L.; Lauter, K. Logistic regression over encrypted data from fully homomorphic encryption. BMC Med. Genom. 2018, 11, 4–81. [Google Scholar] [CrossRef] [PubMed]
Medical Cost Personal Datasets. Available online: https://www.kaggle.com/mirichoi0218/insurance/data (accessed on 21 December 2019).
Cao, X.; Moore, C.; O’Neill, M.; Hanley, N.; O’Sullivan, E. High-speed fully homomorphic encryption over the integers. In Proceedings of the Financial Cryptography and Data Security, Christ Church, Barbados, 7 March 2014; pp. 169–180. [Google Scholar]
Dorz, Y.; Ztrk, E.; Sunar, B. Accelerating fully homomorphic encryption in hardware. IEEE Trans. Comput. 2015, 64, 1509–1521. [Google Scholar] [CrossRef]
Wang, W.; Hu, Y.; Chen, L.; Huang, X.; Sunar, B. Accelerating fully homomorphic encryption using gpu. In Proceedings of the 2012 IEEE Conference on High Performance Extreme Computing, Waltham, MA, USA, 10–12 September 2012; pp. 1–5. [Google Scholar]
Migliore, V.; Seguin, C.; Real, M.M.; Lapotre, V.; Tisser, A.; Fontaine, C.; Gogniat, G.; Tessier, R. A high-speed accelerator for homomorphic encryption using the karatsuba algorithm. ACM Trans. Embed. Comput. Syst. 2017, 16, 1–17. [Google Scholar] [CrossRef]

Figure 1. Fixed point representation of 8-bit real number.

Figure 2. Plot of a training and test dataset. The black curve shows the function used to generate the data.

Figure 3. (a): Graphs of the root-mean-square error (RMSE) evaluated on the training set and independent test set for various values of M. (b): Plot of a model evidence for various values of M.

Figure 4. (a): Graphs of the 9th polynomial regression with 100 data (b): Plot of 1000 data sets for use in training set, test set, and validation set divided by 6:2:2.

Figure 5. (a): Graphs of RMSE of each order in 600 training, 200 test, 200 validation and cross validation dataset. Cross validation performs with 800 training(training + validation) dataset and 200 test dataset. (b): Boxplot of total sum of RMSE on the dataset.

Figure 6. Plot of a model evidence for various values of M.

Table 1. The basic notation of homomorphic operations and homomorphic functions.

Gate Operation	Homomorphic Operation	Operation	Homomorphic Fuction
AND gate	bootsAND	Addition	HomAdd
NOT gate	bootsNOT	Subtraction	HomSub
OR gate	bootsOR	Multiplication	HomMulti
XOR gate	bootsXOR	Division	HomDiv
CONSTANT trivial gate	bootsCONSTANT	Absolute value	HomAbs
		2’s compliment	HomTwosComp
		Mux	HomMux
		Power	HomPow
		Logarithm	HomLog
		Matirix inverse	HomGJ
		Matrix Multiplication	HomMatMulti

Table 2. Execution time (sec) and time complexity of homomorphic operation.

	8-bit	16-bit	32-bit	r-bit
Addition	0.38	0.83	1.69	$(5 r - 3) T_{B}$
Subtraction	0.36	0.82	1.67	$(5 r - 3) T_{B}$
Multiplication	4.85	17.96	69.07	$(6_{r}^{2} + 4) T_{B} + 4 r T_{X}$
Division	6.89	27.48	109.86	$(8 r^{2} - 4 r + 4) T_{B} + (r^{2} + 2 r) T_{X}$
Absolute value	0.33	0.64	1.35	$2 r T_{B} + r T_{X}$
2’s compliment	0.14	0.31	0.66	$(2 r - 3) T_{B}$
Power	8.78	34.09	135.53	$(12 r^{2} + 5 r + 5) T_{B} + 8 r T_{X}$
Logarithm	23.85	152.55	1123.13	$3 r^{3} T_{B} + 55 r T_{B} - 33 T_{B} + 2 r T_{X}$

Table 3. Comparison of error between plaintext and ciphertext evidence.

	M = 0	M = 1	M = 2	M = 3	M = 4	M = 5	M = 6	M = 7	M = 8	M = 9
Plaintext	−89.47	−92.83	−64.04	−35.75	−40.71	−41.93	−46.09	−50.34	−56.01	−62.34
16-bit Ciphertext	−89.88	−93.26	−65.47	−37.11	−42.34	−43,78	−48.11	−52.78	−58.56	−65.03

© 2020 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Hong, M.Y.; Yoo, J.S.; Yoon, J.W. Homomorphic Model Selection for Data Analysis in an Encrypted Domain. Appl. Sci. 2020, 10, 6174. https://doi.org/10.3390/app10186174

AMA Style

Hong MY, Yoo JS, Yoon JW. Homomorphic Model Selection for Data Analysis in an Encrypted Domain. Applied Sciences. 2020; 10(18):6174. https://doi.org/10.3390/app10186174

Chicago/Turabian Style

Hong, Mi Yeon, Joon Soo Yoo, and Ji Won Yoon. 2020. "Homomorphic Model Selection for Data Analysis in an Encrypted Domain" Applied Sciences 10, no. 18: 6174. https://doi.org/10.3390/app10186174

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Homomorphic Model Selection for Data Analysis in an Encrypted Domain †

Abstract

1. Introduction

2. Background

2.1. Homomorphic Encryption

2.2. Our Novel Approach Based on THFE Library

2.2.1. Bitwise Representation of Number

2.2.2. Bitwise Operation

2.2.3. HE Bitwise Logarithm

2.2.4. Time Complexity of Designed Homomorphic Operation

2.3. Model Selection

2.3.1. Cross Validation

2.3.2. Bayesian Model Selection

3. Problems

3.1. Model Selection in Homomorphic Encryption

3.2. Problems of Cross-Validation, an Existing Model Selection Method

4. Methods

4.1. Homomorphic Matrix Operations

4.1.1. Homomorphic Matrix Multiplication

4.1.2. Accurate Homomorphic Matrix Inversion

4.2. Design of Homomorphic Model Selection

4.2.1. Homomorphic Cross Validation Approach

4.2.2. Homomorphic Bayesian Model Selection Approach

5. Implementation

5.1. Toy Dataset

5.2. Body Mass Index (BMI) Dataset

6. Limitation

7. Conclusions

Author Contributions

Funding

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI

Homomorphic Model Selection for Data Analysis in an Encrypted Domain^†