Building Trusted Federated Learning on Blockchain

Oktian, Yustus Eko; Stanley, Brian; Lee, Sang-Gon

doi:10.3390/sym14071407

Open AccessArticle

Building Trusted Federated Learning on Blockchain

by

Yustus Eko Oktian

,

Brian Stanley

and

Sang-Gon Lee

^*

College of Software Convergence, Dongseo University, 47 Jurye-ro, Sasang-gu, Busan 47011, Korea

^*

Author to whom correspondence should be addressed.

Symmetry 2022, 14(7), 1407; https://doi.org/10.3390/sym14071407

Submission received: 20 June 2022 / Revised: 1 July 2022 / Accepted: 2 July 2022 / Published: 8 July 2022

(This article belongs to the Special Issue Blockchain-Enabled Technology for IoT Security, Privacy and Trust)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

Federated learning enables multiple users to collaboratively train a global model using the users’ private data on users’ local machines. This way, users are not required to share their training data with other parties, maintaining user privacy; however, the vanilla federated learning proposal is mainly assumed to be run in a trusted environment, while the actual implementation of federated learning is expected to be performed in untrusted domains. This paper aims to use blockchain as a trusted federated learning platform to realize the missing “running on untrusted domain” requirement. First, we investigate vanilla federate learning issues such as client’s low motivation, client dropouts, model poisoning, model stealing, and unauthorized access. From those issues, we design building block solutions such as incentive mechanism, reputation system, peer-reviewed model, commitment hash, and model encryption. We then construct the full-fledged blockchain-based federated learning protocol, including client registration, training, aggregation, and reward distribution. Our evaluations show that the proposed solutions made federated learning more reliable. Moreover, the proposed system can motivate participants to be honest and perform best-effort training to obtain higher rewards while punishing malicious behaviors. Hence, running federated learning in an untrusted environment becomes possible.

Keywords:

federated learning; artificial intelligence; blockchain; smart contract

1. Introduction

Many companies or organizations have recently utilized Machine Learning (ML) to gain knowledge from their data. These data are mainly obtained from users when they use companies or organizations’ products in their daily lives. The more data gathered from the users, the more accurate the company analytics may become, further driving the data collection practice; however, this data collection often faces public scrutiny from the user’s side, as data privacy has gained public awareness lately (e.g., through GDPR law [1]). Thus, a privacy-preserving ML scheme must be built to comply with the user’s data privacy requirement.

Federated Learning (FL) [2] allows ML models to be trained in users’ local devices instead of companies’ centralized servers; hence, avoiding the user data gathering in the first place; however, the vanilla FL is assumed to be run in a trusted environment, where the trainers are always honest. Meanwhile, the real applications of FL are in an untrusted domain, in which trainers can become malicious. This mismatch highlights the necessity of a secure platform to run FL, where multiple conflicting participants can train models honestly and fairly.

On the other hand, blockchain technology has gained traction lately due to the popularity of Bitcoin [3]. The premise of blockchain is to allow users to store data or process data (e.g., through smart contract [4]) securely in a distributed manner without third-party intervention. If we look closely, the “distributed storage and computing” of blockchain are what we need for “decentralized training” in FL; therefore, many researchers have adopted blockchain to their FL system to secure the collaborative training efforts made by users [5].

Driven by the same background, we propose a blockchain-based FL system in this paper. Our motivation to create yet another blockchain FL system is that most previous papers only use blockchain as a platform to store and audit the trained models. Meanwhile, the FL system is a complex collaboration system involving many other issues, such as how to motivate clients to perform training, how to ensure the quality of the trained model, and how to guarantee the security and fairness of the FL system. So far, only few studies have addressed those issues [6,7,8]. Our paper exists to (i) solve those unresolved issues and (ii) provide alternative solutions from those mentioned studies.

In summary, our contributions are as follows.

We analyze several issues of vanilla FL such as low motivation to perform local training, client dropouts, model poisoning, model stealing, and unauthorized access.
Based on the previously mentioned problems, we create initial building blocks for our proposal, including an incentive mechanism, peer-to-peer reviewed trained model, model encryption, stage timeout, deposit, and reputation system.
We design a full-fledged blockchain-based FL protocol to run fair, secure, and trusted FL tasks.
We provide a proof-of-concept implementation of our protocol and analyze the results.

The rest of this paper is organized as follows. We first explain our proposal’s problem statement and building blocks in Section 2. We then explain the inner workings of our proposal in Section 3 and evaluate our proposed method in Section 4. Literature reviews on the state-of-the-art blockchain-based FL protocols are presented in Section 5. Finally, we conclude in Section 6.

2. Preliminaries

2.1. Problem Statement

The vanilla FL [2] still relies heavily on a centralized and trusted environment. Because of that, we find several general problems (GP) as follows.

GP1: The model owner cannot recruit enough workers to train their model due to a lack of incentives for workers.

The success of FL training depends heavily on the workers’ willingness to perform the FL tasks. The model may become less accurate if it is trained with only a few numbers of data. On the other hand, having more workers may result in more data available for training and increase the data diversity, assuming that each worker trains the model with a unique dataset. Unfortunately, from the workers’ point of view, performing local training means wasting their resources. The workers will most likely not perform the FL tasks in the vanilla FL because it has no incentive mechanism to attract workers’ participation.

GP2: The workers may perform malicious local training that may corrupt the global model.

Because all local models will be aggregated into a global model, corrupt or invalid local models may disrupt the global model. The vanilla FL does not have any protection against malicious workers. The lazy workers, which perform training with low effort, may generate a low accuracy of local models that may decrease the overall global model accuracy [9]. Malicious workers may perform the local model training with adversarial examples to make the global model misclassify [10]. Finally, workers who joined the FL tasks may suddenly drop out and discontinue the training process. When many workers stop the local training simultaneously, it may drastically affect the quality of the updated global model [9].

GP3: Malicious actors may gain access on the updated local or global models.

FL encourages model sharing among entities, where models will be passed on from one entity to another. An unauthorized party may steal the global or local models by eavesdropping on the communication channels. When the attacker steals the local model, they can slightly modify the models (e.g., train the stolen model with their dataset for a few epochs more), then claim the model as theirs and submit it to the system. Furthermore, they can even resubmit the stolen model without any modification, which increases their advantages. If the attacker can obtain the global model, they hit the jackpot and obtain the most reward without training. Since vanilla FL focuses on the training process, they do not protect the communication channels; therefore, the vanilla FL is susceptible to those mentioned attacks.

GP4: The FL actors may crash and jeopardize the whole training process.

Similar to any collaboration system, we cannot guarantee that all participants behave as intended at all times in FL. The local workers may crash and not submit the training result for a given round on time. The aggregator server may be down and fail to generate a global model for a given round. Finally, the model owner may crash and cannot send the training rewards to workers. The vanilla FL will likely fail when such circumstances happen because it does not consider its robustness and fault tolerance.

2.2. Design Considerations

We came up with several design decisions to solve those mentioned general problems.

2.2.1. Incentive Mechanism

The incentive mechanism is one of the features to solve GP1. With enough rewards, a particular FL task should appeal to workers and ease the model owner in finding worker candidates. An important detail regarding designing an incentive mechanism is that its design should be fair and reliable to drive its trustability. In particular, the system must reward workers according to their contributions to the global model. For example, if we reward workers using flat rewards, hard-working workers may feel that they compensate others. Most workers then become reluctant to train a model using their maximum capabilities as they will obtain the same reward as the low-effort ones.

2.2.2. Peer-Reviewed Local Model

All submitted local models must be reviewed to assess their quality and solve GP2. For this purpose, we can employ third-party reviewers to perform model validations.

After the local model training completes, the workers must disclose their training results to the reviewers. The reviewers must evaluate whether the submitted local model generates lower accuracy based on their test dataset. A low accuracy model may indicate a poor training effort or an imperfect training dataset. The reviewers must also try to aggregate the evaluated local model with the current global model to see whether the aggregation improves or reduces the model quality. A quality degradation may point to data poisoning was taking place during the training. After completing the evaluation, the reviewers must submit their evaluation score over trained local models to the system. The system then determines each worker’s contributions based on the submitted training and evaluation scores.

During the evaluation, malicious reviewers (e.g., colluding with or paid by the worker) may send a fake evaluation score that does not represent the evaluated models. Hence, the system must guarantee that the fake scores will not influence the contribution scores. Moreover, a lazy reviewer can do nothing but wait for other reviewers to reveal and submit their evaluation scores to the system. After that, the reviewers copy (or modify slightly) the evaluation score and submit the stolen score as theirs; therefore, the system must also prevent such stealing possibilities from happening.

2.2.3. Model Encryption

The encryption over the distributed models is required to solve GP3. FL actors can perform encryption on the application level (e.g., encrypting the model parameters directly) or build a secure channel using the TLS protocol. With encryption, outside entities will not understand the global or local models being exchanged in the system; therefore, we can minimize the potential model leakage from outside parties.

2.2.4. Stage Timeout

We usually employ stages to manage the training state in synchronous FL. In particular, the system creates a timeout in which participants must complete their work before the given deadline. Furthermore, the system will not count late contributions, thus, ignoring failed or unresponsive FL actors and eliminating GP4.

2.2.5. Deposit Mechanism

We can impose a deposit mechanism to economically punish malicious actors and help solve GP2 and GP4. When the system detects malicious activity, it penalizes all culprit actors by depleting their deposits. This mechanism is arguably a simple yet effective approach; however, a static flat deposit system may not appeal to participants.

A dynamic deposit mechanism can provide perks to attract trustable actors to join our system. For example, the more trustable the actor is, the lesser deposit he or she needs to join the FL tasks. With this lower deposit, actors can join more FL tasks, gaining more profits from those tasks. This method is possible because solving machine learning problems is similar to solving a black-box system. We do not know whether our data can produce optimal accuracy results without a trial and error approach; thus, having the benefits of joining more FL tasks with smaller deposits will be more desirable. This approach should also encourage actors to always behave honestly to maintain their lower deposit benefits. Hence, it indirectly solves GP1.

2.2.6. Reputation System

Without a reputation system, the system will always treat all participants equally. A veteran player is indistinguishable from a new player because there is no way to know the history of player activities and assess their credibility. While this “join and forget” nature may be helpful to preserve the actors’ privacy, attackers become more easily able to perform Sybil attacks in this environment; therefore, the system must determine which actors are considered veteran players and propose more benefits to them when joining our FL tasks.

However, veteran players are not always trustable. In particular, they may intentionally or unintentionally (e.g., the account is hacked by hackers) become malicious. Hence, the reputation system must also track actors’ activity history in the environment and determine whether they are currently in good or evil behavior. This history of activities must become one factor in determining a given actor’s credibility.

3. Proposed Protocol

Our proposed framework is shown in Figure 1 and Table 1 presents the description of important notations and variables in this paper.

Objectives: Model owners want to train their global model but do not have the necessary data and hardware to perform training; therefore, they outsource the training process to multiple clients using Federated Learning (FL). Clients are willing to use their private data and resources to train the model for given incentives. The training scores and trained models from clients will be reviewed, and rewards will be distributed according to the evaluation scores. This way, model owners can be guaranteed to receive high-accuracy models while clients are compensated based on their performances. Hence, a win-win solution for model owners and clients.

3.1. Accountability Metric

We design our FL environment to be public such that anyone can join the FL tasks; therefore, we must guarantee that only credible clients join and perform the task to produce accurate global models. For this reason, we use two metrics to assess clients’ trustability: leveling and credit score mechanism.

3.1.1. Leveling System

The leveling system is used to judge the clients’ experience of performing the FL tasks, which is summarized in Algorithm 1. The admin first sets

γ

value, a multiplier to determine how many experiences are required to reach the next level

e^{n e x t}

. New client will be given zero experience (

e = 0

), which puts them into level one (

l = 1

) by default. Every time clients successfully perform a task (e.g., training or evaluating), the GainExperience(·) method will increase their experiences. The method also increments clients’ level if the required experience to level up is reached. Higher-level clients mean that they are veteran players and are most likely to be trustable.

Algorithm 1 Leveling system for client i

1:: on startup:
2:: admin initiates $γ = 5$ ▹ experience growth multiplier
3:: for each new client, admin initiates:
4:: $l = 1$ and $e = 0$ ▹ default level and experience
5:: $e^{n e x t} = l * γ$ ▹ experience required to level up
6:: procedureGainExperience(i)
7:: get the current experience e for client i
8:: $e \leftarrow e + 1$
9:: if $e = e^{n e x t}$ then ▹ level up is possible
10:: $l \leftarrow l + 1$
11:: $e^{n e x t} = l * γ$
12:: $e = 0$
13:: functionCalculateLevel(i)
14:: return get the current level l for client i

3.1.2. Credit Score System

Unfortunately, high-level players are not always trustworthy because they can still act maliciously, intentionally, or not (e.g., the account is hacked by attackers). Because of that, we propose a credit score mechanism as our second trustability factor, which will monitor the history of clients’ honest/malicious activities throughout their lifetime. We summarize the logic in Algorithm 2.

On startup, the admin first needs to set three parameters:

C^{e v t}

, P, and

t^{e x p i r e d}

. The

C^{e v t}

is a set of credit events containing credit values given when the client performs a particular task. We give positive scores for meaningful actions such as joining, training, and evaluating FL tasks. Meanwhile, we give negative values to punish malicious behaviors.

k = 1, 2, 3, \dots, K

is the index of

C^{e v t}

where K is the total number of distinct

C^{e v t}

types that is available in the system. P is a threshold number to control how many cumulative credit events

C^{c u m}

can be stored for each client. The

t^{e x p i r e d}

is a threshold to justify the freshness of

C^{c u m}

. We measure

t^{e x p i r e d}

in a block timestamp format.

The admin calls SaveCreditEvent(·) procedure to insert a new score for client i. This method should be called at the end of the FL task. First, we calculate

C^{c u m}

for a given client i. The

C^{c u m}

is based on the sum of all

C^{e v t}

that the client receives during the FL task. The

t^{c u m}

indicates the block timestamp when the

C^{c u m}

is calculated and stored in the blockchain.

Any entity (e.g., model owner, client, or reviewer) can call CalculateCreditScore(·) to receive the credit score of a given client i. The system first gathers all last P of

C^{c u m}

and determines its freshness by comparing to

t^{n o w} - t^{e x p i r e d}

. The outdated cumulative credit events will be ignored. The total credit score calculation is then scaled with

\frac{t^{e x p i r e d} - (t^{n o w} - t_{p}^{c u m})}{t^{e x p i r e d}}

. This way, the system puts more attention on the most recent events. They are more critical than obsolete ones; therefore, we give them higher weight.

Algorithm 2 Credit score system for client i

1:: on startup:
2:: admin initiates $C^{e v t}$ , where $C^{e v t} = {C_{1}^{e v t}, C_{2}^{e v t}, C_{3}^{e v t}, \dots, C_{k}^{e v t}, \dots, C_{K}^{e v t}}$
3:: $C_{1}^{e v t} = C^{j o i n} = + 1$ ▹ successfully join the FL task
4:: $C_{2}^{e v t} = C^{t r a i n} = + 3$ ▹ successfully train the FL model
5:: $C_{3}^{e v t} = C^{e v a l} = + 3$ ▹ successfully evaluate the FL model
6:: $C_{4}^{e v t} = C^{p u n i s h} = - 10$ ▹ punishment if perform malicious actions
7:: admin sets $P = 5$ ▹ the number of $C^{c u m}$ object to store
8:: admin defines $t^{e x p i r e d} = 86400$ ▹ epoch time till $C^{c u m}$ expires
9:: procedureSaveCreditEvent(i)
10:: gather all $C^{e v t}$ for client i
11:: calculate $c_{k}$ , where $c_{k}$ is the number of occurence for $C_{k}^{e v t}$
12:: $C^{c u m} = \sum_{k = 1}^{K} c_{k} \times C_{k}^{e v t}$
13:: $t^{c u m} = t^{n o w}$
14:: store $C^{c u m}$ and $t^{c u m}$ ▹ can save up to P times
15:: functionCalculateCreditScore(i)
16:: gather all $C^{c u m}$ for the last P for client i
17:: for $1, 2, 3, \dots, p, \dots, P$ do
18:: if $t_{p}^{c u m} < (t^{n o w} - t^{e x p i r e d})$ then
19:: $C_{p}^{c u m} = 0$ ▹ ignore outdated $C^{c u m}$ score
20:: $C = \sum_{p = 1}^{P} C_{p}^{c u m} \times \frac{t^{e x p i r e d} - (t^{n o w} - t_{p}^{c u m})}{t^{e x p i r e d}}$
21:: return C

3.2. Federated Learning Stage

The FL tasks can be divided into seven stages: start, commit-train, reveal-train, commit-eval, reveal-eval, aggregate, and end stage. All parameters presented here are provided as bare minimum requirements. They can be customized further according to the actual FL use cases.

3.2.1. Start Stage

FL Task Creation: The model owner must first create a training smart contract along with its training properties. The model owner sets up training information and uploads it to the IPFS network, including, but not limited to: the initial global model

M^{(0)}

with its sample dataset for testing

d^{t e s t}

; the task type description

τ^{d e s c}

(e.g., supervised, unsupervised, or reinforcement learning); a timestamp

t^{n o w}

; nonce

η

. Formally, this process can be defined as follows.

Y_{1}^{i p f s} = I P F S (M^{(0)} ‖ d^{t e s t} ‖ τ^{d e s c} ‖ t^{n o w} ‖ η)

(1)

The model owner then uploads the task metadata to the smart contract, including, but not limited to: the IPFS hash of the task

Y_{1}^{i p f s}

—this hash also becomes the task id

τ^{i d}

; the task target

τ^{t a r g e t}

(e.g., achieving 90% accuracy); the base deposit value

β

; the total reward for this task R; the minimum clients’ level

l^{m i n}

; minimum reputation

C^{m i n}

to join the task. Finally, the owner also puts a registration timeout

t^{r e g i s}

, which indicates when the registration will be closed.

Client Assignment: At any given time, FL clients may join the tasks they are interested in. They can download the FL task information from the corresponding IPFS and smart contract. Based on all that information, they can analyze their capabilities to determine whether they can perform the task and earn profits.

For all clients

i = 1, 2, 3, \dots, I

with I is the total number of the available clients, the client i must submit their address

α_{i}

and public key

P K_{i}

to register the FL task. Moreover, the clients also need pay deposits

D_{i}

to the smart contract

S

, which is

F_{i \to S} (D_{i})

. The deposit is calculated as follows, where

l_{i}

and

C_{i}

are the current client’s level and credit scores (obtainable from Algorithms 1 and 2).

D_{i} = β + \frac{l^{m i n}}{l_{i}} \times β + \frac{C^{m i n}}{C_{i}} \times β

(2)

Each client will pay a different amount of deposit from one another depending on their trustworthiness. In general, clients with higher levels and more honest behaviors (i.e., having higher credit scores) will receive discounts on their deposits, while the lower level and dishonest clients will pay more deposits. The discounts or markups scale based on their gaps toward

l^{m i n}

and

C^{m i n}

.

When nearing

t^{r e g i s}

timeout, the model owner has options to prolong the registration step. The owner can extend the timeout if he cannot find enough suitable clients; however, it cannot be prolonged forever because the deposit must be returned to clients. If the owner feels satisfied with the registered clients, they can move on to the next steps. Otherwise, they can also cancel the task, and all clients will receive their deposits back.

3.2.2. Commit-Train Stage

The model owner queries the registration results from the smart contract, which includes a list of addresses

α_{i}

and public keys

P K_{i}

of all registered clients.

Distribution of Initial Global Model: The model owner

O

creates a random secret key k. They then sign the key with

S K_{O}

, which is

Z_{1} = S I G N_{S K_{O}} (k)

, and encrypt the key with

P K_{i}

. In particular, for all i, the owner performs

X_{1, i} = P K E_{P K_{i}} (k)

. The owner then uploads the encrypted keys to the IPFS.

Y_{2}^{i p f s} = I P F S (Z_{1} ‖ ⋃_{i \in I} X_{1, i})

(3)

The model owner then uploads the

Y_{2}^{i p f s}

to the smart contract, which eventually notifies all clients that the key is ready in the given IPFS address.

Local Training: All clients query

Z_{1}

and

X_{1, i}

from IPFS using

Y_{2}^{i p f s}

. They then decrypt the key with

S K_{i}

and verifies that

O

is the signer of

Z_{1}

. Formally, clients calculate

k = P K D_{S K_{i}} (X_{1, i})

and

V E R_{P K_{O}} (k, Z_{1}) = T r u e

. After that, the clients can begin to train

M^{(0)}

using their own private dataset

d_{i}^{t r a i n}

.

Once the training completes, the clients produce the updated local models

M_{i}^{'}

with their associated results

T_{i}^{s c o r e}

. Before submitting the results, the clients must encrypts the trained model using

X_{2, i} = E_{k} (M_{i}^{'})

and signs the model using

Z_{2, i} = S I G N_{P K_{i}} (M_{i}^{'})

. They then upload the encrypted model along with its signature to IPFS and generate commitment hashes

Y_{1, i}^{c o m m i t}

, which can be described as follows.

\begin{matrix} Y_{3, i}^{i p f s} = I P F S (X_{2, i} ‖ Z_{2, i} ‖ t^{n o w} ‖ η) \end{matrix}

(4)

\begin{matrix} Y_{1, i}^{c o m m i t} = H (Y_{3, i}^{i p f s}) \end{matrix}

(5)

After that, clients submits

Y_{1, i}^{c o m m i t}

and

T_{i}^{s c o r e}

to the smart contract; however, clients keep

Y_{3, i}^{i p f s}

secret for now.

This stage completes when one of the following cases happens. First, when all clients have submitted

Y_{1, i}^{c o m m i t}

and

T_{i}^{s c o r e}

to the smart contract. Second, a stage timeout is triggered, which is indicated by

t_{c o m m i t}^{t r a i n}

.

3.2.3. Reveal-Train Stage

Before a

t_{r e v e a l}^{t r a i n}

timeout, clients need to disclose their

Y_{3, i}^{i p f s}

from Equation (4) to the smart contract. The smart contract will verify that the revealed IPFS hashes match the ones previously submitted in the previous stage. When the hash matches, the smart contract records the

Y_{3, i}^{i p f s}

in the storage and continues to the next stage. If the hash does not match, the smart contract will punish the sender by depleting their deposit.

with Y_{1, i}^{c o m m i t^{'}} = H (Y_{3, i}^{i p f s}), {\binom{continue, if Y_{1, i}^{c o m m i t^{'}} = Y_{1, i}^{c o m m i t}}{punish i, otherwise}

(6)

3.2.4. Commit-Eval Stage

Local Model Evaluation: During this stage, all clients must validate each other training results and thus, become reviewers for other clients. We define the reviewers as

j = 1, 2, 3, \dots, J

, where J is the total number of reviewers.

First of all, reviewers download other clients’ model from IPFS using

Y_{3, i}^{i p f s}

. They then perform the necessary decryption to obtain the model and verify its signature. Formally, for all i, the reviewers obtain

M_{i}^{'} = D_{k} (X_{2, i})

and make sure that

V E R_{P K_{i}} (M_{i}^{'}, Z_{2, i}) = T r u e

. All invalid models are ignored in the system. The reviewers then evaluate the

M_{i}^{'}

using the test dataset

d^{t e s t}

. Once completed, the reviewers produce the

T_{i, j}^{e v a l}

for all i, which are the evaluation scores for client i’s model from reviewer j. Before submitting the result, reviewers generate commitment hashes

Y_{2, j}^{c o m m i t}

as follows.

Y_{2, j}^{c o m m i t} = H (T_{i, j}^{e v a l} ‖ η)

(7)

After that, the reviewer submits

Y_{2, j}^{c o m m i t}

to the smart contract but keeps the value of

T_{i, j}^{e v a l}

secret for now.

This stage completes when one of the following cases happens. First, when all reviewers already submitted

Y_{2, j}^{c o m m i t}

to the smart contract. Second, a timeout is triggered which is indicated by

t_{c o m m i t}^{e v a l}

.

3.2.5. Reveal-Eval Stage

Before a

t_{r e v e a l}^{e v a l}

timeout, reviewers need to disclose their

T_{i, j}^{e v a l}

and

η

to the smart contract. The smart contract validates whether the revealed evaluation scores match the ones previously submitted in the previous stage. The smart contract will punish the clients by depleting their deposits if they do not match. When the hash matches, the smart contract saves the

T_{i, j}^{e v a l}

in the blockchain and continues to the next stage.

with Y_{2, j}^{c o m m i t^{'}} = H (T_{i, j}^{e v a l} ‖ η), {\binom{continue, if Y_{2, j}^{c o m m i t^{'}} = Y_{2, j}^{c o m m i t}}{punish i, otherwise}

(8)

3.2.6. Aggregate Stage

After the smart contract obtains all the evaluation scores from all clients, the system can begin the model aggregation stage.

Calculating Contribution Scores: The CalculateWorkerContribution(·) function in Algorithm 3 is used to calculate the training contributions of client i with respect to its corresponding evaluation scores

T_{i, j}^{e v a l}

. We calculate the first quarter

Q_{1}

, the second quarter (the median)

Q_{2}

, and the third quarter

Q_{3}

of all

T_{i, j}^{e v a l}

. We then check if the previously claimed training score

T_{i}^{s c o r e}

resides outside the boundary of

(Q_{1} - I Q R)

and

(Q_{3} + I Q R)

. We assume that 50% of the client is always honest and 50% of submitted evaluation scores can be trusted. If

T_{i}^{s c o r e}

is out of range, we punish client i. Finally, we return the median of all evaluation scores as an accepted training score for

M_{i}^{'}

.

The CalculateReviewerContribution(·) function is used to measure the evaluation contributions of reviewer j towards the model of client i. Similar to the previous function, we first calculate

Q_{1}

,

Q_{2}

, and

Q_{3}

of all

T_{i, j}^{e v a l}

. We then calculate

d_{i, j}

, which is the difference between the submitted eval scores

T_{i, j}^{e v a l}

from each j towards the median

Q_{2}

. Because we assume that 50% of the client is honest, we expect that 50% of their submitted evaluation scores will be closer to the median. We then normalize

d_{i, j}

and flip the score so that the higher

d_{i, j}^{'}

values now become the score closer to the median instead of the lower ones. Finally, we ignore

d_{i, j}^{'}

that is outside the boundary of

(Q_{1} - I Q R)

and

(Q_{3} + I Q R)

, and give them the lowest value possible.

Algorithm 3 Processing contributions from client i and reviewer j

1:: functionCalculateWorkerContribution(i)
2:: $\forall j$ , gather all $T_{i, j}^{e v a l}$
3:: calculate $Q_{1}$ , $Q_{2}$ , and $Q_{3}$ from all $T_{i, j}^{e v a l}$
4:: calculate $I Q R = Q_{3} - Q_{1}$
5:: get previously claimed training score $T_{i}^{s c o r e}$
6:: if $T_{i}^{s c o r e} < (Q_{1} - I Q R)$ or $T_{i}^{s c o r e} > (Q_{3} + I Q R)$ then
7:: punish worker i ▹ client submitted fake training score
8:: return $Q_{2}$
9:: functionCalculateReviewerContribution(j)
10:: for all i do
11:: $\forall j$ , gather all $T_{i, j}^{e v a l}$
12:: calculate $Q_{1}$ , $Q_{2}$ , and $Q_{3}$ from all $T_{i, j}^{e v a l}$
13:: calculate $I Q R = Q_{3} - Q_{1}$
14:: if $T_{i, j}^{e v a l} < (Q_{1} - I Q R)$ then
15:: $d_{i, j} = | (Q_{1} - I Q R) - Q_{2} |$
16:: punish reviewer j ▹ client submitted fake evaluation score
17:: else if $T_{i, j}^{e v a l} > (Q_{3} + I Q R)$ then
18:: $d_{i, j} = | (Q_{3} + I Q R) - Q_{2} |$
19:: punish reviewer j ▹ client submitted fake evaluation score
20:: else
21:: $d_{i, j} = | T_{i, j}^{e v a l} - Q_{2} |$
22:: normalize and flip, $d_{i, j}^{'} = 1 - \frac{d_{i, j} - m i n (d_{i})}{m a x (d_{i}) - m i n (d_{i})}$
23:: return $d_{i, j}^{'}$

Aggregating the Global Model: In vanilla FL [2], we perform aggregation as follows.

w_{t + 1} \leftarrow \sum_{i = 1}^{I} \frac{n_{i}}{n} \times w_{t + 1}^{i}

(9)

n_{i}

is the total number of data owned by client i while n is the total dataset from all clients. Using this formula, the aggregation weight is calculated based on the dataset ownership.

Meanwhile, we slightly modify the formula to adjust the weight based on the models’ accuracy. More specifically, for all i, the owner calculates the following.

x_i = CalculateWorkerContribution(i)

\begin{matrix} x_{i}^{'} = \frac{x_{i} - m i n (x)}{m a x (x) - m i n (x)} \\ w_{t + 1} \leftarrow \sum_{i = 1}^{I} \frac{x_{i}^{'}}{\sum_{i = 1}^{I} x_{i}^{'}} \times w_{t + 1}^{i} \end{matrix}

(10)

The model owner gathers all median scores from reviewers. The owner then perform normalization on the scores

x_{1}^{'}

and aggregates the global model based on each client accuracy, which is weighted as

\frac{x_{i}^{'}}{\sum_{i = 1}^{I} x_{i}^{'}}

. For this matter, we prefer a more accurate

x_{i}^{'}

rather than the less accurate one.

Distributing Reward: The model owner

O

splits the total rewards for workers and reviewers as follows.

R^{t r a i n} = ϕ \times R, R^{e v a l} = (1 - ϕ) \times R

(11)

ϕ

is a parameter determined by the owner, where

0 \leq ϕ \geq 1

.

To distribute the rewards for all workers, for each i, the owner performs:

F_{O \to i} (\frac{x_{i}^{'}}{\sum_{i = 1}^{I} x_{i}^{'}} \times R^{t r a i n})

(12)

Note that the reward is weighted as

\frac{x_{i}^{'}}{\sum_{i = 1}^{I} x_{i}^{'}}

, similar to the aggregation rules. Hence, the client that submits models with more accurate results will be rewarded more than the less accurate ones.

To distribute the rewards for all reviewers, for each j, the owner performs:

d’_i,j = CalculateReviewerContribution(j)

\begin{matrix} z_{j} = \frac{1}{I} \sum_{i = 1}^{I} d_{i, j}^{'} \\ F_{O \to j} (\frac{z_{j}}{\sum_{j = 1}^{J} z_{j}} \times R^{e v a l}) \end{matrix}

(13)

z_{j}

is the average distance of evaluation scores. The reward is then weighted as

\frac{z_{j}}{\sum_{j = 1}^{J} z_{j}}

in which reviewers that submitted scores closer to the median scores will be given more rewards.

Updating Clients’ Level and Credit Scores: The model owners give experience values to all clients that contribute to the FL task by invoking the GainExperience(·) procedure in Algorithm 1. When adding experience, the owner may also eventually increase the clients’ level when the number of experience required is met. Note that malicious clients (e.g., dropping out or submitting fake scores) will not receive any experience.

Moreover, the owner also gathers all of the credit events

C^{e v t}

for all joined clients, calculates the cumulative credit events

C^{c u m}

, and then saves the score in the smart contract using SaveCreditEvent(·) procedure in Algorithm 2. Honest clients will receive positive scores, while malicious clients will receive negative scores.

The clients’ updated levels and credit scores will determine the number of deposits that the clients need to pay when joining future tasks.

3.2.7. End Stage

When all the steps in the aggregation stage finish, the owner will receive the updated global model, and the clients receive compensation for their efforts in training or evaluating the local models. The task now moves to the last stage, where no one can modify this FL task state in the smart contract. This frozen state is used for auditing or as a reference for future tasks related to this task.

4. Experimental Results and Analysis

Setup: We built a docker container utilizing 2 cores of CPU and 2 GB of RAM to run Ganache [11] and our decentralized application (dapp). The smart contract was written in Solidity language and was deployed to the Ganache using Truffle JS [12]. Node JS was used as the programming language to implement our dapp.

4.1. Assessing Client Credibility

We first analyze our proposed client credibility metrics, including leveling, credit score, deposits, model training score, model evaluation score, and reward distribution. Note that the given initial values in this paper are merely used as examples. There are no right and wrong values here so developers can tweak and calibrate them according to their desired results.

4.1.1. Leveling System Results

Figure 2 plots the growth rate of clients’ level based on the given

γ

value (c.f., Algorithm 1). With higher

γ

, the number of required experiences to level up a client increases. Hence, developers can tweak this value to determine how fast clients can level up in their system. For example, with

γ = 1

, clients need to complete at least 200 FL tasks to reach level 20. In contrast, they need to perform more than 1600 FL tasks to reach the same level in

γ = 8

setting.

4.1.2. Credit Score System Results

Setup: We make a simulated scenario where a client performs 20 FL tasks over a one-week timeframe. During that period, the client receives 20 score updates as presented in Table 2. Based on CalculateCreditScore(·) method in Algorithm 2, P and

t^{e x p i r e d}

are the main variables to determine the freshness of the given credit events, and those value will influence the total credit score for a given client; therefore, we provide experiments with varying those values to see their impact on the total credit score. In the first two experiments, we set

t^{e x p i r e d}

to two days and then varied P value to 2 and 10. In the last two experiments, we set P to 5 and then varied the

t^{e x p i r e d}

to one day and three days. At each hour, we call the CalculateCreditScore(·) method and plot the results in Figure 3 and Figure 4. The 20 straight vertical lines in both figures indicate when 20 score updates from Table 2 are applied in the system.

When we set P to a higher value, the system considers more events from the history to be included when calculating the current credit score. For

P = 10

, the system uses the last ten

C^{c u m}

to calculate the credit score, while only two

C^{c u m}

are used for

P = 2

; therefore, in Figure 3, we can see that the line for

P = 10

becomes more positive than the

P = 2

line from 23 April, 1 AM to 24 April, 1 AM. Similarly, we can see that the

P = 10

line suffers less impact on the credit score drop than the

P = 2

line from 24 April, 1 AM to 25 April, 1 AM.

When we set

t^{e x p i r e d}

to a higher value, the system will take a longer time for a credit score to return to its default value (zero score). We can see this behavior clearly in Figure 4 from 24 April, 6 PM to 26 April, 9 PM. In those periods, the

t^{e x p i r e d} = 3

day line takes a longer time to reach positive values than the

t^{e x p i r e d} = 1

day line.

Based on the results in Figure 3 and Figure 4, we can conclude that our credit score can react to dynamics of client history activities by giving positives or negative scores. Moreover, we emphasize the freshness of the events, in which when clients become inactive, their reputation score will eventually return to zero.

Malicious clients with negative scores can improve their scores with two actions: (i) perform honest behavior to increase their score or (ii) do nothing and eventually let the credit score reset to zero. This behavior is intentional as we want to prevent clients from performing Sybil attacks by creating a new account with zero credit score value. Performing such action is not beneficial because adversaries will lose the previous client’s level and start over again from level 1. The level is vital in our system as it will determine the amount of deposit the client needs to stake, which is explained in the following subsection.

4.1.3. Deposit Mechanism Results

Setup: We first create a new FL task with the minimum level

l^{m i n}

is set to 4, the minimum credit score

C^{m i n}

is set to 60, and the base deposit

β

is set to 1 million gwei. The number of deposits that the client has to stake depends on the client’s current level and credit score; therefore, we simulate a scenario where multiple clients join the created FL task by varying the client level from 1 to 10 and the credit score from 10 to 100.

Figure 5 shows that the client with a lower level or credit score than the minimum requirements must pay a more considerable amount of deposits than the qualified and over-qualified ones. The increase can scale up to a triple amount of the normal deposit value, which should be closer to

3 \times β

(c.f., Equation (2)). The yellow area (between 2 and 3 million) is a good deposit value.

4.1.4. Contribution Score Results

Setup: To simulate the effectiveness of our approach in determining workers’ and reviewers’ contributions, we deploy a simulation with ten clients. Client 3 is a malicious worker who sends fake training results. Furthermore, Client 3 and Client 6 act as malicious reviewers who submit fake evaluation scores to the system. We then run steps in Algorithm 3 and plot the results in Figure 6 and Figure 7.

From Figure 6, we can see that the medians of each worker are placed relatively close to the submitted training scores previously claimed by the workers (except for Client 3). This condition remains true even though two malicious reviewers are in the system. Those reviewers submit fake evaluation scores far from the ground truth, which explains the vast distance in each worker’s min and max distribution in Figure 7. A greater distance means inaccurate evaluation and vice versa; therefore, as long as the majority of the reviewers (i.e., more than 50% clients) are honest, we can trust the result of our worker and reviewer contribution scores.

Moreover, from Figure 6, we can see that the claimed training score for Client 3 is far from the median as ground truth. This condition implies that Client 3 previously lied about the actual value of the training score and submitted a fake score in the system. Hence, the system can mark this client as a malicious worker. Furthermore, from Figure 7, the distance for Client 3 and 6 are higher compared to the rest. This fact indicates that those two clients submit evaluation results inaccurately. Thus, they are punished by the system.

Finally, based on the contribution results, we can distribute rewards to workers and reviewers as shown in Figure 8. In this case, we assume that the total reward for workers and reviewers is 20 million gwei, 10 million for workers, and another 10 million for reviewers. Clients 7, 8, and 9 receive low rewards because their models’ accuracy is among the worst of all ten workers. Similarly, Client 3 does not receive any training rewards due to fake training result input, while Clients 3 and 6 do not receive any reviewer rewards due to their inaccurate evaluations.

4.2. Assessing Blockchain Performance

We analyze parts of our implementation that are related to blockchain.

4.2.1. Gas Consumption

All Ethereum smart contract executions that modify the blockchain network state are subject to a unit called “gas”. Generally, the more complex the methods become, the more gas is required to execute them. Table 3 shows gas consumption of our writable methods. Read functions are free and do not require gas; hence, we do not include them in the table.

First and foremost, all of our implemented methods are below the Ethereum gas limit standard of 30 million per block [13]. This indicates that all of our methods should be feasible to be executed in Ethereum networks. Second, those gas usages are calculated per one unit. For example, the add experience refers to adding new experience values to one client. Save credit event infer to submission of cumulative credit events for one client. Similarly, submission methods are for one submission of training score, evaluation score, and commitment hash. The total number of transactions for each method that must be executed in one FL task highly depends on the number of clients.

4.2.2. Transaction Throughput

The transaction throughput in blockchain can be calculated as how many transactions the network can process per second (TPS). This metric depends on two factors: (i) number of transactions can be included in the block (which corresponds to the gas limit per block

g_{l i m i t}

in the Ethereum case), and (ii) the time taken to generate one block (also known as block interval

b_{i n t e r v a l}

). With the results of the gas usage

g_{u s a g e}

per method in Table 3, we can estimate the projected TPS using the following formula

t p s = (g_{l i m i t} / g_{u s a g e}) / b_{i n t e r v a l}

(14)

In this case, we assume that one block contains only transactions from the same methods.

Because block intervals vary among different blockchain networks, we consider three networks to measure the throughput: Mainnet (Ethereum main network using PoW [14]), Kovan Testnet (Ethereum test network using PoA [15]), and Klaytn (private Ethereum network using PBFT [16]). Mainnet processes one block every 13 s [17], Kovan Testnet can do it in 4 s [18], while Klaytn can form a block within 1 s [19].

As shown in Figure 9, the lower the block interval, the higher the transaction throughput becomes, and thus, we can complete more FL tasks. Those results show that our proposal depends heavily on the consensus algorithm that the blockchain employs, and they are most likely to become the main bottleneck in our system. In addition, the whole performance of our system also depends on how many clients join the FL task. From Table 3, we can see that, because of the requirements for clients to review each of the other trained models, they must upload evaluation scores and commitment hash multiple times for all reviewed clients. Thus, creating a massive bottleneck in our system, but the FL system becomes robust and fair.

4.3. Assessing FL Accuracy

Previous FL studies have conducted the accuracy of FL [2]. Moreover, previous blockchain-based FL system have analyzed that FL and blockchain run independently from each other [6,20,21]; therefore, the integration of blockchain to FL did not influence the accuracy result of FL. Because we do not propose a new FL algorithm in this paper, we refrain from discussing the FL performance in this paper.

5. Related Works

In recent years, many FL systems have been implemented in the Ethereum blockchain. Baffle [21] proposes a blockchain-based FL platform where workers can submit parts of the model to the smart contract for aggregation. The aggregation occurs on-chain; therefore, it does not need any centralized FL organizer. On the other hand, Morsbach [22] takes an off-chain aggregation approach by employing all workers to synchronize the global model through the IPFS network. Since storing the same model to the IPFS network will generate identical IPFS hashes, malicious entities can be detected, and most participants can quickly agree on the same global model. In [23], a blockchain-based FL system for healthcare is proposed. The system employs secure and private aggregation using multi-party computation implemented in AMD Secure Encrypted Virtualization for maximum secrecy. While those studies have their own merits, they lack the deposit, incentive, or reputation system to control the workers’ actions. Our proposal exists to provide all of the missing items.

The deposit system was used in several blockchain-based FL studies; however, most of them employ a static deposit system in which participants must submit a fixed amount of money to join the FL task, such as in [7,8,24]. CrowdSFL [6] proposes a dynamic deposit mechanism, where the required deposit is measured based on the age of the workers. A veteran worker may submit a lower deposit compared to a new worker. One limitation of this approach is that being veteran players does not guarantee they will not become malicious in the future; therefore, instead of relying on age alone, we also consider the worker’s credit score to determine the deposit in our proposal.

Several studies have also explored the use of reviewers to judge trained local models. CrowdSFL [6] employs a Crowdsourcing Platform (CSP) to review the submitted local models from workers. The CSP evaluates the models and provides a guideline for the FL organizer on choosing the best model candidates for the model aggregation. Similarly, Kumar et al. [20] mandate all trained models to be reviewed. Models below a given threshold will be excluded from the aggregation process; however, these mentioned works employ a single reviewer approach, which suffers from centralization issues. In contrast, our proposal uses multiple reviewers.

When multiple reviewers exist, the workers disclose their models to all reviewers. The reviewers then evaluate the models using the test dataset and submit them to the smart contract. BlockFLA [24] proposes a penalty smart contract in which users can invoke it anytime they detect malicious training by providing proof of their malicious behaviors. The contract will determine which party is honest. Learning Markets [7] compares the difference between the claimed accuracy from workers and the ones from reviewers. The accuracy must surpass a given acceptance threshold to be included in the aggregation process and obtain rewards. BlockFlow [8] calculates the median of the evaluation scores from reviewers as a base truth to determine a valid accuracy value. This value is then compared with other scores to determine the quality of trained models. Though we share the same multi-reviewer architecture, the algorithm we use to calculate the contribution scores differs from those studies.

Finally, the reputation system is commonly used to track workers’ actions throughout the workflow of FL tasks. CrowdSFL [6] and Learning Markets [7] have their reputation system built-in in their FL platform; however, their reputation calculation is simplistic as it only adds or subtracts the score with some fixed value. Our proposal proposes a more sophisticated credit score mechanism that also considers the data’s freshness.

6. Conclusions and Future Works

This paper investigated several vanilla FL weaknesses that hindered its applications in untrusted environments and proposed counter solutions. We designed a blockchain-based FL aggregation protocol divided into seven stages and implemented the protocol in the Ethereum smart contract.

Through our evaluation results, we have shown that the leveling system can determine the longevity of a client, and the credit score evaluation could adapt to the history of client activities. Based on those two metrics, the proposed deposit mechanism could differentiate between trusted and untrusted clients by imposing different deposit requirements. Untrusted clients can pay deposits up to three times more than trusted clients. Furthermore, the proposed contribution assessment algorithm could detect malicious workers and reviewers to ensure the quality of training models. Proven malicious behavior will result in workers and reviewers losing 100% of their reward for a given FL task. Regarding scalability, we could complete up to 12 times more FL tasks in Klaytn than in Mainnet and about 2.25 times more in Kovan than in Mainnet; however, this performance would decrease exponentially as more clients join the FL tasks.

While our protocol is robust and fair, it creates a communication bottleneck in which clients need to send multiple transactions to the blockchain because of our peer-reviewed model. An improvement can be made such that instead of mandating all trained models to be reviewed, we can select to only review models from clients with low reputation scores; therefore, future works are required to find a good balance between robustness and performance. Furthermore, malicious clients can still collude outside of the blockchain. They may share their training or evaluation results off-chain, bypassing the commitment hash scheme we propose. Our system cannot defend against this attack because it is performed outside our system. For future work, we can combine the concept of homomorphic encryption with our proposal. All models become encrypted; thus, it becomes pointless for clients to share their trained models off-chain because they are encrypted with the system’s homomorphic encryption public key and cannot be decrypted by clients.

Author Contributions

Conceptualization, Y.E.O. and B.S.; methodology, Y.E.O.; software, Y.E.O. and B.S.; validation, Y.E.O., B.S. and S.-G.L.; formal analysis, Y.E.O.; investigation, Y.E.O. and B.S.; resources, S.-G.L.; data curation, Y.E.O. and B.S.; writing—original draft preparation, Y.E.O.; writing—review and editing, Y.E.O., B.S. and S.-G.L.; visualization, Y.E.O.; supervision, S.-G.L.; project administration, S.-G.L.; funding acquisition, S.-G.L. All authors have read and agreed to the published version of the manuscript.

Funding

This work was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant Number: 2018R1D1A1B07047601).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We want to thank the anonymous reviewers for their valuable comments.

Conflicts of Interest

The authors declare no conflict of interest.

References

Goddard, M. The EU General Data Protection Regulation (GDPR): European regulation that has a global impact. Int. J. Mark. Res. 2017, 59, 703–705. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial Intelligence and Statistics, Ft. Lauderdale, FL, USA, 20–22 April 2017; pp. 1273–1282. [Google Scholar]
Nakamoto, S. Bitcoin: A Peer-to-Peer Electronic Cash System. Decentralized Bus. Rev. 2008, 21260. Available online: https://www.debr.io/article/21260.pdf (accessed on 19 June 2022).
Buterin, V. A Next-Generation Smart Contract and Decentralized Application Platform. White Pap. 2014, 3. Available online: https://nft2x.com/wp-content/uploads/2021/03/EthereumWP.pdf (accessed on 19 June 2022).
Nguyen, D.C.; Ding, M.; Pham, Q.V.; Pathirana, P.N.; Le, L.B.; Seneviratne, A.; Li, J.; Niyato, D.; Poor, H.V. Federated learning meets blockchain in edge computing: Opportunities and challenges. IEEE Internet Things J. 2021, 8, 12806–12825. [Google Scholar] [CrossRef]
Li, Z.; Liu, J.; Hao, J.; Wang, H.; Xian, M. CrowdSFL: A secure crowd computing framework based on blockchain and federated learning. Electronics 2020, 9, 773. [Google Scholar] [CrossRef]
Ouyang, L.; Yuan, Y.; Wang, F.Y. Learning Markets: An AI Collaboration Framework Based on Blockchain and Smart Contracts. IEEE Internet Things J. 2020. [Google Scholar] [CrossRef]
Mugunthan, V.; Rahman, R.; Kagal, L. BlockFLow: An Accountable and Privacy-Preserving Solution for Federated Learning. arXiv 2020, arXiv:2007.03856. [Google Scholar]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. arXiv 2019, arXiv:1912.04977. [Google Scholar]
Goodfellow, I.J.; Shlens, J.; Szegedy, C. Explaining and harnessing adversarial examples. arXiv 2014, arXiv:1412.6572. [Google Scholar]
Truffle Suite. Ganache: One Click Blockchain. Available online: https://trufflesuite.com/ganache/ (accessed on 13 June 2022).
Truffle Suite. Truffle: Smart Contracts Made Sweeter. Available online: https://trufflesuite.com/truffle/ (accessed on 13 June 2022).
ethereum.org. Gas and Fees. Available online: https://ethereum.org/en/developers/docs/gas/ (accessed on 6 June 2022).
Sompolinsky, Y.; Zohar, A. Secure high-rate transaction processing in bitcoin. In Proceedings of the International Conference on Financial Cryptography and Data Security, San Juan, Puerto Rico, 26–30 January 2015; pp. 507–527. [Google Scholar]
OpenEthereum. Aura—Authority Round—Wiki. Available online: https://openethereum.github.io/Aura (accessed on 6 June 2022).
Castro, M.; Liskov, B. Practical byzantine fault tolerance. In Proceedings of the OsDI, Cambridge, MA, USA, 22–25 February 1999; Volume 99, pp. 173–186. [Google Scholar]
Etherscan. Ethereum Average Block Time Chart. Available online: https://etherscan.io/chart/blocktime (accessed on 13 June 2022).
Etherscan. Kovan Testnet Explorer. Available online: https://kovan.etherscan.io/ (accessed on 13 June 2022).
Klaytn. Klaytn Overview. Available online: https://docs.klaytn.foundation/klaytn (accessed on 13 June 2022).
Kumar, S.; Dutta, S.; Chatturvedi, S.; Bhatia, M. Strategies for enhancing training and privacy in blockchain enabled federated learning. In Proceedings of the 2020 IEEE Sixth International Conference on Multimedia Big Data (BigMM), New Delhi, India, 24–26 September 2020; pp. 333–340. [Google Scholar]
Ramanan, P.; Nakayama, K. Baffle: Blockchain based aggregator free federated learning. In Proceedings of the 2020 IEEE International Conference on Blockchain (Blockchain), Rhodes, Greece, 2–6 November 2020; pp. 72–81. [Google Scholar]
Felix Johannes, M. Hardened Model Aggregation for Federated Learning Backed by Distributed Trust towards Decentralizing Federated Learning Using a Blockchain. 2020. Available online: https://www.diva-portal.org/smash/record.jsf?pid=diva2%3A1480881&dswid=890 (accessed on 19 June 2022).
Passerat-Palmbach, J.; Farnan, T.; Miller, R.; Gross, M.S.; Flannery, H.L.; Gleim, B. A blockchain-orchestrated federated learning architecture for healthcare consortia. arXiv 2019, arXiv:1910.12603. [Google Scholar]
Desai, H.B.; Ozdayi, M.S.; Kantarcioglu, M. Blockfla: Accountable federated learning via hybrid blockchain architecture. In Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy, Virtual Event, 26–28 April 2021; pp. 101–112. [Google Scholar]

Figure 1. The proposed blockchain-based FL collaboration architecture, which includes seven stages. IPFS network is used to share the models between model owners, clients, and reviewers. All FL task interaction will be recorded in the blockchain and serve as proof of collaborations.

Figure 2. The accumulated experiences on each client’s level depending on the

γ

value.

Figure 2. The accumulated experiences on each client’s level depending on the

γ

value.

Figure 3. The fluctuating behavior of the client’s credit score over one week timeframe when using

P = 2

or

P = 10

with

t^{e x p i r e d}

is set to two days.

Figure 3. The fluctuating behavior of the client’s credit score over one week timeframe when using

P = 2

or

P = 10

with

t^{e x p i r e d}

is set to two days.

Figure 4. The fluctuating behavior of the client’s credit score over one week timeframe when setting the expiry time to one or three days with

P = 5

.

Figure 4. The fluctuating behavior of the client’s credit score over one week timeframe when setting the expiry time to one or three days with

P = 5

.

Figure 5. The amount of deposit (in million gwei) that clients need to pay depending on their current level and total credit score.

Figure 6. The distribution of model accuracy evaluation for each worker from the reviewers compared to the workers’ claimed training score.

Figure 7. The distribution of submitted evaluation score from each reviewer compared to the distance toward the true median from CalculateWorkerContribution(·).

Figure 8. The amount of reward (in million gwei) that each trainer and reviewer receives depending on their contribution.

Figure 9. The total number of possible completed FL task per minute when executed in Mainnet, Kovan, and Klaytn network. This metric is calculated based on the block interval in Mainnet, Kovan, and Klaytn network for one round aggregation of local models.

Table 1. List of notations used in this paper.

Notation	Description
$S, O, i, j$	Smart contract, Model owner, Client, Reviewer
$l, l^{m i n}, C, C^{m i n}$	Level, Minimum level, Credit score, Minimum credit score
$τ^{i d}, τ^{d e s c}, τ^{t a r g e t}$	Task id, Task description, Task target
$M^{(0)}, M^{'}$	Initial global model, Updated local model
$d^{t r a i n}, d^{t e s t}$	Train dataset, Test dataset
$T^{s c o r e}, T^{e v a l}$	Training score, Evaluation score
$x, d$	Accepted training score, distance of evaluation score towards x
$D, β$	Total deposit, Base deposit value
$R, R^{t r a i n}, R^{e v a l}$	Total reward, Reward pool for workers, Reward pool for reviewers
$t^{n o w}, η$	Current epoch timestamp, Random string nonce
$Y^{c o m m i t}, Y^{i p f s}$	Commitment hash, IPFS hash generated from $I P F S (\cdot)$ method
$H (J)$	Hash payload J using KECCAK-256 hash function
$E_{k} (J)$	A symmetric encryption of payload J with secret key k
$D_{k} (L)$	A symmetric decryption of payload L with secret key k
$P K E_{P K} (J)$	A public-key encryption of payload J using public key $P K$
$P K D_{S K} (L)$	A public-key decryption of payload L with private key $S K$
$S I G N_{S K} (J)$	Generate public-key signature of payload K using private key $S K$
$V E R_{P K} (L, S)$	Verify whether the digital signature S of payload L using public key $P K$
$X ‖ Y$	A concatenation of X with Y
$⋃_{n = 1}^{N} X_{n}$	A concatenation of data X for all index n.
$F_{A \to B} (m)$	Transfer m amount of funds from entity A to B
$I P F S (J)$	Store payload J to the IPFS network

Table 2. The cumulative credit events added during one week timeframe using SaveCreditEvent(·) in Algorithm 1. These values are used to plot Figure 3 and Figure 4.

No	Credit ( $C^{c u m}$ )	Date & Time ( $t^{c u m}$ )	No	Credit ( $C^{c u m}$ )	Date & Time ( $t^{c u m}$ )
1	7	23 April 2021 6:00 AM	11	7	27 April 2021 2:00 AM
2	7	23 April 2021 10:00 AM	12	14	27 April 2021 5:00 AM
3	14	23 April 2021 2:00 PM	13	−6	27 April 2021 9:00 AM
4	14	23 April 2021 6:00 PM	14	−6	27 April 2021 5:00 PM
5	7	24 April 2021 12:00 AM	15	−19	27 April 2021 7:00 PM
6	−6	24 April 2021 8:00 AM	16	7	27 April 2021 10:00 PM
7	−6	24 April 2021 1:00 PM	17	7	28 April 2021 6:00 AM
8	−19	24 April 2021 7:00 PM	18	7	28 April 2021 9:00 PM
9	7	26 April 2021 9:00 PM	19	7	29 April 2021 10:00 AM
10	7	26 April 2021 11:00 PM	20	7	29 April 2021 4:00 PM

Table 3. List of writable methods in our smart contract. We assume the block limit of 30 millions gas. n indicates the total number of clients.

Methods	Gas Usage per Tx	% Limit	# Tx per FL Task
Add Experience	44,801	0.15	n
Save Credit Event	102,944	0.34	n
Submit Training Score	42,443	0.14	n
Submit Evaluation Score	43,123	0.14	$n (n - 1)$
Submit Commitment Hash	48,234	0.16	$n + n (n - 1)$

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Oktian, Y.E.; Stanley, B.; Lee, S.-G. Building Trusted Federated Learning on Blockchain. Symmetry 2022, 14, 1407. https://doi.org/10.3390/sym14071407

AMA Style

Oktian YE, Stanley B, Lee S-G. Building Trusted Federated Learning on Blockchain. Symmetry. 2022; 14(7):1407. https://doi.org/10.3390/sym14071407

Chicago/Turabian Style

Oktian, Yustus Eko, Brian Stanley, and Sang-Gon Lee. 2022. "Building Trusted Federated Learning on Blockchain" Symmetry 14, no. 7: 1407. https://doi.org/10.3390/sym14071407

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Building Trusted Federated Learning on Blockchain

Abstract

1. Introduction

2. Preliminaries

2.1. Problem Statement

2.2. Design Considerations

2.2.1. Incentive Mechanism

2.2.2. Peer-Reviewed Local Model

2.2.3. Model Encryption

2.2.4. Stage Timeout

2.2.5. Deposit Mechanism

2.2.6. Reputation System

3. Proposed Protocol

3.1. Accountability Metric

3.1.1. Leveling System

3.1.2. Credit Score System

3.2. Federated Learning Stage

3.2.1. Start Stage

3.2.2. Commit-Train Stage

3.2.3. Reveal-Train Stage

3.2.4. Commit-Eval Stage

3.2.5. Reveal-Eval Stage

3.2.6. Aggregate Stage

3.2.7. End Stage

4. Experimental Results and Analysis

4.1. Assessing Client Credibility

4.1.1. Leveling System Results

4.1.2. Credit Score System Results

4.1.3. Deposit Mechanism Results

4.1.4. Contribution Score Results

4.2. Assessing Blockchain Performance

4.2.1. Gas Consumption

4.2.2. Transaction Throughput

4.3. Assessing FL Accuracy

5. Related Works

6. Conclusions and Future Works

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI