Multidimensional Epidemiological Survey Data Aggregation Scheme Based on Personalized Local Differential Privacy

Liu, Xueyan; Liu, Qiong; Wang, Jia; Sun, Hao

doi:10.3390/sym16030294

Open AccessArticle

Multidimensional Epidemiological Survey Data Aggregation Scheme Based on Personalized Local Differential Privacy

College of Computer Science and Engineering, Northwest Normal University, Lanzhou 730070, China

^*

Author to whom correspondence should be addressed.

Symmetry 2024, 16(3), 294; https://doi.org/10.3390/sym16030294

Submission received: 1 February 2024 / Revised: 27 February 2024 / Accepted: 28 February 2024 / Published: 2 March 2024

(This article belongs to the Special Issue Mathematical Modeling of the Infectious Diseases and Their Controls)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, with the rapid development of intelligent technology, information security and privacy issues have become increasingly prominent. Epidemiological survey data (ESD) research plays a vital role in understanding the laws and trends of disease transmission. However, epidemiological investigations (EI) involve a large amount of privacy-sensitive data which, once leaked, will cause serious harm to individuals and society. Collecting EI data is also a huge task. To solve these problems and meet personalized privacy protection requirements in EIs, we improve the uOUE protocol based on utility-optimized local differential privacy to improve the efficiency and accuracy of data coding. At the same time, aiming at the collection and processing of ESD, a multidimensional epidemiological survey data aggregation scheme based on uOUE is designed. By using Paillier homomorphic encryption and an identity-based signature scheme to further prevent differential attacks and achieve multidimensional data aggregation, the safe, efficient, and accurate aggregation processing of ESD is executed. Through security proof and performance comparison, it is verified that our algorithm meets the requirements of local differential privacy and unbiased estimation. The experimental evaluation results on two data sets show that the algorithm has good practicability and accuracy in ESD collection and provides reliable and effective privacy protection.

Keywords:

privacy protection; data encryption sharing; personalized local differential privacy; signcryption

1. Introduction

In recent years, various infectious diseases, such as SARS, swine flu, Ebola, novel coronavirus epidemics, and influenza virus, have had a significant impact [1,2]. During these outbreaks, EIs have emerged as a crucial measure to curb their spread. EIs typically involve tracking patients, close contacts, and potential contacts and is conducted by health departments or other organizations. This investigation identifies close contacts based on patients’ basic information and behavior patterns; then, it tracks and screens them and takes the necessary actions in such settings as hospitals, communities, and workplaces [3]. Accordingly, ensuring a timely and accurate EI is vital, with a focus on safeguarding personal privacy and information security.

In analyzing infectious disease transmission, the data used require higher privacy protection compared with general health data. These data encompass the identities, health statuses, and close contact details of epidemiological survey objects (ESOs), which are highly sensitive to data owners. Emergencies intensify the urgency and volume of epidemic investigations, which are often recorded physically, thereby heightening information exposure risks. Information leaks have severe consequences, including repeated privacy breaches, misinformed public opinions, social panic, and potential harm. An excessive or insufficient desensitization of patient information by authorities may also lead to irrelevant data disclosure. Consequently, citizens’ privacy must be prioritized during epidemic investigations, including by employing effective encryption and security measures to prevent sensitive information leaks.

1.1. Related Work

In epidemiological studies, there have been many research efforts on the encryption protection of ESD. Blumenberg C et al. [4] used the REDCap system to explore the advantages and limitations of the electronic data collection environment and solved data inconsistency through real-time reporting and field verification, which is expected to play a role in time-saving and data quality improvement. However, there are shortcomings in its method of detail description, limitation discussion, replicability, and data quality analysis. Dong E et al. [5] introduced the COVID-19 dashboard, which provides real-time epidemic data for the world, and described the data collection process in detail. However, the inconsistency and time delay of the data affects the reliability of the data and the limitations of their application. Sperber A D et al. [6] compared two data collection methods, face-to-face interviews and internet surveys, concluding that the preferences and participation of different audiences may affect the bias of the method. Moreover, these authors did not provide a detailed design and description of the data collection method. In summary, most studies only encrypt ESD themselves, ignoring the privacy protection of non-medical data, and may not be able to fully protect users’ privacy information; their methods cannot be replicated, the limitations are significant, and the data cannot be applied. For data involving correlations, such as influenza status, sending data separately may lead to increased errors and reduce data accuracy and availability. In addition, correlations between non-sensitive attributes and sensitive attributes may result in the disclosure of personally sensitive information. Therefore, when dealing with ESOs, we must consider their relevance and take privacy protection measures to ensure data security and privacy.

Differential privacy (DP) [7,8] is pivotal in balancing individual privacy and data availability, and it can also be used to resist member inference attacks. In 2013, local differential privacy (LDP) [9,10] was proposed as a variant of DP [11]; it inherits its advantages and abandons the dependence on trusted third parties, thus improving the practicality of the model. LDP can add noise to all data and effectively protect data privacy. It has a wide range of applications, including machine learning, network services, data statistics, and optimization. LDP has been widely used in the industry. For example, Apple uses it to protect users’ mobile phone usage data, and Google uses LDP-based components (such as randomized aggregable privacy-preserving ordinal response, RAPPOR) to collect user behavior data. In 2017, Wang et al. [12] proposed a framework to incorporate most LDP protocols into the pure LDP protocol framework to optimize and generalize existing protocols and compared the accuracy and communication cost of different LDP protocols. They also introduced the optimized unary encoding (OUE) protocol with higher accuracy. In addition, the study compared a variety of encoding methods and provided suggestions for selecting protocols. Histogram encoding (HE) and unary encoding (UE) required

O (d)

communication costs, and direct encoding (DE) and local hashing (LH) required

O (\log d)

or

O (\log n)

communication costs (

n

is the number of users). All protocols except DE estimated the computational cost

O (n \cdot d)

of the frequency of all values. When the number of values that users might input is large, UE was an effective coding method, and its communication cost

O (d)

was equivalent to that of HE. Therefore, when the user may input the number of values

d > 3 e^{ε} + 2

and

d < n

, to avoid the high computational cost of DE and LH, the OUE coding mechanism offers improved accuracy and unbiased estimation results. Furthermore, UE, known for its simplicity and intuitiveness, is easy to implement and offers superior performance in terms of computational and communication costs. This made it a more practical choice for adoption in various applications. However, many existing LDP mechanisms often overlook the varying privacy protection requirements of different data in practical scenarios. Consequently, they may increase estimation errors by overprotecting non-sensitive data. Therefore, customizing LDP for specific domains is essential. Some protocols designed for protecting privacy location data, such as Geo-Indistinguishability [13] and Private Drop [14], may not directly suit EI frequency estimation. In addition, some personalized DP protocols may introduce excessive noise or data cuts, affecting data quality. In 2019, to address these issues, Murakami et al. [15] introduced the utility-optimized LDP (ULDP) model to reduce the protection of non-sensitive data according to the privacy requirements of different data, thereby improving data utility and protecting the privacy of sensitive data. However, the current ULDP protocols, such as utility-optimized generalized random response (uGRR) and utility-optimized RAPPOR (uRAP), mainly rely on generalized random response (GRR) and symmetric unary encoding (SUE), and face the challenges of utility and communication cost in the field of big data. In 2022, He et al. [16] proposed the utility-optimized optimized local hashing (uOLH) protocol for big data domains, which aims to achieve low communication costs and high data utility. Nonetheless, it focuses on big data and does not apply to all scenarios, and complex data hashing increases the complexity and cost of implementation. Additionally, Cao et al. [17] devised a frequency estimation mechanism conforming to the set-valued data ULDP model, offering a privacy protection solution for set data. However, its applicability may be limited when dealing with different data types, thereby increasing the complexity and cost of practical implementations. These studies aimed to enhance data utility while addressing privacy concerns. EIs present unique challenges, demanding rigorous data collection and privacy safeguards. Current LDP technology falls short, particularly in terms of data aggregation, analysis efficiency, accuracy, and compatibility with diverse data types. Hence, in-depth research is vital to develop customized protocols for EI that effectively balance data privacy and utilization.

Homomorphic encryption [18], notably the Paillier variant, facilitates calculations on encrypted data, thereby enhancing data privacy. Paillier homomorphic encryption (PHE) [19] permits encrypted data aggregation, ensuring privacy and security while enabling efficient data transmission. In epidemiological survey data collection, it safeguards privacy and also enhances transmission efficiency. This ensures that individual data remain encrypted during sensitive data aggregation and analysis, preventing the exposure of personal sensitive information. Consequently, PHE offers an efficient and dependable solution for data collaboration and privacy protection.

Based on the particularity of data in EI, in the process of LDP perturbation, it is necessary to ensure that ESD do not lose information and that ESD are fully protected for privacy. Based on the existing mechanism, we improve the OUE protocol scheme based on the ULDP protocol to transmit EI data set information, avoid the unified perturbation processing of all data like the traditional LDP method, and use OUE coding to avoid complex data hashing. Increasing the complexity and cost of implementation can achieve lower communication costs and higher data utility in the EI data domain.

Therefore, based on the OUE protocol [12], we improve the utility-optimized OUE (uOUE) protocol that conforms to the ULDP model, aiming to further improve the efficiency and accuracy of data coding. A ULDP scheme based on the uOUE protocol is designed. The uOUE protocol is used to deal with situations in which ESO data contain both sensitive values and non-sensitive values to ensure the security and privacy of data, improve the accuracy of ESD frequency estimation results, and avoid the privacy risks posed to patients’ original data during transmission. In addition, the scheme also uses PHE and identity-based signature schemes to protect data from differential attacks. PHE can aggregate encrypted data in a ciphertext state to ensure data security and privacy. The identity-based signature scheme can digitally sign the user’s ESD to ensure data integrity and source credibility.

Our main contributions are as follows:

For users’ personalized privacy protection requirements, we improve the uOUE protocol that conforms to the ULDP model based on the OUE mechanism. By proving that the uOUE protocol satisfies the ULDP model and calculating the theoretical variance of the frequency estimation results, the proposed protocol has been deemed to have a low communication cost and high data utility;
Considering the collection and processing of data in the EI scenario, we design a multidimensional ESD aggregation scheme based on PLDP. This scheme ensures the security and integrity of personal privacy data while maintaining the availability of data and achieves the secure, efficient, and accurate aggregation of ESD;
Through the comparative analysis of mean square error (MSE) and communication cost with the other five LDP protocols, as well as the experimental results on two data sets, our scheme shows higher practicability and performance in ESD aggregation. In terms of multidimensional data aggregation, the scheme shows strong computing performance and more comprehensive functions and cleverly balances computing efficiency and privacy protection in the ESD aggregation scenario, providing a valuable practical solution for this field.

1.2. Organization

The rest of this paper is organized as follows: Section 2 introduces some preliminary knowledge; Section 3 gives the specific content of the uOUE mechanism; in Section 4, considering the personalized privacy requirements of users in the EI scenario, an ESD aggregation scheme based on uOUE is designed; Section 5 gives the theoretical proof and comparative analysis of the scheme; and finally, Section 6 contains a summary of our results.

2. Preliminary Knowledge

In this section, we will briefly introduce the concepts of utility-optimized local differential privacy (ULDP) and Paillier homomorphic encryption (PHE). Finally, we further introduce the utility evaluation mechanism MSE used in this paper.

2.1. Utility Optimization LDP

ULDP [15] divides the original data set

X

into sensitive data set

X_{S}

and non-sensitive data set

X_{N}

and divides the output set into protected data set

Y_{S}

and reversible data set

Y_{N}

. Its formal definition is as follows.

Define 1.

(X_{S}, Y_{S}, ε) - U L D P

,

ε \geq 0

, for the perturbation mechanism with input domain

X

and output domain

Y

,

M : X \to Y

, if and only if the perturbation mechanism

M

satisfies the following properties, satisfies

(X_{S}, Y_{S}, ε) - U L D P .

(1): For any $y \in Y_{N}$ , there is only one:

$\Pr [M (x_{1}) = y] > 0$

(1)

And for any

x_{1} \neq x_{2}

, satisfy the following:

\Pr [M (x_{2}) = y] = 0

(2)

(2): For any input $x_{1}, x_{2} \in X$ , obtain any output $y \in Y_{S}$ and satisfy the following:

$\Pr [M (x_{1}) = y] \leq e^{ε} \Pr [M (x_{2}) = y]$

(3)

ε

-LDP guarantees that any attacker cannot infer the exact original input from the output result, and when the privacy budget

ε

approaches 0, all data in

X

output the same result with almost the same probability. Since the privacy budget

ε

controls the degree of privacy protection in LDP, the smaller (or larger) the value, the stronger (or weaker) the privacy guarantee.

2.2. PHE Algorithm

PHE [19] is widely used in many privacy-preserving data aggregation schemes. Suppose

E (\cdot)

is an encryption function,

K

is an encrypted key, and

a

and

b

are two random encrypted messages. The additional homomorphism of the PHE algorithm is shown as follows:

E_{K} (a) \cdot E_{K} (b) = E_{K} (a + b)

. The PHE algorithm consists of three parts: key generation, encryption, and decryption. The detailed process is as follows:

Key generation: randomly select two large primes $p$ and $q$ and calculate $N = p q$ , $λ = l c m (p - 1, q - 1)$ , then select generator $g \in Ζ_{N^{2}}^{*}$ ; let $g$ satisfy $\gcd (L (g^{λ} \mod N^{2}), N) = 1$ , quorum $L (μ) = μ - 1 / N$ . Obtain the public key $P K = (N, g)$ and private key $S K = (λ, μ)$ ;
Encryption: for any plaintext $m \in Z_{N}$ , select a random number $r \in Z_{N}^{*}$ , and make it satisfy $\gcd (r, n) = 1$ . Encrypt to obtain ciphertext: $c = g^{m} r^{N} \mod N^{2}$ ;
Decryption: the plaintext is obtained by the formula: $m = L (c^{λ} \mod N^{2}) μ \mod N$ .

2.3. Utility Evaluation

The MSE is used to evaluate the effectiveness of a protocol and experiment. The MSE can evaluate the degree of data change. The smaller the value of the MSE, the better the accuracy of the prediction model to describe the experimental data. The formal definition of mean square error is shown in (4):

M S E (\hat{F}) = E [\sum_{i = 1}^{n} (F_{x} - {\hat{F}}_{x}^{2})]

(4)

F_{x}

represents the real frequency, and

F_{x}

represents the estimated frequency.

3. uOUE Mechanism

In this section, we propose a uOUE protocol based on the OUE protocol, which conforms to the ULDP model and protects the privacy of categorical data. Different from the traditional OUE protocol, the uOUE protocol considers the privacy budget and introduces the ULDP mechanism to reduce communication costs and effectively reduce the risk of information leakage while maintaining data accuracy.

3.1. Introduction to uOUE

The uOUE protocol encodes values as binary vectors, matching each data item to candidate values. Privacy perturbation uses vector operations to maintain original data attributes. It treats sensitive and non-sensitive sets differently, applying distinct privacy protection methods. Sensitive data undergo random response interference, and individual candidate values of non-sensitive data are disturbed. Frequency estimation enhances accuracy and utility. This method reduces non-sensitive data protection, improves frequency estimation accuracy, maintains privacy, and enhances data availability.

3.2. Mechanism Description

Participants in the uOUE protocol include three parties: users, servers, and data users. Users hold the original data, encode and perturb them, and send them to the server. The server aggregates and statistically analyzes the perturbation data of all users, estimates the frequency distribution results of all the original data, and finally sends the results to the corresponding data users.

The original data set is recorded as

X = \{x_{1}, x_{2}, \dots, x_{|X|}\}

; the dimension size is

d = |X|

. Among them, the original data set is divided into two parts: sensitive data set

X_{S}

and non-sensitive data set

X_{N}

. The two do not intersect, that is,

|X| = |X_{S}| + |X_{N}|

.

The uOUE scheme is divided into three steps: encoding, perturbation, and aggregation. The specific steps are as follows:

uOUE encoding

In uOUE, the data are first UE-encoded, that is, the classification data in the user’s hands are encoded as a

d

-bit vector

v

. Each bit corresponds to data in the original data domain. If the user data contain data

k

, let the

k

bit of

v

be 1.

Assume that there are

n

users, and each user holds raw data

x

. To reduce the subsequent communication cost, UE is used to encode it, and the encoding result

v = E n c o d e (x)

is obtained. Suppose

\{x_{1}, x_{2}, \dots, x_{|X_{S}|}\}

is sensitive data

X_{S}

and

\{x_{|X_{S}| + 1}, x_{|X_{S}| + 2}, \dots, x_{|X|}\}

is non-sensitive data

X_{N}

, then the sensitive output is

Y_{P} =

\{(y_{1}, y_{2}, \dots, y_{|X|}) |y_{1}, y_{2}, \dots, y_{|X|} \in \{0, 1\}\}

and the reversible output is

Y_{N} = X_{N}

. When

x

is sensitive data, it will be encoded as data in

\{x_{1}, x_{2}, \dots, x_{|X_{S}|}\}

, and when

x

is non-sensitive data, its encoding result is data in

\{x_{|X_{S}| + 1}, x_{|X_{S}| + 2}, \dots, x_{|X|}\}

.

2.: uOUE perturbation method

Users encode and perturb their original data locally to generate perturbation data

v^{'}

. According to sensitivity

x

,

P e r t u r b (v)

is processed by different perturbation methods. The specific method is to perturb each

v_{k}

in

v

to obtain

{v_{k}}^{'}

, as shown in (5). The processed perturbation data

{v_{k}}^{'}

will be sent to the server for aggregation and analysis.

\Pr [v_{k}^{'} |v_{k}] = \{\begin{array}{l} α & v_{k} = 1, {v_{k}}^{'} = 1; v_{k} \in X_{S} \\ 1 - α & v_{k} = 1, {v_{k}}^{'} = 0; v_{k} \in X_{S} \\ β & v_{k} = 0, {v_{k}}^{'} = 1; v_{k} \in X_{S} \\ 1 - β & v_{k} = 0, {v_{k}}^{'} = 0; v_{k} \in X_{S} \\ γ & v_{k} = 1, {v_{k}}^{'} = 1; v_{k} \in X_{N} \\ 1 - γ & v_{k} = 1, {v_{k}}^{'} = 0; v_{k} \in X_{N} \\ 1 & v_{k} = 0, {v_{k}}^{'} = 0; v_{k} \in X_{N} \\ 0 & v_{k} = 0, {v_{k}}^{'} = 1; v_{k} \in X_{N} \end{array}

(5)

For sensitive data

x

, some probabilities remain unchanged

α = \frac{1}{2}

,

β = \frac{1}{1 + e^{ε}}

and probability deflection occurs, as shown in Figure 1 below; for non-sensitive data

x

, the probability of having

γ = \frac{e^{ε} - 1}{2 e^{ε}}

remains unchanged. If

v_{k} \in X_{N}

and

{v_{k}}^{'} = 1

, then the reversible data

v_{k}

are output, that is, we can use

{v_{k}}^{'}

to represent

(y_{1}, y_{2}, \dots, y_{t}, t \leq |X|)

directly. Perturbation examples are shown in Figure 1 and Figure 2. After the disturbance is completed, the disturbance data will be sent to the server.

3.: uOUE aggregation

After the server receives the disturbance data sent by the user, the server will count whether each bit in

v^{'}

is 1. Suppose that the number of occurrences of 1 in the

k

-th bit is

{\hat{F}}_{x}

; by counting the number of occurrences of 1 in each bit, the server can estimate the probability of the occurrence of

x

in each original data item, and the statistical analysis results

{\hat{F}}_{x}

close to the frequency distribution of the original data are as follows:

{\hat{F}}_{x} = \{\begin{array}{l} \frac{F_{x} / n - β}{α - β}, & i f x \in X_{S} \\ \frac{F_{x}}{n γ}, & i f x \in X_{N} \end{array}

(6)

4. ESD Aggregation Scheme Based on uOUE

ESD sensitivity analysis involves sensitive and non-sensitive attributes. For example, in an epidemiological record, attributes like {gender, age, symptoms, allergic drugs, chronic diseases} include sensitive (allergic drugs) and non-sensitive attributes (gender and chronic diseases). Sensitive attributes also have sensitive and non-sensitive candidate values. For instance, when examining regional attributes, an individual’s travel destinations might include {Beijing, Shanghai, Guangxi, Hubei}, where Beijing and Shanghai are sensitive candidate values, and the others are not.

To enhance user data privacy, we designed an aggregation scheme based on the uOUE mechanism outlined in Section 3. This approach improves data utility by reducing non-sensitive data protection. It also introduces PHE and BLS-based short signatures [20] to boost data security. PHE enables the Epidemiological Data Control Center (EDCC) to merge encrypted data from multiple ESOs without decryption, ensuring strong data privacy and security while enhancing transmission efficiency. Meanwhile, the BLS-based short signature scheme offers efficient signing and verification, reducing communication and storage costs while maintaining high security and anonymity. This facilitates the efficient collection and aggregation of ESD while safeguarding authentication and data transmission privacy.

4.1. Scheme Model

This scheme is mainly composed of ESOs, epidemiological survey workers (ESWs), and the three-tier structure of the EDCC. The specific scheme model is shown in Figure 3.

ESOs $(U_{i}, i = 1, 2, \dots, n)$ : individuals or institutions in the EI survey send locally perturbed, encrypted, and signed data to ESWs. To protect ESO privacy, we apply the uOUE mechanism from Section 4. This involves mapping ESD to binary vectors and applying UE encoding to obtain perturbed data. ESOs then use identity-based signatures to safeguard the perturbed and signed data, ensuring data confidentiality and integrity;
ESW $(E S W_{h}, h = 1, 2, \dots, z)$ : a personal, system, or private cloud that pre-aggregates ESD and receives perturbed and signed data from ESOs. $E S W_{h}$ aggregates these data after verifying signatures, employs the PHE algorithm, and forwards the report to the data control center, streamlining interactions and communication with the EDCC;
EDCC $(E D C C)$ : the EDCC is central to the ESD aggregation scheme, acting as the aggregator. It possesses a pair of public and private keys for homomorphic encryption and semantic security, as well as a pair of identity-based public and private keys. Its responsibilities include generating public and private key pairs for ESOs and ESWs, as well as managing the transmission and verification of their data. To facilitate the separation of aggregated ESD, the EDCC constructs a super-increasing sequence for this purpose.

4.2. Scheme Contents

4.2.1. Scheme Initialization

EI user $U_{i} (i = 1, 2, \dots, n)$ owns the attribute set $M = (m_{1}, m_{2}, \dots, m_{l})$ , $m_{j} (j = 1, 2, \dots, l)$ . Each user $U_{i}$ has a $d$ -dimensional attribute value candidate set $m_{j k} = \{m_{j 1}, m_{j 2}, \dots, m_{j d}\}$ , $m_{j k} (k = 1, 2, \dots, d)$ which is the candidate value of the attribute $m_{j}$ . Value $m_{i j}$ is 1 or 0; 1 is the candidate value, 0 indicates that it does not have the candidate value, and $k$ indicates the position corresponding to the candidate value;
The EDCC selects safety parameter $k$ and two primes $p, q$ , and calculates $N = p q$ and $λ = l c m (p - 1, q - 1)$ , where $p = 2 \overset{⌢}{p} + 1$ , $q = 2 \overset{⌢}{q} + 1$ , $| p | = | q | = k$ ; $\overset{⌢}{p}$ and $\overset{⌢}{q}$ are also two primes. Then, it selects generator $g \in Ζ_{N^{2}}^{*}$ , defines function $L (x) = x - 1 / N$ , and calculates the public key $(N = p q, g)$ and private key $(λ, μ)$ of the PHE algorithm, where $μ = {(L (g^{λ} \mod N^{2}))}^{- 1} \mod N$ ;
The EDCC generates a super-increasing sequence $\{a_{1}, a_{2}, \dots, a_{l}\}$ , $a_{j} \in Ζ_{N^{2}}^{*}$ , which satisfies $a_{j} > \sum_{k = 1}^{j - 1} a_{k} (j = 2, 3, \dots, n)$ ; the length of $a_{j}$ is $|a_{j}| \geq k$ , $g_{j} = g^{a_{j}}$ , $j = 1, 2, \dots, l$ ;
$G_{1}$ and $G_{2}$ are cyclic multiplicative groups with the same prime order $q_{1}$ , where $G_{1}$ is generated by $P$ , and $e : G_{1} \times G_{1} \to G_{2}$ is a bilinear map. The EDCC randomly selects a scheme private key $s \in Z_{q_{1}}^{*}$ and calculates the scheme public key $P_{p u b} = P^{s}$ ;
The EDCC selects three hash functions: $H_{1}, H_{2} : {\{0, 1\}}^{*} \to G_{1}$ , and $H_{3}$ ( $H_{3}$ is SHA-256 hash algorithm);
The EDCC publishes the scheme parameters, as follows:

$S P = \{M, N, g, P, q_{1}, e, P_{p u b}, H_{1}, H_{2}, H_{3}, n, h, (g^{a_{1}}, g^{a_{2}}, \dots, g^{a_{l}})\}$

4.2.2. Public–Private Key Pair Generation

ESO

U_{i} (i = 1, 2, \dots, n)

registration, managed by the EDCC, yields a user’s pseudonym and generates their public and private keys using the BLS short signature, as follows:

$U_{i}$ obtains the current timestamp $T_{i}$ , calculates the hash value $H_{3}^{i} = H_{3} (i d_{i}^{U} ‖T_{i})$ using their own real identity $i d_{i}^{U}$ , and sends a registration request $\{i d_{i}^{U}, T_{i}, H_{3}^{i}\}$ to the EDCC;
After receiving the user’s registration request, the EDCC checks whether $H_{3}^{i} = H_{3} (i d_{i}^{U} ‖T_{i})$ is established. If yes, the EDCC computes pseudonym $P S$ for user $U_{i}$ based on their real identity $i d_{i}^{U}$ : the EDCC randomly selects $y_{i} \in Z_{q_{1}}^{*}$ , computes $P P_{i} = P^{s + y_{i}}$ , $P S_{i} = H_{2} (P P_{i} ‖i d_{i}^{U} ‖T_{i})$ , and returns pseudonyms $P S_{i}$ and $P Y_{i} = P^{y_{i}}$ to $U_{i}$ ;
$U_{i}$ verifies pseudonym $P S_{i}$ : ${P S}_{i}^{'} \overset{?}{=} P S_{i}$ . If equal, use the pseudonym. (7) proves the process.

$\begin{matrix} {P S}_{i}^{'} & = H_{2} ((P_{p u b} \cdot P Y_{i}) ‖i d_{i}^{U} ‖T_{i}) \\ = H_{2} ((P^{s} \cdot P^{y_{i}}) ‖i d_{i}^{U} ‖T_{i}) \\ = H_{2} (P P_{i} ‖i d_{i}^{U} ‖T_{i}) = P S_{i} \end{matrix}$

(7)
ESO selects a random number $s_{i}^{U} \in Z_{q_{1}}^{*}$ as its private key and publishes its corresponding public key ${P K}_{i}^{U} = P^{s_{i}^{U}}$ ;
$E S W_{h}$ also selects private key $s_{h}^{E S W} \in Z_{q_{1}}^{*}$ and computes public key $P K_{h}^{E S W} = P^{s_{h}^{E S W}}$ .

4.2.3. uOUE Perturbation

In this stage, the

E S W_{h}

coverage area has

n (0 \leq n \leq η)

ESOs. Each ESO has EI information;

U_{i}

fills in

x_{i} = (m_{1}^{i}, m_{2}^{i}, \dots, m_{l}^{i}), m_{j}^{i} \leq x

, uses the uOUE protocol to encode and perturb to generate noisy data

v_{i}^{'}

, and transmits

v_{i}^{'}

to

E S W_{h}

;

$U_{i}$ fills in $x_{i}$ and converts $x_{i}$ into vector $v_{j}^{i} = E n c o d e (m_{j}^{i})$ of length $d = |v|$ , as shown in (8).

$v_{j}^{i} = E n c o d e (m_{j}^{i}) = \{\begin{matrix} v_{j k}^{i} [m_{j}^{i}] = 1 & , m_{j}^{i} = k \\ v_{j k}^{i} [m_{j}^{i}] = 0 & , m_{j}^{i} \neq k \end{matrix}$

(8)
$U_{i}$ perturbs according to (5) obtain data $v_{i}^{'}$ , that is, perturbs each candidate value $v_{j k}^{i}$ according to its sensitivity to obtain data ${v_{j k}^{i}}^{'} = P e r t u r b (v_{j k}^{i})$ : for sensitive data $x$ , the probability remains unchanged, and the probability of $β = \frac{1}{1 + e^{ε}}$ deflects. $v_{i}^{'}$ is thus obtained;
$U_{i}$ computes $ϑ_{i}^{U} = H_{1} (v_{i}^{'}, P S_{i}, T_{i}^{U})$ , signs $σ_{i}^{U} = {ϑ_{i}^{U}}^{s_{i}^{U}}$ , and $T_{i}^{U}$ is the current timestamp, which can resist message replay attacks;
$U_{i}$ sends report $\{P S_{i} ‖v_{i}^{'}‖ σ_{i}^{U} ‖T_{i}^{U}\}$ to $E S W_{h}$ through a secure channel.

4.2.4. Data Pre-Processing

E S W_{h}

collects and processes the ESD

v_{i}^{'}

, and aggregates the data of the coverage area to obtain

C_{h}

. The specific steps are as follows:

After $E S W_{h}$ receives the report, the effectiveness of $T_{i}^{U}$ is checked. The report is sent to $E S W_{h}$ at time point $T_{h}^{E S W}$ to check whether $T_{h}^{E S W} - T_{i}^{U} \leq Δ T$ is established, and $Δ T$ is the allowed delay of the scheme. If it holds, then $T_{i}^{U}$ is valid, otherwise it terminates;
$E S W_{h}$ verifies the signature: $e (σ_{i}^{U}, P) \overset{?}{=} e (ϑ_{i}^{U}, P K_{i}^{U})$ . If it is equal, this indicates that the report from legal $U_{i}$ is received by $E S W_{h}$ , otherwise it is terminated. (9) proves the correctness of signature verification:

$\begin{array}{l} e (σ_{i}^{U}, P) & = e (ϑ_{i}^{{U s}_{i}^{U}}, P) \\ = e (ϑ_{i}^{U}, P^{s_{i}^{U}}) \\ = e (ϑ_{i}^{U}, P K_{i}^{U}) \end{array}$

(9)
$E S W_{h}$ obtains $v_{i}^{'} (i = 1, 2, \dots, n)$ frequency statistics: $V_{j k}^{h} = \sum_{i = 1}^{n} v_{j k}^{i}$ . They are stored in the form of an array, $V_{j k}^{h} = \{V^{h} [j] [k], (k = 1, 2, \dots, d), (j = 1, 2, \dots, l)\}$ , $V_{j}^{h} = \{V^{h} [j], (j = 1, 2, \dots, l)\}$ ;
$E S W_{h}$ randomly selects $r_{h} \in Z_{n_{1}}^{*}$ and calculates the ciphertext according to (10):

$C_{h} = g_{1}^{V_{1}^{h}} g_{2}^{V_{2}^{h}} \dots g_{l}^{V_{l}^{h}} {r_{h}}^{N} \mod N^{2}$

(10)
$E S W_{h}$ calculates $ϑ_{h}^{E S W} = H_{1} (C_{h}, i d_{h}, T_{h}^{E S W})$ , $σ_{h}^{E S W} = ϑ_{h}^{E S W}^{s_{h}^{E S W}}$ ;
$E S W_{h}$ sends reports $\{i d_{h}^{E S W} ‖C_{h} ‖σ_{h}^{E S W}‖ T_{h}^{E S W}\}$ to the EDCC through a secure channel.

4.2.5. Data Aggregation

The EDCC receives report $\{i d_{h}^{E S W} ‖C_{h} ‖σ_{h}^{E S W}‖ T_{h}^{E S W}\}$ to check the effectiveness of $T_{h}^{E S W}$ . If $T_{h}^{E S W}$ is valid, signature $σ_{h}^{E S W}$ is verified; if it is invalid, this indicates that a replay attack was detected and the process is terminated;
The correctness of $σ_{h}^{E S W}$ is verified and compared with $e (σ_{h}^{E S W}, P) \overset{?}{=} e (ϑ_{h}^{E S W}, P K_{h}^{E S W})$ . If the equation is equal, the EDCC receives the report;
The EDCC aggregates the data according to (11) and (12) to obtain ciphertext $C$ .

$C_{h} = g^{a_{1} V_{1}^{h} + a_{2} V_{2}^{h} + \dots + a_{l} V_{l}^{h}} {r_{h}}^{N} \mod N^{2}$

(11)

$\begin{array}{l} C & = \prod_{h = 1}^{z} C_{h} \mod N^{2} \\ = g^{a_{1} \sum_{h = 1}^{z} V_{1}^{h} + a_{2} \sum_{h = 1}^{z} V_{2}^{h} + \dots + a_{l} \sum_{h = 1}^{z} V_{l}^{h}} {(\prod_{h = 1}^{z} r_{h})}^{N} \mod N^{2} \\ = g^{a_{1} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{i k}^{h} + a_{2} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{2 k}^{h} + \dots + a_{l} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{l k}^{h}} {(\prod_{h = 1}^{z} r_{h})}^{N} \mod N^{2} \end{array}$

(12)

4.2.6. Data Acquisition

The EDCC decrypts and analyzes the ESD. The specific steps are as follows:

Let $S = a_{1} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{1 k}^{h} + a_{2} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{2 k}^{h} + \dots + a_{l} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{l k}^{h}$ in $C$ , $a_{j} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{1 k}^{h} = U_{j k}$ $(j = 1, 2, \dots, l; k = 1, 2, \dots, d)$ , $\prod_{h = 1}^{z} r_{h} = R$ , $C = g^{M} R \mod N^{2}$ , using $(λ, μ)$ to decrypt $C$ according to (13):

$W = L (C^{λ} \mod N^{2}) μ \mod N$

(13)

S = a_{1} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{1 k}^{h} + a_{2} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{2 k}^{h} + \dots + a_{l} \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{l k}^{h}

, let

ℚ_{l} = S

,

ℚ_{j - 1} = ℚ_{j} \mod a_{j}

,

ℝ_{j} = (ℚ_{j} - ℚ_{j - 1}) / a_{j} = \sum_{h = 1}^{z} V_{j}^{h} (j = l, l - 1, \dots, 2)

. So

ℝ_{1} = ℚ_{1} = \sum_{h = 1}^{z} V_{1}^{h}

,

ℝ_{j k} = \sum_{h = 1}^{z} \sum_{k = 1}^{d} V_{1 k}^{h}

.

2.: The EDCC performs frequency statistics on the ESD.

The candidate value frequency of

m_{j}

is

ℝ_{j k}

, and the frequency estimate can be calculated by the method shown in (14) and (15).

$x$ is sensitive data:

$F_{j, k} = E ({\hat{F}}_{x}) = \frac{α ℝ_{j k} + β (1 - ℝ_{j k}) - β}{α - β}$

(14)
$x$ is non-sensitive data:

$F_{j, k} = E ({\hat{F}}_{x}) = ℝ_{j k}$

(15)

5. Scheme Analysis

This section provides a theoretical and comparative analysis of the uOUE protocol, demonstrating its low communication cost and high data utility. It also includes a security proof and comparative analysis of the ESD aggregation scheme based on uOUE.

5.1. Theoretical Analysis of uOUE Protocol

This section will introduce some related properties of the uOUE protocol and give the corresponding theoretical proof.

Theorem 1.

The perturbation process of uOUE conforms to the ULDP model.

Proof of Theorem 1.

For any

v_{1}, v_{2} \in X_{m}

, the probability of outputting the same result

A

satisfies (16):

\begin{array}{l} \frac{\Pr [A {|v}_{1}]}{\Pr [A |v_{2}]} & = \frac{\prod_{k \in [m]} \Pr [A [k] |v_{1}]}{\prod_{k \in [d]} \Pr [A [k] |v_{2}]} \\ \leq \frac{\Pr [A [v_{1}] = 1 {|v}_{1}] * \Pr [A [v_{2}] = 0 |{|v}_{1}]}{\Pr [A [v_{1}] = 1 |v_{2}] * \Pr [A [v_{2}] = 0 |v_{2}]} \\ = \prod_{k \in d} \frac{α}{β} \cdot \frac{1}{1 - γ} = e^{ε} \end{array}

(16)

□

The above (16) satisfies the second nature of (3).

In uOUE, for any output

A \in Y_{N}

, there is only one original data item

x \in X_{N}

that can be perturbed into reversible data, that is, if and only if

x \in X

, the reversible data are output with probability

γ

. Therefore, uOUE satisfies the properties of (1) and (2) in the ULDP model definition.

In summary, the perturbation process of uOUE conforms to the ULDP model.

Theorem 2.

The result of uOUE frequency estimation is an unbiased estimation.

Proof of Theorem 2.

For original data

x

, their estimated frequency is denoted by

{\hat{F}}_{x}

.□

If

x

is sensitive data, it can be known from the disturbance process that:

\begin{array}{l} E ({\hat{F}}_{x}) & = E (\frac{F_{x} / n - β}{α - β}) \\ = \frac{α F_{x} + β (1 - F_{x}) - β}{α - β} = F_{x} \end{array}

(17)

If

x

is non-sensitive data, it can be known from the disturbance process that:

E ({\hat{F}}_{x}) = E (\frac{F_{x}}{n γ}) = \frac{n γ F_{x}}{n γ} = F_{x}

(18)

In summary, the frequency estimation result of uOUE is unbiased.

Theorem 3.

In the uOUE protocol, the mean square error of the estimated frequency

{\hat{F}}_{x}

is shown in (19):

M S E [{\hat{F}}_{x}] = \{\begin{matrix} \frac{4 e^{ε}}{n {(e^{ε} - 1)}^{2}} + \frac{F_{x}}{n}, x \in X_{S} \\ \frac{e^{ε} + 1}{n (e^{ε} - 1)} F (x), x \in X_{N} \end{matrix}

(19)

Proof of Theorem 3.

From Theorem 2, Equation (6) is an unbiased estimation, so MSE is equal to the variance of

{\hat{F}}_{x}

.

$x$ is sensitive: $\begin{array}{l} MSE [{\hat{F}}_{x}] & = V a r [{\hat{F}}_{x}] = V a r [\frac{F_{x} / n - β}{α - β}] \\ = \frac{n F (x) α (1 - α) + n (1 - F (x)) β (1 - β)}{n^{2} {(α - β)}^{2}} \\ = \frac{4 e^{ε}}{n {(e^{ε} - 1)}^{2}} + \frac{F_{x}}{n} \end{array}$
$x$ is non-sensitive: $\begin{array}{l} MSE [{\hat{F}}_{x}] & = V a r [{\hat{F}}_{x}] = V a r [\frac{F (x)}{n γ}] \\ = \frac{n F (x) γ (1 - γ)}{n^{2} γ^{2}} = \frac{1 - γ}{n γ} F (x) \\ = \frac{e^{ε} + 1}{n (e^{ε} - 1)} F (x) \end{array}$

□

5.2. Comparative Analysis of uOUE Protocol

5.2.1. Comparison of Theoretical Results

We evaluate the utility of the uOUE protocol with traditional LDP perturbation methods (GRR [12] and RAPPOR [12]) and existing ULDP perturbation methods (uGRR [15], uRAP [15] and uOLH [16]). Since

M S E [\hat{F}] = \sum_{x \in X_{S}} M S E [{\hat{F}}_{x}] + \sum_{x \in X_{N}} M S E [{\hat{F}}_{x}]

and Theorem 3, we compute the MSE for both the proposed and existing LDP mechanisms when

ε = O (1)

. The results are displayed in Table 1, in which

F_{X_{N}}

represents the actual total frequency of non-sensitive data.

In practical applications, most of the MSE comes from sensitive data, but sensitive data usually only account for a part of the entire data set. Therefore, by optimizing the utility of the personalized component privacy mechanism, its MSE is significantly smaller than the non-utility mechanism. Our improved mechanism is easy to implement in practical applications and shows better performance in terms of computational cost and communication costs. This mechanism can protect sensitive data more accurately and reduce its impact on the overall error.

In addition to data utility, communication cost is also an important criterion to evaluate whether a mechanism is good or not. We summarize the communication cost of existing ULDP protocols, which can be seen in Table 2.

As the privacy budget increases, the error of the six protocols gradually decreases. However, in terms of data utility, uGRR is significantly behind the other two protocols. This difference is mainly due to the use of large data sets in the original data domain in the experiment, which poses a challenge to the adaptability of uGRR. For a wide range of raw data domains, uOLH shows superior communication cost performance. In practical scenarios, especially when the original data domain is moderate, the uOUE protocol is superior in communication overhead. Considering the unique application scenarios and data characteristics of ESD, the uOUE protocol effectively meets the needs of actual scenarios while ensuring privacy protection.

5.2.2. Comparison of Experimental Results

Experimental Settings:

Our experimental environment is set as follows: the operating system is Windows 10, the processor is Inter i7-1165G7, the memory is 16.0 GB, and PyCharm 2021.3.

We conducted experiments on two data sets: the COVID-19 data set [21,22,23] and the SARS virus data set [24]. Their relevant parameter settings are also given in Table 3.

2.: The effect of $ε$ on MSE

In this subsection, we compare the four mechanisms under different privacy budgets, as shown in Figure 4 and Figure 5.

When the data domain size

d = 256

, with the increase in privacy budget

ε

, the MSE of the four mechanisms decreases gradually. Higher privacy budgets lead to more accurate frequency estimation and improved data utility but reduce privacy protection. This implies that within a given privacy budget, we must strive for precise statistical results while preserving data privacy. Balancing data utility and privacy protection is essential.

In the experiment, we observed that data utility with the GRR mechanism was notably lower than with the other four mechanisms. This occurred because the experiment used relatively large data domains, leading to a higher likelihood of data disturbance in uGRR, which impacted frequency estimation accuracy, in line with GRR’s characteristics.

From the chart, it is evident that uOLH excels with particularly large data domains. However, for mid-sized data domains, the enhanced uOUE mechanism proves superior in practical applications. It offers higher practicality and performance for the ESD aggregation scheme.

3.: The effect of $d$ on MSE

Data domain size

d

also has a certain impact on data utility. Since the data domain of the real data set is fixed, this section evaluates simulated data sets of different sizes. The value range of

d

set by the experiment is

\{16, 32, 64, \dots, 1024\}

; the proportion of sensitive data is 0.5, and the privacy budget is

ε = 1

. The results are shown in Figure 6.

5.3. Security Proof and Analysis

The security of our scheme is based on the BLS signature. Under the Computational Diffie–Hellman (CDH) assumption and random oracle model, the ESD aggregation scheme based on uOUE is unforgeable under an adaptive chosen message attack.

Theorem 4.

If the CDH problem on

G_{1}

is difficult, then the ESD aggregation scheme based on uOUE achieves Existential Unforgeability Against Adaptive Chosen Message Attacks (EUF-CMA).

Proof of Theorem 4.

Let

H_{1}

be a random oracle. The adversary

B

knows that

(P, P K_{A} = P^{a}, ϑ)

takes

A

(attacking BLS short signature scheme) as a subroutine, and the goal is to calculate

ϑ^{a}

. Suppose that ①

A

will not initiate two identical queries on the random oracle; ② if

A

requests a signature of message

\bar{M}

, he has asked

ϑ

before; ③ if

A

outputs

(\bar{M}, σ_{A})

, he has asked

ϑ

before. □

Analysis:

B

regards

P K_{A}

as its public key and

a

as its private key (

B

does not actually know

a

), then

ϑ^{a}

is

B

’s signature on a message, that is,

σ_{A} = H_{1} {(\bar{M})}^{a} = ϑ^{a}

, where

(\bar{M}, σ_{A})

is generated by

A

’s forgery.

B

can take

ϑ^{a}

as the hash function value of message

{\bar{M}}_{j}

, but

B

does not know which message’s signature is forged by

A

, so this needs to be guessed. And

B

wants to hide the problem instance

(P, P K_{A} = P^{a}, ϑ)

, so a random number

r

is first selected and

P K_{A} \cdot P^{r}

is sent to

A

as the public key.

The specific process is as follows:

1.

Initialization:

B

sends generator

P

and public key

P K_{A} \cdot P^{r} \in G_{1}

of multiplicative group

G_{1}

to

A

, where

r \in Z_{q_{1}}^{*}

. The private key corresponding to

P K_{A} \cdot P^{r} = P^{a + r}

is

a + r

. In addition,

j \in (1, 2, \dots, q_{H})

is randomly selected as one of its guess values. This

H

-query of

A

corresponds to the final forgery result of

A

;

2.

H_{1}

-inquiry (up to

q_{H}

times):

B

creates a

H^{list}

, initially empty, with the element type of triple

({\bar{M}}_{i}, y_{i}, b_{i})

. When

A

starts the

i

-th query (set the query value to

{\bar{M}}_{i}

),

B

answers as follows:

(1): If there is a term $({\bar{M}}_{i}, y_{i}, b_{i})$ corresponding to ${\bar{M}}_{i}$ in $H^{list}$ , then $y_{i}$ is the answer;
(2): Otherwise, $B$ randomly selects $b_{i} \in Z_{q_{1}}^{*}$ .

if

i = j

, then

y_{i} = ϑ P^{b_{i}} \in G_{1}

is calculated. Otherwise,

y_{i} = P^{b_{i}} \in G_{1}

is calculated.

y_{i}

is taken as the response to the query, and

({\bar{M}}_{i}, y_{i}, b_{i})

is stored in the table;

3.

Signature query (up to

q_{H}

times): When

A

requests a signature of message

\bar{M}

, let

i

satisfy

\bar{M} = {\bar{M}}_{i}

;

{\bar{M}}_{i}

represents the query value of the

i

-th

H_{1}

query.

B

answers the question as follows:

(1): If $i \neq j$ , then there is a triple $({\bar{M}}_{i}, y_{i}, b_{i})$ in $H^{list}$ . $σ_{i} = {(P K_{A} P^{r})}^{b_{i}}$ is calculated and the reply to $A$ is $σ_{i}$ . Because of $σ_{i} = {(P K_{A} P^{r})}^{b_{i}} = P^{b_{i} (a + r)} = y_{i}^{(a + r)}$ , so $σ_{i}$ is the signature of ${\bar{M}}_{i}$ with secret $a + r$ ;
(2): If $i = j$ , the process is interrupted;

4.

Output:

A

output

(\bar{M}, σ)

. If

M \neq M_{i}

,

B

interrupts; otherwise,

B

outputs

\frac{σ}{ϑ^{a} P {K_{A}}^{b_{j}} P^{b_{j} r}}

as

ϑ^{a}

.

σ = y_{j}^{(a + r)} = {(ϑ P^{b_{j}})}^{a + r} = ϑ^{a} ϑ^{r} {(P^{a})}^{b_{j}} P^{b_{j} r} = ϑ^{a} ϑ^{r} P {K_{A}}^{b_{j}} P^{b_{j} r}

.

In the above process, if

B

is not interrupted, then the simulation of

B

is complete.

When guessed correctly, the view of the above reduction is identically distributed with the view of the real attack. This is because of the following two points:

(1)

Each of the

q_{H}

H_{1}

queries of

A

is answered by a random value, and the response to

{\bar{M}}_{i} (i = 1, 2, \dots, q_{H})

is as follows:

When $i = j$ is answered by $y_{i} = ϑ P^{b_{i}} \in G_{1}$ , it is known that $y_{i}$ is distributed in $G_{1}$ according to the randomness of $b_{i}$ ;
When $i \neq j$ is answered by $y_{i} = P^{b_{i}} \in G_{1}$ , $y_{i}$ is also distributed in $G_{1}$ .

In real attacks,

H_{1}

is regarded as a random oracle. Therefore, the response to

A

’s hash query is the same distribution as the response in the real attack.

(2): The response obtained by $A$ to the signature query of ${\bar{M}}_{i} (i \neq j)$ is signed by the private key $a + r$ corresponding to the public key $P K_{A} \cdot P^{r} = P^{a + r}$ ( $A$ has obtained this), so the signature response obtained by $A$ is valid (relative to the public key it obtains).

Therefore, the view of

A

in the above reduction is identically distributed with its view in the real attack, that is, the simulation of

B

is complete.

If the conjecture of

B

is correct, then

B

solves the problem in group

G_{1}

. Because the CDH problem in group

G_{1}

is difficult, the signature scheme of this scheme is unforgeable under adaptive chosen message attack.

5.4. Performance Analysis

5.4.1. Functional Comparison

Our scheme achieves multidimensional ESD aggregation, guarantees user identity anonymity, and effectively defends against eavesdropping, active attacks, and differential attacks. Compared to existing schemes, homomorphic-based multiple data aggregation (HB-MDA) [25] could aggregate multidimensional data but not multisubset data. Fault-tolerant and flexible privacy-preserving multisubset data aggregation (FF-PPMA) [26] allows the control center to aggregate subsets, but clients can only report one type of data. Moreover, the DP-based multidimensional and multisubset data aggregation scheme (DP-MMDA) [27] signature mechanism is susceptible to adaptive selection message attacks. As seen in Table 4, our proposed scheme offers significant functional advantages and a more comprehensive defense against various security threats.

5.4.2. Computational Overhead

In this section, we will analyze the computational overhead of our scheme, comparing it to HB-MDA, FF-PPMA, and DP-MMDA. Table 5 presents the core operations and their execution times. As per Table 6, our scheme accomplishes multidimensional and multiregion aggregation, with less computational expense and greater efficiency during the aggregation and decryption phases compared to the HB-MDA and FF-PPMA schemes.

The HB-MDA scheme takes more time owing to the construction of a super-increasing sequence and the use of the PHE algorithm for encryption. By contrast, the DP-MMDA scheme efficiently reduces computational overhead by employing the Chinese remainder theorem to merge multidimensional data into composite data. The FF-PPMA scheme, not accounting for authentication, does not consider signature generation computational overhead. In this study, we solely compare computational overhead for implementing multidimensional and multisubset data aggregation.

With an increasing number of data aggregations, the DP-MMDA scheme and our scheme excel in multidimensional data aggregation, outperforming the HB-MDA and FF-PPMA schemes in computational performance. Note that the DP-MMDA scheme aggregates grid data and is suitable for accumulating electricity, whereas our ESD aggregation involves multiattribute data, primarily used for statistical analysis. Nevertheless, we effectively integrate and analyze ESD across multiple regions and attributes.

We include a ULDP mechanism, enhancing privacy protection for dimension attributes, which is crucial for ESD analysis. With the LDP mechanism, our scheme demonstrates strong computational performance in multidimensional and multisubset data aggregation, delivering heightened data privacy. Our comprehensive analysis showcases that our scheme adeptly balances computational efficiency and privacy protection in large-scale data aggregation scenarios, thereby offering a valuable practical solution.

6. Summary

In the context of ESD aggregation, we enhance the uOUE protocol aligned with the ULDP model to boost data coding efficiency and accuracy. Concurrently, we devise an epidemiological survey grounded in the uOUE protocol, ensuring heightened accuracy in ESD frequency estimation without compromising sensitive data protection. We aim to mitigate the privacy risk associated with raw patient data traversing channels and servers, ensuring data integrity during processing while preserving privacy. These enhancements enable secure, efficient, and precise ESD aggregation, bolstering result accuracy and reliability while thwarting data tampering and forgery. Our scheme also has potential practical application value in multifunctional data aggregation [27,28,29] and data aggregation combined with wearable medical devices.

Our future research will focus on research into data aggregation technology in big data environments to improve data utility in a personalized model and develop a multilevel privacy ESD aggregation scheme. In the personalized model, we will explore adaptive data processing methods to meet different user scenarios and data needs. In this multilevel privacy-level ESD aggregation scheme design, we will account for varying data sensitivity and privacy requirements, optimizing data information utilization while preserving privacy. These studies will advance the field, providing comprehensive, optimized solutions for data aggregation and privacy protection.

Author Contributions

Conceptualization, Q.L.; methodology, X.L. and Q.L; software, J.W. and Q.L.; verification, J.W. and H.S.; formal analysis, J.W. and Q.L.; investigation, Q.L.; resources, Q.L.; data monitoring, Q.L. and H.S; writing—original draft preparation, Q.L.; writing—review and editing, X.L. and Q.L.; supervision, X.L.; All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by National Natural Science Foundation of China, grant number 62262060, 61662071; Industrial support plan project of the Gansu Provincial Department of Education, grant number 2022CYZC-17 and Gansu Science and Technology Program, grant number 22JR5RA158.

Data Availability Statement

Data are contained within the article.

Acknowledgments

The authors would like to express their sincere thanks to the referees for their careful reading and suggestions which helped us to improve the paper.

Conflicts of Interest

The authors declare no conflicts of interest.

References

Giabicani, M.; Le Terrier, C.; Poncet, A.; Guidet, B.; Jean-Philippe, B. Limitation of life-sustaining therapies in critically ill patients with COVID-19: A descriptive epidemiological investigation from the COVID-ICU study. Crit. Care 2023, 27, 103. [Google Scholar] [CrossRef] [PubMed]
Song, Q.X. New Coronavirus Pneumonia Epidemic-related Rumors and Its Mechanism of Generation and Dissemination—Discussion on the Cooperative Principle of Emergency Information Release. Lang. Plan. Res. 2021, 57–66. [Google Scholar]
Feng, B.; Chao, L. Analysis of epidemic prevention and control behavior and influencing factors of employees in public places in the normalized prevention and control stage of COVID-19. Anhui J. Prev. Med. 2022, 28, 406–409. [Google Scholar]
Blumenberg, C.; Barros, A.J. Electronic data collection in epidemiological research. Appl. Clin. Inform. 2016, 7, 672–681. [Google Scholar] [PubMed]
Dong, E.; Ratcliff, J.; Goyea, T.D.; MS, A.K. The Johns Hopkins University Center for systems science and engineering COVID-19 Dashboard: Data collection process, challenges faced, and lessons learned. Lancet Infect. Dis. 2022, 22, e370–e376. [Google Scholar] [CrossRef] [PubMed]
Sperber, A.D.; Bor, S.; Fang, X. Face-to-face interviews versus Internet surveys: Comparison of two data collection methods in the Rome foundation global epidemiology study: Implications for population-based research. Neurogastroenterol. Motil. 2023, 35, e14583. [Google Scholar] [CrossRef] [PubMed]
Dwork, C.; Roth, A. The algorithmic foundations of differential privacy. Found. Trends Theor. Comput. Sci. 2014, 9, 211–407. [Google Scholar] [CrossRef]
Lecuyer, M.; Atlidakis, V.; Geambasu, R.; Hsu, D.; Jana, S. Certified robustness to adversarial examples with differential privacy. In Proceedings of the 2019 IEEE Symposium on Security and Privacy (SP), San Francisco, CA, USA, 19–23 May 2019; IEEE: Piscataway, NJ, USA, 2019; pp. 656–672. [Google Scholar]
Erlingsson, Ú.; Feldman, V.; Mironov, I.; Raghunathan, A.; Talwar, K.; Thakurta, A. Amplification by shuffling: From local to central differential privacy via anonymity. In Proceedings of the Thirtieth Annual ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, USA, 6–9 January 2019; Society for Industrial and Applied Mathematics: Philadelphia, PA, USA, 2019. [Google Scholar]
Duchi, J.C.; Jordan, M.I.; Wainwright, M.J. Local privacy and statistical minimax rates. In Proceedings of the 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, Berkeley, CA, USA, 26–29 October 2013; IEEE: Piscataway, NJ, USA, 2013. [Google Scholar]
Dwork, C. Differential privacy: A survey of results. In Proceedings of the International Conference on Theory and Applications of Models of Computation, Berlin, Germany, 25–29 April 2008. [Google Scholar]
Wang, T.; Blocki, J.; Li, N.H. Locally differentially private protocols for frequency estimation. In Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Proceedings of the 26th USENIX Security Symposium (USENIX Security 17), Vancouver, BC, Canada, 16–18 August 2017. [Google Scholar]
Liu, X.; Xia, G.; Xia, X.; Zong, C.; Zhu, R.; Li, J. Personalized privacy protection for spatio-temporal data. J. Comput. Appl. 2021, 9, 643–650. [Google Scholar]
Tian, F.; Wu, Q.Z.; Lu, L.F.; Liu, H.; Gui, X.L. Personalized differential privacy protection mechanism for trajectory data publishing. Chin. J. Comput. 2021, 44, 709–723. [Google Scholar]
Murakami, T.; Kawamoto, Y. Utility-optimized local differential privacy mechanisms for distribution estimation. In Proceedings of the 28th USENIX Security Symposium (USENIX Security 19), Santa Clara, CA, USA, 14–16 August 2019. [Google Scholar]
He, X.Y.; Zhu, Y.W.; Zhang, Y. Utility optimization of local differential privacy mechanism based on OLH. J. Cryptogr. 2022, 9, 820–833. [Google Scholar]
Cao, Y.R.; Zhu, Y.W.; He, X.Y.; Zhang, Y. Utility-optimized local differential privacy set data frequency estimation mechanism. Comput. Res. Dev. 2022, 59, 2261–2274. [Google Scholar]
Gentry, C. Fully homomorphic encryption using ideal lattices. In Proceedings of the Forty-First Annual ACM Symposium on Theory of Computing, Bethesda, MD, USA, 31 May–2 June 2009. [Google Scholar]
Paillier, P. Public-key cryptosystems based on composite degree residuosity classes. In Proceedings of the EUROCRYPT’99, Prague, Czech Republic, 2–6 May 1999; pp. 223–238. [Google Scholar]
Boneh, D.; Lynn, B.; Shacham, H. Short signatures from the weil pairing. J. Cryptol. J. Int. Assoc. Cryptologic Res. 2004, 17, 297–319. [Google Scholar] [CrossRef]
Jihoo, K. Data Science for COVID-19 (DS4C). Available online: https://www.kaggle.com/datasets/kimjihoo/coronavirusdataset (accessed on 18 May 2023).
World Health Organization. Coronavirus 2019 (COVID-19). Available online: https://covid19.who.int/ (accessed on 3 May 2023).
U.S. National Library of Medicine, ClinicalTrials.gov. Available online: https://www.clinicaltrials.gov/ (accessed on 6 June 2023).
Hugging Face. Available online: https://huggingface.co/datasets?sort=trending&search=SARS (accessed on 13 May 2023).
Chen, Y.W.; Martínez-Ortega, J.F.; Castillejo, P.; López, L. A homomorphic-based multiple data aggregation scheme for smart grid. IEEE Sens. J. 2019, 19, 3921–3929. [Google Scholar] [CrossRef]
Chien, H.Y.; Su, C. A fault-tolerant and flexible privacy-preserving multisubset data aggregation in smart grid. Comput. Sci./Intell. Appl. Inform. 2020, 848, 165–175. [Google Scholar]
Xu, S.H. Research on Privacy Protection Data Aggregation Scheme for Smart Grid. Master’s Thesis, Zhejiang Gongshang University, Hangzhou, China, 2022. [Google Scholar]
Ren, H.; Li, H.W.; Liang, X.H.; He, S.B.; Dai, Y.S.; Zhao, L. Privacy-Enhanced and Multifunctional Health Data Aggregation under Differential Privacy Guarantees. Sensors 2016, 16, 1463. [Google Scholar] [CrossRef] [PubMed]
Thantharate, P.; Thantharate, A. GeneticSecOps: Harnessing Heuristic Genetic Algorithms for Automated Security Testing and Vulnerability Detection in DevSecOps. In Proceedings of the 2023 6th International Conference on Contemporary Computing and Informatics (IC3I), Gautam Buddha Nagar, India, 14–16 September 2023. [Google Scholar]

Figure 1. Example of uOUE mechanism disturbing sensitive data.

Figure 2. Example of uOUE mechanism disturbing non-sensitive data.

Figure 3. Scheme model of ESD aggregation based on uOUE mechanism.

Figure 4. The effect of

ε

on MSE.

Figure 4. The effect of

ε

on MSE.

Figure 5. MSE of different ε on SARS virus.

Figure 6. The effect of

d

on MSE.

Figure 6. The effect of

d

on MSE.

Table 1. Comparison of MSE when

ε = O (1)

.

Table 1. Comparison of MSE when

ε = O (1)

.

Mechanism	MSE	Mechanism	MSE
GRR	$O (\frac{d^{2}}{n ε^{2}})$	uRAP	$O (\frac{\|X_{S}\|}{n ε^{2}} + \frac{F_{X_{N}}}{n ε})$
RAPPOR	$O (\frac{d}{n ε^{2}})$	uOLH	$O (\frac{\|X_{S}\|}{n ε^{2}} + \frac{F_{X_{N}}}{n})$
uGRR	$O (\frac{{\|X_{S}\|}^{2}}{n ε^{2}} + \frac{\|X_{N}\| F_{X_{N}}}{n ε})$	uOUE	$O (\frac{\|X_{S}\|}{n ε^{2}} + \frac{F_{X_{N}}}{n ε})$

Table 2. Communication cost comparison.

Mechanism	MSE	Mechanism	MSE
GRR	$O (\log d)$	uRAP	$O (\|X_{S}\| + \|X_{N}\|)$
RAPPOR	$O (d)$	uOLH	$O (\log \|H\| + \log (g + \|X_{N}\|))$
uGRR	$O (\log \|X_{S}\| + \|X_{N}\|)$	uOUE	$O (\|X_{S}\| + \|X_{N}\|)$

Table 3. Simulation parameter settings in Figure 4 and Figure 5.

Synthetic Data Set	COVID-19 Data Set	SARS Virus Data Set
$ε$	$(0.0, 2.0)$	$(0.0, 2.0)$
Data size N	100K	50K
Data domain size d	$\{16, 32, 64, \dots, 1024\}$	$\{16, 32, 64, \dots, 1024\}$

Table 4. Comparison of privacy-preserving data aggregation schemes.

	HB-MDA	FF-PPMA	DP-MMDA	Our Scheme
Function	HB-MDA	FF-PPMA	DP-MMDA	Our Scheme
Multidimensional data aggregation	√	×	√	√
Multidimensional and multisubset data aggregation	×	×	√	√
Identity anonymity	×	×	√	√
Eavesdropping attack	√	√	√	√
Active threat	√	×	√	√
Differential attack	×	×	√	√
EUF-CMA	×	×	×	√

Table 5. Applications in each class.

Sign	Description	Time (ms)
$T_{e}$	Exponential operation on $Ζ_{N}$	11.256
$T_{m}$	Multiplication operation on $Ζ_{N}$	1.032

Table 6. Comparison of computational overhead.

Scheme	Encryption Stage	Decryption Phase	Aggregation Phase
HB-MDA	$(l + 1) T_{e} + l T_{m}$	$(z - 1) T_{m}$	$T_{e} + T_{m}$
FF-PPMA	$4 T_{e} + 2 T_{m}$	$2 (z - 1) T_{m}$	$2 T_{e}$
DP-MMDA	$2 T_{e} + T_{m}$	$(z - 1) T_{m}$	$T_{e}$
Our scheme	$l T_{e} + T_{m}$	$(z - 1) T_{m}$	$T_{e}$

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Liu, X.; Liu, Q.; Wang, J.; Sun, H. Multidimensional Epidemiological Survey Data Aggregation Scheme Based on Personalized Local Differential Privacy. Symmetry 2024, 16, 294. https://doi.org/10.3390/sym16030294

AMA Style

Liu X, Liu Q, Wang J, Sun H. Multidimensional Epidemiological Survey Data Aggregation Scheme Based on Personalized Local Differential Privacy. Symmetry. 2024; 16(3):294. https://doi.org/10.3390/sym16030294

Chicago/Turabian Style

Liu, Xueyan, Qiong Liu, Jia Wang, and Hao Sun. 2024. "Multidimensional Epidemiological Survey Data Aggregation Scheme Based on Personalized Local Differential Privacy" Symmetry 16, no. 3: 294. https://doi.org/10.3390/sym16030294

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Multidimensional Epidemiological Survey Data Aggregation Scheme Based on Personalized Local Differential Privacy

Abstract

1. Introduction

1.1. Related Work

1.2. Organization

2. Preliminary Knowledge

2.1. Utility Optimization LDP

2.2. PHE Algorithm

2.3. Utility Evaluation

3. uOUE Mechanism

3.1. Introduction to uOUE

3.2. Mechanism Description

4. ESD Aggregation Scheme Based on uOUE

4.1. Scheme Model

4.2. Scheme Contents

4.2.1. Scheme Initialization

4.2.2. Public–Private Key Pair Generation

4.2.3. uOUE Perturbation

4.2.4. Data Pre-Processing

4.2.5. Data Aggregation

4.2.6. Data Acquisition

5. Scheme Analysis

5.1. Theoretical Analysis of uOUE Protocol

5.2. Comparative Analysis of uOUE Protocol

5.2.1. Comparison of Theoretical Results

5.2.2. Comparison of Experimental Results

5.3. Security Proof and Analysis

5.4. Performance Analysis

5.4.1. Functional Comparison

5.4.2. Computational Overhead

6. Summary

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI