A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning

Huo, Zheng; Fan, Yilin; Huang, Yaxin

doi:10.3390/math11132804

Open AccessArticle

A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning

by

Zheng Huo

^1,2,

Yilin Fan

^1,* and

Yaxin Huang

¹

Information Technology School, Hebei University of Economics and Business, Shijiazhuang 050061, China

²

Hebei Cross-Border E-Commerce Technology Innovation Center, Shijiazhuang 050061, China

^*

Author to whom correspondence should be addressed.

Mathematics 2023, 11(13), 2804; https://doi.org/10.3390/math11132804

Submission received: 17 May 2023 / Revised: 10 June 2023 / Accepted: 20 June 2023 / Published: 21 June 2023

(This article belongs to the Special Issue Advances in Data Mining, Machine Learning and Causal Inference and Their Applications)

Download

Browse Figures

Versions Notes

Abstract

:

Text classification is an important application of machine learning. This paper proposes a communication-efficient federated text classification method based on parameter pruning. In the federated learning architecture, the data distribution of different participants is not independent and identically distributed; a federated word embedding model FedW2V is proposed. Then the TextCNN model is extended to the federated architecture. To reduce the communication cost of the federated TextCNN model, a parameter pruning algorithm called FedInitPrune is proposed, which reduces the amount of communication data both in the uplink and downlink during the parameter transmission phase. The algorithms are tested on real-world datasets. The experimental results show that when the text classification model accuracy reduces by less than 2%, the amount of federated learning communication parameters can be reduced by 74.26%.

Keywords:

federated learning; parameter pruning; text classification; TextCNN

MSC:

68U15

1. Introduction

The development of deep learning has brought a breakthrough to natural language processing. However, high-quality natural language processing models cannot be separated from the support of a large number of training data [1]. A large amount of text data are stored in user terminals or service providers. Due to the restrictions of data security and privacy protection regulations, computing and storage resources, it is impossible to store data from multiple data sources in a centralized manner for model training. How to combine multiple data holders to use the deep learning model for text classification is an urgent problem. The proposal of federal learning [2] provides a new way to solve this problem. Federated learning adopts a distributed learning framework, which allows the data holder to control the training data independently. The central server coordinates multiple data holders to train a unified global model. This method has no data exchange and centralized data storage; only model parameters are interchanged in the training process, which protects the privacy and security of data.

The primary task of deep learning-based text classification is to generate the text representation, that is, to convert the text into a form that can be recognized by the computer. The most popular way is the word embedding model. This model requires a large number of document corpora as input, and the high-quality word embedding model cannot be obtained by training with a single data source. Thus, it cannot provide an effective guarantee for subsequent downstream tasks. Many leading IT enterprises have launched the natural language processing model application based on federated learning. For example, Google took the lead in using federated learning to predict the next input word through the Gboard mobile keyboard; Apple uses federal learning to detect wake-up words in Siri; Snips introduced cross-device federated learning for hot word detection [3].

Federated learning provides a new idea for the training of text classification models, which can break through the limitation of data islands and make it possible to train models on larger datasets. However, in the process of federated learning and training, the aggregation server needs to frequently interact with the data holder with parameters. Researchers who aim to improve the quality of the model continue to deepen the breadth or depth of the model, and the volume of the model’s parameters is growing. A large amount of parameters brings huge costs to the parameter communication of federated learning. This becomes a major factor restricting the expansion of the deep learning model to the federated architecture.

In view of the above problems, this paper aims to train high-quality word embedding models based on text representation methods by using the horizontal federal learning architecture, cooperating with different data holders under the premise that data are non-IID (Non-Independent and identically distributed, non-IID), and stored in a distributed manner. The main contributions are as follows.

We extend the W2V model to a federated version, called FedW2V, in the case that data are non-IID distributed and the document corpus is not allowed to be shared. In FedW2V, the aggregation algorithm is refined.
Then we extend the deep learning model TextCNN for text classification under the federal architecture.
In order to deal with the problem of large model parameters and high federation communication costs in federated TextCNN. A parameter pruning algorithm FedInitPrune (Federated Initialization Prune, FedInitPrune) is proposed, which reduces the amount of uplink and downlink communication data during the training process and only loses small model performance.
We run a set of experiments on real-world datasets, the accuracy of our proposal is up to 91.71%, and the communication cost is reduced by 74.26% in the best case, indicating the effectiveness of our method.

The remainder of this paper is organized as follows. Section 2 introduces the related works, Section 3 illustrates system architecture and algorithms, and Section 4 analyzes the experiment results. At last, Section 5 concludes the paper.

2. Related Works

Deep learning is widely used in many fields and has shown excellent results in the field of natural language processing. Text representation is the starting point of natural language processing tasks. With the development of deep learning, text representation based on the word embedding model emerges. Compared with the traditional discrete word bag model, the distributed word embedding model represents each word as a dense vector, which contains rich semantic information, and can well map words into low dimensional space to obtain the relationship between words. The effect of the sparsity of text representation is reduced. In 2013, Google proposed the Word2Vec method [4] to obtain word-embedded representation by training a neural network model. Subsequently, researchers proposed the GloVe model [5] and the most advanced BERT model [6]. Word2Vec and GloVe have different perspectives on generating word vectors; the BERT model uses a more complex network to obtain word-embedded representation. Due to the huge parameters of the BERT model, it is not suitable for mobile terminals with poor computing ability. At present, the Word2Vec model still has a wide range of applications [7,8,9,10].

Classical deep learning models for processing text classification tasks include TextCNN, TextRNN, LSTM, etc. Researchers are committed to continuously improving the performance of the model, starting with changing the depth and structure of the model so as to increase the number of model parameters. When the deep learning model is applied to the federated learning architecture, the architecture requires frequent parameter exchange between participants and servers, resulting in high communication costs. With the increasing number of participants, communication costs have become a bottleneck of federated learning. There are two ways to save the communication cost of federated learning: reducing the number of communication rounds or reducing the amount of communication data between participants and servers in each round [11,12]. The FedAvg algorithm proposed by McMahan [13] and others is a widely used federate learning model updating algorithm at present. Based on the FedSGD algorithm, it reduces the total amount of uploaded data and the communication frequency by reducing the number of participants participating in the global model update each round while increasing the number of local training rounds. FedAVG is an efficient way of federal communication. Bathla et al. [14] proposed a fake review detection method based on CNN and LSTM. In their approach, aspects are extracted from reviews, and only these aspects and respective sentiments are employed for fake review detection. Extracted aspects are fed into CNN for aspect replication learning. The replicated aspects are fed into LSTM for fake review detection.

In the research on reducing the amount of communication data in each round, there are two ways: reducing the number of participants participating in the update in each round and reducing the amount of communication data. Yin et al. [15] proposed that in the model update phase, participants only upload gradient information greater than a certain threshold to the server to reduce the amount of data transmitted in each round. Dong et al. [16] adopted the Top-k gradient selection method. Instead of relying on a fixed threshold, they chose the first k gradients with the largest absolute gradient to upload. Konečný et al. [17] and Reiszadeh et al. [18] also proposed to use of quantitative methods to lossy compress model parameters before uploading local model parameters, so that model parameter can be expressed in lower digits. However, after reducing the accuracy of the parameters, the parameters obtained by the server inverse quantization will produce certain errors. Most of the above methods only optimize the communication cost of the uplink (that is, the local model parameters of the participants are uploaded) in the update phase of the federated learning model, while the downlink (that is, the model parameters of the central server are downloaded) is not optimized. Inspired by Wang et al.’s [19] deep learning model compression, he believed that the final model performance is determined by the model structure after pruning rather than the weight inherited from the pre-training large model. Therefore, it is possible to prune directly from the randomly initialized model weight without pre-training the model with redundant parameters. The research shows that the weight parameters in the depth neural network have significant redundancy, and only a small part of the weights can be used to predict the rest of the weights. Therefore, most of the weights in the network do not need to be learned. In the first phase of federated learning, we prune the model first and train the pruned model. Through model pruning, we can not only reduce the amount of communication data in the uplink but also reduce the amount of communication data in the downlink. Li et al. [20] proposed FedTCR, a Federated learning approach via Taming Computing Resources. FedTCR includes a coarse-grained logical computing cluster construction algorithm and a fine-grained intra-cluster collaborative training mechanism as part of the FL process. The computing resource heterogeneity among devices and the communication frequency between devices and the server are indirectly tamed during this process, which substantially resolves the straggler problem and significantly improves communication efficiency.

3. System Architecture and Algorithms

3.1. System Architecture

The method proposed in this paper is deployed under the federated learning architecture, including an aggregation server

S

and

k

participants

P_{1}, P_{2}, \dots, P_{k}

, as shown in Figure 1. The training process is as follows:

Step 1: Server

S

builds a global dictionary according to the local dictionary uploaded by

P_{i}

.

Step 2:

P_{i}

pre-processes the local training data on the global dictionary.

Step 3: Server

S

cooperates with

k

participants to train a high-quality word vector model by FedW2V as the input of the TextCNN model.

Step 4: Server

S

conducts the federal TextCNN training, judges the importance of the parameter of the initialization model in the first round of training, and the participants upload the calculation results of the parameter connection sensitivity to the central server for aggregation.

Step 5: Server

S

selects a certain proportion of redundant parameters for deletion through aggregation, normalization and sorting operations. Send the lightweight model parameters to the participants.

3.2. Federated Word Embedding Model

We extend the Word2Vec model under the federated learning framework called FedW2V. We first introduce the aggregation algorithm of FedW2V; then, we introduce the whole algorithm.

3.2.1. FedW2V Aggregation Method

As we know, the aggregation algorithm is the most important in federated learning. While the FedAvg algorithm uses the data size ratio

\frac{| D_{i} |}{| D |}

as the weight of each participant’s parameter aggregation. However, we found that using the ratio of data size as the weight of the parameter update is not accurate. This is because the frequency of a word’s occurrence does not depend on the amount of data from the participants. Some participants have small amounts of data, but a certain word’s occurrence frequency is high. Using only the amount of data from the participants as a weight is obviously inaccurate. Therefore, this paper redesigns the aggregation weight of parameters; we calculate the proportion of a word’s local frequency in the global vocabulary as the aggregation weight, as

a g g_w e i g h t_{i} = \frac{l o c a l_v o c a b_{i} (w_{t})}{g l o b a l_v o c a b (w_{t})}

, where agg_weight_i(w_t) is the update weight of

P_{i}

of word

w_{t}

in the global vocabulary,

{l o c a l_v o c a b}_{i} (w_{t})

and

g l o b a l_v o c a b (w_{t})

represent the frequency of

w_{t}

in P_i’s side and in the global vocabulary, respectively.

3.2.2. FedW2V Algorithm

FedW2V algorithm is shown in Algorithm 1. Participant

P_{i}

calculates the word frequency based on the local training data, conducts secondary sampling, and obtains

P_{i}

’s local vocabulary (Line 1). The idea of secondary sampling is to reduce the frequency of high-frequency words, which can speed up model training and obtain better word vectors. The local vocabulary of

P_{i}

is constructed and then sent to the server (Line 2). The central server collects the local dictionaries, counts the word frequencies of all participants, and sorts them incrementally according to the total word frequency. The top-n words with the highest total word frequency are selected as the global dictionary for the subsequent word embedding model training (Lines 3–5).

Algorithm 1: FedW2V algorithm

Input:

P_{i}

’s dataset

D_{i}

initial parameter

ω_{0}

of federated Skip-gram model, global training rounds

G

;
Output:

P_{i}

’s local vocabulary

{l o c a l_v o c a b}_{i}

, global vocabulary

g l o b a l_v o c a b

, global word embedding model

g l o b a l_m o d e l

;

P_i’s side

1. Build a local vocabulary based on the local dataset:

{l o c a l_v o c a b}_{i} \leftarrow D_{i}

;

2. Upload local vocabulary

{l o c a l_v o c a b}_{i}

to the server;

Server’s side

3. Collect the local vocabularies from the participants;
4. Calculate the global vocabulary

g l o b a l_v o c a b \leftarrow \sum_{i = 1}^{k} {l o c a l_v o c a b}_{i}

;
5. Send

g l o b a l_v o c a b

to the participants;

Participants’ side

6. Build the training set S_i based on

g l o b a l_v o c a b

;

Server’s side
7. Calculate each participant’s update vocabulary weight:

{a g g_w e i g h t}_{i} = \frac{{l o c a l_v o c a b}_{i} (w_{t})}{g l o b a l_v o c a b (w_{t})}

;
8. Initialize the federated Skip-gram model parameter

ω_{0}

9. for

g

= 0, 1, 2,…, G do

10. for

i \in P_{k}

do

11.

ω_{g}^{i} \leftarrow C l i e n t U p d a t e (ω_{g})

;

12. end

13.

ω_{g + 1} \leftarrow \sum_{i = 1}^{k} {a g g_w e i g h t}_{i} \times ω_{g}^{i}

14. end

Then,

P_{i}

generates training data

S_{i}

according to the global vocabulary, extracts the training headwords and background words, and uses the negative sampling strategy to accelerate the model training (Line 6). In the model training phase,

P_{i}

uses the three-layer neural network to build a Skip-gram model for word vector training. Assume that the total number of words in the entire global dictionary is

V

, and the word vector dimension is

d

; the three-layer neural network can be represented as:

Input layer: The input is the one-hot encoding form of the word index in the dictionary, corresponding to the headword, background word and noise word.
Hidden layer: the number of neurons in the hidden layer is consistent with the dimension of the word vector; that is, the parameter of the hidden layer is a matrix of size $V \times d$ , called the central word vector matrix $W_{1} .$ The reduced dimension representation of the headword vector can be obtained by multiplying the one-hot vector of the headword of the input layer vector.
Output layer: the aim is to predict the score of background words and noise words according to the headword. The one-hot vector of background words and noise words multiplies the background word vector matrix $W_{2}$ with size $V \times d$ , as the parameter weight from the hidden layer to the output layer. Multiply the parameter weight matrix of the hidden layer and the output layer to get the predicted scores of background words ${P o s}_{p}$ and noise words ${N e g}_{q}$ . Perform a sigmoid operation on the score and output it as the prediction score ${P o s}_{p}^{'}$ and ${N e g}_{q}^{'}$ . Compare the predicted score with the ground truth to calculate the model loss, and carry out backpropagation to update matrix $W_{1}$ and $W_{2}$ .

After the local training,

P_{i}

uploads the local parameter

W_{1}

and

W_{2}

.

P_{i}

selects a batch size of local training data to continuously train the local model and sends each round of model parameters to

S

, which updates the weighted model using update weights and then issues new model parameters. The process is repeated until the end of the training (Lines 8–14).

Algorithm 2 shows the training process of P_i. At first, each participant uploads their parameters, and then P_i selects a batch of data as the training set to do the local training (Lines 2–7). After the training is done, parameters are sent to the server (Line 8).

Algorithm 2:

C l i e n t U p d a t e (ω_{g})

Input: the global model parameter

ω_{g}

of the

g

-th training round, local training rounds

E

;
Output: the local model parameter

ω_{E}^{i}

of the

g

-th round;

1. Participants update local model parameters:

ω_{e}^{i} \leftarrow ω_{g}

;

2.

B \leftarrow

Partition the training set

S_{i}

by the

b a t c h_s i z e

;

3. for

e = 0, 1, 2, \dots, E

do

4. for

b \in

B do

5.

ω_{e}^{i} \leftarrow ω_{e}^{i} - η \nabla l (ω_{e}^{i}; b)

;

6. end

7. end

8. Upload

ω_{E}^{i}

to the server;

3.3. Communication-Efficient Federated Text Classification

We extend the TextCNN deep learning model to the federated architecture for accurate text classification. The word embedding model trained by the aforementioned Fed2Word algorithm is the input of TextCNN. Compared with traditional machine learning methods, deep learning solves the problem of high data coefficients and feature dimensions in text feature extraction. It can be seen from the TextCNN model that the parameters mainly come from the convolution layer and the full connection layer. With the increase of the word vector dimension and the number of users participating in federated learning, a large amount of transmission data will be generated in the federated learning parameter communication phase. With the deepening and widening of the deep neural network model, the cost of federated communication will be seriously increased. In this paper, a model pruning algorithm is proposed to reduce the communication cost while maintaining the model accuracy.

3.3.1. Parameter Pruning

Parameter pruning refers to the design of evaluation criteria for network parameters on the basis of pre-trained large-scale models based on which redundant parameters are deleted. We design an unstructured parameter pruning method FedInitPrune at the beginning of the federated model training phase. Since unstructured pruning has little impact on model accuracy. At first, we define connection sensitivity as the evaluation criteria.

Definition 1.

Connection sensitivity. Suppose the model parameter set is θ, and the loss function is L(). The change of loss function caused by the removal of a certain parameter

θ_{j}

is defined as the connection sensitivity of parameter

θ_{j}

, as shown in Formula (1).

S e n s (θ_{j}) = ∆ L (θ_{j}) = \frac{1}{| D_{i} |} \sum | L ({w |}_{θ_{j = 0}}; D_{i}) - L (w; D_{i}) |

(1)

where,

L ({w |}_{θ_{j} = 0}; D_{i})

is the function loss after removal of parameter j.

Optimization problem: The parameter pruning problem can be defined as an optimization problem; that is, the loss function error caused by the deletion of a parameter from the model is the minimum, and the optimization problem can be represented by Formula (2).

m i n \frac{1}{|D_{i}|} \sum |L ({w |}_{θ^{*} = 0}; D_{i}) - L (w; D_{i})|

(2)

L ({w |}_{θ^{*} = 0}; D_{i})

is the model loss after removing the parameter set

θ^{*}

from the convolution layer and the full connection layer. According to the absolute value of the loss function change before and after removing the parameter, we find the parameter set that achieves the pruning rate with the minimum loss change so as to remove the parameter set

θ^{*}

, and reset the parameter weight to zero.

We add a mask code to each parameter to indicate its current status; that is, 1 means deletion and 0 means the reverse. We use a trainable mask to aid training without updating model parameters. The optimization problem can be expressed as Formula (3).

∆ L (m_{j}) = | L (w, m_{j} = 0; D_{i}) - L (w, m_{j} = 1; D_{i}) |

(3)

Solutions: The first-order Taylor expansion can be used to approximate the problem in order to reduce the computational complexity. The first-order Taylor expansion of

L (w, m_{j} = 1; D_{i})

is Formula (4).

L (w, m_{j} = 1; D_{i}) = L (w, m_{j} = 0; D_{i}) + \frac{𝜕 L (w, m_{j} = 1; D_{i})}{𝜕 m_{j} = 1} (m_{j} = 1) + R_{1} (m_{j} = 1)

(4)

Substitute the first-order Taylor expansion into Formula (4) to obtain Formula (5).

∆ L (m_{j}) = | - \frac{𝜕 L (w, m_{j} = 1; D_{i})}{𝜕 m_{j} = 1} (m_{j} = 1) - R_{1} (m_{j} = 1) | \approx | \frac{𝜕 L (w, m_{j} = 1; D_{i})}{𝜕 m_{j} = 1} |

(5)

Therefore, only one step of mask information needs to be solved through back propagation. The connection sensitivity of each parameter can be solved efficiently by Formula (5).

3.3.2. FedInitPrune Algorithm

Then we introduce the FedInitPrune algorithm; see Algorithm 3 for details.

P_{i}

uses the unified model initialization parameter

ω_{0}

, and performs a round of connection sensitivity calculation based on local training data set

D_{i}

. Assuming that TextCNN model’s convolution layer and full connection layer have

m

parameters in total,

g l o b a l_S e n s (θ_{j}) = \sum_{i}^{k} {S e n s}_{i} (θ_{j})

is used to calculate one step of mask code corresponding to each parameter as the connection sensitivity of this parameter, where,

g l o b a l_S e n s (θ_{j})

is the connection sensitivity of each parameter after aggregation,

j \in \{1, 2, \dots, m\}

. The connection sensitivity of each parameter of

P_{i}

is

{S e n s}_{i} (θ)

,

θ = {θ_{1}, \dots, θ_{m}}

. Then

{S e n s}_{i} (θ)

is uploaded to

S

, which uses for aggregation.

{g l o b a l_S e n s}^{'} (θ_{j}) = \frac{g l o b a l_S e n s (θ_{j})}{\sum_{n = 1}^{m} g l o b a l_S e n s (θ_{n})}

is used to normalize the parameter connection sensitivity.

Algorithm 3: FedInitPrune

Input:

P_{i}

’s dataset

D_{i}

, Global model initial parameters

ω_{0}

, pruning rate

α

, global training rounds

G

, local training rounds

E

, participatory dataset partition size

b a t c h_s i z e

;
Output: global model after pruning

{g l o b a l_m o d e l}^{'}

Server:
1. Initialize the TextCNN model parameter

ω_{0}

;
2. for

g = 0, 1, \dots, G

do

3. if

g = = 0

do

4. for

i \in P_{K}

do

5.

{S e n s}_{i} (θ) \leftarrow C l i e n t U p d a t e (ω_{g})

6.

g l o b a l_S e n s (θ) = \sum_{i}^{k} {S e n s}_{i} (θ)

7.

{g l o b a l_S e n s}^{'} (θ) = \frac{g l o b a l_S e n s (θ)}{\sum_{n = 1}^{m} g l o b a l_S e n s (θ_{n})}

//normalization

8.

G S (θ) \leftarrow S o r t e d D e s c e n d i n g ({g l o b a l_S e n s}^{'} (θ))

9.

δ = G S (θ) * α

10.

θ^{'} \leftarrow {g l o b a l_S e n s}^{'} (θ_{j}) ≧ δ

11.

ω^{*} \leftarrow θ^{'}

12. else do

13. for

i \in P_{K}

do

14.

ω_{g}^{i *} \leftarrow C l i e n t U p d a t e (ω_{g}^{*})

15.

ω_{g + 1}^{*} \leftarrow \sum_{i} \frac{| D_{i} |}{| D |} * ω_{g}^{i *}

P_i’s side:

C l i e n t U p d a t e (ω_{g})

16.

B \leftarrow

partition the dataset

D_{i}

based on

b a t c h_s i z e

;
17. if

g = = 0

18. update local model parameter:

ω^{i} \leftarrow ω_{g}

19. for

b \in

B do
20. calculate

{S e n s}_{i} (θ)

based on Formula (6) and

ω^{i}

21. Upload

{S e n s}_{i} (θ)

to the server;
22.else
23. update local model parameter:

ω_{e}^{i *} \leftarrow ω_{g}

24. for

e = 0, \dots, E

do
25. for

b \in B d o

26.

ω_{e}^{i *} \leftarrow ω_{e}^{i *} - η \nabla l (ω_{e}^{i *}; b)

27. Upload

ω_{E}^{i *}

to the server;

Sort the sensitivity based on

{g l o b a l_S e n s}^{'} (θ_{j})

. Obtain the global model parameter structure

ω^{*}

after pruning according to the pruning rate

α

, then send the global model after pruning to each participant for subsequent training.

4. Experiments and Results Analysis

This section mainly introduces the experimental setup and analysis of experimental results.

4.1. Dataset

We use the THUCNews and Fudan Chinese text classification datasets as the experiment dataset. THUCNews selects 12 categories of news data, each of which contains 5000 training samples and 5000 test samples. The Fudan Chinese text classification dataset contains 20 categories, including 9936 training samples and 9802 test samples. In the federated learning environment, each participant attracts different interested users with different characteristics, resulting in a different distribution of data categories. Therefore, we need to conduct random sampling of the original data set with non-IID distributed data. This paper uses the Dirichlet distribution division method to divide each participant into non-IID distributed data sets with label offset, that is, to make each participant’s training data set with different label distributions. The non-IID distributed data of 3 participants are shown in Figure 2.

4.2. Evaluation Measures

We use internal tasks and external tasks to evaluate the quality of word vectors. Internal tasks refer to evaluating the quality of word vectors from the perspective of word semantics by calculating the similarity scores and analogy effects between words. External task refers to classification accuracy.

In this paper, WordSim-240 and WordSim-297 datasets [21] are used to evaluate the internal task of word vector similarity. The above datasets include 240 and 297 groups of manually labeled word similarity scores, respectively. According to the word vectors trained by the FedAvg algorithm and FedW2V algorithm under the federal framework, the similarity scores of each group of words are calculated, and the word vector quality of different algorithms is evaluated by the Spearman correlation coefficient. The Analogy dataset [21] is used to evaluate the internal task. The dataset contains 1125 sets of word analogy data. In a centralized environment, the TextCNN text classification model is used for classification evaluation.

4.3. Experimental Results and Analysis

4.3.1. Training Results of Word Embedding Model

We use the FedAvg algorithm and FedW2V aggregation algorithm proposed in this paper to train a word embedding model containing 10,000 words under the federal architecture. The evaluation results are shown in Table 1 and Table 2.

From the experimental results, we can see that the FedW2V algorithm proposed by this paper has improved in both internal and external tasks compared with the FedAvg algorithm. This is because, in the Federated learning environment, the central server builds a global dictionary based on the local dictionary for each participant, which causes the words to be trained in the global dictionary only to appear in a few participants or even one participant. When the FedAvg algorithm updates the global model on the central server, participants apply the same update weight to the vector parameters of each word in the global dictionary, which causes the imprecision of the results. While the FedW2V algorithm proposed in this paper accurately updates the weight of the global word vector model in the word granularity level based on the different proportions of each word in the global dictionary. Therefore, the FedW2V algorithm has better results.

4.3.2. Classification Model Communication Cost Results

In this experiment, the FedAvg algorithm and the Top-k parameter selection algorithm are used for comparison with the FedInitPrune algorithm proposed in this paper. The word vector adopts the word embedding model trained by the FedW2V algorithm, and the classification method adopts the TextCNN model. In order to speed up the model training, the text input length is set to 200 words. Figure 3 shows the changes in accuracy under different communication rounds in the federated learning training stage. Table 3 shows the communication parameters of different algorithms under the federation architecture. One round of the training process includes the global model distribution process of the central server and the local model parameter upload process of the participants. The total traffic is the total communication parameter at the end of the training. Two metrics are used to evaluate the performance: accuracy, and communication cost, under the changes of training rounds.

Figure 3 shows the accuracy under changing communication rounds. It can be seen from Figure 3 and Table 3 that the FedAvg algorithm has relatively high accuracy because it transmits all the parameters of the model in each round. When the Top-k algorithm and FedInitPrune with compression ratios of 25% and 50%, there is almost no impact on the model performance. On the THUCNews dataset, when the model parameter compression ratio is 50%, the accuracy of the model is slightly higher than the FedAvg, indicating that fewer model parameters may improve the generalization performance of the final model. When the compression ratio is higher, for example, the Top-k algorithm and FedInitPrune with compression ratios of 10% and 25%, the accuracy of the model decrease, especially in Figure 3a; the accuracy reduces sharply when the communication round is between 45–85. This is because the data distribution of the Fudan Chinese text dataset, shown in Figure 2b, is not only imbalanced in label distribution but also in the distribution of data volume for each category of data. When selecting the Top-10% important parameter, categories with smaller data volumes did not have an impact on the final result, resulting in incorrect classification, which reduces the accuracy. Table 4 shows the relationship between different algorithms on model accuracy and communication rate in more detail. Compared with the FedAvg algorithm, Top-k/10% and FedInitPrune/25% in Fudan Chinese text classification dataset is reduced from 88.28% to 87.20% and 86.84%, respectively. In the THUCNews dataset, it decreased from 91.09% to 90.37% and 91.02%, respectively.

Figure 4 shows the changes in parameter volume under different training rounds of federated learning architecture. It can be seen that the FedInitPrune algorithm proposed in this paper significantly reduces the volume of federated learning communication parameters. Since the FedAvg algorithm reduces the federal communication cost by increasing the local training rounds, but all the model parameters are transmitted from participants to the server, the communication parameter volume in each round is still big. The Top-k parameter selection algorithm updates the maximum first k parameters to the central server each time according to the training results of the participants. Since the parameters transmitted by each participant are not fixed, additional parameter location information needs to be transmitted to the central server, and the central server needs to distribute all the parameters of the aggregated updated global model to the participants for the next round of model update training. Therefore, the total model parameter transmission will be slightly less than that of the FedAvg algorithm due to the reduction of parameters in the model upload phase of participants. FedInitPrune uses a simplified model that has been pruned in the model training phase, so each participant has fixed parameters in the model upload phase, does not need additional location information, and the global model is also simplified, so the number of transmission parameters in the global model distribution phase is relatively small, and the total transmission parameters will be significantly reduced, compared with FedAvg algorithm, FedInitPrune/25% reduced 74.26% and 72.76% of the communication parameters on Fudan Chinese text classification dataset and THUCNews dataset, respectively, and the model performance loss is small. FedInitPrune/50% reduced 49.34% and 47.84% of the parameter volume on Fudan Chinese text classification and THUCNews, respectively, with almost no model performance loss.

5. Conclusions

We propose a communication-efficient text classification method based on a deep learning model and federated learning architecture. In the training phase, the FedW2V algorithm is used to improve the quality of the word embedding model and provide a high-quality guarantee for the TextCNN model. In the federation training phase, the global parameter pruning strategy in the initialization phase is used to reduce the communication cost of the federation training process. The experimental results on Fudan Chinese text classification data and the THUCNews data set show that the FedW2V algorithm can improve the quality of the word embedding model. Compared with the classical FedAvg algorithm, the FedInitPrune algorithm reduces communication costs by 70–75%, while the accuracy has decreased by less than 2%. This method can well reduce the uplink and downlink traffic in the federated communication phase and can be easily combined with the quantitative method. Based on the FedInitPrune method, we can use quantization techniques to compress the intermediate results of the transmitted model between each round of communication transmission to further reduce the cost of federated communication.

Author Contributions

Conceptualization: Z.H.; data curation: Y.H.; formal analysis: Y.F.; funding acquisition: Z.H.; methodology: Z.H. and Y.F.; validation: Y.H.; writing-original draft: Y.F.; Writing—review & editing: Z.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research was funded by the National Natural Science Foundation of China, grant number 62002098, and the Natural Science Foundation of Hebei Province, grant number F2020207001, F2021207005.

Data Availability Statement

The data presented in this study are available on request from the corresponding author. The data are not publicly available due to privacy issues.

Conflicts of Interest

The authors declare no conflict of interest.

References

Zhou, L.; Pan, S.; Wang, J.; Vasilakos, A.V. Machine learning on big data: Opportunities and challenges. Neurocomputing 2017, 237, 350–361. [Google Scholar] [CrossRef] [Green Version]
Yang, Q.; Liu, Y.; Chen, T.J.; Tong, Y. Federated machine learning: Concept and applications. ACM Trans. Intell. Syst. Technol. 2019, 10, 1–19. [Google Scholar] [CrossRef]
Yin, X.; Zhu, Y.; Hu, J. A Comprehensive Survey of Privacy-preserving Federated Learning: A Taxonomy, Review, and Future Directions. ACM Comput. Surv. 2022, 54, 1–36. [Google Scholar] [CrossRef]
Mikolov, T.; Chen, K.; Corrado, G.; Dean, J. Efficient Estimation of Word Representations in Vector Space. In Proceedings of the International Conference on Learning Representations, ICLR (Workshop Poster), Scottsdale, AZ, USA, 2–4 May 2013. [Google Scholar]
Pennington, J.; Socher, R.; Manning, C.D. Glove: Global Vectors for Word Representation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1532–1543. [Google Scholar]
Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. Bert: Pre-Training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 17th Annual Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (NAACL-HLT 2019), Minneapolis, MN, USA, 2–7 June 2019; pp. 4171–4186. [Google Scholar]
Kim, S.; Park, H.; Lee, J. Word2vec-based latent semantic analysis (W2V-LSA) for topic modeling: A study on blockchain technology trend analysis. Expert Syst. Appl. 2020, 152, 113401. [Google Scholar] [CrossRef]
Pablos, A.G.; Cuadros, M.; Rigau, G. W2VLDA: Almost unsupervised system for Aspect Based Sentiment Analysis. Expert Syst. Appl. 2018, 91, 127–137. [Google Scholar] [CrossRef] [Green Version]
Sharma, A.; Kumar, S. Ontology-based semantic retrieval of documents using Word2vec model. Data Knowl. Eng. 2023, 144, 102110. [Google Scholar] [CrossRef]
Ma, J.; Wang, L.; Zhang, Y.-R.; Yuan, W.; Guo, W. An integrated latent Dirichlet allocation and Word2vec method for generating the topic evolution of mental models from global to local. Expert Syst. Appl. 2023, 212, 118695. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Found. Trends Mach. Learn. 2021, 14, 1–210. [Google Scholar] [CrossRef]
Li, Q.; Wen, Z.; Wu, Z.; Hu, S.; Wang, N.; Li, Y.; Liu, X.; He, B. A survey on federated learning systems: Vision, hype and reality for data privacy and protection. IEEE Trans. Knowl. Data Eng. 2021, 35, 3347–3366. [Google Scholar] [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. Artificial intelligence and statistics. Proc. Mach. Learn. Res. 2017, 54, 1273–1282. [Google Scholar]
Bathla, G.; Singh, P.; Singh, R.K.; Cambria, E.; Tiwari, R. Intelligent fake reviews detection based on aspect extraction and analysis using deep learning. Neural Comput. Appl. 2022, 34, 20213–20229. [Google Scholar] [CrossRef]
Yin, L.; Feng, J.; Xun, H.; Sun, Z.; Cheng, X. A privacy-preserving federated learning for multiparty data sharing in social IoTs. IEEE Trans. Netw. Sci. Eng. 2021, 8, 2706–2718. [Google Scholar] [CrossRef]
Dong, Y.; Hou, W.; Chen, X.; Zeng, X. Efficient and Secure Federated Learning Based on Secret Sharing and Selection. J. Comput. Res. Dev. 2020, 57, 10. [Google Scholar]
Konečný, J.; McMahan, H.B.; Yu, F.X.; Richtárik, P.; Suresh, A.T.; Bacon, D. Federated learning: Strategies for improving communication efficiency. arXiv 2016, arXiv:1610.05492. [Google Scholar]
Reisizadeh, A.; Mokhtari, A.; Hassani, H.; Jadbabaie, A.; Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. Proc. Mach. Learn. Res. 2020, 108, 2021–2031. [Google Scholar]
Wang, Y.; Zhang, X.; Xie, L.; Zhou, J.; Su, H.; Zhang, B.; Hu, X. Pruning from Scratch. In Proceedings of the AAAI Conference on Artificial Intelligence, New York, NY, USA, 7–12 February 2020; Volume 34, pp. 12273–12280. [Google Scholar]
Chen, X.; Xu, L.; Liu, Z.; Sun, M.; Luan, H. Joint Learning of Character and Word Embeddings. In Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence, Buenos Aires, Argentina, 25–31 July 2015. [Google Scholar]
Li, K.; Wang, H.; Zhang, Q. FedTCR: Communication-efficient federated learning via taming computing resources. Complex Intell. Syst. 2023, 1–21. [Google Scholar] [CrossRef]

Figure 1. Federated Learning System Architecture.

Figure 2. Data distributions of datasets.

Figure 3. Accuracy under changing communication rounds. (a) is the result of the Fudan Chinese text classification dataset, and (b) is the result of the THUCNews dataset.

Figure 4. Parameter volume on different communication rounds. (a) is the result of the Fudan Chinese text classification dataset, and (b) is the result of the THUCNews dataset.

Table 1. Quality Evaluation of Word Embedding Model on Fudan Chinese text classification Dataset.

Aggregation Algorithm	External Tasks (Classification Accuracy)	Internal Tasks (Similarity)		Internal Tasks (Analogy)
Aggregation Algorithm	External Tasks (Classification Accuracy)	WordSim-240	WordSim-297	Internal Tasks (Analogy)
FedAvg	88.68%	34.98	52.70	20.88%
FedW2V	90.25%	39.27	60.99	28.57%

Table 2. Quality Evaluation of Word Embedding Model on THUCNews Dataset.

Aggregation Algorithm	External Tasks (Classification Accuracy)	Internal Tasks (Similarity)		Internal Tasks (Analogy)
Aggregation Algorithm	External Tasks (Classification Accuracy)	WordSim-240	WordSim-297	Internal Tasks (Analogy)
FedAvg	91.08%	32.77	51.54	34.77%
FedW2V	91.29%	56.48	54.29	56.48%

Table 3. Communication data volume of different algorithms in FL architecture (bit).

Algorithms	Communication Data Volume/Round		Total Communication Data Volume
Algorithms	Fudan	THUCNews	Fudan (200 Rounds)	THUCNews (50 Rounds)
FedAvg	4.7337 × 10⁷	4.6156 × 10⁷	9.4674 × 10⁹	2.3078 × 10⁹
Top-k/10%	2.8700 × 10⁷	2.7991 × 10⁷	5.7399 × 10⁹	1.3995 × 10⁹
Top-k/25%	3.5800 × 10⁷	3.4914 × 10⁷	7.1600 × 10⁹	1.7457 × 10⁹
FedInitPrune/25%	1.1948 × 10⁷	1.1651 × 10⁷	2.4365 × 10⁹	6.2872 × 10⁹
FedInitPrune/50%	2.3744 × 10⁷	2.3153 × 10⁷	4.7962 × 10⁹	1.2038 × 10⁹

Table 4. Model accuracy and compression ratio.

Algorithms	Accuracy		Compression Ratio
Algorithms	Fudan	THUCNews	Fudan	THUCNews
FedAvg	88.28%	91.09%	−0%	−0%
Top-k/10%	87.20%	90.37%	−39.37%	−39.36%
Top-k/25%	88.25%	90.88%	−24.37%	−24.36%
FedInitPrune/25%	86.84%	91.02%	−74.26%	−72.76%
FedInitPrune/50%	88.15%	91.71%	−49.34%	−47.84%

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Huo, Z.; Fan, Y.; Huang, Y. A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning. Mathematics 2023, 11, 2804. https://doi.org/10.3390/math11132804

AMA Style

Huo Z, Fan Y, Huang Y. A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning. Mathematics. 2023; 11(13):2804. https://doi.org/10.3390/math11132804

Chicago/Turabian Style

Huo, Zheng, Yilin Fan, and Yaxin Huang. 2023. "A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning" Mathematics 11, no. 13: 2804. https://doi.org/10.3390/math11132804

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Communication-Efficient Federated Text Classification Method Based on Parameter Pruning

Abstract

1. Introduction

2. Related Works

3. System Architecture and Algorithms

3.1. System Architecture

3.2. Federated Word Embedding Model

3.2.1. FedW2V Aggregation Method

3.2.2. FedW2V Algorithm

3.3. Communication-Efficient Federated Text Classification

3.3.1. Parameter Pruning

3.3.2. FedInitPrune Algorithm

4. Experiments and Results Analysis

4.1. Dataset

4.2. Evaluation Measures

4.3. Experimental Results and Analysis

4.3.1. Training Results of Word Embedding Model

4.3.2. Classification Model Communication Cost Results

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI