1. Introduction
Recently, deep neural networks (DNNs) or convolutional neural networks (CNNs) have been widely applied to complicated signal processing, such as classification tasks and signal regression problems, due to their outstanding performances in nonlinear adaptability and feature extraction ([
1,
2,
3] and references therein) and are also extended to the distributed sensing systems (e.g., the object recognition using distributed microDoppler radars in [
4] and the data driven digital healthcare applications [
5,
6,
7]). In the distributed sensing systems, centralized training strategies may be adopted to train their common DNN or CNN modules by sharing their sensing data. However, due to the datasize and the privacy issues of the locally collected data, the centralized training is not desirable, especially when the capacity of the backhaul link for the data exchange is limited.
The federated learning approach has been extensively investigated as an alternative distributed machine learning method [
8,
9] where, rather than sharing their locally collected dataset, the clients report the stochastic gradient information (minimizing the loss function with respect to their local dataset) to the main server. The main server then aggregates the stochastic gradient information and broadcast it to the clients. Accordingly, to achieve the unbiased stochastic gradient at the main server, the training data sampling methods are investigated [
10,
11]. Furthermore, in [
12], to reduce the communication overhead of transmitting the updated gradient information (proportional to the number of weights in the DNNs and CNNs), an efficient weight aggregation protocol for federated learning is proposed and in [
13], the structured updating method is proposed for the communication cost reduction. However, they assume that the stochastic gradient information is perfectly transferred from the multiple clients to the main server without any distortion.
In the federated learning process, when the clients are connected with wirelessconnected clients, local gradient information needs to be transferred from the distributed clients to the server over the wireless backhaul link and can be distorted due to wireless channel fading. In [
14,
15,
16,
17], for the wireless backhaul, the federated learning strategies are proposed for the MNIST handwriting image classification and the associated wireless resources are efficiently optimized. In [
14,
15], the average of the local stochastic gradient vectors is recovered at the server when the preprocessed local gradient vectors are transferred from the clients. In [
16], the compressive sensing approach is proposed to estimate the local gradient vectors at the server. In [
17], joint communication and federated learning model is developed, where the resource allocation and the client selection methods are proposed such that the packet error rates of the communication links between server and clients are optimized. We note that most of the previous works have focused on the estimation of local stochastic gradient vectors at the server.
In this paper, we also consider the federated learning system, where distributed clients are connected to the server via wireless backhaul link and develop efficient training strategies for the federated learning over wireless backhaul link. Differently from the previous works, where the average of the local stochastic gradient vectors (i.e., the equalweight combining) is recovered at the server, we propose an efficient gradient updating method, in which the local gradients are combined such that the effective signaltonoise ratio (SNR) is maximized at the server. In addition, we also propose a binary gradient updating strategy based on thresholding in which the round associated with all channel having small channel gains is excluded from federated learning. That is, when the backhaul links for all clients have channel gain smaller than a predefined threshold simultaneously, the server may have severely distorted gradient vectors, which can be avoided through the proposed updating with thresholding. Furthermore, because each client has limited transmission power, it is effective to allocate more power on the channel slots carrying specific important information, rather than allocating power equally to all channel resources (equivalently, slots). Accordingly, we also propose an adaptive power allocation method, in which each client allocates its transmit power proportionally to the magnitude of the gradient information. This is because, when training a deep learning model, the gradient elements with large values imply the large change of weight to decrease the loss function.
Through the extensive computer simulations, it can be found that the proposed gradient updating methods improve the federated learning performance over the wireless channel. Specifically, due to the distortion over wireless channel, the classification accuracy of the equalweight combining decreases drastically as the rounds of the federated learning increase. In contrast, the proposed effective SNR maximizing scheme with thresholding exhibits the accuracy performance which is comparable to that for the federated learning over the errorfree backhaul link. We note that, as the threshold level increases, the federated learning is performed stably, because the highly distorted gradient update vector due to small channel gain can be discarded by a large threshold level. However, the large threshold level may incur the gradient updating delay, but the adaptive power allocation strategy can improve the tradeoff between the federated learning performance and the learning delay due to the threshold level.
The rest of this paper is organized as follows. In
Section 2, the system model for the federated learning system with the wireless backhaul is presented in which the distributed clients have a common CNN module for the handwriting character recognition. In
Section 4, gradient updating methods are proposed. In addition, the adaptive power allocation method is also developed considering the importance of the gradient information. In
Section 5, we provide several simulation results and in
Section 6, we give our conclusions.
2. System Model
In
Figure 1, we consider the federated learning systems with wireless backhaul, where the
L multiclients have their own datasets to train each local network. Here, a common neural network model is shared for all clients and it is trained through the federated learning over wireless backhaul connected to the main server. The common neural network is designed for the classification problem, in which the label
$\hat{d}$ is induced from the network output for the
lth client’s measured data with the label
d,
${\mathbf{S}}_{d}^{\left(l\right)}$. That is,
where
$\mathbf{f}(;\mathbf{\theta})$ denotes the nonlinear neural network function with the model parameter
$\mathbf{\theta}(\in {\mathbb{R}}^{P\times 1})$ that gives the estimate of the categorical label probability vector as its output vector. Here,
P denotes the number of weights in the common neural network model and
${\mathbf{x}}_{out}\left[d\right]$ is the
dth element of the vector
${\mathbf{x}}_{out}$. We note that the size of the model parameter (
P) is determined by the structure of the neural network model. Specifically, in the case of a convolutional layer with
K${K}_{f1}\times {K}_{f2}$ filters, the number of weights is given as
${K}_{f1}\times {K}_{f2}\times K+K$ that accounts for the kernel size (
${K}_{f1}\times {K}_{f2}$), the number of kernels (
K) and the number of biases (
K). In the case of a single fullyconnected layer, the number of weights is calculated as
${N}_{in}\times {N}_{nr}+{N}_{nr}$, where
${N}_{in}$ and
${N}_{nr}$ denote the input size and the number of neurons, respectively. See also
Section 2.1. We note that, because the collected data at each client are generally of a large dimension with private security issues, it is not desirable to report the collected data to the server. Furthermore, the large dimension of the data may cause the significant burden on the typical backhaul link to transmit a number of training datasets. Instead, the neural network model
$\mathbf{f}(;\mathbf{\theta})$ will be shared over all clients and
$\mathbf{\theta}$ can be locally trained with the data obtained from each client. By denoting
${\mathbf{\theta}}^{\left(l\right)}$ as the model parameter trained at the
lth client,
${\mathbf{\theta}}^{\left(l\right)}$ is reported to the server through the wireless uplink backhaul for the federated learning. The associated federated learning strategies and power allocation over the wireless backhaul will be discussed in more detail in
Section 4.
2.1. CNN Architecture for Handwriting Character Recognition
Throughout the paper, multiple clients have a common neural network for the handwriting character recognition. Specifically, a typical CNN module is considered for the character image classification as in
Figure 2, but the proposed federated learning strategy can be applied to other CNN models. The nonlinear neural network function
$\mathbf{f}({\mathbf{S}}_{i}^{\left(l\right)};\mathbf{\theta})$ in (
1) is composed of an input layer, convolutional layers, activation layers, max pooling layer, a fullyconnected layer, and an output layer. See
Section 4 for the specific values of the hyperparameters of CNN module.
– Convolution Layer: The handwriting image matrix,
${\mathbf{S}}_{i}^{\left(l\right)}\in {\mathbb{R}}^{{N}_{width}\times {N}_{height}}$ is exploited as the input of the convolution layers. In addition, each element of their output is computed through the convolution operation with a
${K}_{fi1}\times {K}_{fi2}$ filter (equivalently, kernel) for
ith layer. Specifically, the output of the
ith convolution layer can be given as:
where
${X}_{(i1)}\left[m,n,k\right]$ is the
$(m,n,k)$th element of
${\mathbf{X}}_{(i1)}\in {\mathbb{R}}^{{m}_{i1}\times {n}_{i1}\times {k}_{i1}}$, the input of the
ith layer and
${f}_{a}\left(\xb7\right)$ is an activation function. In addition,
${W}_{\left(i\right)}[p,q,k]$ is the
$(p,q,k)$th element of the filter matrix
${\mathbf{W}}_{\left(i\right)}$ at the
ith layer and
${b}_{\left(i\right)}\left[k\right]$ is the
kth element of a bias vector
${\mathbf{b}}_{\left(i\right)}$. Throughout the paper, rectified linear unit (ReLU) function is used as the activation function, which is given as
– Max pooling layer: In the pooling layer, to reduce the dimension of the input data without losing useful information, the elements of the input are downsampled [
18]. In the Max pooling layer, after dividing the input matrix into multiple blocks, the maximum value in each block is sampled and forwarded to the dimensionreduced output matrix.
– Flatten, FullyConnected (FC) layer: The flatten layer is used for changing the shape of output of convolution layer into the vector which is used as the input of FC layer. We note that, in the case of a single fullyconnected layer with ${N}_{in}$ input elements and ${N}_{nr}$ neurons, the number of weights is given as ${N}_{in}\times {N}_{nr}+{N}_{nr}$. In the FC layer, the output of convolution layer is associated with a proper loss function such that the label is correctly identified after the training.
Throughout the paper, the cross entropy (CE) is used as the loss function which is given as
where
${\mathbf{x}}_{out}\in {\mathbb{R}}^{D\times 1}$ is the output of FC and
${\overline{\mathbf{L}}}_{d}$ is a label onehot encoded vector of size
D that has zeros in all elements except the
dth element, which is assigned a value of 1. Then, by using the local training datasets (
${\Phi}_{tr}^{\left(l\right)}={\{{\mathbf{S}}_{d,tr}^{\left(l\right)},{\overline{\mathbf{L}}}_{d,tr}\}}_{t=1}^{{N}_{tr}}$) at the
lth client, the network function parameter can be updated as:
where
${\mathbf{g}}_{t1}^{\left(l\right)}(\in {\mathbb{R}}^{P\times 1})$ denotes the gradient such that the loss function is minimized for the local training datasets
${\Phi}_{tr}^{\left(l\right)}$ and is given as
${\mathbf{g}}_{t1}^{\left(l\right)}\triangleq {\left.\eta {\u25bd}_{\mathbf{\theta}}\phantom{\rule{0.166667em}{0ex}}{L}_{CE}({\mathbf{x}}_{out},{\overline{\mathbf{L}}}_{d,tr};\mathbf{\theta})\right}_{\mathbf{\theta}={\mathbf{\theta}}_{t1}^{\left(l\right)}}$ with a learning rate,
$\eta $.
2.2. Signal Model for Wireless Backhaul
As in
Figure 1, the clients are connected to the server through the wireless backhaul link. For the federated learning, the model parameters aggregated at the server are broadcast at each iteration of training phase through the wireless downlink channel, while the model parameters trained at the
lth client are reported to the server through the wireless uplink backhaul link. Throughout the paper, we focus only on the uplink phase of multiple access channel and assume that the broadcast channel for the downlink phase is errorfree, as done in [
15,
16,
19].
Assuming that the clients and the server have a single antenna for the backhaul link, when total
B channel resources with narrowband signal bandwidth are available (Here, we note that the channel resources may be given in the frequency axis or may be given in the time axis.), the received signal at server for the
tth round of the gradient update can be given as
for
$b=1,\dots ,B$, where
${x}_{l,t}\left[b\right]$ is the precoded transmit signal of the
lth client at the
bth channel resource with
$E\left[\right{x}_{l,t}\left[b\right]{}^{2}]=1$ for the
tth round. Here,
${h}_{l,t}\left[b\right]$ and
${n}_{t}\left[b\right]$ denote the aggregated Rayleigh fading channel and the zeromean additive white Gaussian noise (AWGN) at the
bth channel resource, respectively. That is,
${h}_{l,t}\left[b\right]$ follows a Gaussian distribution with a zeromean and a variance
${\sigma}_{{h}_{l}}^{2}$ (that is,
${h}_{l,t}\left[b\right]\sim \mathcal{N}\left(0,{\sigma}_{{h}_{l}}^{2}\right)$). Likewise,
${n}_{t}\left[b\right]\sim \mathcal{N}\left(0,{\sigma}_{n}^{2}\right)$. In addition, the wireless channel is constant over each round of federated learning process, but changes independently from round to round. By concatenating
${y}_{t}\left[b\right]$ in (
4), the received signal at server can be vectorized as:
where
${\mathbf{H}}_{l,t}=diag\{{h}_{l,t}\left[1\right],\dots ,{h}_{l,t}\left[B\right]\}$ and
Here, $diag\left\{{a}_{1},\dots ,{a}_{B}\right\}$ denotes a $B\times B$ diagonal matrix having its diagonal elements as ${a}_{1},\dots ,{a}_{B}$.
3. Federated Learning for Handwriting Character Recognition
Note that, as in (
3), the CNN parameter
${\mathbf{\theta}}^{\left(l\right)}$ can be trained with the local training datasets at each client, which limits the adaptability of the CNN due to the lack of the globally measured data. Accordingly, to train their parameters globally, federated learning strategy is exploited, known as an efficient learning strategy suitable to the multiclients environment such as our system model shown in
Figure 1.
Specifically, during the
tth round of the training phase, each client receives the gradient of the model parameter
${\mathbf{g}}_{t1}$ from the server via the backhaul link. Then, by exploiting
${\mathbf{g}}_{t1}$ instead of
${\mathbf{g}}_{t1}^{\left(l\right)}$ in (
3) the network function parameter can be updated as:
We note that
${\mathbf{g}}_{t1}$ is the globally aggregated gradient computed at the server, which tends to minimize the loss function with respect to the data collected at all clients. Then, each client can compute its next local gradient
${\mathbf{g}}_{t}^{\left(l\right)}$ such that the local loss function is minimized for the locally collected datasets
${\Phi}_{tr}^{\left(l\right)}$. Then, the locally updated gradient vector is reported to the server via the backhaul link. The server can then aggregate the local gradient vector to get
${\mathbf{g}}_{t}$ as:
where the function
${\mathbf{f}}_{g}\left(\right)$ represents the gradient aggregation function. In [
20], the FederatedAveraging technique (i.e., equal weight combining) is proposed which is given as:
The aggregated gradient ${\mathbf{g}}_{t}$ is again broadcast to the multiclients and exploited to update the neural network model at each client. The above described steps are repeated for a given number of rounds, T.
At the beginning of the training phase, the server needs to initialize the global model parameters and, throughout the paper, the parameters are initialized based on He normal weight initialization method [
21], which is advantageous when used with ReLU activation function. Based on the above description, generalized federated learning process is summarized in Algorithm 1.
Algorithm 1. Generalized federated learning train process. 
 1:
Initialize ${\mathbf{\theta}}_{0}$ based on He normal weight initialization method  2:
${\mathbf{g}}_{0}\leftarrow \mathbf{0}$  3:
for$t\leftarrow 1$ to T do  4:
(Clients) ${\mathbf{\theta}}_{t}^{\left(l\right)}\leftarrow {\mathbf{\theta}}_{t1}^{\left(l\right)}+{\mathbf{g}}_{t1}$  5:
(Clients) Update ${\mathbf{g}}_{t}^{\left(l\right)}$ from the datasets ${\Phi}_{tr}^{\left(l\right)}$  6:
(Clients) Report ${\mathbf{g}}_{t}^{\left(l\right)}$ to the server via the backhaul link  7:
(Server) ${\mathbf{g}}_{t}\leftarrow {\mathbf{f}}_{g}({\mathbf{g}}_{t}^{\left(l\right)},\phantom{\rule{3.33333pt}{0ex}}l=1,\dots ,L)$ as in ( 7)  8:
(Server) Broadcast ${\mathbf{g}}_{t}$ to multiclients  9:
end for

Differently from the centralized learning, the datasets collected by each client are not necessarily reported to the main server in Algorithm 1. We note that, in many cases, data sharing is not free from security, regulatory and privacy issues [
8]. We also note that the communication cost for the centralized learning depends on the number/size of the collected data [
22,
23]. In contrast, the communication cost for the federated learning is independent with the data size, but depends on the CNN architecture (specifically, the number of weights in the CNN).
5. Experiment Results
To see the validation of the proposed federated learning train strategy discussed in
Section 4, we develop the CNN module for handwriting character recognition having the architecture in
Figure 2. Specifically, the CNN module has three twodimensional convolutional layers and the values for the hyperparameters exploited in the computer simulations are summarized in
Table 1. Then, the number of elements in the gradient vector
${\mathbf{g}}^{\left(l\right)}$ is given as
$5.26\times {10}^{4}$. The CNN module is shared by three clients connected to the server over the wireless channel. Throughout the simulations, we exploit the handwriting MNIST dataset where
${N}_{width}={N}_{height}=28$. In addition, three clients are considered and the received SNR at the server is defined as:
where
${\sigma}_{n}^{2}$ is the variance of the AWGN. In addition, we split the gradient vector into multiple vectors having 128 elements (i.e.,
$\overline{P}=128$ in (
9)).
In
Figure 3 (respectively,
Figure 4), we evaluate the classification accuracy and CE loss of the conventional gradient updating method based on the equalweight combining and the proposed updating method based on MRC, discussed in
Section 4.2 for high SNR (
$SN{R}_{rec}=15$ dB) (respectively, low SNR (
$SN{R}_{rec}=10$ dB)). For comparison purposes, the performance of the federated learning with errorfree backhaul link is also evaluated. Here, the channel gain of each client is set as
${\sigma}_{{h}_{l}}^{2}=\{0.3,1.0,3.0\}$ and the threshold level in given as
$\u03f5=1.0$, and this value was experimentally determined. For the local training of the commonly shared CNN module, ADAM optimizer is adopted [
27] at each client with a fixed learning rate,
$\eta =0.001$.
From
Figure 3, when the backhaul link is perfect and noise free, the classification accuracy increases in proportion to the rounds and the accuracy up to 0.97 can be achieved. In contrast, due to the channel fading and noise in the wireless backhaul link, training does not proceed stably when the conventional equalweight combining is exploited. In Round 120, there is a sharp increase at the loss curve from
$0.28$ to
$2.75$, resulting in the decrease in the accuracy from
$0.92$ to
$0.11$. In contrast, the performance of the proposed updating method based on MRC in
Section 4.2 exhibits a similar performance to that with the perfect backhaul link. In
Figure 4, it can be found that, for low SNR, the classification accuracy of the equalweight combining is not improved as the rounds increases and is below 0.15. In addition, the associated CE loss goes to infinity. At low SNR, it is difficult to recover the distortion caused over the wireless backhaul link when transmitting the gradient for model update. Especially, when there is channel distortion, the equalweight combining does not reflect the received SNR in the gradient update and fails to train the distributed CNN modules. Interestingly, the updating method based on MRC and thresholding shows unstable peak in the CE loss, but it can avoid the CE loss divergence and improve the classification accuracy as the learning round increases.
In
Figure 5, we evaluate the classification accuracy for various threshold levels
$\u03f5$ with (a)
$SN{R}_{rec}=15$ dB and (b)
$SN{R}_{rec}=10$ dB when the updating method with MRC and thresholding in
Section 4.2.2 is exploited. From
Figure 5a, at high SNR, the federated learning can be well operated through the gradient updating method with MRC and thresholding, regardless of the threshold levels. However, for
$\u03f5=10.0$, the accuracy does not effectively increase as the learning round increases. That is, for a larger threshold level, more local gradient vectors transferred through the wireless channel can be discarded. In
Figure 5b, it can be found that the classification performance is more sensitive to the threshold level at low SNR compared to the high SNR case. Specifically, as
$\u03f5$ is larger, the federated learning is performed stably. This is also because the gradient update vector containing the amplified noise due to small channel gain can be discarded for large
$\u03f5$. We note that the large
$\u03f5$ may incur the gradient updating delay, which leads the tradeoff between the federated learning performance and the learning delay.
In
Figure 6, to validate the adaptive power allocation strategy in
Section 4.3, we evaluate the classification accuracy of various gradient updating methods with/without the adaptive power allocation strategy when the received SNR is low with different threshold levels (i.e., (a)
$\u03f5=1.0$ and (b)
$\u03f5=0.1$). It can be found that the accuracy of the MRC based gradient updating method with
$\u03f5=1.0$ in
Figure 6a is more stable compared to that with
$\u03f5=0.1$ in
Figure 6b, which coincides with the observation in
Figure 5. Interestingly, by exploiting the adaptive power allocation strategy jointly with the MRC based gradient updating method in
Figure 6a, the accuracy can be improved by 96.7% and it is comparable to the performance with errorfree backhaul link. In addition, from
Figure 6b, the adaptive power allocation strategy drastically stabilizes the federated learning performance during the learning process over wireless channel even for small
$\u03f5=0.1$. Accordingly, the adaptive power allocation strategy improves the tradeoff between the federated learning performance and the learning delay due to the threshold level discussed in
Figure 5.
In
Table 2 and
Table 3, the confusion matrices for the test dataset are evaluated after the federated learning is completed, where the proposed gradient updating method (
Table 2) and the conventional updating method (
Table 3) are, respectively, exploited. From
Table 2, the proposed gradient updating method shows the classification accuracy of 0.9 or more for all labels. However, from
Table 3, the CNN module trained through the conventional gradient updating method over wireless channel misclassifies most test data with specific labels.
6. Conclusions
In this paper, efficient gradient updating strategies are developed for federated learning when distributed clients are connected to the server via a wireless backhaul link. That is, a common CNN module is shared for all the distributed clients and it is trained through the federated learning over wireless backhaul connected to the main server. During the training phase, local gradients need to be transferred from the distributed clients to the server over a wireless noisy backhaul link. To overcome the distortion due to wireless channel fading, an effective SNR maximizing gradient updating method is proposed, in which the gradients are combined such that the effective SNR is maximized at the server. In addition, when the backhaul links for all clients have small channel gain simultaneously, the server may have severely distorted gradient vectors. Accordingly, we propose a binary gradient updating strategy based on thresholding in which the round associated with all channels having small channel gains is excluded from federated learning, which results in the tradeoff between the federated learning performance and the learning delay. Due to the channel fading and noise in the wireless backhaul link, training does not proceed stably with the conventional equalweight combining especially at low SNR. In contrast, the updating method based on MRC and thresholding improves the classification accuracy as the learning round increases by avoiding the CE loss divergence. Finally, we also propose an adaptive power allocation method, in which each client allocates its transmit power proportionally to the magnitude of the gradient information. Note that the gradient elements with large values imply the large change of weight to decrease the loss function. Through the computer simulations, it is confirmed that the adaptive power allocation strategy can improve the tradeoff between the federated learning performance and the learning delay due to the threshold level.