Abstract

Federated learning (FL) is a distributed machine learning (ML) framework. In FL, multiple clients collaborate to solve traditional distributed ML problems under the coordination of the central server without sharing their local private data with others. This paper mainly sorts out FLs based on machine learning and deep learning. First of all, this paper introduces the development process, definition, architecture, and classification of FL and explains the concept of FL by comparing it with traditional distributed learning. Then, it describes typical problems of FL that need to be solved. On the basis of classical FL algorithms, several federated machine learning algorithms are briefly introduced, with emphasis on deep learning and classification and comparisons of those algorithms are carried out. Finally, this paper discusses possible future developments of FL based on deep learning.

1. Introduction

In the era of big data, people pay more and more attention to data security and user’s privacy; protection of data has become the focus of enterprises and individuals. In addition, data leakage has attracted the attention of governments and public media in recent years. The world major powers and major unions have enforced the supervision of citizens’ data security and privacy in law; the General Data Protection Regulations (GDPR) issued by the European Union [1] have come into effect on May 25, 2018. China’s Cyber Security Law, promulgated in 2017, requires Internet companies not to disclose or tamper with the personal information they collect from users, and when conducting data transactions with third parties, they need to ensure that both Internet company and third party comply with user data protection obligations [24]. The protection of data privacy in various countries becomes stricter, which makes large-scale user private data transfers between different companies in the future no longer allowed. The promulgation of these laws and regulations, on the one hand, protects the privacy of users and, on the other hand, prohibits big data from being excavated arbitrarily, which restricts the development of artificial intelligence. Big data is the basis of large-scale distributed ML. Under the restrictions of the abovementioned laws and regulations, data often exist in the form of isolated islands among different enterprises; even among different subsidiaries of one group.

The term “federated learning” was put forward by McMahan et al. [5] in 2016: “We call our approach FL because learning tasks are solved through a loose federated of participating devices (what we call clients) coordinated by a central server.” FL was originally defined as a distributed ML method that uses multiple user data to train a central model [6]. The purpose of FL is to carry out efficient distributed ML between multiparticipants or multicomputing nodes on the premise of ensuring the information security of big data exchange, protecting mobile data and personal privacy, and ensuring legal compliance. FL uses the framework of classical distributed ML and adopts distributed ML technology, but the control of the central server is different from that of distributed ML. Researchers can mine and utilize data without violating laws and regulations. In a broad sense, FL refers to a method where the data owner can realize the training of models without uploading local data [2]. The modeling of FL is based on the local model uploaded by each participant, and then the joint training model is returned to each participant to get similar results to traditional ML without violating laws; this makes FL have the advantage of confidentiality.

However, the classical algorithm of FL has some shortcomings in dealing with nonindependent and identically distributed data, communication transmission, and model establishment, and their resulting solution is too numerous to enumerate. Therefore, after consulting the relevant literature, this paper introduces classical FL and FL algorithms that have been promoted in some aspects. Moreover, in the era of big data, the effect of FL based on deep learning is more effective. This paper focuses on recent developments of federated deep learning algorithms; they are sorted out and summarized. We hope this article can make it easier for readers to quickly review the whole FL field, especially federated deep learning subfield.

The content of this paper includes the following: Section 2 introduces the basic knowledge of FL; Section 3 introduces some unsolved problems of FL; Section 4 introduces the FL algorithm based on ML; Section 5 introduces the FL algorithm based on pan-deep learning; Section 6 introduces the attack of FL; Section 7 describes future challenges of FL; and finally, the 8th summarizes this paper.

2. Basic Knowledge of Federated Learning

2.1. Machine Learning

With the rapid development of ML, its models are becoming more and more complex and effective [7, 8]. The core idea of ML is that the computer learns the mapping between input and output according to existing data samples: , where x is the input, y is the output, f is the corresponding rule, and is the parameter to be learned. According to the corresponding relationship, the model predicts the output value of the next input. The purpose of ML is to make the gap between the predicted value and the real value as small as possible. The mathematics is expressed as

In traditional ML, such as backpropagation neural network (BPNN) and convolutional neural networks (CNNs), the learning process of this parameter is all concentrated on one computer, and the commonly used methods are gradient descent and a series of improved algorithms. The core algorithm of FL is very similar to the Stochastic Gradient Descent (SGD) method [7]. In SGD, a sample is randomly selected from all samples to participate in the operation at each iteration.

2.2. Distributed Machine Learning

Distributed machine learning combines multiple computers for computing. Its core goal is to disassemble computing tasks into multiple small tasks and perform computing on multiple local processors. Its final training requires a central server to deal with the data uploaded by local clients; as a result, communication and privacy security is difficult to be guaranteed. Algorithm 1 shows the distributed machine learning algorithm.

Distributed Machine Learning
Server side
(1) Input: data sample X, Y, initial model parameters , iterative step ;
(2) Divide X Perry Y into collections in units of records . m indicates the number of clients;
(3) Send to the client ;
(4) Execute t times for each iteration: send to the client;
(5) Receive gradient update from client ; execute ;
(6) To determine whether the termination condition is met: if so, it will be terminated; otherwise, it will be executed ;
Client side
(1) Input: ;
(2) Batches are randomly selected from records as training data ;
(3) Calculated gradient ;
(4) Send to server side.
2.3. Federated Learning

FL is different from distributed ML; in FL, the information uploaded by each participant to the server is no longer the original data, but a trained submodel. At the same time, the FL also allows asynchronous transmission [9], and the communication requirements can be appropriately reduced. On this basis, the formula of federated machine learning can be updated as follows:where k is the number of clients, is the weight value of the kth client, and the scenario for FL is the decentralized multiuser . Each client user has the current user's data set . In deep learning, these data are sorted out into a data set . The practice of FL is no longer to simply aggregate them to form a new data set to complete the next stage of training tasks. Suppose the global model after the completion of a federal modeling task is and the corresponding training model after aggregation is . Generally speaking, the global model is functioning due to the operation of parameter exchange and aggregation. There will be a loss of accuracy during the entire training process; that is, the performance of the global model is not as good as the performance of the aggregate model . To quantify this difference, we define the performance of the global model on the test set as , and the performance of the aggregate model on the test set as . At this time, the δ-loss accuracy [10] of the model is defined aswhere δ is a nonnegative number. However, in actual situations, the aggregation model cannot be obtained in the end, because the basic requirement of FL is privacy protection. According to Professor Yang’s book “Federated Learning” [10], the federated average algorithm of FL can be expressed in Algorithm 2.

Federated average algorithm.
(1)Execute in the coordinator:
(2)Initialize and broadcast the original model parameter to all participants;
(3)For each global model update round t = 1, 2, ..., do;
(4)The coordinator determines , that is, determines the set of randomly selected participants;
(5) For each participant do in parallel;
(6)Update the model parameters locally: participants update (see line 13);
(7)Send the updated model parameter to the coordinator;
(8) end for
(9)The coordinator aggregates the received model parameters, that is, using a weighted average for the received model parameters:
(10)The coordinator checks whether the model parameters have converged. If it converges, the coordinator sends a signal to all participants to suspend model training;
(11)The coordinator broadcasts the aggregated model parameter parameter-to all participants;
(12)end for
(13)Update in the participant(participantsare executed in parallel)
(14)Get the latest model parameters from the server, that is, set ;
(15)For each local iteration from 1 to the number of iterations S do;
(16)Batches randomly divide the data set into the size of the batch M;
(17)Obtain the local model parameters from the previous iteration, set ;
(18) For batch number b do from 1 to batch quantity ;
(19)Calculate batch gradient ;
(20)Update model parameters locally: ;
(21) end for
(22)end for
(23)Get the local model parameter update , and send it to the coordinator (for participants of ).
2.4. Federated Learning Classification

In FL, data are distributed among the participants in the form of isolated islands, and each participant can use a matrix to represent its own data. At present, according to the distribution of data feature space and sample ID space, researchers divide FL into three categories: horizontal federated learning, vertical federated learning, and federated transfer learning [11, 12].

2.4.1. Horizontal Federated Learning

Horizontal federated learning is similar to traditional distributed ML. It aims at the overlapping data characteristics of each client in FL. That is, the participants have the same data characteristics but different data samples; it is mainly used in Business-to-Business (B2B) scenarios [10].

2.4.2. Vertical Federated Learning

Vertical federated learning is aimed at data samples with overlapping training data of each client in FL. That is, the data samples between participants are the same, but the data characteristics are different; it is mainly used in Business-to- Client (B2C) scenarios [13].

2.4.3. Federated Transfer Learning

Federated transfer learning is used when the training data characteristics of each participant and the overlap of data samples are relatively small. There are three types of federated transfer learning: case-based, feature-based, and model-based. It is mainly used in retail e-commerce, financial investment, and medical research [10].

Figure 1 shows three categories of federated learning.

The main differences and problems of horizontal federated learning, vertical federated learning, and federated transfer learning are shown in Table 1.

3. Unsolved Problems

The definition and classification of FL are described above. This section is mainly on its 5 unsolved problems.

3.1. The Problem of Nonindependent and Identically Distributed Data Samples

In distributed ML, local data samples are often independently and identically distributed. Although FL is a kind of distributed ML, most of its data are nonindependent and uniformly distributed. Moreover, it is different from the batch training in traditional distributed ML; there are some differences in the training data obtained by FL in each round of training. Some scholars tried local data sharing or model migration to solve it, such as federated semisupervised learning and unsupervised learning, which we mention in Chapter 4.

3.2. The Problem That Different Participants Have Different Amounts of Data

The amount of data owned by different clients is different; it is determined by the participants themselves and cannot be controlled. There is a similar problem in Business-to-Business (B2B); some large companies occupy a lot of data resources. How to get such large companies to participate in joint modeling is the first question. The key is to establish a reasonable incentive mechanism to share the profits generated fairly and equitably with participants. Federated blockchain technology can well solve the problem of incentive mechanisms. Paper [13] describes the collection of FL and blockchain technology and how to reward participants.

3.3. Robustness of Participants

In FL, many participants are mobile devices, and different participants have different network structures in data communication. When participating in joint modeling, some methods need to be adopted to ensure the robustness of the model. In addition, there will be some fake participants to attack within the global model established by FL, which we call fake local clients. Some scholars have proposed intrusion detection methods based on federated convolutional neural networks. In the deep learning part, they will specifically introduce how to use deep learning to improve the robustness of the model.

3.4. Communication and Computing Problems

FL means that large-scale data are trained locally, and most of its real application cases are transmitted by wireless communication, so its exchange process requires a stable communication condition. But the task model and data distribution will frequently change with time, and the structure of the federated network, target data characteristics, feature extractors, business tags, etc. will also change, which will lead to communication and computing problems. At present, a large number of papers of this field are proposed, including improving the bandwidth of communication, increasing the stability of transmission, and ensuring the security of communication. In this paper, we will introduce some deep learning algorithms which improved the problem in Chapter 5.

3.5. Privacy and Security of Federated Learning

Although the purpose of FL is to protect the privacy and security of users, in the process of participating in the joint training of FL, even if there is no need to obtain the information of local users, the privacy of users cannot be 100% guaranteed. When constructing a joint model, participating devices need to upload model parameters or gradient values, these parameters come from local models, and the partially trained local devices contain all the information of the data. There are many attacking models and part or all of the original data can be deduced from model parameters or gradients [14, 15], some local device attacks or disguised local model training participants. Therefore, many encryption methods have been proposed, and the common encryption methods are Secure Multiparty Computing (SMC) [16]; Homomorphic Encryption (HE) [17]; Data Disturbance (DD); Differential Privacy (DP) [18];and so on.

In view of the above problems, various researchers have put forward various solutions. In this paper, each method is divided into two categories: one is based on ML and the other is based on pan-deep learning.

4. Federated Learning Algorithm Based on Machine Learning

The most classical Federated Learning Average (FedAvg) is proposed by McMahan et al. [5]; it proves that FedAvg can achieve expected results when tested on the benchmark image classification data set (such as MNIST [19] and CIFAR-10 [20]). Since then, many FL are proposed. Here are several common FL algorithms based on ML; they are classified according to federated supervised learning, federated semisupervised learning, and federated unsupervised learning. Figure 2 shows the classification of federated ML [21].

4.1. Federated Supervised Learning

Supervised learning is a classic ML method, which infers a functional ML task from marked training data. The training data include a set of training examples. In supervised learning, each instance consists of an input object and the desired output value.

4.1.1. Federated Linear Algorithm

Yang et al. [22] put forward a logical regression method of center-based vertical FL, which realizes logical regression in vertical learning, and the objective function is as follows:where is the loss function, is the parameter of the model, is the feature of the model, is the label of the model, and is the amount of data. In the framework of this optimized federated algorithm, homomorphic encryption is added to encrypt the data and gradient of both sides. The whole training process can be described as the data of the unlabeled data holder are , where represents the model parameters of the unlabeled data holder in the round state. represents the homomorphic encryption of . The unlabeled data holder first sends , , and to the labeled data holder , and calculates the gradient and loss and sends them back after homomorphic encryption. After receiving the encrypted gradients from and , the central server assists and to update their models.

4.1.2. Federated Support Vector Machine

A federated support vector machine was proposed by Hartmann et al. [23] in 2019. The method optimizes and protects the parameters by updating blocks of local modules, attributing feature hashing and other ways. The objective function is as follows:where N is the training data, is the parameters of the model, is the loss at the point , is the regular term of the loss function, and is the hyperparameter to control the penalty. The objective function of support vector machine for traditional ML is as follows:

The Support Vector Machine (SVM) performs dimensionality reduction hash processing on the eigenvalues to hide the actual eigenvalues. The federated support vector machine can update the parameters of the model by updating the gradient of the central server, which can better protect the privacy of the parameters of the model. In practical application cases, the federated support vector machine will not increase calculation, so its actual performance is even better.

4.1.3. Federated Decision Tree Algorithm

Liu et al. [24] proposed a decision tree-oriented vertical federated learning method, a random forest implementation method based on a centralized FL framework, named as Federated Decision Tree (FDT). Its local participants upload the ranking of performance of their model parameters, not model parameters which the original FL constantly uploaded. Thus, it can greatly reduce communication frequency, a large amount of storage, and computing resources consumed by the encryption. In the joint modeling, the model mechanism of the whole random forest is scattered and stored, the central server holds the original complete structural information, and each participating node holds only their own the information [25]. When the federated decision tree model is used, the node information of the local tree is first obtained, and then the other local node information of the tree model is called jointly by the central server. Among federated decision tree models, Secure Boost model [26] is a decentralized vertical FL framework based on gradient lifting decision tree. According to the common gradient lifting decision tree algorithm, the objective function is as follows:where is the minimum loss value of the objective function, is the th iteration of the regression tree, is the loss on the leaf node of each tree function, and is the sum of the first derivative and the second derivative of the prediction residual. In order to prevent overfitting, a regular term is usually added to the loss function:where γ and λ are hyperparameters. In order to adjust the characteristics and the number of trees, is the weight value and is the original loss function. In the original distributed ML, joint modeling is realized by sending to participants, but distributed ML can use to calculate data labels backward, resulting in data leakage, which does not meet the basic requirements of FL in principle [27]. The federated tree model is based on the Secure Boost [26] encryption algorithm, training the samples of the model that needs joint training, and the first sample and the second sample are trained to get the prediction model of the decision tree. According to the prediction model of the decision tree based on the sample label, it can ensure that the data will not be deduced and calculated in reverse.

Li et al. [28] proposed a decentralized horizontal FL framework for multiparty, named Gradient Boosting Decision Tree (GBDT) modeling—a learning model based on the degree of similarity between data. The encryption degree of hash table encryption is not high, which is not as good as that of differential privacy and federated blockchain [29], but it gives some compensation to communication efficiency when modeling up and down transmission. This is a new research direction of algorithm research under the federated tree model. If data disturbance is added, its confidentiality can be comparable to the differential privacy protection and federated blockchain technology.

4.2. Federated Semisupervised Learning

Semisupervised learning is a key issue in the field of ML. It can use as much unlabeled data as possible to complete the task [30]. After FL is added to semisupervised learning, on one hand, FL can be used to ensure that sufficient training data are available, and, on the other hand, semisupervised learning can be used to alleviate the problem of the high cost of client-side scattered data labeling.

Jeong et al. [31] proposed a federated semisupervised learning framework according to the number of data tags. Its generative model is mainly to obtain data reconstruction from the perspective of probability, such as , so it can be estimated by a hybrid model. Recently, VAE [32] and GAN [33] have generated more complex models for semisupervised learning, which further improve the efficiency of semisupervised learning.

According to different split positions of sample identification and feature space, federated semisupervised learning can be divided into two categories: horizontal federated semisupervised learning and vertical federated semisupervised learning [31].

In horizontal federated semisupervised learning, the participating parties have the same feature space but different ID logo spatial data of each participant; that is, , which is held by all parties involved in the horizontal federated semisupervised learning. For each participant, has its own data , where

Vertical federated semisupervised learning has the same ID logo space of all parties involved, but each party holds different feature space; that is, . For each participant, has its own data , where

Yang et al. [34] proposed a logical regression method of decentralized longitudinal FL in fact, the label data side to replace the central server. In decentralized vertical FL, data are divided into tagged data and untagged data, in which tagged data are dominant. Assuming that there is an agreement between the unlabeled data holder and the labeled data holder to cooperate in modeling, first sends the modeling key to , and initialize the parameters respectively, and calculate , where . After is calculated, the results are sent to . averages both calculation results and then uses logical regression equation to get the final. At last, both tagged and untagged parties are updated by gradient. Table 2 shows the articles of three types of federated machine learning algorithms.

4.3. Federated Unsupervised Learning

Unsupervised learning is a ML method mainly used to discover potential patterns in data. Its input data have no label, and only the input variable (X) is provided, no corresponding output variable (Y).

In unsupervised learning, the algorithm needs to find the pattern structure in the data by itself [35]. The data on each participating client of FL is basically collected in a nonindependent and uniformly distributed way, so there is a problem of domain migration between clients. This problem of domain migration makes it difficult to extend the model and its training to new devices. Based on the framework of FL and without user supervision, knowledge is transferred from decentralized nodes to new nodes with different data domains. Peng et al. [36] defined an Unsupervised Federated Domain Adaptive (UFDA) method; it can align the representations learned among the different nodes with the data distribution of the target node. In the domain adaptive system of FL, models on different nodes have different convergence rates. In addition, the domain migration between the source domain and the target domain is different; as a result, some nodes may not contribute to the target domain or even show negative contribution [36].

5. Federated Learning Algorithm Based on Pan-Deep Learning

Federal learning combined with deep learning is one of the mainstreams of federal learning. This chapter focuses on this area; Figure 3 shows a classification of federated pan-deep learning.

5.1. Federated Neural Network

McMahan et al. [37] proposed a federated neural network model and carried out tests on neural networks on MINST data sets. In this paper [37], five groups of experiments are introduced, and this section only introduces the part of neural network (NN). The model has a four-layer network structure, including one input layer, two hidden layers, and an output layer; each hidden layer has 200 neurons. The MINST data set is assigned to each client, and these clients do not intersect. Then the federated training was carried out and the experiment was carried out in two groups: Experiment 1 uses the same random seed to initialize local model parameters allocated to the two clients. Experiment 2 uses different random seeds to initialize local model parameters assigned to the two clients. The different local model parameters of the two groups of experiment are weighted and integrated proportionally to obtain the final federated neural network model, namely,

Among them is a federated model parameter, and are model parameters distributed at different nodes, and is weight, which changes between 0 and 1. The experiment in this paper shows that when using FL, the federated model with the same random initialization seed has the best effect, and, at the same time, the optimal loss is achieved when the ratio of model parameters is 1 : 1.

5.2. Federated Convolutional Neural Network

Zhu et al. [38] proposed a federated CNN; it used a simple CNN network to do text recognition work in unclassified scenarios, and the whole model is built based on TensorFlow and PySyft to test the impact of FL infrastructure and local clients [39]. The built-in reference [38] is a simple CNN with four convolution layers, two fully connected layers, using ReLU activation function, and four output layers defined by the author. The structure of convolutional neural network is described in Figure 4. The CNN classifier is used for dictionary-free text recognition in the Chinese character corpus, and the parameters in CNN are optimized to minimize the aggregate negative logarithmic likelihood of the character sequence:where N is the training data set, M is the total number of classifications, and and are the probability that the kth character of sample is marked as . In their experiments, we compared two prevalent federated learning frameworks, namely, TensorFlow Federated and PySyft. Results show that federated text recognition models can achieve similar or even higher accuracy than models trained on deep learning framework. Figure 4 shows the convolutional neural network diagram in [38].

Rong et al. [40] proposed an intrusion detection method based on a federated CNN. This paper uses the data joint training model of multiple participants to expand the number of local participants. Based on the original FL framework, an intrusion detection model based on deep learning is designed. First of all, the data dimension is reconstructed by data filling to form a two-dimensional array. Then, Diffusion-Convolutional Neural Networks (DCNN) are used to extract and learn the feature parameters under the mechanism of FL. Finally, it is combined with the Softmax classifier training model for detection. This method greatly reduces the training time and maintains a high detection rate. In addition, compared with the general intrusion detection model, the improved model also ensures data security and privacy [40]. Federated convolutional neural networks are generally implemented by a simple CNN model. References [3840] use a CNN model with four convolution layers and two fully connected layers. This model is suitable for horizontal FL. The ID of the sample is used as the basis, and then the data set is randomly assigned to different clients to form different subsets to simulate distributed data. During the training, the client first carries out gradient calculation and parameter update on the local data set. At the end of each training iteration, the accumulated parameter updates for each client are summarized to update the final federated model.

Three groups of experiments are carried out in paper [40]. In experiment 1, the effectiveness of the method of transforming one-dimensional data into two-dimensional data intrusion detection network is verified. This method not only improves the accuracy of the model but also reduces the operation cost of the model. In experiment 2, the depth of the DCNN model is determined. The experimental results show that the two different models have little change in training and testing time, but in terms of accuracy, the accuracy of the model with two hidden layers is improved by an average of 1%. When it is increased to three hidden layers, the performance is not significantly improved, so simply increasing the number of hidden layers has little effect on the performance improvement. In experiment 3, the intrusion detection model is constructed by the FC algorithm, and the multi-classification experiment is carried out on the NSL-KDD standard data set. The accuracy of the test set has no obvious change; it is optimized in recall rate and false alarm rate, but the optimization effect is obvious in training time. Because the FC model only needs to transmit a small number of parameters when training data, it has certain advantages over other centralized training models in terms of data security. Generally speaking, the federated CNN cannot only improve the security performance in deep learning but also improve the computing power of the model by using GPU.

For the model parameter transfer between clients and the server, in order to reduce the occupation of bandwidth, the CNN is generally compressed. Sattler et al. [41] proposed a new framework of Spare Ternary Compression (STC), which is specially designed to meet FL. The training process of FL includes downloading the model, training the model locally, and updating the trained model to the server for aggregation. The number of bits in which data are transmitted iswhere is the total number of training iterations performed by each client, is the communication frequency, is the size of the model, is the entropy of the weight update exchanged during upload and download, and is the inefficiency of coding, that is, the difference between the real update size and the minimum update size (given by entropy). STC extends the existing top-k gradient sparse compression technology through a new mechanism to achieve downstream compression, internalization, and optimal Golomb coding of weight updates. The existing compression algorithms assume that the local data are independently and identically distributed, and most of the training data in FL are nonindependent and identically distributed data. In the independent and identically distributed data, it is considered that the local gradient is an unbiased estimation of the global gradient; that is,where is the data distribution of client and is the empirical risk function of the whole data, but this assumption of independent and identically distributed data is difficult to hold in FL, and we can only expect that the mean value of the distribution is unbiased; that is,

The gradient of a single client will be biased towards the local data set:

Experiments show that if each edge device sees a unique data distribution, the quality of model training will decline. For neural networks trained with highly skewed non-IID data, the accuracy of FL is significantly reduced by about 55%. It is further proved that the accuracy reduction can be explained by weight divergence and can be quantified by the Empirical Mode Decomposition (EMD) between the distribution of each category and the overall distribution on each device. This paper proposes a strategy: the author improves the training of non-IID data by creating a small part of data that is globally shared among all edge devices. Experiments show that the accuracy of CIFAR-10 data sets containing only 5% globally shared data can be improved by 30%.

5.3. Federated Bayesian Network

Yurochkin et al. [42] proposed to apply Bayesian networks based on FL. Under the assumption that both local data and local models are available, a probabilistic FL framework is developed and studied, with special emphasis on training and aggregation neural network models. Estimated local model parameters (in the case of a neural network, a set of weight vectors) between data sources are matched to build a global network [43, 44]. When the data are available, the method is proposed by training the local model for each data source in parallel. Then, the estimated local model parameters (weight vector group in the case of neural network) are matched between data sources to build a global network. Parameter matching is controlled by the posterior of the Beta-Bernoulli Process (BBP), which is a Bayesian Nonparametric (BNP) model that allows local parameters to match existing global parameters. Or if the existing global parameters do not match, new global parameters are created [42].

The federated Bayesian structure provides several advantages over existing methods [40]. First of all, the federated Bayesian separates the learning of local models from the fusion of local clients to become a global federated model. This decoupling allows us to remain unknown to local learning algorithms, which can be adjusted as needed, and each data source may even use a different learning algorithm. Secondly, given only pretrained models, their BBP information matching process can combine them into joint global models without additional data or learning algorithms for generating pretrained models. Last but not least, federated Bayesian can effectively learn to compress the federated network from the pretrained local network, and under a moderate communication budget, it can outperform the state-of-the-art algorithm of FL using neural networks. In order to apply the joint probabilistic neural matching method to FL, the feature extractors of Multilayer Perceptron (MLP) sets must be grouped and combined in the process of constructing global feature extractors (neurons). The goal of the Bayesian nonparametric mechanism is to identify the subset of neurons in the J local model that matches the neurons in other local models. Then, the matched neurons are combined to form a global model. Suppose we train Multilayer Perceptron (MLP) , and each perceptron has a hidden layer and each sensor has a hidden layer. Let and respectively denote the weight and offset of the hidden layer and and represent the weight and offset of the softmax layer. D represents the data dimension, the number of neurons in the hidden layer of , and k represents the number of classes. We consider a simple architecture:where is a nonlinear activation function. A set of weights and deviations learns a global neural network with weights and deviations. Figure 5 shows the Bayesian network diagram of a single hidden layer, single-layer probabilistic federated neural matching algorithm. The nodes in the figure represent neurons, and neurons of the same color are matched. This paper uses the corresponding neurons method in the output layer to convert the neurons in each J batch into a weight vector of the reference output layer. Figure 5 shows Bayesian network with hidden layers.

5.4. Federated LSTM

LSTM [45] was proposed in 1997. For its unique design structure, LSTM is suitable for dealing with and predicting important events with long intervals and delays in time series. Some researchers have applied LSTM to the centralization-based FL model to predict the character MINST [46, 47]. LSTM is specially designed to avoid long-term dependency problems. Memorizing long-term information is the default behavior of LSTM in practice [48]. The LSTM is added to the local model training. Its input gate determines the next input parameters, the forget gate loses some parameters, and the output gate outputs the required parameters, which makes the iterative effect better. In LSTM, the first on the left is the activation function of the forget gate; the second middle and tanh are the activation functions of the input gate; the rightmost and the middle tanh are the activation functions of the output gate; is the input, is output, is the output at the previous time, is the state at the previous time, and is the state at the current time. Figure 6 shows the internal structure of the LSTM network unit.

The study in [45] proposed to segment the data sets of multiple participating clients. When LSTM is placed in the FL framework, the data are nonindependent and identically distributed, and the appropriate hyperparameters are selected. The nonindependent and identically distributed data model is adjusted to the model accuracy of the conventional situation [46, 47]. Li et al. [49] trained LSTM classifiers in federated data sets and proposed a FL framework federated proximal term (FedProx) to solve statistical heterogeneity for sentiment analysis and character prediction. Compared with traditional FedAvg, FedProx has a faster convergence speed. In the case of system heterogeneity, each local client based on the FedAvg framework cannot complete the variable work according to the change of local client. The FedProx framework proposed in reference [49] introduces a regular term to improve the stability of the whole framework. The essence of the modified term is to increase the limitation of the difference between the parameters in the local model and the parameters in the global model, so as to provide a theoretical basis for explaining the heterogeneity between global and local information. Traditional FedAvg objective function iswhere means that there are samples on the th device. Generally, it is set to , where n is the sum of all , and local functions are minimized . E in FedAvg plays an important role in the convergence of global objective function. The higher the E, the more local computation and less communication between devices, which can effectively improve the overall convergence speed of the global objective function. On the other hand, for the heterogeneous local objective , the E value is too large, which may cause each device to strive to achieve the optimization of its local objective function, rather than the optimization of the global objective function, which will affect the convergence of the global objective function and even lead to divergence. The framework FedProx proposed in this paper [49] is similar to FedAvg in that it selects a subset of devices to participate in the update in each round, performs local updates, and then averages these updates to form global updates. However, FedProx makes some simple and critical modifications to converge. The objective function of improved FedProx:

A two-layer LSTM classifier with 100 hidden units and 80 embedded layers is used in FedProx.

Its task is to predict the next character, a total of 80 categories of characters. The model takes a sequence of 80 characters as input, embeds each character into a 8-dimensional space, and outputs one character for each training sample after two LSTM layers and a dense connection layer [46]. The experimental results show that FedProx has a faster convergence rate than FedAvg. In particular, in a highly heterogeneous environment, FedProx shows a more stable and accurate convergence behavior than FedAvg, which improves the absolute test accuracy by 22% on average.

5.5. Federated Reinforcement Learning

Nadiger et al. [50] first proposed the overall framework of Federated Reinforcement Learning (FRL), which includes grouping strategies, learning strategies, and federated strategies. Reinforcement Learning (RL) and other artificial-intelligence-based technologies have recently been used to achieve personalization. However, reinforcement learning faces the challenge of realizing individualization. In this paper [50], the author proposes a federated reinforcement technology, and its main goal is to improve personalization time. FL, which is applied to reinforcement learning techniques, is an example of hierarchical learning, which enables agents at lower levels to communicate their findings. Local clients with similar environments can be joined more efficiently [51]. The article proposes the use of Deep Q Network reinforcement learning algorithms in a federated environment to achieve faster personalization. The client model and the shared model are regarded as a large Q network and optimized by the Behrman equation. However, in the current work, there is a separate Q-learning on each client, and a joint strategy determines the shared model parameters. The personalized implementation scheme of this article is as follows:where PM refers to a set of games with more personalized measures, is a set of games with long-distance greater than or equal to 4 rounds, and is the total number of gatherings of various lengths in a game. The server sends the global model to all clients. This provides a “hot start” method for each customer. The global model is built offline. Then, the client updates the weight of the Nonplayer Character (NPC) model according to the local RL algorithm. The server starts waiting until the NPC model is received from all customer groups. The global model iswhere is the global model, is the client model, is the global model regularization factor, the percentage of rebounds with a length greater than or equal to 4 on client , and is the number of clients. The experimental results show that this paper proposes a method to speed up the personalization of agents by using federated reinforcement learning. It also puts forward the grouping strategy, learning strategy, and federated strategy, which makes up the whole FRL architecture. The effectiveness of this method is shown by testing on 3, 4, and 5 human players, in which the personalization time is accelerated by about 17%.

Anwar et al. [52] analyzed multitasking federated reinforcement learning from the perspective of confrontation, analyzed the attack performance of many common attack methods, and proposed an adaptive attack method. The general countermeasure is not enough to attack the mobile terminal effectively, so a model poisoning attack method based on minimizing the gain of training information is proposed. In FL, we have multiple local clients. In addition to preventing data poisoning and policy poisoning, we must also consider that the model is attacked. Because we have more than one local client, a complete local client can play the role of an attacker.

Attackers can enter false data and deliberately destroy the federated model. In the attack of federated model, the attacker tries to directly modify the learned model parameters by providing error information that intentionally damages the global model [53, 54]. Because the classical FL uses an average algorithm to merge the local model parameters of a single client learning, such an attack will seriously affect the performance of the global model. In Multitask Federated Reinforcement Learning (MT-FedRL), each client runs in its own environment, which can be characterized by different Markov Decision Processes (MDP). Each agent operates and observes only in its own environment. The goal of MT-FedRL is to learn a unified strategy that is jointly optimal in all n environments. Each agent shares its information with a centralized server. In each of these n environments, the state and action space do not need to be the same. If the state space does not intersect across the environment, the joint problem is decoupled into a set of n independent problems. The goal of the MT-FedRL problem is to find a unified strategy to maximize the sum of the long-term discounted returns for all environments, namely,

Solving the abovementioned equations will produce a uniform , thus achieving balanced performance in all environments. Where is the value function of the strategy , in the states of the th environment, we use to represent the initial state distribution on the action space of the th environment. In this article [55], it is proved that multitask federated reinforcement learning can converge to a unified strategy, which can achieve the best performance in every environment. If the client’s goals are positively correlated, this joint optimal strategy works best when evaluated in each environment. If the client’s goals are not positively correlated, a unified strategy may not produce a near-optimal strategy for a single environment. In this article, three common attack models are discussed in detail: the random strategy attack model, the reverse target strategy attack model, and the counterattack model with minimum information gain. Finally, we propose a modification of the general federated reinforcement learning algorithm to solve the antiattack problem, which is equally effective with and without attacks. The federated reinforcement learning process and federated reinforcement learning algorithm are given in reference [52], in which several cooperative models try to maximize the sum of discounted returns in the presence of hostile models in different environments. Figure 7 shows the flow chart of federated reinforcement learning.

5.6. Federated Meta-Learning

Chen et al. [56] proposed a Federated Meta-learning (FedMeta) framework, which shares parameters rather than the previous global model. This article evaluates the LEAF data set and the actual data set and proves that the communication cost required by FedMeta is reduced by 2.82–4.33 times, and its convergence speed is faster, compared with FedAvg, by 3.23%∼14.84%. In the field of FL, the local model uses SGD training to achieve high accuracy while balancing the computational and communication costs; in the field of meta-learning, the MAML algorithm is used to quickly converge on new tasks and show good generalization; on this basis, a federated meta-learning framework was built. The FedMeta framework integrates the MAML algorithm and meta-sgd into FL, which improves the accuracy of the joint training model and reduces the communication overhead. Meta-learning algorithm is where uses a set of task updates in the meta-training process, and the task test in the meta-training consists of a support set and query set both containing marked data points [57]. Algorithm A trains a model on the support set and outputs called internal update, evaluates the model on the query set , and calculates the test loss to reflect the training ability of [58]. Finally, is updated to minimize test loss, which is called external update. Each episode of meta-learning algorithm A will sample a batch of tasks from a meta-training set, so the optimization goal of meta-train can be expressed as

For each task T, the algorithm makes , so that the parameters of the algorithm are equal to those of the model . Then the parameters of model f are trained on the support set and updated according to the loss function:

Finally, the model parameters are tested on query set, and then the loss function of the test is calculated:

The experiment is verified on the LEAF data set, which shows that the convergence speed is faster and the accuracy is greatly improved over the traditional FL. At the same time, it also reduces the cost of communication. The goal of meta-learning is to train an algorithm. Federated meta-learning means that many devices join together to train the same meta-learner. Each device has its own meta-learner, but the parameters are aggregated on the server, and then the global meta-learner is trained. The global model trained by FL is the same on every device. Because of the strong data heterogeneity of each device, it is necessary to use meta-learning to personalize the model. Meta-learning generates a metamodel locally, and then metamodels generate personalized models locally, which are suitable for local heterogeneous data. Figure 8 shows the federated meta-learning framework.

5.7. Federated Residual Network

Huang et al. [59] proposed a new compression strategy Residual Pooling Network (RPN) [60] in order to improve the communication efficiency of FL. Compared with traditional FL, RPN alleviates the problem of communication computing overhead by selecting appropriate parameters and can maintain the original performance while reducing data transmission. RPN is an end-to-end process, and it can also be applied to CNN-based model training scenarios to improve the communication efficiency of federated models. The total number of bits that must be transmitted during model training is given bywhere T is the total number of iterations, M is the number of clients that the server chooses to update in the T round, represents the global model after t aggregations, and is downloaded to the optional parameter bit of the client. Similarly, is the selected parameter bit of the client used for uploading to the server. The article improves communication efficiency from four aspects: iterative frequency, pruning, importance-based update, and quantification. is defined as a residual network, and the definition of a residual network is given in the following formula:where is the parameterized by ; that is, . The experiment in this article includes classification, object detection, and semantic segmentation. They prove that RPN not only effectively reduces data transmission but also achieves almost the same performance as traditional FL. Most importantly, RPN is an end-to-end process, which makes it easy to deploy in real-world applications without human intervention. The federated residual network learning workflow includes (1) selecting clients for local model updates, (2) restoring local models, (3) training local model based on local data sets, (4) calculating remaining networks, (5) spatial aggregation, (6) sending RPN to the server and aggregating, and (7) sending RPN back to the selected client and repeating the cycle. Figure 9 is a schematic diagram of the federated residual network.

Table 3 shows the current federated learning methods based on deep learning.

6. Privacy and Security Issues of Federated Learning

Although FL can ensure that the data are trained locally on the client, it still has privacy and security issues in the event of malicious attacks, which are mainly reflected in the following three aspects. Firstly, the data collector collects user data privately without permission, leading to direct data leakage during data collection; secondly, there is indirect privacy leakage due to insufficient generalization ability of the model; finally, the model may be polluted for the lack of safety precaution [61]. This section discusses the prevention and attack aspects of FL.

6.1. Byzantine Prevention of Federated Learning

In recent years, security issues in FL have attracted widespread attention; especially in some scattered environments, some unstable clients may behave abnormally and even have Byzantine failures—arbitrary and potentially hostile behaviors [62]. Byzantine-robust FL aims to accurately learn the global model on the server side when a limited number of clients are malicious. The key idea of the existing Byzantine-robust FL is that the service provider performs statistics in the client’s local model update and removes the available models before aggregating them to update the global model [63]. At present, the main vulnerability of FL is the concern of SGD. How to ensure the robustness of distributed SGD and sending poisonous hostile Byzantine clients in the training phase is a hot research topic [64]. In the learning process of the hostile Byzantine client, the learning model may be biased due to data corruption, communication failure, or malicious sending of incorrect information to the server side [65]. Learning the defense against the Byzantine problem, Blanchard [53] proposed Krum, the first provable Byzantine algorithm for distributed SGD, which satisfies the elasticity of the aggregation rule. In face of potential abnormal clients, Yin et al. [62] proposed two robust distributed gradient descent algorithms based on median and pruning average operations for sharp analysis and proved that the distributed algorithm based on median is robust, having the same optimal fault tolerance rate of the distributed gradient descent algorithm. Li et al. [65] proposed the Byzantine-Robust Stochastic Aggregation (RSA) method. RSA regularizes the objective function to enhance the robustness of the learning task. Compared with most algorithms, the RSA method can adapt to independent and identically distributed FL, so it is suitable for a wider range of applications. Shejwalkar and Houmansadr [66] proposed divide-and-conquer (DnC) and demonstrated that DnC outperforms all existing Byzantine-robust FL algorithms in defeating model poisoning attacks.

6.2. Local Model Attack of Federated Learning

In addition, some researchers study the robustness of FL from the attack method. The attacks of FL mainly come from internal attackers participating in the FL process and unique model training strategies. Malicious opponents may interfere with or backdoor the process of distributed learning. Baruch et al. [67] proposed a new attack method, through limited changes to many parameters, in Moran’s paper, a variant of trimmed mean is to be chosen among existing defenses, producing the best results for convergence attack excluding the choice of naive averaging, which is obviously vulnerable to other simpler attacks [67].

Bhagoji et al. [68] explored the threat of model poisoning attacks on federated learning, initiated by a single, noncolluding malicious agent where the adversarial objective is to cause the model to misclassify a set of chosen inputs with high confidence. They use a suite of interpretability techniques to generate visual explanations of model decisions for both benign and malicious models and show that the explanations are nearly visually indistinguishable. Their results indicate that even a highly constrained adversary can carry out model poisoning attacks while simultaneously maintaining stealth, thus highlighting the vulnerability of the FL setting, to develop effective defense strategies [68]. Bagdasaryan et al. [69] used the privacy protection mechanism of FL and added abnormal data to carry out vicious attacks on the model, making the existing Byzantine anomaly detection unrecognizable. So how to design robust FL systems is an important topic for future research. Fang et al. [70] performed the first systematic study on local model poisoning attacks to FL. They assume an attacker has compromised some client devices, and the attacker manipulates the local model parameters on the compromised client devices during the learning process such that the global model has a large testing error rate. Experiments show it is valuable future work to design new defenses against local model poisoning attacks, new methods to detect compromised local models, and new adversely robust aggregation rules [70].

7. Future Challenges

7.1. Data Privacy Issues

Under the framework of FL, although the user's local data do not need to be uploaded to the server, it will be directly used in local modeling. If you do not independently add noise to these local data to protect their security, an attack by a malicious user may take place [71]. There are two modes of attack: one is an active attack, and the other is a passive attack.

When setting the FL algorithm protocol, if we assume that the active participant is a malicious attack, which destroys the security performance of the model, we call the malicious attack of the active participant the active attacker of FL. The server can obtain the model update parameters from various devices, and it can carry out FL model attacks by analyzing the model parameters of each round of updates.

We call semihonest but curious server-side attacks passive attackers. The main difference between the active attacker and the passive attacker is that the attack behavior is initiated by different malicious users, the initiating user of the active attacker is the client, and the initiating user of the passive attacker is the server. Both types of attacks damage the confidentiality, integrity, and availability of the FL model [72, 73]. The attacked federated model and the jointly trained model will lose their balance. In the worst case, the jointly established model cannot be returned to the local client.

7.2. Data Communication Issues

In the framework of FL, client-side and server-side devices communicate and transmit model parameters or gradients, and its communication rate is more frequent than the traditional distributed machine transmission rate. But each model participating in joint training cannot have the same computing power and stable transmission rate, which will often cause communication instability. For example, the input method of mobile phone uses FL, some mobile phones use mobile data and some mobile phones carry out joint modeling in WIFI state, the stability of data transmission in mobile data state is usually worse than that in WIFI state, and it is easy to cause communication interruption when uploading or downloading model parameters. Even if the same mobile phone is in the same network state, the communication will be unstable due to the different number of parameters transmitted. Therefore, in the modeling of FL, the data communication problem is a problem worthy of various researchers to ponder. In addition, the problems of communication bandwidth proposed in [74], the convergence of the joint training model, and the communication between cloud service providers are all problems that need to be researched.

7.3. Data Heterogeneity Issues

The data of distributed ML are often independent and identically distributed, but FL is different from traditional distributed ML. Devices in FL often exist in the network in a nonindependent and uniformly distributed way. The data participating in training is generally nonindependent and identically distributed. For example, banks and Internet shopping, although they have the same customers to some extent, their data storage structures are heterogeneous. In addition, the uneven distribution of data held by cross-device data holders will also lead to data heterogeneity. Therefore, many common algorithms for independent and identically distributed data cannot be used directly. How to research algorithms that are more compatible with FL heterogeneous data is also a very important development direction of FL.

7.4. Data Overhead Issues

In the application scenario of FL, most of the local models that participate in the training need to perform computing and communication tasks on mobile terminal. Because the number of local models involved is very large, it is not only a challenge for the communication but also a great test for computing. FL is not only technical labeling but also a business model. Encryption is a very important link in the financial industry, and the original cloud computing model has been challenged in encryption. Adding the encryption algorithm to cloud computing data transmission is a common encryption method. Some researchers have proposed secure mixing [74] and secure mixing [75] methods, but this increases the cost of communication. After adding the encryption method, it needs to be decrypted, and the computational cost of the model data is further increased. Literature [76] puts forward the problem of keeping a balance between communication cost and accuracy and guides the balance between them by evaluating the distributed statistics and learning rate of a certain bandwidth. At present, no researchers have applied it to FL, and there is no up-to-date method to solve the problem of high data computing overhead, so this field needs people to open up and improve. The problem of data computing overhead is urgently waiting to be solved.

7.5. Lack of a Trusted Central Server

In the process of FL, a trusted central server is needed to ensure the privacy and security of users. Some scholars put forward the decentralized algorithm, which is based on the local update scheme of heterogeneous data decentralization training. FL requires a central server to coordinate the training process and receive models uploaded by all clients. Therefore, the server is a central participant, and it may also have a single point of failure. Although large companies or organizations can play this role in certain application scenarios, in more collaborative learning scenarios, a reliable and powerful central server may not always be available. Even if centralized differential privacy is adopted in the protection of data, the central server must be trusted by users. Otherwise, it will cause data leakage. Future researchers can start with how to build a trusted central server to further improve the server structure of FL, making it less vulnerable to attacks and failure. The existing trusted server transformation mainly includes ARM's Trust-Zone architecture and Intel’s SGX-enabled CPU architecture [59]. Table 4 shows the federated learning problem.

8. Conclusion

This paper discusses the classification and development of FL and several existing problems of FL. It expounds from the point of view of FL algorithm, focusing on federated deep learning on the basis of the introduction of federated ML. In the chapter of federated deep learning, existing deep learning algorithms are discussed from the perspectives of communication, data heterogeneity, privacy protection, and trusted server in FL. At present, FL is still in the stage of rapid development, and there are still many unsolved problems about ML and deep learning algorithms under the framework of FL. With the further expansion of the amount of data in the future, the implementation of deep learning algorithm is not only a feasible scheme for practicing in the field of artificial intelligence but also a more efficient and comprehensive method for the use of distributed ML and edge data. In the future, FL will develop incoordination in multiple fields, such as edge computing, blockchain, privacy protection, and other coordinated development to improve the performance of FL and, at the same time, make the commercial value better. In order to facilitate readers to understand common symbols in this paper, we have added a symbol table, shown in Table 5.

Data Availability

No data were used to support this study.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

All authors drafted, read, and approved the final manuscript.

Acknowledgments

This research was supported by the National Natural Science Foundation of China (42075130, 61773219, and 61701244) and the Key Special Project of the National Key R&D Program (2018YFC1405703).