Abstract

Federated learning (FL), a distributed machine-learning framework, is poised to effectively protect data privacy and security, and it also has been widely applied in variety of fields in recent years. However, the system heterogeneity and statistical heterogeneity of FL pose serious obstacles to the global model’s quality. This study investigates server and client resource allocation in the context of FL system resource efficiency and offers the FedAwo optimization algorithm. This approach combines adaptive learning with federated learning, and makes full use of the computing resources of the server to calculate the optimal weight value corresponding to each client. This approach aggregated the global model according to the optimal weight value, which significantly minimizes the detrimental effects of statistical and system heterogeneity. In the process of traditional FL, we found that a large number of client trainings converge earlier than the specified epoch. However, according to the provisions of traditional FL, the client still needs to be trained for the specified epoch, which leads to the meaningless of a large number of calculations in the client. To further lower the training cost, the augmentation FedAwo algorithm is proposed. The FedAwo algorithm takes into account the heterogeneity of clients and sets the criteria for local convergence. When the local model of the client reaches the criteria, it will be returned to the server immediately. In this way, the epoch of the client can dynamically be modified adaptively. A large number of experiments based on MNIST and Fashion-MNIST public datasets reveal that the global model converges faster and has higher accuracy in FedAwo and FedAwo algorithms than FedAvg, FedProx, and FedAdp baseline algorithms.

1. Introduction

Federated learning, a distributed machine-learning framework that can effectively protect the privacy and security of user data, has received extensive attention from academia and industry in recent years. Federated learning involves co-training a machine-learning model by servers and clients. The server sends the global model to clients, receives local models trained by clients, and aggregates them to generate a new global model until the training of the global model ends. Clients use local data to train the global model given by the server and return the trained local model to the server [1]. Federated learning effectively protects the privacy and security of data by transmitting model parameters between the server and the client (data do not leave the client) and is used in many fields. The most typical example is Google’s keyboard input method, which uses a federated learning platform to train a recurrent neural network (RNN) for next word prediction. In addition, federated learning is also widely used in clinical auxiliary diagnosis, new drug development, and precision medicine in the medical industry, portrait recognition, and voice print recognition in the security industry. Although federated learning effectively solves the problem of data privacy and security, it is different from traditional distributed machine learning and brings serious challenges to system heterogeneity and statistical heterogeneity. Traditional distributed machine learning is usually deployed in the same data center or in a network with a good communication environment, and the clients for model training have similar hardware conditions. However, the clients of federated learning are often widely distributed in geographical locations. There are great differences among them in network conditions, hardware environment, and computing power, and the time when clients can participate in model training is also different. The above phenomenon is called system heterogeneity which may lead to the problems of falling behind (nodes that cannot complete the specified training rounds within the specified time) and fault tolerance [2]. In addition, data distribution and data volume of local data held by different clients are also different, which is the statistical heterogeneity of data. Both statistical heterogeneity and system heterogeneity have a negative impact on the convergence speed and final accuracy of the global model [3].

At present, most researchers try to reduce the negative impact of heterogeneity by sampling clients and modifying clients’ loss function. The sampling method is that the server filters out the local models that are more conducive to the global model convergence for aggregation. In sampling algorithms [4, 5], the method of importance is widely used [4, 68]. This method selects the “important” clients by comparing client gradient information and aggregates their local gradients. The method of modifying the loss function of the client is more mainstream at present [2, 9, 10]. Its idea is to modify the loss function of the client, such as adding a near term [2] in the loss function or normalizing it with the last round of the global model [11, 12]. However, the above methods ignore a crucial phenomenon: the imbalance of computing power between servers and clients in the federated learning system. We know that in the actual application scenario, the computing power of clients is relatively weak. The method of modifying the client loss function further increases the computing burden of the client. Servers often have strong computing power and network conditions, and they only undertake the task of aggregating local models and generating global models.

Obviously, in the federated learning system, the computing power and network environment of clients are poor, but they are responsible for heavy work of model training. The server with the strong computing power and network environment undertakes light work, which does not match its ability. In order to make better use of system resources and improve performance, this paper studies how to use server resources to solve the problems of statistical heterogeneity and system heterogeneity without increasing the load of clients. This paper proposes the federated learning algorithm for automatic weight optimization (FedAwo) and its enhancement algorithm (FedAwo) and verifies the feasibility of the methods from both theoretical and experimental aspects. Our main contributions in this paper are as follows:(1)We design a federated learning algorithm for automatic weight optimization (FedAwo). In this algorithm, the server calculates the optimal weight for the local model through the machine-learning algorithm to solve the problem of statistical heterogeneity and system heterogeneity in federated learning. The FedAwo algorithm effectively utilizes server resources and does not increase the burden on clients.(2)We prove the convergence of FedAwo and propose the enhancement algorithm FedAwo for FedAwo to further reduce the training cost. The algorithm of FedAwo is based on the heterogeneity of clients, and FedAwo reduces the training cost by dynamically adjusting the training epoch times of local model training.(3)We use the MNIST and Fashion-MNIST public datasets as test datasets and use FedAvg and FedProx as baseline algorithms to compare the performance of them with that of FedAwo and FedAwo under IID and non-IID conditions. The analysis results show that the FedAwo and FedAwo algorithms can converge faster and obtain a better global model. The experimental code of this article has been uploaded to Github (https://github.com/amazing.yx/FedAwo).

The rest of this paper is organized as follows: The second section introduces the related work of federated learning in solving heterogeneity. The third section introduces the federated learning algorithm for automatic weight optimization (FedAwo) in detail. In the fourth section, we prove the convergence of the FedAwo algorithm. In the fifth section, we propose the optimization algorithm FedAwo. In the sixth section, we verify the performance of FedAwo and FedAwo through experiments. Finally, we summarize this paper.

1.1. Related Work

The research studies on the convergence of federated learning [2, 9, 11, 13] show that the system heterogeneity and statistical heterogeneity in federated learning have a great negative impact on the convergence speed and accuracy of the global model.

The optimization methods of heterogeneous problems mainly focus on modifying the loss function of clients or sampling clients. For modifying the loss function of the client, literature [2] proposed the FedProx algorithm, which aims to add a proximal term to help improve the stability of federated learning. At the same time, the FedProx algorithm would dynamically adjust the number of client-training epochs to solve the straggler problem caused by system heterogeneity. The effect of this method is more obvious in the environment with stronger heterogeneity. However, the original intention of the FedProx algorithm is to solve the problem of straggler. Due to the introduction of the proximal term, the computing overhead of the client increases instead. In some cases, the problem of client struggling is even more serious. Literature [11] proposes the SCAFFOLD algorithm, which corrected the client-drift phenomenon that occurs in the FedAvg algorithm by introducing the correction term . Literature [10] proposed the FedNova algorithm, which eliminated objective inconsistencies and maintained fast convergence by normalizing local models. The SCAFFOLD algorithm and the FedNova algorithm are the same as the FedProx algorithm. Although the communication overhead has been further optimized and the model quality has been improved, it still increases the computing overhead of the client. Literature [14] proposed the FedDyn algorithm to keep the local model and global model distribution approximately consistent by assigning a dynamic regularization optimizer to each client in each round. All of these methods can reduce the influence of heterogeneity on the convergence speed and model accuracy, but they all increase the computational overhead of clients. The computing power of the server is better than that of the client. In practice, most clients are always busy, but the server is often idle.

For the sampling method, the authors in [4] established a general sampling-federated learning system and obtained an unbiased optimal sampling probability to alleviate the influence of heterogeneity on the global model. Literature [15] proposed the FedL algorithm, which was a graph convolutional network (GCN)-based sampling method that maximized the accuracy of the global model by learning the relationship between network attributes, sampling nodes, and generated offloads. Literature [16] classified local models according to the importance^ of each round of clients, aggregated the “important” local models, and proposed an approximate unbiased sampling optimization algorithm. Literature [17] proposed the FOLB algorithm by estimating the gradient information of the local model, which inferred the performance of the client and performed weighted sampling based on it. This method could cope with system heterogeneity and made the global model converge quickly. Although the sampling method can promote the global model to converge quickly, the quality of the final global model is poor.

In addition, literature [18] proposed the FedHQ algorithm to solve the system heterogeneity by minimizing the upper limit of the convergence speed as a function of the heterogeneous quantization error of all clients and assigning different aggregation weights to different clients. In order to address heterogeneity, literature [19] proposed an algorithm with periodic compressed communication, which introduced a local gradient tracking scheme and obtained fast convergence speed matching communication complexity. Literature [20] analyzed the convergence bound of gradient descent-based federated learning from a theoretical perspective and obtained a novel convergence bound. Using the above theoretical convergence bound, literature [20] proposed a control algorithm that learns data distribution, system dynamics, and model characteristics, and based on which, it dynamically adapts the frequency of global aggregation in real time to minimize the learning loss under a fixed resource budget. Literature [1820] solved the system heterogeneity caused by external environment such as system configuration and hardware conditions, but do not pay attention to the statistical heterogeneity caused by local data differences.

Due to the limitations of the above two methods, this paper hopes to solve the problem of heterogeneity by introducing adaptive learning. Before that, literature [13, 21, 22] tried to combine adaptive learning with federated learning. Literature [21] proposed a federated learning optimization scheme with an adaptive gradient descent function. This algorithm improved the privacy performance of the local training process by differential privacy and the scaling of update volume. This algorithm can enhance the privacy security of each client in the process of joint learning, but it cannot effectively suppress the negative impact of heterogeneity. Literature [22] proposed an adaptive-personalized federated learning (APFL) algorithm, where each client would train their local models while contributing to the global model. The APFL algorithm adaptively learns the model by leveraging relatedness between local and global models as learning proceeds, which effectively improves the convergence speed of the global model. Literature [13] proposed federated adaptive weighting (FedAdp) that assigns different weights to nodes for global model aggregation in each round of communication. The FedAdp algorithm allocates the weight of the client by calculating the intercept between the global model and the local model. However, when the performance of the local model is due to the global model, FedAdp will still assign a lower weight to the local model according to the intercept value, which is obviously unreasonable. We summarize the limitations of the above methods in Table 1.

Therefore, the method of modifying the client loss function increases the computational overhead of the client, and the sampling method has the problem of low accuracy of the final global model. However, the current federated learning algorithm combined with adaptive learning does not focus on solving the problem of heterogeneity. This paper is different from the above methods. From the perspective of resource allocation of the federated learning system, this paper makes full use of the advantageous resources of servers and combines adaptive learning to reduce the negative impact of heterogeneity. As far as we know, this paper is the first work aimed at using server-computing resources to solve the optimal weight allocation value.

2. Federated Learning Algorithm for Automatic Weight Optimization (FedAwo)

In this section, we establish the system architecture and propose the automatic weight optimization algorithm FedAwo. Finally, we introduce the specific process of it in detail.

2.1. System Model

A federated learning system generally includes one server and clients. The server plays the role of coordinating the training for each client, aggregating, and distributing the global model. Clients hold their own local dataset , and the total amount of data of all clients is [2327]. Clients perform a local learning operation under the coordination of servers. We first define as a loss function, where is the model parameter. Thus, the global loss function of clients can be defined as

The local loss function for each client is defined aswhere is the loss function evaluated at the data sample , and the model represents the training data weight value of the -th client

The global model aggregation mode is defined as

The purpose of federated learning is to find the optimal value in (1), and the FedAvg algorithm is to repeat the process of (3) and (4) until the global model converges. The most popular and de facto optimization algorithm to solve (1) is FedAvg [1]. Here, denoting as the index of a federated learning round, we describe one round (e.g., -th) of the FedAvg algorithm as follows:(1)The server uniformly broadcasts the global model to each client.(2)Each client uses local data to perform local SGD to calculate the updated model . Then, the client sends the updated model back to the server.(3)The server aggregates (with a weight ) the clients’ updated model and computes a new global model .

The above process repeats for many rounds until the global loss converges.

At present, the research on the negative effects of heterogeneity mostly uses the sampling method or modifies the loss function of clients. Different from the previous algorithms, we modify in (1) to reduce the influence of heterogeneity on the global model by finding the correction value . So the global model aggregation mode is rewritten as

As shown in Table 2, the loss function of the global model is updated to .

2.1.1. Federated Learning Algorithm for Automatic Weight Optimization

We design a federated learning algorithm FedAwo for automatic weight optimization to obtain . The FedAwo algorithm aims to reduce the negative impact of statistical heterogeneity and system heterogeneity on federated learning and makes full use of the computing resources of the server. Compared with traditional federated learning, this algorithm needs to have a certain amount of high-quality data in the server, which is achievable in most federated learning tasks. We would use these high-quality data as the server’s datasets in the server and use the way of machine learning to calculate the optimal weight correction value . The specific process of the federated learning algorithm for automatic weight optimization is as follows:(1)The server establishes a federated learning global model and a weight allocation model . Then, the server calculates the initial weight value for each client according to the data quantity. The initialization weight distribution formula of each client is , and according to the above formula, we can get the initial client weight allocation vector . At the same time, the global model is broadcast to each client , and the server has the dataset . The data in are independent and identically distributed(IID) high-quality data. The total amount of data are , and each data has a unique corresponding label , which is a one-hot type data. For example, in the MNIST dataset, the one-hot type label of digital zero is . We can get a matrix of all data labels .(2)Each client would use its own local data for SGD for the received global model until it is trained for the specified criterion, and send the model to the server .(3)Assuming that is a data sample in the dataset , we input data into the local model , and the output is , which is a one-hot type data. Then, we input all the data in to get a matrix . We carry out the above operations on all client models to get a matrix (4)The server calculates , which is the product of and . Thus, we haveNote that each element in represents the average prediction result for the -th sample in . We then calculate the cross-entropy loss between and , i.e.,where reflects the prediction loss under the current weight . By minimizing , we can obtain the best weight , which is given byWe take in as the optimal weights. In this paper, we adopt a machine-learning-based approach in the server to get . In particular, a neural network model is trained so that is minimized.(5)The server aggregates the models according to the current round of updated weight correction values to obtain the global model of the next round .(6)The server broadcasts the new global model to each client and repeats the process of 1–6 until the global model converges.

Require: initialized the global model and initialized the weight distribution model and server dataset
Ensure: final global model
(1)for t = 0 to T do
(2)Broadcast to clients
(3)for e = 0 to I do
(4)
(5)end for
(6)
(7)  Pass to server
(8)Calculate through III-V
(9)Aggregate the new global model
Server updates
(10)end for

For Algorithm 1, we need to define the initial global model , the initial adaptive learning model , and the initial weight value . The server broadcasts the global model to all clients within the specified time of the system. The client uses local data to train the model to the specified epoch and then returns the model to the server. This process is shown in of Algorithm 1, which represents the process of local model training. Then, in the server, the optimal weight value is obtained through the adaptive learning model . The model aggregation is carried out according to the optimal weight value , and the latest global model is obtained. This process is shown in 6–9 of Algorithm 1, which represents the process of model aggregation [2833].

The federated learning algorithm of automatic weight optimization adds an adaptive weight allocation algorithm to the FedAvg algorithm. In the traditional weight allocation method (3), the weight of the client is allocated according to the amount of data, which is fully applicable under the condition of IID. However, under the influence of heterogeneity, only considering the amount of data cannot fully reflect the quality of client data because the data of most clients tend to shift to a certain feature in practice affected by statistical heterogeneity. In other words, most data in one client often have similar features. If such a client has more data, it would often lead to a poor aggregation effect according to (3). The correct approach is to adjust the weight to minimize the cross-entropy. When the cross-entropy is the smallest, predicted local distribution is closest to global distribution, which is also the biggest advantage of FedAwo compared with the traditional weight allocation algorithm. FedAwo can converge quickly and improve the accuracy of the global model, which is still applicable under IID conditions.

3. Proof of Convergence

3.1. Nonconvex Loss Functions

As is known to all that for convergence of nonconvex loss functions, the expected gradient norm is usually taken as the index of convergence to ensure convergence to a stagnation point [1517, 34]. Therefore, this article takes the norm of the expected gradient as the convergence index, namely,

As is commonly used in literature studies [2022], the following assumptions are adopted in this article.

Assumption 1. The loss function smooth, for any and any , there is inequality (10).where L denotes the Lipschitz constant.

Assumption 2. Stochastic gradients in clients are unbiased, and the second raw moment of a stochastic gradient for all functions is . (bounded).

Theorem 1. Suppose Assumptions 1 and 2 hold, when the step size is set as , the convergence of FedAwo with nonconvex loss functions satisfies:where denotes the minimum value of (1), and denotes the initialized value of (1).

Proof. In order to prove inequality (11) is true, would be deduced first.
According to (5), can be defined as 2.1 ng, and the amount of data can be expressed as follows (Appendix A):Since , it can be derived as follows:The expectation of the inner product in inequality (14) can be derived as inequality (15) (Appendix B):According to inequality , of inequality (14) can be rewritten as (16) (Appendix C):Substituting inequality (15) and (16) into inequality (14) yields inequality:Dividing (17) both sides by and rearranging terms yield inequality:Summing over and dividing both sides by yield inequality:Finally, substituting into (18) yields the desired result (12). So Theorem 1 is true.

3.2. Strongly Convex Loss Functions

Compared with nonconvex loss functions, the convergence analysis of convex loss functions usually adds Assumption 3 [13, 16, 22].

Assumption 3. The loss functions are strongly convex, and for any and any , there is inequality.

Theorem 2. Suppose Assumptions 1 to 3 hold, for any t > t0 ( is a constant), when the step size is set as , the convergence of FedAwo with strongly convex loss functions satisfies:where .

Proof: The following inequality is proved:. Substituting into inequality (22) (Appendix D) yields:For the sake of simplicity, set , . The inequality (23) is rewritten as inequality:Next, the induction would be used to derive Theorem 2. Obviously, inequality (11) holds for , and then, assuming that inequality (21) is true when . Then, we haveNext, from inequality (24) and (25), we obtain inequality as follows:Therefore, inequality (21) is true; i.e., Theorem 2 is true.

4. Federated Learning Enhancement Algorithm for Automatic Weight Optimization (FedAwo)

System heterogeneity is caused by the client’s computing power, storage capacity, load capacity, and network environment, and it means that the converged clients still need to carry out model training for the specified epoch. This phenomenon results in the computing resource waste and energy waste in clients. Therefore, we further optimize the FedAwo algorithm and propose an enhanced algorithm (FedAwo). Based on the FedAwo algorithm, the FedAwo algorithm adds an adaptive training round optimization algorithm to the client, which can effectively reduce the model training overhead of clients.

The above phenomenon is common in federated learning, but traditional federated learning algorithms do not pay attention to this problem, and this phenomenon is aggravated with the progress of federated learning, which leads to a large number of invalid calculations in the client and adds a lot of meaningless computational overheads. Therefore, it is necessary to add discriminant conditions for model convergence in local training. This is where the FedAwo algorithm is optimized for the FedAwo algorithm. This method returns to the server a local model that satisfies the convergence criteria, even if the specified epoch has not been completed. This idea seems to be similar to that of the FedProx algorithm [2], but the starting points of them are completely different. FedProx is to solve the problem of struggling, while FedAwo is to reduce training costs. When the model trained by the client reaches the convergence criteria we set, the local training would automatically stop even if the training numbers are less than the epoch set by the system. And local converged model would been returned to the server, so as to reduce the computational overhead of the clients.

The specific process of FedAwo is as follows:(1)In each epoch, clients save of the current epoch and subtracts the previous round to get the difference .(2)Judging the convergence of clients, if and , is considered converged and would be returned to the server. represents a very small parameter and represents a parameter close to the global model convergence loss. The value is adjusted according to the specific situation. In section 6, we would set and .(3)If the conditions in II cannot meet the specified criterion, after training the specified criterion, would be returned to the server.

Require: initialized the global model , initialized the weight distribution model and the server dataset , and initialized
Ensure: final global model
(1)for t = 0 to T do
(2)Broadcast to clients
(3) for e = 0 to I do
(4)  :
(5)Update // for the optimization part of the FedAwo algorithm
(6)  :
(7) Calculate whether the local model converges:
(8) if , and , or then
(9) Break;
(10) else
(11) Continue
(12) end if
(13) Pass Model to the server
(14) end for
(15)  :
(16)Through III-V calculate
(17)Aggregate the new global model:
Server updates
(18)end for

In Algorithm 2, we need to define the initial global model , the initial adaptive learning model , the initial weight value , the initial loss function difference , and other parameters . The server broadcasts the global model to all clients within the specified time of the system. The client uses local data to train the model for the specified epoch and then returns the model to the server. At the same time, in each local training epoch, we would record the difference between the loss function of this epoch and the previous epoch. When the difference between the loss functions of two consecutive epoch is very close, or the difference between these two epochs is less than , we consider that the local model has converged at this time and immediately return this model to the server. This process is shown in 2–14 of Algorithm 2, which represents the process of local model training. Then, in the server, the optimal weight value is obtained through the adaptive learning model . The model aggregation is carried out according to the optimal weight value , and the latest global model is obtained. This process is shown in 15–18 of Algorithm 2, which represents the process of model aggregation.

The algorithm reduces the computational overhead of clients by dynamically performing local-training epochs. According to a large number of experiments, we found that in the process of federated learning, some clients have converged before performing the number of specified epochs. Following the previous federated learning algorithm, these clients still need to perform training until the specified epoch. This process inevitably results in the waste of computing resources [2]. Therefore, the FedAwo algorithm adaptively judges whether the SGD process converges during the client training. If the convergence conditions are reached before the specific epoch, the SGD would be stopped and the converged local model would be returned to the server. Otherwise, the SGD would continue and stop after reaching the specified epoch.

5. Experiments

5.1. Experimental Environment

In order to analyze the performance of FedAwo and FedAwo algorithms, we established an experimental environment based on PyTorch 1.10.1 and CUDA 10.2. The software environment is Python 3.8. The hardware environment is 3.60 GHz AMD Ryzen 7 3700X 8core processor CPU, 16.00 GB, Win10 64 bit, and NVIDIA GeForce RTX 2070 system. The simulation experiment strictly follows the protocols and rules that may be used in distributed federated learning [35]. More details of the experimental environment are shown in Table 3.

5.2. Experimental Setup

In this paper, MNIST and Fashion-MNIST datasets are selected as experimental datasets to verify the performance and stability of FedAwo and FedAwo algorithms. MNIST and Fashion-MNIST are two image datasets. In the experiment, we normalize the two datasets, respectively. For IID dataset partition, data samples are evenly and randomly distributed to clients. For nonIID dataset partitions, data samples are sorted by their labels and divided into 2K groups, and each client receives two groups (i.e., samples corresponding to two labels).

For MNIST, the dataset has 60000 training samples and 10000 test samples. It is an image dataset containing 0–9 hand-written digits, and each sample contains 28 × 28 pixels. We set a total of K = 100 clients, and we allocate 600 training samples for each client. In addition, when using the FedAwo algorithm, we get 2000 data from 10000 test datasets and take these data as the server dataset for adaptive learning of weight distribution and the remaining 8000 data as test datasets. For comparison, we configured the same CNN model according to the method proposed in [1]. The model has two 5 × 5 convolution layers of CNN (the first has 32 channels, the second has 64 channels, and each channel is followed by 2 × 2 maximum pool), and one has 512 units, ReLU activation, and final Softmax output layer.

For Fashion-MNIST, the dataset also has 60000 training samples and 10000 test samples. It is an image dataset containing different commodities, and each sample also contains 28 × 28 pixels. Other experimental settings are consistent with the MNIST dataset.

The specific experimental setup details are as follows: we set the learning rate to 0.01, batch size to 64, and epoch to 5. Since the MNIST and Fashion-MNIST datasets have the same input and output and are both image datasets, we set the same CNN model. The details of the specific model settings are shown in Table 4.

5.3. Results of the Experiment

We chose the most classic and widely used FedAvg, FedProx, and FedAdp algorithms as the baselines of the experiments.

For the MNIST dataset, we first used the data distribution of IID to compare FedAwo, FedAwo, FedAvg, FedProx, and FedAdp algorithms. As shown in Figures 1 and 2, we could see that under the dataset with IID distributed data, the five algorithms converge in 10–15 communication rounds. FedAvg has slower convergence speed and lower accuracy of the global model than the other four algorithms, but it is not clear.

NonIID experiments heavily distribute skewed data to individual clients, and the results of the experiment are shown in Figures 3 and 4. In Figures 1 and 2, we could see that the convergence rates of the five algorithms were affected by statistical heterogeneity. The FedAvg algorithm was seriously affected, which led to a significant decrease in the convergence speed, and converged after the 70th communication round. At the same time, the quality of the global model was obviously inferior to the global model under the IID condition. Due to the addition of the near term to the loss function, the quality of the global model was not affected in the FedProx algorithm, but the convergence speed was still slowed down. The same is true for the FedAdp algorithm. For FedAwo and FedAwo algorithms, both the convergence speed and the quality of the global model were minimally affected by statistical heterogeneity, and they reached convergence around the 30th communication round.

We also simulated both systematic and statistical heterogeneity of federated learning. Obviously, the influence of heterogeneity on the global model was further increased. It could be seen from Figures 5 and 6 that the FedAvg algorithm had a great impact on the convergence speed and global model quality. The model did not completely converge until round 80. The convergence speed of FedProx and FedAdp was not significantly slowed down compared with the condition of only statistical heterogeneity, but the quality of the global model was degraded. For FedAwo, both the convergence rate and the quality of the global model were still minimally affected, while FedAwo had some fluctuations under the influence of system heterogeneity. The convergence speed and global model quality of FedAwo and FedAwo algorithms were better than those of FedAvg, FedProx, and FedApd baseline algorithms.

In order to ensure the superiority of FedAwo and FedAwo algorithms, we conducted the Friedman test on the model accuracy and loss of these five algorithms and obtained the results of stat = 14.68, value = 0.00184 and stats = 10.24, value = 0.02626. The Friedman test can only show that there are differences between the accuracy and loss of the models, but it cannot show which model is better. Therefore, we conducted the Nemenyi test on the above algorithms to further verify whether there is a significant difference between the two models. According to the results shown in Table 5, it can be concluded that FedAwo and FedAwo algorithms are superior to the other three algorithms.

In addition to accuracy and loss, we also cited the final precision, recall, AUC, and F1 values of the global model as performance indicators to compare the five algorithms under statistical heterogeneity, as shown in Table 6.

For the Fashion-MNIST dataset, we obtained similar conclusions as in the MNIST dataset. According to Figures 7 and 8, the convergence speed of the FedAvg algorithm would be slower under the IID condition, and the other four algorithms were not much different.

According to Figures 9 and 10, the experimental results were also similar to those in the MNIST dataset of the influence with only statistical heterogeneity.

In Figures 912, we can see that although the FedAwo and FedAwo algorithms have some fluctuations, their convergence speed and model accuracy are better than those of the baseline algorithms.

For the Fashion-MNIST dataset, we conducted the Friedman test on the model accuracy and loss of the five algorithms, and the results obtained were stat = 13.27, value = 0.00181 and stats = 10.24, value = 0.02626. We conducted the Nemenyi test on the above algorithms to further verify whether there is a significant difference between the two models. According to the results in Table 7, it can be concluded that FedAwo and FedAwo algorithms are superior to the other three algorithms.

Similarly, for the Fashion-MNIST dataset, we also cited the final precision, recall, AUC, and F1 values of the global model as performance indicators to compare the five algorithms under statistical heterogeneity, as shown in Table 8.

In addition, we tested the computational overhead of four algorithms under the IID condition and nonIID condition. The results of IID condition are shown in Figures 13 and 14, and the other four algorithms did not judge whether the local model converged, so clients would train 5 epochs in each round. The total calculation amount of 100 clients in a communication round is 500. Due to the IID dataset, the convergence speed of the local model and the global model was fast. When the model and global model was close to convergence, clients would save more computing resources.

Under the condition of nonIID (statistical heterogeneity), FedAwo can still save computing resources of clients. However, compared with the IID condition, the convergence speed was slower in the case of statistical heterogeneity, and the saving effect of saving computing resources in the FedAwo algorithm was slightly worse, which is shown in Figures 1316.

5.4. Discussion on Experiment

According to Figures 1 and 2, in the MNIST dataset, we can see that each federated learning algorithm has similar performance without heterogeneity. When we use the local dataset with statistical heterogeneity, as shown in Figures 3 and 4, the global model accuracy and convergence speed of the FedAvg algorithm are significantly reduced. The global model accuracy of the FedProx algorithm and the FedAdp algorithm is not affected, but the convergence speed is significantly reduced, reaching convergence in the 70th round. However, the global model accuracy and the convergence speed of the FedAwo algorithm and the FedAwo algorithm are almost not affected by statistical heterogeneity and can reach convergence 20 rounds before. On this basis, we add system heterogeneity. When two types of heterogeneity exist at the same time, heterogeneity has a more significant negative impact on model aggregation. As shown in Figures 5 and 6, the global accuracy and convergence speed of the FedAvg, FedProx, and FedAdp algorithms are significantly reduced. The FedAwo and FedAwo algorithms have received a slight impact, but the global model accuracy can still reach 90% and can converge within 20 rounds. In the Fashion-MNIST dataset, we get consistent results, as shown in Figures 712. Through the above experiments, it fully reflects the optimal weight value calculated according to the adaptive learning algorithm; compared with the weight value assigned by the traditional federated learning algorithm according to the amount of client data, it has significant advantages. The FedAwo algorithm optimizes the computational cost of the FedAwo algorithm for the client. As shown in Figures 1316, FedAwo can significantly reduce the computing overhead of the client and is applicable to the situation of both IID and heterogeneity.

Through the above experiments, we can clearly find that the ability of FedAwo and FedAwo algorithms to solve the heterogeneity of federated learning is better than that of the other three baseline algorithms. Even under the condition of system heterogeneity and statistical heterogeneity, the algorithm in this paper can still converge quickly and ensure excellent global model quality. In addition, the algorithm in this paper is still applicable to IID. Therefore, FedAwo and FedAwo algorithms are universal, and they can be applied to most federal learning scenarios. The FedAwo algorithm optimizes the convergence criterion of the local model. As shown in Figures 13 and 16, FedAwo significantly saves the computing overhead of the client compared with other algorithms. Therefore, FedAwo is an adaptive weight optimization federated learning algorithm that can effectively solve the heterogeneity and save the computational overhead. Compared with existing algorithms, it has great advantages.

6. Conclusion

We investigate an automatic local model weight optimization strategy to reduce the negative effects of systematic and statistical heterogeneity in federated learning and propose federated learning algorithms FedAwo and FedAwo. The FedAwo algorithm can improve the convergence speed of the global model and obtain a global model with higher accuracy, and the enhancement algorithm FedAwo can reduce the training overhead. Experimental results verify the superiority of our proposed schemes in terms of convergence speed and global model accuracy, as well as the effectiveness of FedAwo in saving the client-computing overhead. In this paper, we combine adaptive learning with federated learning to solve the heterogeneity problem and have achieved remarkable results. This paper puts forward a new idea to solve the negative impact of heterogeneity in federated learning.

7. Future Work

However, the FedAwo and FedAwo algorithms also have some instability. As shown in Figures 9 and 10, in the Experiment section, the global model shows the zig-zag spike phenomenon when it is close to convergence. The reason for this phenomenon is that the learning rate is too high when the algorithm is about to converge fast. In the future work, we hope to improve the zig-zag spike phenomenon by dynamically adjusting the learning rate. In addition, we will further improve the adaptive learning model $\vartheta^0$ in future work to further improve the performance of the FedAwo algorithm.

Appendix

A Proof of equation (13)

We expand equation (5) according to the SGD,

B Proof of inequality (15)

We expand the right half of equation (14)

C Proof of inequality (16)

According to inequality , of inequality (14) can be rewritten as (16)

D. Proof of inequality (22)

We take into inequality (21)

Data Availability

The MNIST and Fashion-MNIST datasets used to support the findings of this study have been deposited in the (“https://www.kaggle.com/datasets/oddrationale/mnist-in-csv”) (“https://www.kaggle.com/datasets/zalando-research/fashionmnist”) repository ((DOI or OTHER PERSISTENT IDENTIFIER)). The MNIST and Fashion-MNIST datasets used to support the findings of this study are included within the article. The MNIST and Fashion-MNIST datasets used to support the findings of this study are included within the supplementary information file(s). The experimental code has been open source to the “https://github.com/amazing-yx/FedAwo.” Fashion-MNIST is available at https://www.kaggle.com/datasets/zalando-research/fashionmnist MNIST is available at https://www.kaggle.com/datasets/oddrationale/mnist-in-csv. Our experimental code for the manuscript is as follows: https://github.com/amazing-yx/FedAwo.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the Major Science and Technology Special Project of Henan Province (No. 201300210400), the Science and Technology Department of Henan Province (No. 222102520006), the Key R&D and Promotion Special Project of Henan Province (No. 212102210094).