Abstract

Federated learning is a new framework of machine learning, it trains models locally on multiple clients and then uploads local models to the server for model aggregation iteratively until the model converges. In most cases, the local epochs of all clients are set to the same value in federated learning. In practice, the clients are usually heterogeneous, which leads to the inconsistent training speed of clients. The faster clients will remain idle for a long time to wait for the slower clients, which prolongs the model training time. As the time cost of clients’ local training can reflect the clients’ training speed, and it can be used to guide the dynamic setting of local epochs, we propose a method based on deep learning to predict the training time of models on heterogeneous clients. First, a neural network is designed to extract the influence of different model features on training time. Second, we propose a dimensionality reduction rule to extract the key features which have a great impact on training time based on the influence of model features. Finally, we use the key features extracted by the dimensionality reduction rule to train the time prediction model. Our experiments show that, compared with the current prediction method, our method reduces 30% of model features and 25% of training data for the convolutional layer, 20% of model features and 20% of training data for the dense layer, while maintaining the same level of prediction error.

1. Introduction

In recent years, with the rapid development of 5G, “Interconnection of All Things” has become a trend of information technology. The speed of information circulation on the Internet has reached an unprecedented level. The rapid information circulation has brought a dramatic increase in data and promoted the development of big data and artificial intelligence technology. However, the interconnection of more devices brings higher security risks. How to enjoy the benefits brought by artificial intelligence on the premise of ensuring data privacy has become a challenge.

In 2016, McMahan et al. [1] first proposed the concept of federated learning to respond to the challenge. Multiple clients can jointly train the model under the coordination of a central server or service provider in federated learning. Each client uses its data to train the local model and uploads the parameters of local model to the server for model aggregation to achieve the global model. Therefore, the original data of each client is stored locally without exchange or transmission. At present, federated learning has been widely applied to optimize the user experience on the premise of protecting privacy in Google, Apple, and other enterprises. For example, Google has widely used federated learning in Gboard [2], Pixel mobile phone [3], and Android Message [4], so as IOS13 [5] of Apple.

In the FedAvg algorithm proposed by McMahan et al., the central server generates a global model after aggregation in each synchronization and then distributes the global model to some clients which are selected randomly. When the clients received the global model, they use their own data to train their local models with the parameters of global model in the specified epochs. After the local models are trained, the clients will send them to the server, and the server will execute model aggregation with the weighted average strategy. The global model will gradually converge after several synchronizations. According to the FedAvg, the update of global model in each synchronization can be expressed by formula (1), where represents the parameters of the global model in the -th synchronization of model aggregation, is the amount of client ’s data, and denotes the amount of all clients’ data. is the client learning rate, represents the number of client ’s local updates, which can be obtained by , is the number of local epochs, is the mini-batch size of the client, represents the parameters of local model in -th local update on -th client after the synchronization, and denotes the local model’s parameters gradient of client .

In each synchronization of global model updating, FedAvg sets the same number of local epochs for all clients participating in model aggregation, which will lead to inefficient training as the difference in training speed. A lot of other federated learning algorithms are also set in the same way, such as FedProx [6]. Although FedNova [7] has implemented the training of the global model under the situation of different local epochs, it randomly sets the number of local epochs, which may set a small number of local epochs for clients with high training speed, and leads to the idling problem. But, FedNova gives us an inspiration: if we can predict the model training time of clients, the local epochs will be dynamically set according to the predicted training time, which can reflect the training speed of clients.

As the idling of faster clients in federated learning will prolong the training of global model, we propose to predict the training time of deep learning models on heterogeneous clients to guide dynamically set the number of local epochs. In the deep learning task, the training time may be affected by the amount of training data and the setting of hyperparameters. We call the factors in training data and hyperparameters that may affect the training time as model features. Justus et al. [8] have trained a multilayer perceptron (MLP) to predict the training time of layers in the neural network using the model features and training time collected on different GPUs. They divide model features into layer features and predict the training time of the complete model by accumulating the prediction results of multiple layers. However, when the structure of the model is very complex or the number of layers is very large, collecting many features in each layer will increase the burden of the system and hinder the convergence progress of global model. What is more, when a new device is added to the federated learning system, it is necessary to collect a large amount of high dimension training data on this device to tune the current prediction model, which is very time-consuming and will lead to long-term failure of training time prediction for the new device.

To solve these problems, we propose a method based on deep learning to reduce the number of model features and the amount of training data required by training time prediction. We first design a neural network to extract the influence of model features on training time, which can accurately interpret the relationship between model features and training time. Then, we propose a dimensionality reduction rule to extract the key features based on the influence of model features on training time. A large number of experiments show that compared with the current prediction method, the features of the convolutional layer and dense layer can be reduced by 30% and 20%, respectively, and the error of training time prediction still maintains the original level. At the same time, 25% convolutional layer training data and 20% dense layer training data are reduced, which speeds up the adaptation of the time prediction model to the new device.

The rest of this paper is arranged as follows: in Section 2, we discuss the related work in time prediction; in Section 3, we introduce the method proposed in this work, including our neural network and dimensionality reduction rule. We also provide an algorithm to dynamically set the number of local epochs in this section. To verify our work, we set up a large number of experiments and interpret the experimental results in Section 4. Finally, we provide the summary of all work in Section 5.

At first, machine learning regression algorithms are often used to predict time series, such as linear regression, random forest, and GBDT. Edelman et al. [9] use a linear regression model to predict the execution time of surgery; Wang et al. [10] train regression decision trees to predict the arrival time of buses by using the nearest neighbor-based random forest algorithm; Cheng et al. [11] use GBDT to predict traffic time in different time ranges and find the variables which have a great impact on prediction error. These regression methods have good universality, and they can be used in many fields. But, the error range of their prediction is very large, so they can only be applied to scenarios with low sensitivity to time fluctuations.

To narrow the error range of time series prediction, some scholars have proposed the method with specific domain knowledge to predict time series. It constructs mathematical models to achieve time series prediction, by studying the calculation characteristics of the specific domain such as PALEO [12] and Optimus [13] method. PALEO is a method to predict the computing time by counting the number of floating-point operations. It counts the number of floating-point operations required in a model training epoch and multiplies the number by a scale factor to predict the training time. However, PALEO assumes that the whole training process of the model is linearly related to the number of floating-point operations, ignoring some operations that are not, such as parameter transmission. Unlike PALEO, Optimus mathematically summarizes the factors that affect model training, establishes a performance model to evaluate the training speed, and can predict the model convergence according to online resources. Compared with the regression methods, these works reduce the prediction error range of model training time to a certain extent, but the mathematical model established for the training is fuzzy and it ignores some factors which contribute greatly to the training time, resulting in instability of the prediction.

Because of the excellent performance of deep learning models, researchers began to use deep learning methods for predicting time series, and trying to further reduce the error of time series prediction. Xu et al. [14] creatively combine linear regression and deep belief network (DBN) to predict time series; PreVIous [15] trains the MLP model to predict the inference time of convolutional neural networks according to the throughput and energy consumption of the Internet of Things vision device; Petersen et al. [16] design a neural network mixed with convolutional layers and LSTM layers to accurately predict the bus arrival time. These works have achieved high prediction accuracy, but their application in the training time prediction of deep learning models is limited by the specific model structure. Their time prediction models can only predict those network structures contained in their training data (such as VGG [17], ResNet [18], or user-defined network). When a network with a new structure is encountered, their models need to be retrained, that is, they cannot apply to other new deep learning models. Although Fathom [19] has proved that the inference time of a model can be estimated by another model with a similar structure and known performance, its prediction is very rough, and it is still to be proved that whether this method can be used in the prediction of training time. In order to accurately predict the training time of networks with different structures, Justus et al. divide the neural network into layers and classify these layers (such as convolutional layer and dense layer) according to the structural characteristics, then collect the layer features and train an MLP model to predict the training time of a single layer in the neural network, which can achieve high prediction accuracy. This method has good generality. When a network with a new structure is encountered, it only needs to predict the training time of layers according to the layer model features, and then the training time of the whole model can be predicted by accumulating the training time of layers. However, there are some problems when Daniel Justus’ method is applied to federated learning: (1) the relationship between model features and training time is not accurate. They assumed that almost every model feature is necessary for training time prediction, including features that have no or little impact on the final result. (2) Too many unimportant features need to be collected. When the neural network is very deep, collecting features in every layer will increase the burden of the federated learning system, and it is usually difficult to obtain all details of clients’ models. (3) High training cost caused by too much redundant training data. Unimportant features produce a lot of redundant training data, which increases the transfer training cost of the time prediction model on the newly added device and reduces the training efficiency of the global model.

To solve the above problems, we propose a training time prediction method based on deep learning, which can reduce the required model features and training data on the premise of ensuring low prediction error and improve the feasibility of practical application in federated learning. The contributions of this paper are as follows: (1)We design a neural network to extract the influence of model features on training time according to the characteristics of the deep learning model, which provides an effective analysis of the relationship between model features and training time(2)We propose a dimensionality reduction rule to extract the key features that have a great impact on training time according to the influence of features, which can reduce the number of features required for predicting model training time without loss of prediction accuracy. By using the dimensionality reduction rule, 7 dimensions are extracted from convolutional layer features (10 dimensions in total), and 4 dimensions are extracted from dense layer features (5 dimensions in total)(3)We train the time prediction model using dimension-reduced datasets. Compared with the method of Justus et al., the training data of the convolutional layer is reduced by 25%, and the training data of the dense layer is reduced by 20%, with the error of prediction remaining at the same level

3. Methodology

In this section, we will introduce the overall process and technical details of extracting the influence of model features on training time, dimensionality reduction, and the algorithm of dynamically setting the number of local epochs. First, we prove the feasibility of accumulating the layers’ training time for the prediction of the whole network by interpreting the calculation process of training. And we describe the layer features based on the work of Justus et al. in detail. Second, we introduce the structure of the neural network (Here we name our neural network weights model), which is designed for extracting the influence of model features on the training time. Third, we propose the dimensionality reduction rule to extract the key features which have a great impact on training time, based on the influence of model features. Finally, we provide a representative algorithm for dynamically setting the number of local epochs.

3.1. Feature Analysis

One training of neural network consists of forward propagation and backward propagation. With the widespread use of Batch Normalization [20] which can speed up the convergence of neural networks, it usually has to perform a batch of forward propagation before one backward propagation. A complete round of training (including multiple batches) of a neural network in the training set is called an epoch. Generally, a model with high accuracy needs to be trained many epochs until the model converges. At present, the method of setting the number of epochs is based on the experience of deep learning engineers. It needs to set the different number of epochs for different models to achieve the specified accuracy. Therefore, it becomes a challenge to accurately predict the training time of models with the different number of epochs. In addition, since the structures of models are heterogeneous, the training time will be significantly different. For example, the training of the convolutional layer is usually more time-consuming than the dense layer. Therefore, it is also a challenge to accurately predict the training time for models with different structures. To solve these problems, Justus et al. proposed a method to predict the training time of different layers in a batch. The training time of the whole model in one batch can be obtained by accumulating the prediction results of layers. And the total training time can be predicted by accumulating the training time in batches. Ulteriorly, we prove the feasibility of predicting the whole model’s training time through layers combined with the training characteristics and structural characteristics of the deep learning model.

Usually, a neural network needs to be trained repeatedly on the training set many times, which has obvious iteration characteristics. According to the iteration, the time required for model training can be expressed as formula (2), where represents the number of epochs, represents the number of batches in an epoch, and denotes the training time in a batch. The total number of batches in epochs can be obtained by , is the amount of training data, and is the size of a batch (i.e., batch size).

One training of the neural network consists of a batch of forward propagation and one backward propagation. Therefore, the time cost of the propagation can be expressed as formula (3), represents the time cost of forward propagation, represents the time cost of backward propagation, and is the -th training data in a batch.

Combining formula (2) and formula (3), the training time of a deep learning model can be described as the following formula:

A neural network is composed of many layers, and the output of the current layer is used as the input of the next layer. Besides the iteration characteristic of the training process, the neural network also has obvious hierarchical structure characteristics. According to this hierarchy characteristic, a complete neural network can be divided into layers, and its forward and backward propagation can be expressed as formulas (5) and (6) on layer level. and , respectively, represent the time cost of layer ’s forward and backward propagation, and denotes the number of model layers.

Combining formula (4) with formulas (5) and (6), the training time formula of the complete model with the layer’s training time as the unit can be derived, as shown in the following formula.

In summary, the training time of a single layer can be used as the basic unit of the whole model’s training time. Therefore, for models with number of different epochs, the total training time of the model can be obtained by accumulating training time in a batch; for models with different structures, one batch training of the model can be obtained by accumulating the forward and backward propagation time of layers, well solved the two challenges in training time prediction of models.

In order to predict the training time of a single layer, it is necessary to analyze the layer features of neural networks. First of all, we classify the layer features into common features, dense features, convolutional features, and hardware features according to the device characteristics and computing characteristics. And then we extract the features according to the categories. The layer features are shown in Table 1. Since the deep learning model contains a large number of convolutional layers and dense layers, we mainly focus on the convolutional layer and dense layer in this paper. And we trained the time prediction model for the convolutional layer and the dense layer, respectively, based on the data of layer features and training time, which are collected on six different types of GPUs (P100, V100, K40, K80, M60, and 1080ti).

3.2. Model Design

In the previous section, we analyzed the layer features which may affect the training time of neural networks by parsing the structure of different layers. However, we find that collecting layer features will increase the system overhead for deep-seated neural networks. For example, in terms of the ReNet101 with 100 convolutional layers and 1 dense layer, if we collect 10 features of the convolutional layer and 5 features of the dense layer, we finally need to collect 1005 features to predict the training time of ResNet101. For clients with low computing power, the process of predicting will take a lot of time.

What is more, since there are a large-scale of heterogeneous devices in federated learning, we cannot use the time prediction model to predict the training time of models for newly added devices whose types are not in the set of preset device types. To predict the training time for the new device, we need to tune the parameters of the time prediction model based on the training data collected on this device. But it needs a long time to collect a large number of high dimension training data for tuning the time prediction model. The prediction model cannot quickly adapt to the new device, and the number of local epochs is set to be a fixed value for a long time, which may lead to the clients remaining idle with high training speed.

For cutting down the time cost of predicting and accelerating the adaptation of the time prediction model to new devices, it is necessary to reduce the dimension of layer features and redundant training data. In the case of ensuring high prediction accuracy, we choose to exclude the features that have no or little impact on training time. So, the influence of model features on training time needs to be analyzed.

In order to extract the influence of features, we abstract the relationship between model features and training time as , where represents the model features, denotes the training time of a single layer, and denotes the weights of features which can be treated as the influence of features. It should be noted that due to different value ranges of features, and cannot represent the real influence of features when simply taking the original feature data as . The value of should be the standardized feature data.

In related work, we introduced some machine learning regression models, including linear models and nonlinear models. The linear regression model can directly extract the features’ weights , but it underperforms in time prediction. The nonlinear models are difficult to extract since their weights are implicitly dispersed in the model parameters. To get the weights of features accurately, we design a weights model which can use a neural network to explicitly extract features. The structure of the weights model is shown in Figure 1. See Table 2 for the hyperparameter settings of weights model.

As can be seen from Figure 1, the input of the weights model is the standardized feature data, and the output is the predicted training time. Each neuron of layers is activated by ReLu and then output. For ensuring the convergence of the model, the settings of layer 1 to layer are consistent with the hidden layers of Justus et al.’s time prediction model. In order to learn the weights of features, we multiply the output of the weights layer and the input layer, and then output after ReLu activation. The weights of features calculated from layer 1 to weights layer is multiplied by the feature data to form .

Different from the simple linear model whose is fixed, the weight extracted by weights model will change with the input data, which can fit the training data better. In weights model, for each input data , the output is , where is a weight function that the weight will change with the input data. It can be obviously found that the weight is obtained by the deep learning model, which means that it can not only benefit from the high accuracy of the nonlinear model but also use the output of weight layer to obtain the weight data explicitly. Justus et al. have proved that their MLP model has higher prediction accuracy than the linear regression model, but it cannot characterize the performance of our weights model, which is the reconstructed MLP. Therefore, we conduct the comparative experiments to prove the superiority of weights model (see Section 4.2 for the results).

3.3. Dimensionality Reduction Rule

We have introduced the weights model which used to extract the weights of features in the last section. As the weights will change with the input data , the order of features’ influence (i.e., weights ranking) may fluctuate. For example, for input data , the feature batchsize has the greatest influence, but for data , its influence may be the smallest. The prediction error will be further expanded and the influence of features cannot be measured uniformly because of the fluctuation of . Therefore, we use the average ranking of feature weights and the average standard deviation of weights ranking to comprehensively analyze the influence of features. The average ranking of feature weights can represent the overall contribution of features to training time and the average standard deviation of weights ranking can measure the fluctuation of weights. According to these two metrics, we propose a dimensionality reduction rule to extract the key features that have a great overall influence on training time. Our analysis method and dimensionality reduction rule will be introduced in detail as follows.

Before analyzing the influence of feature weights, we first need to extract the weights of features by weights model based on multiple datasets described in Section 3.1. Then, we use formula (8) to calculate the ranking of the -th feature’s weight in weights data , where represents the weight of the -th feature in the weight data , and represents the total number of features.

The standard deviation is an indicator which can reflect the extent of data dispersion. For measuring the fluctuation of the ranking of feature weights, we calculate the standard deviation of weights’ ranking, and use to represent the standard deviation of the -th feature weight’s ranking of dataset .

Because of the device heterogeneity in federated learning, it is not universal to extract key features based only on the dataset generated by a single device. Therefore, we analyze the weights of features in many datasets collected from different GPUs. We use formula (9) to calculate the average ranking of feature weights to represent the overall contribution of features on heterogeneous devices and use formula (10) to obtain the average standard deviation of weights ranking for measuring the overall fluctuation extent of the weights ranking. Where represents the number of datasets, and represents the data volume of dataset .

and reflect the overall influence of features on training time from the perspective of contribution and fluctuation. In order to reduce the feature dimension, we formulate a unified dimensionality reduction rule according to and for extracting the key features which have a great overall impact on training time. The dimensionality reduction rule can be expressed as formula (11), represents the set of key features, is the collection of all features, and and are constants, whose values need to be set according to the and . Note that the values of and are set empirically. After many experiments, we found that the effect of dimensionality reduction is the best when is set to 1.55 and is set to 8 for convolutional layers in this paper. And for dense layers, should be set to 2 and should be set to 8.

The selection of key features by dimensionality reduction rule can be divided into the following two steps: (1)Select the features with greater (greater than )

From the metrics of , the overall stability of weights ranking can be intuitively judged. Features with smaller have a relatively stable overall influence on training time, while others with greater usually fluctuate wildly. As for features with strong ranking fluctuations, we find that they may have a small influence on some input data, but have a large influence on other data. It will cause serious deviations in the prediction of training time without such features. So, we choose to extract features with greater . (2)Select the features with smaller (smaller than )

After screening in (1), there are some features with stable rankings (smaller ) in the feature collection. These features contain the ones with smaller which have a greater overall impact on training time, such as the feature input channels whose influence is the largest for almost every input data. Therefore, we select the features with smaller from the rest of the feature collection.

The process of extracting key features using the dimensionality reduction rule can be described by Algorithm 1.

Input: weights model, the number of dataset, feature set, dataset , the amount of dataset
Output: the set of key features
1: Initialize
2: For each dataset
3:    ⟵GetWeights
4:    ⟵rank
5: End for
6: For in do
7:   
8:   
9:   Ifthen
10:      AddNewFeature
11:   End if
12: End for
13: Return

By using the dimensionality reduction rule, the layer features with no or little impact on training time will be eliminated and the high prediction accuracy will be kept. Therefore, the dimension of layer features and redundant training data for the time prediction model will be reduced. Such as the feature dimension and training data are reduced by 30% and 25% for the convolutional layer, and they are both reduced by 20% for the dense layer. We take Justus et al.’s time prediction model as the baseline to verify the accuracy of our dimensionality reduction rule. Then, we retrain the baseline with the dimension-reduced dataset and compare it with the original baseline. The results show that the prediction error of the model trained with the dimension-reduced dataset remains at the same level as the original baseline. See the experimental analysis in Section 4.3 for details.

3.4. Dynamically Setting Number of Local Epochs

In federated learning, each client only uses its own data for model training. And most of the training data are generated by the client itself. Due to the different characteristics of clients, the distribution of data generated by different client is usually different. Therefore, the training data of federated learning is nonindependent and identically distributed (Non-IID). Unfortunately, Non-IID will cause the divergence between the local model and the global model, resulting in the error convergence of the local model. If the number of local epochs set for each client is different, it will undoubtedly aggravate the divergence of the local model. Therefore, FedNova can eliminate the fast error convergence by normalized averaging the gradients of local models and ensure the convergence of the global model when set different number of local epochs for clients. The update of the global model in FedNova can be described by the following formula:

Compared with the update of global model in FedAvg (formula (1)), FedNova uses the normalized averaging gradient to replace the accumulation of the local model’s gradient , which can restrict the error convergence of local models. in formula (11) can be treated as the learning rate of the global model. It turns out that compared with FedAvg, FedNova can improve the performance of the global model while reducing the number of communications between the clients and the server.

Although we can set different number of local epochs for clients while ensuring good performance of the global model by using FedNova, the problem of the clients with high training speed waiting for the clients with low training speed still exists. This is because FedNova adopts a random sampling method within a given value range to set the number of local epochs, which may set a small number of epochs for the faster clients and set a large number of epochs for the slower clients.

In order to solve the idling problem and improve the training efficiency of global model, we predict the training time of local models in FedNova for guiding the dynamic setting of local epochs. It should be noted that training time prediction can be combined with any algorithm that supports setting different numbers of local epochs for clients. And we choose FedNova for its excellent performance. We use the time prediction model trained with dimension-reduced datasets to predict the training time of one epoch of local models, and then we calculate the number of local epochs for each client according to the time window of communication between the client and the server. There are many ways to combine FedNova and training time prediction. For example, you can choose to predict the training time of local models on the server side and calculate the number of local epochs, or you can choose to send the time prediction model to clients and let clients calculate the number of local epochs by themselves. In actual scenarios, the way of dynamically setting the number of local epochs needs to be determined according to specific needs. Next, we will introduce the method based on the FedNova and training time prediction algorithm, and the details are described in Algorithm 2.

First, the server randomly selects some clients according to proportion and collects their model updates. Second, the server aggregates the updates of the local models to the global model and prepares the time prediction model for clients. Third, the client obtains the global model and time prediction model from the server, divides the local dataset into batches, and extracts the features required for time prediction. Then, the client uses the time prediction model to predict the training time of one epoch of the local model and calculates the number of local epochs according to the preset communication time window . Finally, the client uses FedNova to update the local model and send it to the server.

Input: The clients are indexed by ; B local minibatch size, T Coomunication time window, M time prediction model.
1: Server executes:
2: initialize
3: for each round t, t =1,2,…,Ndo
4:   
5:   
6:   for each client in parallel: do
7:      ClientUpdate
8:   end for
9:   
10: end for
11: ClientUpdate
12: // Client receives the global model
13: (number of batches per epoch divided by B)
14:
15: GetTrainingTime(M, f)
16:
17: for each epoch from e to do
18:   for each batch from k to do
19:    
20:    
21:    
22:   end for
23: end for
24: return to the Server

4. Experiments

In this section, we arrange experiments to verify the effectiveness of our work, including the weights model and the dimension reduction rule. First, the experimental settings are introduced, including the selection and description of datasets, the setting of model structure, the evaluation metrics, and the introduction of the experimental environment. Second, we use the datasets collected on heterogeneous GPUs to train the weights model for verifying its convergence, and then we conduct a large number of comparative experiments to compare the prediction error between weights model and baseline. We also compare the performance of weights model and linear regression model to prove the superiority of our weights model. Finally, for verifying the effectiveness of the dimensionality reduction rule, we train the baseline based on the complete dataset and the dimension-reduced dataset, respectively, and then compare the error level of prediction between them.

4.1. Experiment Settings
4.1.1. Datasets

We use the layer features described in Section 3.1 and use 8 different datasets for experiments. There are 6 datasets composed of common features, convolutional features, dense features, and training time collected on different GPUs, which we called single-GPU datasets. The other 2 datasets are the stacking of these 6 datasets, which we called stacked datasets. In order to distinguish different types of GPU, we add three hardware features to the stacked datasets: GPU bandwidth, GPU processing units’ number, and GPU clock cycle speed. See Table 3 for the description of the datasets.

The layer features selected in the convolutional layer are shown in Table 4. One-hot encoding is used to represent different optimizers and activation functions for more appropriately measuring the feature distance.

The number of features in the dense layer is less than that in the convolutional layer, and some features are different, such as the dimension of input and the dimension of output. The features selected in the dense layer are shown in Table 5.

4.1.2. Model Settings

Compared with the baseline, we only add a weights layer at the end of hidden layers for extracting the weights of features, and the rest of the layers are the same as the baseline. Note that the data volume of the stacked datasets (All_Conv and All_Dense) is larger than single-GPU datasets, and three more hardware features are added to distinguish the GPUs. Therefore, for the single-GPU datasets and stacked datasets, the number of weights model’s layers are different. Figure 2 shows the settings of the baseline and weights model for the single-GPU datasets and the stacked datasets.

4.1.3. Metrics

We use the root mean square error (RMSE) and mean absolute percentage error (MAPE) to measure the error between predicted training time and observed training time. The calculation formula of RMSE is shown in equation (13), and the unit is milliseconds (ms); the calculation formula of MAPE is shown in equation (14). represents the observed training time of the -th input data, and represents the predicted training time of the model of the -th input data.

4.1.4. Experimental Environment

The hardware environment of our experiments is a stand-alone machine, using a CPU for training which contains 6 cores and 12 threads, the main frequency of this CPU is 3.1 GHz, and the memory is 16 G. And we use TensorFlow 1.13.2 framework on 64-bit Windows 10 operating system to finish this program.

4.2. Analysis of Weights Model
4.2.1. Convergence Verification of Weights Model

In the convergence verification of weights model, we train the weights model using all datasets. Figure 3 shows the comparison of predicted time and observed time on different datasets. On the P100_Conv, V100_Conv, K40_Conv, and All_Conv, the RMSE of weights model is 2.894 ms, 1.660 ms, 9.051 ms, and 4.273 ms, respectively; on the P100_Dense, V100_Dense, K40_Dense, and All_Dense, the RMSE of weights model is 0.032 ms, 0.046 ms, 0.159 ms, and 0.077 ms, respectively. Therefore, the weights model has good convergence whether based on single-GPU datasets (P100, V100, and K40) or stacked datasets (All_Conv and All_Dense).

4.2.2. Weights Model vs. Linear Regression Model

In order to verify the comparative analysis of the weights model and linear regression model in Section 3.2, we use P100_ Conv, V100_ Conv, and K40_ Conv datasets to train the weights model and linear regression model, respectively. Table 6 shows the RMSE comparison between the linear model and weights model. It can be seen that the error of weights model is far lower than that of the linear regression model. On the P100_ Conv, V100_ Conv, and K40_ Conv datasets, the RMSE of the linear regression model is 9.5 ms, 7.61 ms, and 35.74 ms larger than the weighted model, respectively.

4.2.3. Weights Model vs. Baseline

Since the training time prediction accuracy of the weights model determines the authenticity of feature weights extracted by weights model, we divide the dataset into the training set, test set, and validation set according to the ratio of 8 : 1 : 1 and calculate the test error and the validation error of the weights model to prove the accuracy of weights model. The comparison of test error and verification error between weights model and baseline shows that weights model reaches the same error level as the baseline on P100, V100, K40, and stacked datasets (see Table 7). It is worth mentioning that on the test sets of K40_Conv, K40_Dense, P100_Dense, and All_Conv, the MAPE of weights model is lower than baseline by 0.1%, 0.19%, 0.46%, and 1.04%, respectively, and on the test sets and validation sets of P100_Conv, the RMSE of weights model is lower than baseline by 0.068 ms and 0.364 ms.

4.3. Analysis of Dimensionality Reduction Rule

In 4.2, we proved that the weights model has good convergence and has a high prediction accuracy that is not lower than baseline. In this section, we will verify the effectiveness of the dimensionality reduction rule. We use the weights model to extract feature weights of different test sets to form the weights datasets, then we use the dimensionality reduction rule described in Section 3.3 to reduce the dimension of features. After that, we compare the prediction error of the model trained based on dimensionality-reduced data with the original baseline.

In our experiments, we found that the ranking of feature weights may fluctuate significantly for different input feature data. We think it is because the influence of layer features on training time does not necessarily change linearly. Assuming the influence of channels_in and elements_kernel on the final training time is shown in Figure 4, when the value of channels_in gradually decreases, its influence on the final result is not necessarily greater than the elements_kernel.

Before using the dimensionality reduction rule, we use formula (8) to calculate the ranking of feature weights in each weights dataset. According to the first step of the dimensionality reduction rule, we need to measure the fluctuation of each feature weight, that is, the standard deviation. Therefore, the standard deviation of weights ranking and the average standard deviation of weights ranking are calculated, see Tables 8 and 9. For the second step of the dimensionality reduction rule, we calculate the average ranking of each feature’s weight on all weights datasets and obtain the overall average ranking on all datasets according to formula (9). Tables 10 and 11, respectively, show the average ranking of feature weights on each dataset, as well as the overall average ranking .

Taking the convolutional features as an example, since the optimizers and activation functions are represented by one-hot encoding, we regard all optimizer fields (feature name starting with opt_) as one feature opt and all activation function fields (feature name starting with act_) as one feature act.

According to step 1 of the dimensionality reduction rule, we select features with greater than 1.55, the results are elements_matrix, elements_kernel, channels_in, channels_out, strides, and opt (all optimizer fields). The remaining features with small are batchsize, channels_in, padding, use_bias, and act (all activation fields).

According to step 2 of the dimensionality reduction rule, we select features with less than 8 in batchsize, channels, padding, use_bias, and act, and the results are batchsize and channels_in.

Finally, by using the dimensionality reduction rule, we get batchsize, channels_in, elements_matrix, elements_kernel, channels_in, channels_out, strides, and opt. The convolutional layer features are reduced by 3 dimensions (padding, use_bias, and act), and the training data is reduced by 5 dimensions (padding, use_bias, act_relu, act_tanh, and act_sigmoid).

For the dense layer features, we use the same method. According to the dimensionality reduction rule, we exclude features that have less influence on training time, including act_relu, act_tanh, act_sigmoid, and batchsize. It should be noted that the dense layer has the parameter-intensive characteristic, which means that the transmission of parameters takes a long time. However, the time overhead of parameter transmission was ignored in our datasets. According to the forward propagation process of the neural network, the dense layer must perform one forward propagation calculation for one input data, and it must perform a batch of forward propagation for a batch of input data. Therefore, the batchsize determines the number of times the parameters are transmitted, which should not be excluded.

After extracting key features by dimensionality reduction rule, we filter the training sets and get the dimension-reduced datasets on P100_Conv, V100_Conv, K40_Conv, All_Conv, P100_Dense, V100_Dense, K40_Dense, and All_Dense (We add _ small represents the dimension-reduced datasets. For example, the dimension-reduced dataset of P100_Conv is P100_ Conv_ small). For verifying the validity of dimension-reduced datasets, we use the dimension-reduced datasets to train the baseline, and the trained model is called baseline_small. And our experiments have proved that baseline_small also has good convergence. Figure 5 shows the comparison of predicted time and observed time of baseline_small on the test sets.

For determining whether the dimension-reduced datasets caused the loss of prediction accuracy, we calculated the RMSE and MAPE of baseline and baseline_small on all datasets. The results are shown in Table 12. For the convolutional layer datasets, the baseline_small test RMSE is 0.1385 ms lower than the baseline on average, the verification RMSE is 0.3664 ms higher than the baseline on average; for the dense layer datasets, the baseline_small test RMSE is 0.0078 ms lower than the baseline on average, and the verification RMSE is 0.015 ms lower than the baseline. It is worth mentioning that for the dense layer datasets, both RMSE and MAPE of baseline_small are lower than baseline. We consider it is because there are some features in the dense layer that have no contribution to the training time. These features will disturb the data distribution and increase the prediction error of the model.

The experiments show that after reducing the convolutional layer features by 30% and the training data by 25%, the error level of prediction is still consistent with the original baseline; after reducing the dense layer features by 20% and the training data by 20%, the error level is generally lower than the baseline, which proves the effectiveness of our weights model and dimensionality reduction rule.

5. Conclusion

For the problem of setting the number of local epochs for heterogeneous clients in federated learning, we propose a solution of predicting the training time of deep learning tasks on clients to guide the dynamic setting number of local epochs. We design the weights model to extract the weights of features and accurately interpret the relationship between model features and training time. This paper focuses on the combination of weights model and dimensionality reduction rule to extract the key features for reducing the dimension of features and redundant training data required by the time prediction model. The purpose of our work is to improve the feasibility of predicting the training time of deep learning models on heterogeneous clients in federated learning, so as to dynamically set the number of local epochs for clients. Compared with the existing methods, the results of our experiments show that (1) the weights model has good convergence on heterogeneous devices; (2) the predicted training time of weights model reaches the same error level as baseline; (3) the dimensionality reduction rule in this paper can reduce 30% features and 25% redundant data for the convolutional layer and reduce 20% features and 20% redundant data for the dense layer, while maintaining high prediction accuracy.

Data Availability

Previously reported [Model Features] data were used to support this study and are available at [10.1109/BigData.2018.8622396]. These prior studies (and datasets) are cited at relevant places within the text as references [8].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work is supported by the National Natural Science Foundation of China under Grant no. 62072146 and no. 61972358 and the Key Research and Development Program of Zhejiang Province (2019C01059, 2019C03135, and 2019C03134).