Abstract

Logistic regression has been widely used in artificial intelligence and machine learning due to its deep theoretical basis and good practical performance. Its training process aims to solve a large-scale optimization problem characterized by a likelihood function, where the gradient descent approach is the most commonly used. However, when the data size is large, it is very time-consuming because it computes the gradient using all the training data in every iteration. Though this difficulty can be solved by random sampling, the appropriate sampled examples size is difficult to be predetermined and the obtained could be not robust. To overcome this deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling. The proposed method decomposes the problem of gradient estimation into several subproblems according to its dimension; then, each subproblem is solved independently by adaptive sampling. Each element of the gradient estimation is obtained by successively sampling a fixed volume training example multiple times until it satisfies its stopping criteria. The final estimation is combined with the results of all the subproblems. It is proved that the obtained gradient estimation is a robust estimation, and it could keep the objective function value decreasing in the iterative calculation. Compared with the representative algorithms using random sampling, the experimental results show that this algorithm obtains comparable classification performance with much less training time.

1. Introduction

Supervised learning is to train a learner with a labelled training set and correctly determine the outputs for the unseen instances [1]. As one of the most famous classification algorithms in supervised learning, logistic regression learner is a generalized linear regression model in which the output is discrete [2]. Logistic regression has been widely used in various kinds of applications owing to its good performance, as process tomography [3], customer churn prediction [4], spatial prediction [5], major chronic diseases and clinical risk prediction [6, 7], and so on [810]. The process of training a logistic regression model aims to solve an unconstrained convex optimization problem, where gradient descent is one of the most important solutions [11]. Because of the computation of the gradient using the whole training instances, it is very time-consuming to use gradient descend (GD) when the data size is large [12].

To speed up GD, many improved algorithms have been developed. According to the volume of data to obtain the gradient estimation, these algorithms can be divided into two groups: stochastic gradient descent and batch gradient descent [13]. Stochastic gradient descent (SGD) uses only one randomly selected training example to compute the gradient, and this can be very efficient for large datasets [14]. So, SGD is much faster and very suitable for online learning. However, the estimated gradient obtained by SGD is difficult to be a descent direction at each iteration, so that it needs a vast number of iterations. Furthermore, SGD is difficult to be suitable for the parallel environment [15].

Different from SGD, batch gradient descent (BGD) obtains the aim gradient estimation using randomly choosing a certain amount of training examples. In this way, BGD could largely reduce the error and instability of the estimation, and it also obtains an effective solution [1618]. As the sampled examples play an important role in estimating the gradient, BGD needs to carefully choose an appropriate sample size before sampling. However, it is difficult to predetermine an appropriate sample size for different datasets. Furthermore, samples of the same size could vary in terms of their qualities, because some examples are more representative or resembling the original data than others [19].

This paper presents an improved adaptive sampling (AS) algorithm for accelerating the logistic regression training process. This method firstly gives a rule for estimating the gradient by some examples, and the obtained gradient estimation can guarantee that the objective function value keeps decreasing in the iterative calculation of GD. Then, the problem of obtaining an appropriate vector that meets the rule can be decomposed into several subproblems, where each subproblem determines a component of the vector that satisfies the stopping rule. Finally, the examples are drawn successively from the training set into the sample, and it terminates as soon as each component of the estimated gradient over the obtained sample satisfies its own rule. To speed up this process, the estimated components satisfying their own stopping rules are not estimated in the subsequent iteration, and they are the corresponding components of the final estimate of the gradient. The main contributions of this paper are as follows:(1)Giving the rule to judge whether the direction of a vector is a descent direction of the current objective function or not, it is critical for the execution efficiency of the gradient descend method.(2)Providing an adaptive sampling method to overcome the difficulty of the predetermining sample size before sampling, this method can adaptively determine the sample size according to the character of datasets and avoid the influence of human subjective factors.(3)Applying a strategy of divide-and-conquer to efficiently obtain the gradient estimation on the sampled examples, the aim gradient vector estimation problem is divided into several one-dimensional estimation subproblems, and each subproblem can be solved independently.(4)Proving the obtained gradient estimation is robust using probably approximately correct theory, and this estimation could be a descent direction of the current objective function at each iteration.(5)Designing an efficient mechanism to solve the multivariate estimation problem for large-scale data.

The rest of the paper is organized as follows: Section 2 reviews related methods according to their characteristic. Section 3 proposes a sampling on-demand algorithm for logistic regression and proves its effectiveness. Section 4 reports the experimental results through the comparison with existing methods. Section 5 gives the conclusion of this paper and shows some future research work.

Related work in improving GD has been widely developed nowadays. According to the amount of data to obtain the gradient estimation, existing GD algorithms can be divided into two groups: SGD algorithm and BGD algorithm.

The original SGD (OSGD) algorithm computes the gradient with only one sample from the training set. The OSGD algorithm does not consider the effect of different dimensions on its convergence, so its rate of convergence could be slow when the surface of the objection function curves steeply for different dimensions. Qian [20] proposed the Momentum algorithm to an accelerated OSGD algorithm, where the current update vector is appended with a fraction of the obtained vector in the previous iteration. To efficiently solve the sparse data, Duchi et al. [21] proposed an Adagrad method to deal with the online learning task, and it set different learning rates for different components of the vector. Besides, there are lots of algorithms to determine the adaptive learning rate for different components [2224]. These kinds of algorithms can deal with large-scale data, they must perform a vast number of iterations before an appreciable improvement of the objective function, and it is difficult to parallelize.

Different from the SGD algorithm, the BGD algorithm computes the gradient with some randomly sampled examples from the training set at each iteration. So, the BGD algorithm can reduce the variance of the estimate of the gradient, and it achieves more stable convergence. In order to sufficiently often achieve convergence to the optimal solution, the estimate of the gradient by the BGD algorithm needs to enforce descent in the objective function at every iteration. Therefore, the sample size is carefully determined. Byrd et al. [25] proposed a dynamic sample gradient (DSG) algorithm, which can dynamically determine the sample size before sampling. For the convex optimization problem, the DSG algorithm can get the optimal solution. However, the sample size determined by the DSG algorithm could increase with the increasing steps, so that the total running time of the DSG algorithm also increases. Furthermore, owing to the fact that samples of the same size could vary in terms of their qualities, this leads to the estimate of the gradient with the fixed-size sample that may not enforce descent in the objective function. On the other hand, choosing a proper learning rate is an important issue for the performance of the BGD algorithm. A smaller learning rate could cause the convergence rate to become slower, but a larger one fluctuates obviously around the optimal solution. Robbins and Sutton [26] proposed a schedule to select the appropriate learning rate during training, where the predefined schedule is conducive to reducing the learning rate. Liang et al. [27] have proposed a sampling on-demand to speed up logistic regression, but the theoretical proof about its robust result is not given, and it does not compare its classification performance with other state-of-the-art approaches. There are many similar algorithms to yield high accuracy in the solution of the optimization problem [28].

3. Main Content

3.1. Preliminary

The logistic regression classier is generated using the posterior probabilities of two labels ({0, 1}) denoted by a linear function in , where the sum of these two posterior probabilities is one. The form of this model is thatwhere the weight vector , , is the inner product between and , is the label of , and is the dimensional size.

Let be a training set, each training instance is represented by a -dimensional vector , and its label is . Therefore, the optimal weight vector is obtained by minimizing the following problem:where . Clearly, the objective function is a strictly convex function. Let be the optimal solution of the optimization problem (2), and the classifier of logistic regression is :where . With these notations, the predicted label for a given test instance can be derived from the function , where .

The GD algorithm is an iterative optimization algorithm, where it updates the current solution with the solution solved in the previous step and the gradient of the current objective function each time. Owning to its advantage of simple implementation and effectiveness, GD is widely adopted to solve the unconstraint optimization problem (2). Let be the optimal solution at the th iteration. The gradient estimation is obtained as follows:where , . Problem (2) is a convex optimization problem and the obtained solution is the global optimal solution. The computation of the gradient estimation needs all the data, so that it is very time-consuming for large-scale data. To overcome the deficiency, we propose a novel algorithm for fast training logistic regression via adaptive sampling (LLR-AS).

3.2. LLR-AS Algorithm

In fact, the LLR-AS algorithm is also a type of minibatch gradient descent algorithm. It randomly samples a subset from to compute the gradient estimation to replace computed with all the data in every iteration. According to the feature of the GD algorithm, its convergence efficiency is closely related to the quality of the gradient estimation . The value of is difficult to be smaller all the time and the final execution time becomes longer if the objective function value cannot keep decreasing using the gradient estimation in each iteration. Because gradient is the fastest direction in which the current objective function value decreases, a similar direction between ,, and should be considered to achieve this aim. In the following, a rule is given to obtain an estimate with high quality. This results in the following inequality:where is a norm of the vector and the relaxation parameter . According to the conclusion [29], the vector must keep the objective function value decreasing in the iterative calculation of GD if it satisfies inequality (5). Moreover, parameter controls the descent speed for the function . The larger the value of , the smaller the directional derivative, and the lower the descent speed for the function . Because the optimization problem (2) is convex, the LLR-AS algorithm can get the globally optimal solution. With the rule, our algorithm is outlined in Algorithm 1.

Input: Dataset , the stepsize .
Output: The optimal vector .
(1)Initialize: and
repeat
(2) Obtain the vector satisfying inequality (5)
(3)
(4)
until a convergence test is satisfied;
(5)Return
3.3. AS Algorithm

In order to get the vector meeting inequality (5), a novel adaptive sampling algorithm is proposed in the following: let be the new dataset generated by the set , where , . According to inequality (5), the simplest way of getting the vector is to sample a subset from the set and obtain an estimate from as , where is the size of subset . However, the sample size is difficult to determine for different tasks though there exist some theoretical and empirical results in the literature such as PAC [30] learning theory and learning curves [31]. Moreover, the theoretical results are usually worst-case and learning curves are average-case, so they are not necessarily consistent with each other [19]. To tackle this difficulty, we propose an adaptive sampling algorithm. Our method obtains the aim subset by continually sampling examples from until the gradient estimation on the sampled subset satisfying the stopping rule. It decides the sample size through the information of the sampled examples and solves the difficulty of predetermining the sample size. Therefore, the key issue becomes the problem that how to design the stopping rule for our adaptive sampling to satisfy inequality (5).

From a statistical point of view, the estimation of gradient satisfying inequality (5) is a -dimensional vector estimating problem. However, the existing sampling procedures mainly focus on a one-dimensional estimate problem, and they cannot be directly applied to the multidimensional problem. Although there exists a close relationship between the components of the gradient , each component of can still be seen as a one-dimensional estimating subproblem. Therefore, the multivariate estimation problem (5) can be solved by solving these one-dimensional estimating subproblems. Inequality (5) is equivalent to according to the formula of vector inner product. In other words, inequality (5) must hold if each component of simultaneously satisfies its own inequality , where . So, the problem of seeking for the gradient estimation satisfying inequality (5) can be approximatively divided into subproblems, where each subproblem is solved by at the same time.

Each component of can be considered as the exception of the population , and is the estimation of on the subset sampled from the population . According to the central limit theorem, the difference of value between and continually becomes small with the enlarging sampled subset . There exists a critical value of the sample subset size for satisfying inequality with a large probability. Therefore, both the early stopping rule and consecutive sampling are adopted to get the sampled subset and estimation over it. Given the aim vector and the objective function , each element can be a constant value and its upper bound of the absolute value could be estimated by a function , where and and are the size of sampled subset per sampling and total number of sampling, . In the next section, we will show that such a function can achieve this goal, and the stopping condition can be finally transformed into for each component . However, it needs to recompute on the sampled subset and test whether all the components satisfy their own stopping rule during each round of sampling, and this computational burden could cost much more execution time to achieve this aim. Two improvements are made to solve this problem (Algorithm 2).

Input: Dataset , a weight vector , parameters .
Output: The vector .
(1)Initialize: , ,
(2)Compute , where
whiledo
(3)
(4)Sample a random subset with the size of from the set ,
  for eachdo
(5)Compute
   ifthen
(6)
(7)
   end
  end
end
(8)Return

Let be the cumulatively sampled subset obtained by the first iterations sampling. The set , where is the sampled subset at the th iteration. We haveThe computation of can be largely reduced using the result on the set at each iteration. On the other hand, we adopt an asynchronous way to get each component of . If one or more components satisfy their own stopping rules, then they are directly the corresponding components of the final result without considering in subsequent iteration. Algorithm 2 describes the multivariate adaptive sampling method.

3.4. The Effectiveness of AS Algorithm

In this subsection, we study the effectiveness of AS algorithm. To derive our main result, we need the following lemmas and theorems.

Lemma 1. Let be random events. For any positive integer ,

Proof. The lemma is proved by mathematical induction.(1)Consider [32]. We have and for , where is the complementary set of the set . Meanwhile, . So, .(2)Assume that the inequality holds for ; that is, .(3)Consider . We haveThe lemma follows immediately from mathematical induction.

Lemma 2. Let be a subset, which is obtained through independent and random sampling from the training set with the size of . For any given weight vector , and , we have

Proof. According to and , we have , where , , . Moreover,Therefore, for any , it follows from inequality (10) and the Hoeffding inequality [32] thatFor the convenience of the following developments, we make the following remarks. Let be the number of iterations until the AS algorithm satisfies the inequality , where is a subset obtained through random and independent sampling from with the size of , . Note that is a random variable depending on the sample drawn from . Let be the smallest integer meeting the following inequalities:Since is a strictly decreasing function with and is fixed under any given , hence, is uniquely determined for .

Lemma 3. If , then we have with probability , where , .

Proof. We get from the AS algorithm the estimate satisfying at the -th step. When , we have andThus, it always holds that as long as and have the same sign. On the other hand, if and have different signs, then they are quite different as . Next, we show the difference is large enough, and it is conducive to prove the probability that this situation occurs is small.
In the following, we will give this difference from two cases. If and , then . If and , then . Therefore, the difference is as . Then, the probability that and have different labels is estimated using Lemma 2 as follows:

Lemma 4. for .

Proof. When , it always holds that and . Then,It follows from the triangle inequality that . Thus, . Combining inequality (15) with Lemma 2, we haveCombining with Lemma 3 and 4, we can easily get the following theorem.

Theorem 1. For any and , the estimation generated by AS algorithm satisfies the following inequality:where .

Theorem 2. For any and , final estimation , generated by the AS algorithm satisfies inequality .

Proof. Combining Theorem 1 and Lemma 1, then we haveIn Theorem 2, the obtained estimation using the AS algorithm could keep the decreasing value of the current objective function in each iteration calculation of GD. Therefore, it could guarantee that LLR-AS algorithm gets the optimal solution of the convex optimization problem (2), and this conclusion is verified in the experiment. Besides the optimal solution, we also pay attention to the number of sampled examples (NSE). AS algorithm is an iterative sampling algorithm that samples a fixed number of samples each time, then we can estimate NSE using the total number of iterations. We have already proved that AS algorithm terminates finally within steps from Lemmas 3 and 4, where . It is well known that could be the minimum number of iterations satisfying condition (12); then, we could assume that , where . Finally, we can estimate NSE as follows:Next, we will discuss the above formula. The effect of the parameter and on formula (19) is small because they are in the logarithmic function. Namely, the AS algorithm has a low possibility to sample too many examples. So, it is useful for sampling examples from the large-scale data.

4. Experiments

4.1. Experimental Setup

Two representative gradient descend methods were selected in this study: a common gradient descend with all the data and a dynamic sample gradient algorithm [25]. These two algorithms are used to solve the logistic regression, and they are named LR-GD and LR-DSG. Seven benchmark datasets are selected for making a fair comparison between our proposal and others [33, 34]; their information is shown in Table 1. All of these selected datasets have larger than 50000 instances.

Owning to the simplicity and successful application, we select the classification accuracy () and training time as the performance measure. LLR-AS and LR-DSG are both accelerated algorithms of LR-GD, so we compare their relative speeds () and the difference in their solutions. is a ratio of training time between LLR-AS and each of the others. For estimating these three performance measures and , we used a 10-fold cross-validation method. To compare the difference between our method and others under a performance measure, we adopt the Wilcoxon signed-rank test (WSRT) [35]. The reason for selecting WSRT is that it does not require a strict data distribution hypothesis and has stable performance. It is empirically considered to be stronger than other tests [36]. The null hypothesis of WSRT denotes that there exists no significant difference between our algorithm and each one of the others under a performance measure, while the alternative is that there exists a significant difference.

In the following experiments, we fix an initial sample of size 1 of the total training set and for DSG algorithm according to [25], and , for LLR-AS. The stopping condition for these three iterative algorithms is that the Euclidean distance between the current and the previous solution is smaller than 0.001 and the maximum iteration 5000. The significance level of 0.05 is used. All the experiments are executed in Python 3.8 on the same computer of Intel Xeon E5-2650 CPU and 32 GB of RAM.

4.2. Experimental Results and Analysis

In this section, we adopt the classification ability and training efficiency as two important measurements to evaluate the performance of these algorithms and give a detailed comparing analysis and the reason for the experiment result.

4.2.1. Classification Performance Analysis

Under the theoretical hypothesis of logistic regression, its classification performance of logistic regression depends on the solution obtained by gradient descend. LR-GD algorithm is trained by gradient descend with all the training data; then its weight vector is the optimal solution, as well as its classification performance. In the following, we compare the predictive ability of these algorithms from the difference in the obtained solution vector and classification accuracy.

(1) The analysis of the difference on solution vector: Pearson correlation coefficient is chosen to evaluate the difference between two solution vectors for its high effectiveness. Its value is inversely proportional to the difference. The larger the value of , the smaller difference between the two vectors. The correlation coefficients between the solution vector obtained by LLP-AS and each one of LR-GD and LR-DSG algorithms on the test data of each dataset are computed, all the statistics results on different datasets are listed in Table 2.

Table 2 shows that the correlation coefficient of the weight vector between LLR-AS and LR-GD is nearly closed to 1 on almost all datasets except for the Cifa dataset, and this same result can be obtained between LLR-AS and LR-DSG. The mean of correlation coefficients on all the datasets is 0.986 and 0.989, and their medians are 0.996 and 0.986. Moreover, it can get a detailed comparison from the descriptive statistics. The following can be seen: (1) there exists a negligible difference of solution vector between the LLR-AS algorithm and each one of these two algorithms, and then the LLR-AS algorithm can get nearly the same solution vector as the LR-GD algorithm. (2) The standard deviation of the correlation coefficients on each data is tiny; then, this experiment result validates that the proposed algorithm could get a robust solution.

Own to the properties of the obtained vector estimation at each iteration, the LLR-AS algorithm performs multiple iterations to continually minimize the objective function value. Meanwhile, the original optimization problem has a unique optimal solution because it is a convex optimization. So, the LLR-AS algorithm is able to guarantee convergence and obtain the optimal solution as the LR-GD algorithm. Furthermore, Theorem 2 has also been verified by this experiment result, and the gradient estimation is stable for different datasets.

(2) The analysis of the difference in classification. the classification accuracy is adopted to compare the performance between the proposed algorithm and two state-of-the-art approaches, and WSRT is performed to test whether there exists a significant difference among them. Table 3 lists the descriptive statistics of classification accuracy of each algorithm obtained by 10-fold cross-validation, and their results are also plotted in Figure 1.

The result on each dataset plotted in Figure 1 shows that the LLR-AS achieves better classification accuracy than the LR-GD algorithm and LR-DSG algorithm on the Ijcnn1 dataset, and it has not the worst classification accuracy on the rest of the datasets. To assess the overall classification performance on all the datasets, the mean and median of the result of each algorithm on eight datasets are computed in the last two rows of Table 3. Their mean values of classification accuracy are 0.750, 0.731, and 0.747, and the median values are 0.722, 0.711, and 0.713. Therefore, there exists a negligible difference in classification accuracy among these algorithms. Finally, the obtained p values using WSRT between LLR-AS and each one of the other algorithms are 0.1563 and 0.688, both larger than the given significant level of 0.05. Then, it gets that (1) the LLR-AS algorithm has no significant difference in classification accuracy with the LR-GD algorithm and LR-DSG algorithm on the selected datasets. (2) The LLR-AS algorithm has a stable classification performance because its standard deviation of the classification accuracy of the LLR-AS algorithm on every data is relatively small.

The reason for the similar classification result is that the classification performance of logistic regression depends on the solution vector, and the LLR-AS algorithm has no significant different solution vector from the LR-GD algorithm and the LR-DSG algorithm. Moreover, the obtained solution vector of the algorithm is robust according to Theorem 2, and the small standard deviation of the correlation coefficient on different datasets also verifies this fact.

4.2.2. Training Efficiency Analysis

Besides the classification performance, training speed is another important measurement to evaluate algorithm training performance. The relative speed can evaluate the accelerating extent of the LR-DSG algorithm. Table 4 lists on different datasets.

It finds that the value of between the LLR-AS algorithm and LR-GD algorithm is larger than 20 on all the datasets from Table 4, and its value is larger than 109 on Cifa, Cod-rna, and Covtype dataset. So, the LLR-AS algorithm can largely reduce the training time of the LR-GD algorithm. On the other hand, the value of between the LLR-AS algorithm and LR-DSG algorithm is larger than one on these seven datasets, and its average value of on all the datasets is 1.843. Therefore, the LLR-AS algorithm indeed needs less training time than the LR-DSG algorithm.

There exist three reasons for explaining that the proposed mechanism can achieve a good result on training efficiency. (1) The obtained gradient estimation could be a descent direction of the current objective function at each iteration. Thus, the total number of iterations that is positively correlated with the training efficiency will reduce. (2) The divide-and-conquer approach is adopted to compute each component of the gradient vector at each iteration, and it can be executed in a parallel environment. (3) It is proved that there exists a low possibility to sample too many examples to estimate the gradient; the time of estimating gradient could become short at each iteration. Therefore, the proposed algorithm has a better performance than other representative algorithms owning to the above three merits.

5. Summary

A novel algorithm for fast training logistic regression via adaptive sampling has been proposed to effectively handle the massive dataset in this paper. The proposed algorithm solves the difficulty that the sample size needs to be fixed before sampling, and it also offers an idea of dividing the multivariate estimation problem into several easy-to-solve subproblems. Experimental results on real datasets demonstrate that LLR-AS has obtained a similar classification performance with less execution time in comparison with other representative algorithms. Moreover, this proposed algorithm can deal with the multiclassification problem using the one-vs-all scheme, and paper [37] has shown that this scheme is as accurate as any other approach. The proposed algorithm solves the binary classification problem, but it can be used for the multiclassification problem using the one-vs-all scheme. It needs to train several different classifiers, where each classifier is obtained by distinguishing the instances of the same class from the others in the rest classes. When given an unlabeled instance, the final output is the largest result among the results of all the classifiers.

Though the proposed algorithm has a good performance for large-scale data, there exist two limitations for dealing with various kinds of real datasets. The gradient estimation needs all the features of the data at each iteration, so that it may take a great challenge of its training efficiency for high dimensional data [3840]. Furthermore, this algorithm does not consider the label distribution of the data, and then its performance on imbalanced data could decrease. In the future, we will study how to combine sampling and feature selection to scale up machine learning algorithms and design an effective mechanism to deal with the class imbalance problem.

Data Availability

This publication was supported by LIBSVM datasets, which are openly available at location cited in [33].

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by Shandong Provincial Natural Science Foundation, China (no. ZR2020MF146), Major Scientific and Technological Innovation Project of Shandong Province (no. 2019JZZY010716), and Key R&D Plan of Shandong Province (no. 2019GGX101061).