Abstract
The use of conditional probabilities has gained in popularity in various fields such as medicine, finance, and imaging processing. This has occurred especially with the availability of large datasets that allow us to extract the full potential of the available estimation algorithms. Nevertheless, such a large volume of data is often accompanied by a significant need for computational capacity as well as a consequent compilation time. In this article, we propose a low-cost estimation method: we first demonstrate analytically the convergence of our method to the desired probability and then we perform a simulation to support our point.
1. Introduction
The likelihood that an event will occur knowing that event has already occurred is called the conditional probability, denoted by or . For example, if a card is randomly drawn from a deck, there is a one in four chance of getting a heart suit, but if a red reflection is seen on the table, there is now a one in two chance of getting it. If events A and B have nonzero probabilities, then Bayes theorem states that . That was for the scientific part, but in daily life also conditional probability is useful in various fields and is even gaining more and more interest. For example, banks estimate the probability of default of a borrower or bond issuer using conditional probability estimation methods based on Basel II regulations (see [1] for more information). The estimation of this probability is crucial since it allows the banks to compute the expected losses and therefore to cover the consequences. Another area where the estimation of conditional probabilities is important is marketing, where it is used to estimate the interest of a customer in a given product or service. Therefore, they are able to focus on the most attractive population in order to optimize the marketing costs [2]. The estimation of this probability is also frequently used in the field of medicine, as doctors need to estimate the likelihood of a patient being affected by a given disease based on the symptoms the patient presents [3] and many more areas, such as drug discovery, computer vision, speech recognition, handwriting recognition, biometric identification, document classification, Internet search engines, pattern recognition, and recommender system [4–11].
In practice, we can divide conditional probability estimation methods into two categories, linear and nonlinear classifiers. The linear classifiers can be split into two subcategories, the generative and discriminative models [12, 13], and the most commonly used are(i)Fisher’s linear discriminant(ii)Logistic regression(iii)Naive Bayes classifier
Nonlinear classifiers can be grouped into the following groups of methods:(i)Linear classifier with transformed data such as a discretization of continuous variables(ii)Support vector machines(iii)Quadratic classifiers(iv)K-nearest neighbor(v)Decision trees(vi)Neural networks(vii)Learning vector quantization
To learn more about these different algorithms, see [14–20].
Let us consider an observable random binary variable and a random variable . We define such that
We are willing to estimate the vector such that the conditional probability is written in the form:
We are looking for a simple method of estimating the parameter that will be less demanding in terms of computational capacity. This is useful especially in the Big Data era, where the datasets can be massive and any common iterative estimation can take a lot of time. To do this, we use the stochastic approximation, which has been introduced by Herbert Robbins and Sutton Monro in 1951 [21]. The goal is to find the unique root of a function , while cannot be directly observed. Yet, we assume that we can observe a variable such that . According to [21], there exists a sequence that satisfiessuch as the process defined byconverges to the unique root of . In our case, we start from the work of Bennar et al. [22] who established the conditions for almost sure convergence, as well as the quadratic mean convergence of a stochastic gradient process to the parameter that allows us to estimate . Here, we are interested in the case of binary random variables, where is equivalent to , as we can see in the following:
We also chose these results as the basis of our work since the stochastic gradient process performs a sampling at each iteration in order to achieve the estimates without relying on all the available data.
In this article, we first present the convergence results elaborated by Bennar et al., then we show that these results are also valid in the framework of estimating the conditional probability. We also present a simulation to highlight the obtained results, and finally, we conclude our work by addressing development perspectives.
2. Preliminaries
Let us consider an observable random variable and a random variable , both have values in of law . We try to estimate the parameter in such that approaches in the least squares sense. It should also be noted that the estimation of the parameters of a logistic regression in the sense of least squares is already achieved through the iterative weighted least squares method [23] which, unlike our purpose, is heavy and employs huge computing capacities in the case of large dataset.
Let be the real positive function defined in bywe are looking for the value of that minimizes the function .
Let us define the real positive function in by
We havethus the problem reduces to looking for that minimizes the function . We have
To estimate in a sequential way, we use a stochastic gradient algorithm. We consider a random in defined bywith(i) is a sequence of positive real numbers(ii) is a sample of independent random variable couples with the same probability law that (iii) is a real known measurable function in
In the following, the abbreviation means almost sure convergence and means quadratic mean convergence.
2.1. Almost Sure Convergence
Bennar et al. have considered the following assumptions: , , : there exist and such that for all , : there exists such that for all , is a local minimum of : is the unique stationary point of :
Lemma 1. Under the assumptions , we have
Proof. See [22].
2.2. Quadratic Mean Convergence
Bennar et al. have considered the following assumptions: and are uniformly bounded in and . : there exist two real positive functions and defined in such that , is a real random bounded variable.
Lemma 2. Under the assumptions , we have
Proof. See [22].
3. Application
3.1. Proof of Process Convergence
Let us assume be functions of measurable real variables. We note
In order to estimate the value of that minimizes , we consider the following stochastic approximation process in defined bywithwhere is a sample of formed of independent random variable and distributed identically.
We assume the following assertions: are observed in a finite way is a random variable such that
Theorem 3. Under the assumptions , we have
Proof. Let be the real function of defined byLet us prove that the assumption 3 is true.
We haveFor , we haveThus, for , we haveas the are observed in a finite way, andThen, there exists such that for all ,Let us prove that the assumption 6 is true.
We havewith ;
then, and , and since are observed in a finite way, then there exists such that for all and , . Then, and are uniformly bounded in and .
Let us prove that the assumption 7 is true. To do this, we use the following result.
Lemma 4 (mean value inequalities). Let and be two real normed vector spaces, an open of , and a differentiable application. For any segment included in , we havewhere, for any point of , is the operator norm of the differential of at point .
Proof. See [24], p. 31.
Then, there exist two real positive functions and defined in such that
,Let us prove that , and .
We have already seen that and since
are observed in a finite way, then .
Furthermore, we have that , then , and since are observed in a finite way, then .
Moreover, since is a binary random variable, then assumption 8 is true.
Then, under assumptions , we have
3.2. Simulation
In order to illustrate our work, we perform a simulation in which we estimate the different parameters of a logistic regression. Our simulations are performed using the programming language “R.” We simulate observations of the random variable , and we define such thatwith , to avoid having a perfectly fitted model. Then, we fitted a classical logistic regression with the Fisher scoring algorithm, which converged in 12 iterations. We define the accuracy rate as the number of correctly classified observations over the total number of our observations, and the classic model has an accuracy of 90.34%. Table 1 shows all the remaining outputs of the model.
Regarding the proposed process, we initiate it with the following randomly chosen values, , and we choose ; as and are finite, we can see that assumption is verified, and we also randomly draw a sample of one observation to perform our calculations at each iteration. Finally, we have set an accuracy of . Following simulations, we obtain the results as follows.
We can see through Figures 1 and 2, as well as Figure 3, that the process converged in 10 iterations. Therefore, we only needed 10 samples of one observation to obtain a robust estimation of the coefficients. Moreover, we can see in Figure 3 as well as in the summary of the process, in Table 2, that the latter records a prediction accuracy on the set of simulated observations of 89%, hence a loss of 1% in accuracy, but, in return, we gained greatly in terms of computing capacity.



4. Conclusion
In this work, we have demonstrated the convergence of the process studied towards the values that minimize the function , and following our simulations, we can see that this theoretical result is also valid on the empirical level. Nevertheless, this simulation required that we arbitrarily set a starting point, which leads to a possible slow convergence of the process in case the initial point is far from the targeted value. Moreover, the speed of convergence is also greatly affected by the choice of the . Thus, a possible improvement would be to find the optimal sequence that provides the fastest convergence.
Data Availability
No data were used to support this study.
Conflicts of Interest
The authors declare no conflicts of interest.