Abstract

Simple and multiple linear regression analyses are statistical methods used to investigate the link between activity/property of active compounds and the structural chemical features. One assumption of the linear regression is that the errors follow a normal distribution. This paper introduced a new approach to solving the simple linear regression in which no assumptions about the distribution of the errors are made. The proposed approach maximizes the probability of observing the event according to the random error. The use of the proposed approach is illustrated in ten classes of compounds with different activities or properties. The proposed method proved reliable and was showed to fit properly the observed data compared to the convenient approach of normal distribution of the errors.

1. Introduction

The quantitative structure activity/property relationships (QSARs/QSPRs) are computational techniques that quantitatively relate chemical feature (such as descriptors) to a biological activity or property [1]. Linear regression is one of the earliest methods [2] used to link the activity/property with structural information and is frequently used due to the relative easy interpretation [3]. Sometimes, linear regression is misuse due to the application without investigation of its assumptions (such as linearity, independence of the errors, normality, homoscedasticity, and absence of multicollinearity [4]).

The error, “a measure of the estimated difference between the observed or calculated value of a quantity and its true value” [5], was first used in mathematics/statistics in 1726 in Astronomiae Physicae & Geometricae Elementa [6]. In the late 1800’s, Adcock [7, 8] suggested that the errors must pass through the centroid of the data. The method proposed by Adcock, named orthogonal regression, explores the distance between a point and the line in a perpendicular direction to the line [7, 8]. Kummell [9] investigated other than perpendicular directions between the points and line. The regression slope (“”) was described by Galton in 1894 based on an experiment of sweet pea seeds [10]. Two years later, Pearson generalized the errors in the variable and published a rigorous description of correlation and regression analysis [11] (Pearson recognized the contribution of Bravais [12] to mathematical formula of correlation). Due to the ability to produce best linear unbiased parameters [13], the coefficients in simple linear regression (SLR) models are estimated by minimizing the sum of squared deviations (least squares estimation, method introduced by Legendre in 1805 [14] and used/applied by Gauss in 1809 [15]). Furthermore, Fisher introduced the concept of maximum likelihood within linear models [16, 17].

The generic equation of simple linear regression (1) between observed dependent variable and observed independent variable is:where and are unknown constant values (estimators of statistics parameters of simple linear regression), is the value of the dependent variable estimated by the model, is the observed value of dependent variable, and is the observed value of the predictor variable.

The array use to estimate the residuals is given by formula, where is the th observation in the sample (, when = sample size) and is an unknown coefficient. The unknown coefficient is an estimator of the power of the errors on simple linear regression.

In the SLR-LS (simple linear regression least squares), residuals (, where = residual) follow the Gauss-Laplace distribution with , , and being unknown statistical parameters:where is population mean, is population standard deviation, is power of the errors, is gamma function, and is sample standard deviation.

Gauss-Laplace distribution is symmetrical and has three statistical parameters (population mean, population standard deviation, and power of the errors) [15, 18] and two main particular cases. First particular case is Gauss distribution [15] often observed on arrays of biochemical data [1921] while the second particular case is Laplace distribution (with mean of zero and variance ) [22, 23] commonly seen on astrophysical data [24, 25].

The problem of estimating the parameters of the SLR (1) for the first particular case (Gauss distribution) considers residuals (where is the power of the errors related with experimental errors). The coefficients of regression for this particular case are obtained by solving the system of linear equations under the assumption that [26] (, where and are unknown parameters).

The second particular case is when residuals follow the Laplace distribution. In view of the fact that “is not differentiable everywhere” [27], the solution in more difficult to be obtained for this particular case.

One question can be asked: “what is the proper value of that should be used in the simple linear regression analysis (1)?” A previous study showed that, for different sets of biological active compounds, the distribution of the dependent variable can be approximated by Gauss distribution () just in a relatively small number of cases when the whole Gauss-Laplace family is investigated [28]. Based on this result, the aim of the present study was to formulate the problem of solving the simple linear regression equation (1) without making any assumptions about the power of the errors .

2. Materials and Methods

2.1. Mathematical Approach

The problem of regression (1) is transformed into a problem of estimation if the residuals are introduced in (2) with a slight modification: in the quantity the constants and are equivalent and just one will be further used. Gauss-Laplace distribution is symmetrical and the observed mean is an unbiased estimator of the population mean . This could be expressed in terms of (1) as presented inwhere is the population mean of the Gauss-Laplace quantity (2), is observed/measured dependent variable, is dependent variable estimated by the regression model, is independent/predictor variable, and is mean operator. For certain arrays of paired observations , the problem of regression expressed in (1) is transformed to a problem of estimating the parameters of the bidimensional Gauss-Laplace distribution as presented inAn efficient instrument to solve (4) is maximum likelihood estimation (MLE), method proposed by Fisher [16, 17]. The main assumption of the MLE is that the array has been observed due to its higher chance to be observed (simultaneously and independent). This could be translated as , and thus , which lead to the expression inBy including (4) in (5) and using the natural logarithm, the problem presented in (1) became a problem of optimization:where is number of pairs.

The optimization problem presented in (5) could be iteratively solved if the start point is a good initial solution (situated near the optimal solution). In this research, the start point in the optimization was the solution of a particular case of (6) as presented inwhere is power of the errors, is population mean, is population standard deviation, is average (central tendency operator), and is variance (dispersion operator).

2.2. Algorithm Implementation

The classical simple linear regression uses least squares method to estimate , , and coefficients in (7) using the fixed values of 2 for the power of the errors . In our approach, starting with the optimal solutions for , , and coefficients obtained by (7), the optimal solution of (6) was iteratively obtained by making small changes to the values of the coefficients and selecting the coefficients that make the MLE value higher. The implemented weights of changes were more or less arbitrary, and the selected ones are a compromise of convergence speed in the convergence space.

The flowchart of the proposed approach is presented in Figure 1.

A PHP program was developed to find the optimal solution for (6). As the input data, the implemented program needs a .txt file with three columns (file named as mol, where mol is the identification of the molecule and could be text or number, is the independent variable, and is dependent variable). The program generates the output file as specified by the user (a .txt file could be used) that contains for each iteration the data for the following coefficients: , , , , and MLE.

The source code of the implemented algorithm is free to be used and is presented in the Supplementary Material available online at http://dx.doi.org/10.1155/2015/360752. The full program can be obtained upon request from the authors.

2.3. Data Sets

Ten classes of previously investigated compounds were used to assess the proposed method. The class of compounds, the activity/property of interest along with the number of compounds in the dataset and the reference to the paper from where the independent and dependent variables were collected are given in Table 1.

Simple linear regression (SLR) models under the assumption of linear relationship between structural descriptors and activity/property of chemical compounds were identified using the values of descriptors previously published in the literature (see reference in Table 1). The characteristics of the models with the highest goodness-of-fit for each class of compounds are presented in Table 2.

3. Results and Discussion

The proposed solution for solving the simple linear regression without making any assumptions about the power of the errors has been successfully implemented and reliable solutions were obtained.

The developed algorithm was successfully tested on ten different data sets. The number of iteration needed to find the optimal solution varied from 9 (set10) to 185 (set4b) and seems not related with the number of compounds in the sample when the same class of compounds is investigated (63 iterations (set1a), 51 iterations (set1b), and 86 iterations (set1c)). The number of iterations needed to obtain the optimal solution was equal to 173 for the smallest dataset (set2) and 86 for the dataset with the highest number of compounds (set1c). Accordingly, the maximum number of iterations was almost 21 times more than the minimum number of iterations.

The results of simulation study obtained for the convenient solution (, residual follows the Gaussian distribution) and for solution that satisfies (6) are presented in Table 3. The values of calculated coefficients (, , and ) are provided with three decimals; equal values for and optimal were obtained as follows: , coefficient in set1b, set3, and set6; , coefficient in set3, set6, set8, and set10; and , coefficient in the following sets: 1b, 1c, 3, 4a, 5, 6, 8, 9, and 10.

The analysis of the obtained coefficient presented in Table 3 revealed the following.(i)In 9 out of 13 cases, at least one coefficient (, , or ) proved equal for convenience; and is determined to satisfy (6).(ii)In 6 out of 13 cases, the power of the errors obtained by MLE proved significantly higher than 2. The difference varied from 0.8099 (set4a) to 7.5176 (set1a).(iii)Just in one case, the difference between powers of the errors proved not statistically different (set3, ).(iv)In 6 out of 13 cases, the difference between power of the errors (SLR-LS and SLR-MLE) proved lower than 1.(v)The smallest distance between the powers of the errors (from SLR-LS and SLR-MLE) was of 0.2613 (set10) and was identified as being statistically significant .(vi)Two classes of compounds (set3 and set6) proved identical values of , , and unconcerned with the method used in the regression analysis (SLR-LS and SLR-MLE).(vii)The obtained by SLR-MLE proved significantly different by convenient value with one exception represented by set3.The most probable distribution of the power of the error obtained by MLE is Fatigue Life or Birnbaum-Saunders distribution [44] (Kolmogorov-Smirnov statistics = 0.1245, ; Anderson-Darling statistics = 0.2753 ; value associated with Anderson-Darling statistics was calculated taking into account the values of the statistics and the sample size [45]). The Fatigue Life distribution of the power of the errors is characterized by two parameters represented by continuous shape parameter () and continuous scale parameter (). The median of the power of the errors is closed to the convenient values of 2, with a mean of 2.68. Nevertheless, the normal distribution of the obtained power of the errors could not be rejected at a significance level of 5% (Kolmogorov-Smirnov statistics = 0.278, ; Anderson-Darling statistics = 1.178, ).

The evolution of value of power of the errors according to iteration was in both directions and, as expected, never achieved negative values (see Figure 2). The analysis of the evolution of the power of the errors as function of iteration revealed that even if identical values of are obtained in the first 29 iterations for the first two related samples (set1a and set1b, Figure 2), the pattern is not representative for the class of the compounds. Thus, the pattern from 1c is significantly different by those observed on subsets of the whole class of compounds (1a and 1b). Opposite behavior is also observed for the other two related samples (set4a and set4b), and the value of increased until a maximum (iteration 10 for set4a) and decreased after this value while the value of decreases in steps for set4b.

Overall, two distinct patterns are observed in Figure 1. In the first pattern, the values of power of the error increase with iteration until a peak and after that the value decreases (sometimes with a decrease in steps (set6, set7, and set9)); see set1a, set1b, set4a, set6, and set9 (Figure 2). In the second pattern, the power of the error decreases in steps with the increase of iteration as for set1c, set2, set3, set4b, set5, set8, and set10 (see Figure 2).

The plot of both regression lines (simple linear regression and associated 95% confidence interval and MLE regression) for each investigated data sets is presented in Figure 3.

The analysis of the regression lines presented in Figure 2 revealed that, in one case represented by set7, the assumption of the linearity of with n-rings is breached and, for this dataset, the simple linear regression is not the proper analysis. In 4 out of 13 cases, the SLR-MLE line is partly outside the 95% confidence boundaries of the SLR-LS line (set1a, set1c, set2, and set4b; Figure 3). Accordingly, it could be considered in all these cases that the SLR-MLE model is significantly different by the SLR-LS model. The overlapping of SLR-MLE and SLR-LS line is observed for the set3, without being possible to make a visual distinction between them (Figure 3). For this set, the obtained by SLR-MLE was equal to 1.34 and proved not significantly different by convenient value of 2 (see Table 3). For all other sets, the SLR-MLE line is within the boundaries of 95% confidence intervals of SLR-LS line and thus even if the powers of the errors proved significantly different by the convenient value of 2, these SLR-MLE models could not be considered significantly different by the SLR models.

To conclude, it is certain that the proposed approach of maximizing the probability of observing the event according to the random error fits well the observed data and frequently the power of the errors is significantly different by the convenient value . However, no pattern could be identified between iteration and sample size on the investigated sets of pairs. It is expected that the recognized behavior of the power of the errors is to be identified on other pairs, analysis which is currently conducted by our team. The relation presented in (6) thereby defines a new general approach to treat the relationships. Practically, the expression could be replaced with any expression of dependency (not just linear), such as(i)exponential: for ;(ii)double exponential: for ;(iii)power: for ;(iv)inversed: for .The relation presented in (6) may be also extended to the multiple linear regression () when the expression becomes . If in the case of multiple linear regressions the classical method (minimizing the squared error) maximizes the correlation coefficient, the proposed approach (6) maximizes the probability of observing the event according to the random error. In view of that, (6) has a significant advantage compared to the classical approach. The classical approach that maximizes the correlation coefficient is exposed to type I errors; a model of regression could be accepted even if the model does not exist. On the contrary, the proposed approach that maximizes just the chance of observation (the approach has just one hypothesis: the error between the observation and the model must be random and its value does not depend on the size of the observed value) is not affected by a type I error. In the case of simple linear regression, application of (6) did not change the correlation coefficient between and but offers a solution in regard to estimated valued of and of the unknown coefficients (estimators of the population coefficients) that enter the relation between and . The relation proposed in this paper (6) introduced an additional parameter in the estimation, namely, the power of the errors of Gauss-Laplace distribution (this led to decrease by one unit of the degrees of freedom in the analysis of variance in the regression model).

The MLE approach is frequently used in estimation of unknown parameters and it is known to be sensitive to outliers (influential compounds) in the data [4648]. No outliers have been identified in the dependent variable on set2 and set3 [42, 46, 47]. Therefore, on these two sets of compounds, it is a certainty that the proposed approach was not affected by the presence of outliers in the data. Evaluation of how the values in the investigated sets could lead to identification of outliers (influential compounds [4, 31, 49]) was beyond the aim of the present study. The proposed approach proved its usefulness in estimation of SLR parameters and is now under evaluation by our team on different types of classes of compounds and relations to assess its behavior and robustness.

4. Conclusions

The proposed approach proved feasible for estimating the parameters of the simple linear regression, in the absence of the assumption that the errors are normally distributed, assumption replaced by a more general one that the errors are Gauss-Laplace distributed. The obtained results demonstrated that in 12 out of 13 investigated cases the power of the error is significantly different by the convenient values of two. However, the plot of SLR-MLE and SLR-LS lines showed that, just in 3 out of 12 cases, the models are significantly different. The proposed approach can be further extended from simple linear regressions to multiple linear regressions.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgment

Dr. Alina C. Cozma is a fellow of POSDRU Grant no. 159/1.5/S/138776 entitled “Model colaborativ instituţional pentru translatarea cercetării ştiinţifice biomedicale în practica clinică, TRANSCENT.”

Supplementary Materials

The classical simple linear regression (SLR) uses least squares method to estimate a, µ and s coefficients (see Eq7) using the value of the power of the errors equal to 2. The supplementary material contains lines of the program implemented in PHP to find the solutions of Eq6 (maximum likelihood estimation - MLE) starting with values of coefficients identified by Eq7. The program makes small changes to the values of the coefficients and selects the coefficients that maximize the MLE value.

  1. Supplementary Materials