Abstract

The polymerase chain reaction is a central component of current molecular biology. It is a cyclic process, in each early cycle of which the template DNA approximately doubles. An indicator which fluoresces when bound to DNA quantifies the DNA present at the end of each cycle, giving rise to a fluorescence curve which is characteristically sigmoid in shape. The fluorescence curve quantifies the amount of DNA initially present; the more the initial DNA, the earlier the rise in the fluorescence. Accordingly the amount of DNA initially present in two samples can be compared: the sample with the less DNA gives rise to a relatively delayed fluorescence curve and the ratio of the DNAs can be deduced from the separation of the curves. There is, however, a second determinant of this separation, the fold increase in DNA per cycle: ideally a twofold increase but frequently less. Current guidelines recommend that this be determined experimentally by carrying out PCR on a series of dilutions. If the value of the fold increase is known, then the algorithm for determining the separation can be reduced to a relatively simple computation, rather than employing a multidimensional nonlinear optimization such as the Marquardt-Levenberg as currently employed.

1. Introduction

The polymerase chain reaction (PCR) introduced by Mullis et al. [1] is a cyclic process, each cycle of which involves three stages: denaturation, annealing (to a primer, possibly also to some forms of fluorescent probe), and extension. Optimally each strand of template DNA gives rise to its complement in each cycle: the amount of DNA doubles each cycle. In reality each strand is copied with probability . If were constant, then the DNA present after cycles would be where is the amount initially present. If two samples, A and B which initially comprise and strands, have the same value of the probability and if they exhibit the same fluorescence after and cycles, respectively, then implying where .

In practice this simple “exponential” model for the growth in fluorescence is complicated by three factors. As resources are depleted and as template DNA strands bind to their complement rather than to primer, the value of in (1) diminishes: the fluorescence curve is sigmoid rather than exponential. There is a background fluorescence present initially and not indicative of DNA, and this background varies between replicates. There is also, between replicates, a variation in the scale, in that the plateau level that the fluorescence approaches also varies. These considerations give rise to a collection of models for the fluorescence curve seen in PCR. The open-source facility “qpcR” [2] offers ten nonlinear sigmoidal models (and nine others) that can be fitted to the fluorescence data using the Marquardt-Levenberg [35] algorithm.

Notwithstanding the utility of the Marquardt-Levenberg algorithm and of open-source utilities such as qpcR which implement it, many scientists prefer to write and understand their own analytic software. To this end we develop, in the following sections, justification for a four-parameter model. When, as recommended by MIQE guidelines [6], the amplification efficiency is known, this reduces to a three-parameter model in which two of the parameters are linear; there is only one nonlinear parameter. The calculations are then greatly simplified and the multidimensional Marquardt-Levenberg algorithm becomes unnecessary; nonlinear optimization in a single dimension is trivial, as the problem reduces to maximizing a correlation. Finding a zero of a function is, however, even more elementary than finding a maximum, and we derive the function, the zero of which defines the value of the required parameter.

Regardless of whether the adopted approach is finding the maximum correlation or the zero of a function, the linear parameters are obtained algebraically by simple linear regression. The nonlinear search, whether for a maximum or a zero, is in a single dimension and its convergence can be guaranteed.

2. Theoretical Development

In the following we review briefly the exponential model for amplification and the justification of a more realistic sigmoid model. We note that finding the single nonlinear parameter is a “partially linear” problem and this gives rise to several approaches to its estimation: an approximate algebraic method (without iteration) and more exact methods involving iteration. A modicum of algebra then simplifies the problem to one of finding a zero rather than a maximum. We will then apply the methods to the estimation of the nonlinear parameter using real data in order to establish three things. First is that the maximum method and the zero method arrive at the same parameter value and that this is the same value returned by the more usual Marquardt-Levenberg approach. Second is to establish the extent to which iteration improves the algebraic approximate estimator: how much gained by the extra computation required? Finally we determine computational times: our algorithm is easier to code, but can it compete with the established Marquardt-Levenberg algorithm?

2.1. The Exponential Model

If, as envisaged during early cycles, the amount of DNA increases geometrically, the increase is proportional to the amount present, and the corresponding continuous function, , satisfies the differential equation for some constant . Then it is elementary that a solution is , where is the amount initially present and is the fold increase per cycle.

2.2. Resource-Limited Growth: A Sigmoid Model

If limiting resources, or other factors, limit growth such that fluorescence asymptotically approaches some limit, then the differential equation, is perhaps the simplest to describe such a situation; for small , growth per cycle corresponds to (4) above, but growth then diminishes as approaches an asymptote (here, an asymptote of one). Equation (5) has the solution (Bernoulli’s method, see Appendix A): for some . Looking at (6), is the value of for which . For early cycles ( the fold increase per cycle is as before. If we add to this a baseline fluorescence, , which may vary between replicates and a scale which may also vary between replicates (both of which are therefore essentially nuisance parameters), we have been describing the fluorescence curve which is a four-parameter sigmoid curve of the form which is one of the more widely used models of PCR fluorescence, corresponding to model b4 of the qpcR package. As before, is the fold increase per cycle when is small, and if this is known from a dilution series, then or equivalently is known and there are only three unknown parameters.

2.3. The Partially Linear Problem

Prior knowledge of radically alters our approach, because there is then only one parameter entering the problem nonlinearly. This situation arises not only following the prior determination of by dilution series, as in MIQE guidelines, but also during the analysis of a dilution series where is determined by fixed-point convergence [7]. Note that we can write (7) in the form where denotes the logistic curve so that, for any given value of , the value of at any cycle is known, and fitting the model values to the observed fluorescence values is only a problem of simple linear regression. The optimal value of is then the value for which the residual sum of squares is minimized.

Regardless of the algorithm used in the computation, all the iterative procedures used here converge on the least-squares estimates of , , and . That the least-squares estimate of also maximizes the Pearson product-moment correlation coefficient which follows from a well-known identity. Briefly, if the data are pairs and we regress the dependent variate on the independent , then the coefficient of determinism is denoted and defined by and the Pearson correlation coefficient is denoted and defined by where , , , and are means and standard deviations, respectively. Then the identity states that the coefficient of determinisms is, under these circumstances, equal to the square of the correlation coefficient. That is, It is immediate that where the residual sum of squares depends also on a parameter , the value which minimizes the residual sum of squares must maximize the correlation coefficient. An elementary proof of the identity appears in our Github repository [8].

But having determined , what is its experimental significance? Looking at (9), note that when , we have ; that is, is the fractional cycle at which the growth of fluorescence is half of its maximum. We can also show that it is the point of maximum slope of the fluorescence curve. Two samples being compared have the same template DNA and the same value of ; they differ only in their values of (and in the nuisance parameters and which do not influence ). For two samples A and B having values and , the shift of one curve relative to the other is .

2.4. The Approximate Estimator of

Reducing the problem to finding the optimal value of a single nonlinear parameter does not avoid the necessity of finding an approximate starting value. Given that is the value at which slope of the curve is greatest, we conjecture that the appropriate starting value for a curve such as that in Figure 1, from the publicly available batsch1 data set, would be at about . Figure 2 shows the correlation between observed fluorescence and for values of between 1 and 45. As expected, the maximum correlation is around the expected value of 29. More precisely, if we take the curve near its maximum to be quadratic, we can interpolate between the maximum cycle and the two adjacent ones to derive an interpolated fractional cycle at which the correlation would be greatest. This interpolated value is .

In the following analysis we will refer to this approach, seeking an approximate value of the maximum correlation by quadratic interpolation, as the CorQuad algorithm: it is algebraic rather than iterative and will give only an approximation to the least-squares estimate to which all other (iterative) algorithms will converge.

2.5. The Optimal by Iterated Maximum

Finding the maximum (or minimum) of a nonlinear function by iteration is routine once a good first approximation is known. In Figure 2, for instance, iteration arrives at a maximum correlation at . The nonlinear optimization here is in one dimension, for which we have used the R function optimize() (a mixture of golden section and quadratic interpolation) and we refer to this algorithm as CorOpt. Once the least-squares estimate of is known, the least-squares estimates of and are found algebraically by simple linear regression.

2.6. The Optimal as the Root of a Function

Even simpler than finding the maximum of a function is finding the zero of a function. The optimal value of is that which, in maximizing the correlation, equivalently minimizes the residual sum of squares between the expected fluorescence and that found experimentally. Denoting by the error or residual by which the th observation differs from the model value , the outcome of Appendix B is that if we define the function where is the logistic as defined in (9), then the optimal value of satisfies which is to say that the vector is orthogonal to the residuals .

Figure 3 is a plot of the dot product for values of from 1 to 45, demonstrating its zero at the appropriate value of . Two algorithms using this approach, but with minor coding differences to illustrate variations in computational time, are hereafter referred to as SlowZ and FastZ.

3. Materials and Methods

Analysis was carried out using the R programming environment [9] under GNU/Linux Ubuntu 14.04 LTS together with the R graphics packages ggplot2 [10] and analysis package qpcR [2]. The Marquardt-Levenberg algorithm used for comparison uses the function nls.lm() which is to be found in the minpack.lm package [11] and computational time calculations implemented the Rprof() function in the R.utils package [12]. Computational time depends heavily upon coding technique, particularly in high-level languages like . We take the liberty of coding the zero-approach in two ways. One (SlowZ) uses the notoriously slow mean() and lm() functions in for calculating mean and performing simple linear regression. The other (FastZ) codes these using the traditional for mean and the usual formula for simple linear regression, avoiding the computational overhead of the various checks carried out by the above R functions. The other three approaches are Marquardt-Levenberg (ML) and the maximum correlation developed above, either with a quadratic estimate for the maximum (CorQuad) or with an iterative approach (CorOpt) to maximum using the optimize() function in ; both use the cor() function of , known to be very slow.

We use as data thirteen files from the publicly available data sets distributed with the qpcR package (batsch 1–5, boggy, guescini1, lievens1, reps, reps2, reps3, rutledge, and sisti1) together with ten from our own laboratory available through our online repository at Github [8].

In all there are 836 fluorescence curves. For each we have five ways of determining , given a known .

We aim to address three issues.(i)We must confirm that the four iterative approaches all converge towards the same least-squares estimates of the parameters , , and .(ii)We must then determine the extent to which the approximate approach CorQuad, which involves no iteration, differs from the more precise approach of the iterative methods.(iii)We should compare computational times for the five algorithms.

We ran all five methods for each fluorescence curve in each of the data sets. For determining computational time using Rprof() we cycled through 2000 iterations of each data set. Times quoted are from an Intel core2 Duo 3.0 GHz machine.

4. Results and Discussion

As expected, the mean difference between the values estimated by the iteration approaches was very small; for a median difference of cycles and for the 836 curves analysed the biggest difference was cycles. As expected theoretically, the iterative approaches all converge on the same least-squares parameter estimates. The only differences in the algorithms are ease of understanding, ease of coding, and computational time or rapidity of convergence.

We were surprised to find very little difference between the algebraic approximation and the iterative methods. The median difference in the estimation of was cycles, and the biggest difference in the 836 comparisons was only cycles.

We would emphasize, however, that the data analysed here are data of the quality that researchers are happy to make publicly available. The day-to-day results from a working laboratory may well uncover data sets exhibiting more pathological behaviour. Of course, the algebraic approach could be refined by taking smaller steps than a whole cycle in calculating the correlation curve, but even with whole-cycle steps the method is more than acceptably accurate. It is trivially easy to implement on a spread-sheet.

As expected, finding a zero of a function in one dimension is computationally faster than finding the maximum of a function in three dimensions; our algorithm, with elementary coding, is faster than the Marquardt-Levenberg. The average times per analysed curve were 14.65 msec for SlowZ, 4.126 msec for CorrOpt, 2.448 msec for CorrQuad, 0.4280 msec for Marquardt-Levenberg, and 0.3344 msec for FastZ; the Marquardt-Levenberg took 28% more time than FastZ.

These times will vary, depending among other things on the shape of the experimental curves being analysed. We have therefore made comparisons individually for each of the 23 data sets; Figure 4 shows a boxplot of the computational times for the algorithms relative to that of the fastest, the FastZ algorithm. The interquartile ranges are 19.5 to 20.64 for CorOpt, 10.62 to 11.52 for CorQuad, 1.21 to 1.37 for ML, and 43.6 to 46.6 for SlowZ. Of course, one can generate any relative difference in computational time by modifying the tolerance at which iteration terminates. Here, Marquardt-Levenberg tolerance was such that its mean residual was greater than that for FastZ. That is to say, even after, on average, 28% more time, it was still further from the optimum than was FastZ.

5. Conclusions

Although the proof of the root-finding iterative method (Appendix B) is more demanding, it is much less so than is the logic underlying the Marquardt-Levenberg algorithm. The extra programming needed to calculate the vector is, for many high-level languages like R, a single line of code. The logic behind Brent’s method [13] for root-finding is outlined in three pages of Press et al. [5]. We suggest that the algorithms developed here allow groups with a modicum of programming ability to analyse their qPCR data with less recourse to proprietary “black-box” software. Although our approach is faster than the Marquardt-Levenberg commonly used, we do not see speed as being its principal advantage. If speed is important, the solution should be a faster computer, the use of compiled code, or programming in a faster language such as C++.

Appendices

A. Bernoulli’s Method for (5)

The differential equation is a Bernoulli equation. Making the substitutions gives

Separating variables and integrating, we have

Recasting the constant of integration as for some yields

B. Derivation of the SlowZ and FastZ Algorithms

We seek that minimizes the sum of squares of residuals and have asserted that the required satisfies , where Let denote the vector .

By construction the linear regression has resolved the observed vector of fluorescences, , into components where lies in the solution space spanned by the columns of and is orthogonal to . The vector of parameters is , the first column of is , and the second is , where The vector of residuals can be expressed as .

Then

So But the first term above is zero because is the vector of observations, and in the third term, because by construction is orthogonal to . Considering the middle term, the first column of is independent of , so

But

Hence

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This work was funded by Grant APP1022720 from the National Health and Medical Research Council of Australia. The authors gratefully acknowledge the contributions of two anonymous reviewers whose suggestions, we hope, clarified several issues.