Abstract

Atomic force microscopy (AFM) is a high-resolution scanning technology, and the measured data are a set of force curves, which can be fitted with a piecewise curve model and be analyzed further. Most methods usually follow a two-step strategy: first, the discontinuities (or breakpoints) are detected as the boundaries of two consecutive pieces; second, each piece separated by the discontinuities is fitted with a parametric model, such as the well-known worm-like chain (WLC) model. The disadvantage of this method is that the fitting (the second step) accuracy depends largely on the discontinuity detection (the first step) accuracy. In this study, a sparse representation model is proposed to jointly detect discontinuities and fit curves. The proposed model fits the curve with a linear combination of parametric functions, and the estimation of the parameters in the model can be formulated as an optimization problem with -norm constraint. The performance of the proposed model is demonstrated by the fitting of AFM retraction force curves with the WLC model. Results shows that the proposed method can segment the force curve and estimate the parameter jointly with better accuracy, and hence, it is promising for automatic AFM force curve processing.

1. Introduction

The atomic force microscopy (AFM) is a high-resolution scanning probe microscopy [13], which is widely used in life science [4], chemistry [5], and nanotechnology [6]. A typical AFM instrument consists of five components: laser generator, probe, cantilever, piezoelectric scanner, and photodiode detector. The testing sample is mounted above the piezoelectric scanner, which can move in an approaching manner toward the probe or a retracting manner away from the probe. When the sample surface and the probe are close enough, the interaction force between them yields the cantilever deflecting toward (attractive force) or away (repulsive force) from the probe. The deflection can be calculated from the output of the photodiode detector, which measures the laser reflected by the cantilever. The characteristics (e.g., the Hooke constant) of the cantilever are usually known in advance, so the interaction force can be calculated from the deflection of the cantilever. As a result, the measurement of AFM is the interaction force versus piezoelectric scanner indentation .

There are several parametric models describing the relationship between the interaction force and indentation. For the approaching force curve, there are models such as the Hertz model, Johnson–Kendall–Roberts model, and Derjaguin–Müller–Toporov model [3]. For the retracting force curve, the worm-like chain (WLC) model and freely jointed chain (FJC) received a wide range of interests [3], which describe the elastic behavior of polymers. As mentioned in the previous study [7], the low-load, moderate-load, and high-load regions of an approaching force curve can be modeled as the exponential model, Hertz model, and Hooke model, respectively; each piece between the two neighboring jumps of a retracting force curve can be modeled as either WLC or FJC.

The processing of AFM data is a typical curve fitting problem, which aims to estimate the parameters of the given models that can best fit the curve. Those parameters can help investigators better understand the biochemical and biophysical properties of the biochemical samples or nanomaterials. There are lots of methods to estimate parameters specific to different models. When the model is linear with respect to its parameters, the noisy data can be transformed into the parametric space, in which the parameters are estimated, e.g., the classical Fourier analysis. When the model is nonlinear with respect to its parameters, one can minimize the difference between the noisy output and the ideal model, which is known as regression analysis.

Since the AFM force curves usually consist of piecewise segments, we focus on the piecewise curve fitting with given parametric models. One well-known technique is the spline fitting [8, 9] for piecewise smoothing. However, spline is constrained to polynomial functions and needs to have the prior knowledge of discontinuities (or break points, change points, and knots in spline terminology), which is often unknown. Polyakov et al. [7] proposed a two-step strategy. In the first step, discontinuities were detected from noisy curves. To detect the discontinuities, a set of functions consisting of piecewise polynomials were used to fit the force curve. The piecewise polynomials are the ensemble of Heaviside step functions, the first, the second, , and the th degree integration of Heaviside step functions, where is the highest degree of the polynomial. In the second step, the noisy curve was segmented into pieces by the detected discontinuities, and then, each piece was fitted with a parametric model by nonlinear optimization.

Another way of parameter estimation is clustering in the space of model parameters directly, and hence, curve fitting can be avoided. A method of such kind was proposed by Maity et al. [10], in which the contour length of the WLC model is estimated for each data point on a force curve by solving a third-order polynomial equation, and then, the force versus contour length scatter plot is obtained, and the contour lengths are finally estimated by clustering. Since the segmentation procedure in the curve fitting is replaced by clustering, the performance of such method highly depends on that of clustering, which is still an open question.

This study is organized as follows: in Section 2, the curve fitting problem is formulated as a sparse representation model, and then, propose a systematic method to find the best fitting. In Section 3, the proposed method is applied to process a retracting force curve from AFM, which can be fitted with the WLC model. The study is concluded in Section 4.

2. Methodology

2.1. Worm-Like Chain (WLC)

In stretching experiments, several jumps may present in the retracting force curve. This phenomenon is caused by the bound between macromolecule and the probe (similar to that of Velcro). Each jump represents a detachment from the surface.

There are several models describing this phenomenon, among which the WLC and FJC models received wide interests. Since the FJC model was intensively studied [7], this study only focuses on the WLC model. However, the method can be applied to the FJC model straightforward. The WLC model considers a polymer as an elastic cylinder with a constant bending elasticity and of constant length. The relationship between the interaction force and extension readswhere is the Boltzmann constant, is the absolute temperature, is the contour length, and is the persistence length. In the stretching experiments, one would like to estimate the persistence length and contour length from experimental data and . Figure 1(a) shows the WLC model, which shows that the WLC model exhibits an asymptotic behavior, with asymptote . Figure 1(b) shows a typical AFM retracting force curve. Each piece between two jumps can be fitted with a WLC model. Because of this asymptotic behavior, two-step methods suffer from modeling error, since the polynomial function set was used to detect the discontinuities. Instead, in this study, the set of WLC signals is proposed to detect discontinuities and to fit curves jointly.

2.2. From 2D to 1D Estimation

It is noticeable that, for a given , the term in the parenthesis is fixed. So from experimental data and , the coefficient can be easily estimated by solving a least square problem, yielding estimate . Thus, the 2D (two parameters and ) problem (1) can be simplified to a 1D function . Based on this, the principle of the proposed method is as follows: first, a signal set (or dictionary) is built by sampling contour length within the feasible region with high intensity (large ), i.e.,whereis the term in the parenthesis in function (1). Then, the force curve can be fitted with a linear combination of : , where is the coefficient and can be formulated as the problem in equation (5), where each scalar is an entry of the vector x.

Once the estimate of , i.e., is known, for those who are nonzeros, the corresponding estimate of contour length is , and the corresponding persistence length can be estimated as

2.3. Sparse Representation Model

In a sparse representation model, one would like to approximate a given signal using a limited number of functions from a dictionary . Here, signal f and dictionary F are the vectorized version of and by discretization . This approximation problem is usually formulated as an constrained least square optimization problem:where is the vector which gathers the fitting coefficients, and denotes the number of nonzero entries in vector x, i.e., the −0 norm of x; is the sparsity level (or model degree), which controls the tradeoff between the model complexity and fitting quality. It is easy to see that larger yields better fitting quality, while smaller yields less complexity of the model (use less columns in matrix F). For discrete Fourier transform (DFT) and discrete cosine transform (DCT) [11], the columns of F are complex exponential and cosinusoids functions with different frequencies, respectively; for deconvolution problems, F is a Toeplitz matrix formed from the impulse response function [12]; and in compressed sensing, F is a random sampling matrix [13, 14]. In our WLC fitting problem mentioned above, each column in dictionary F is generated by sampling the parametric model given by equation (2).

2.4. Model Solver

The optimization problem (5) was proved to be NP-hard [15], and the optimal solution can only be found by performing an exhaustive search of all the combination of the columns in F. The exhaustive search is time-consuming even when the dimension of the solution vector is moderate (in the magnitude of hundreds). In real applications, the dimension is often more than thousands, so exhaustive search is unrealistic. Instead, heuristical methods, such as matching pursuit (MP) [16], orthogonal matching pursuit (OMP) [17], and orthogonal least square (OLS) [18], were proposed to get the suboptimal solutions. However, when the correlation between the columns in the matrix F (or mutual coherence [19]) is high ( for F), the performance of these methods are not satisfactory.

Therefore, single best replacement (SBR) [12] and continuation of SBR [20] were developed to tackle the problem (5). SBR is a forward-backward algorithm. In each iteration, one column in the matrix F is included into or excluded from the so-called active set. Only the columns in the active set are used to fit the signal f by least squares, yielding nonzero coefficients in x. The coefficients corresponding to columns excluded from the active set are assigned to be zeros. The criterion to determine which column shall be included into or excluded from the active set is the deepest-descent of the cost function.

The highlight of continuation of SBR is an efficient way to explore the solution set by increasing the model degree , which controls the tradeoff between the model complexity and the fitting quality. The output of the continuation algorithm is a set of solutions with respect to model degree , and each solution is a tradeoff profile, which obeys . For the WLC fitting problem, actually represents the number of segments. For more details of SBR and continuation SBR, readers are referred to our previous publications.

Once the solution set is ready, the following issue is to select a proper solution or to determine the best model degree .

2.5. Model Degree Determination

The choice of the model degree , which is usually unknown in advance, is an important issue. A relative high model degree yields overfitting, while a low model degree yields underfitting. For the curve fitting problem is this study, represents the number of pieces in an AFM force curve.

There are several ways to determine ; most notable one is the information criteria such as Akaike information criterion (AIC) [21], Bayesian/Schwarz information criterion (SIC) [22] (also known as minimal description length (MDL)), Hannan and Quinn criterion (HQC) [23], Draper information criterion (DIC) [24], and others [25]. All these criteria are essentially penalized logarithm likelihood estimates with the following formulation [26, 27]:where is the logarithm likelihood estimate and is equal to , and is the fitted curve. is equal to 2, , and , for AIC, BIC, and HQC, respectively. The best model degree is achieved where an information criterion reaches its minimum. Since the results of previous subsection is a set of solutions , problem (6) can be solved straightforward.

3. Real Data Processing

AFM is widely used in molecular and cell biology [28]. In this section, the proposed method described above is shown to process a real AFM force curve (Figure 1(b)) sampled on Xenopus laevis oocytes with single-molecule force spectroscopy (SMFS) at room temperature (, so average ), which aims to study the cyclic nucleotide-gated channel subunit alpha 1 (CNGA1) [10, 29]. CNGA1 consists of several secondary structures, and its conformation changes by the stretching force, yielding several jumps in the force curve. Each piece between the two neighboring jumps can be described with the worm-like chain model.

3.1. Raw Data Preprocessing

The raw data were first preprocessed with the following steps: (1) reorder the data points with respect to , such that they arrange in an ascent manner; (2) resample the force curve at an uniform grid of (four samples per nanometer since the raw force curve’s sample step is nanometer) using linear interpolation; (3) align the “contact point” to origin; (4) remove the baseline by subtracting a linear function, such that the null interaction force region (right part of the force curve) is horizontal, and the mean of this region is zero; and (5) keep only the right part of the force curve whose adjusted position () is nonnegative. A force curve f of length after preprocessing is shown in Figure 1(b).

3.2. Worm-Like Chain Dictionary Building

In a real situation, the force curve cannot reach the asymptote , where . In order to avoid numerical instability, when building the dictionary F, an additional parameter is introduced, such that function in (3) is evaluated only at the interval and linearly declines to zero at the interval . In our real data, is  nm with a step of 1 nm. For contour length , the feasible region is and is sampled with a step of 1 nm. Following these settings, for each given and , a vector of length is sampled and appended to the matrix F as a column. Each vector is sampled as follows (Figure 1(a), the red bold line): at the interval , function in (3) is sampled; at the interval , the samples are set to be zero; and in between, i.e., at the interval , the samples are linearly sampled, where is the maximum of in the force curves ( nm in our experiment). All the data points are sampled at the uniform grid with a step of four points each nanometer, yielding a vector of length . Above all, after exploring all the configurations of and , a WLC dictionary of size is built.

3.3. Curve Fitting and Parameter Estimation

Once the force curve f and the dictionary F are given, the continuation of SBR [12, 20] was used to solve the sparse representation problem (5), yielding a set of solutions . For each solution , the fitted curve can be calculated as . Note that the sparsity level equals to the number of discontinuities in the fitted force curve in our WLC fitting problem. Since each column in F contains a discontinuity, the total number of discontinuities in equals to the number of selected columns in F, i.e.,, or . Information criteria aforementioned were employed, but neither provide a meaningful fitting results; hence, was used; equivalently, . Figures 1(d) and 1(e) show the fitted curves with and 6, respectively.

The contour length of each piece can be retrieved from the corresponding column in F, and the persistence length can be estimated following equation (4). Table 1 provides the estimates of contour length and persistence length of six WLC pieces in Figure 1(e).

The fitting results show that the proposed method successfully detects the discontinuities with good fitting quality. Comparing Figures 1(d) and 1(e), one can see that with the increase of , one more discontinuity was detected, yielding an improved fitting which consists of six WLC pieces.

The proposed method was also compared with the method proposed by Maity et al. [10], which is introduced briefly in the Appendix, and whose result is shown in Figure 1(f). In this panel, six clusters are shown, indicating the six WLC pieces with contour lengths as the red vertical lines. A t-test shows that the estimates of contour length with both methods are consistent. However, Maity et al.’s method cannot estimate persistence length, and hence, it assumes a constant.

4. Conclusion and Discussion

This study proposed a method to fit a piecewise curve with parametric models. Different from many existing methods, a linear combination of several model functions (with parameters to be determined) from the dictionary was used to fit the curve, which was formulated as a sparse representation problem. By employing the continuation of the SBR algorithm, a set of solutions with different sparsity levels was obtained. The proposed method was applied to process the retracting force curve from AFM and was compared with a state-of-art method.

One advantage of the proposed method is that it can perform both the discontinuity detection and curve fitting at the same time. Most common methods use two separate steps, i.e., discontinuity detecting followed by curve fitting, and hence, the fitting result largely depends on the detection of discontinuities. For example, when using the piecewise polynomial functions to detect the discontinuities in a retracting force curve (which could be fitted with the WLC model), false detection of discontinuity is inevitable because of the presence of modeling error. However, this problem is solved by our proposed method.

Another advantage of our proposed method is its flexibility. If one wants to use other parametric models, e.g., freely jointed chain (FJC), which reads [3, 7]where is the Kuhn length; other notations are the same as in the WLC model (equation (2)). By comparing the FJC with WLC model and following the method in Section 3.2, one can simply build a dictionary F with FJC functions. And by replacing the dictionary in optimization problem (5), one can finally fit the data with the FJC model. Figure 2 demonstrates both WLC and FJC models.

The disadvantage of the proposed method is the large storage space of the matrix F and the corresponding computational cost introduced by fine sampling of model parameters. In our experiment, is sampled  nm with a step of 1 nm (276 configurations), and is sampled  nm with a step of 1 nm (25 configurations). So the total number of columns in F is . f is of size 1357. Then, the size of F is . If one would like to improve the resolution, e.g., , the step has to be reduced, e.g., 0.5 nm, which doubles the size of F. Hopefully, with the rapid developing of computing resources in recent years, better performance is expected.

The fitting at the boundary region of the force curve can be further improved. Note that there seems a WLC segment between 20 and 40 nm, but neither Maity’s method nor ours discovered it. The reason might be that the right slope is not steep, which is the characteristic of the WLC model, and hence, this segment was treated as inference.

As a perspective, at the beginning of the force curve, it can be modeled as a truncated linear function, while at the end of the force curve, it can be modeled as a step function. By appending more columns to matrix F and each newly added column representing the model at the boundary region, better fitting results can be expected to achieve.

Appendix

Maity et al.’s Method

Maity et al. [10] proposed a method to estimate contour length of the WLC model by solving equations. However, their description has a minor error ( should not depend on ), so here is a short summary of their method with correction.

For each data point on a force curve , a third order polynomial equation of is solved, which readswhere , , and .

This equation has three roots, and the real one is denoted as , which is between 0 and 1. Finally, the estimate of contour length reads .

For all the data points on a curve, force versus contour length scatter plot is obtained, and the contour lengths are estimated as the means of scatters, which is shown in Figure 1(f) as red vertical lines.

Data Availability

The dataset used to support the findings of this study is available at https://github.com/galvanetto/Fodis/blob/master/Datasets/TXT%20universal%20import%20files/104_CNGA1_selected.txt.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

This study was partially supported by Provincial Science Foundation of Shaanxi (2021JM-128).