#### Abstract

This research presented an accurate and efficient contour length estimation method developed for DNA digital curves acquired from Atomic Force Microscopy (AFM) images. This automation method is calibrated against different AFM resolutions and ideal to be extended to all different kinds of biopolymer samples, encompassing all different sample stiffnesses. The methodology considers the digital curve local geometric relationship, as these digital shape segments and pixel connections represent the actual morphology of the biopolymer sample as it is being imaged from the AFM scanning. In order to incorporate the true local geometry relationship that is embedded in the continuous form of the original sample, one needs to find this geometry counterpart in the digitized image. This counterpart is realized by taking the skeleton backbone of the sample contour and by using these digitized pixels’ connection relationship to find its local shape representation. In this research, one uses the 8-connect Freeman Chain Code (CC) to describe the directional connection between DNA image pixels, in order to account for the local shapes of four connected pixels. The result is a novel shape number (SN) system derived from CC, which is a fully automated algorithm that can be applied to DNA samples of any length for accurate estimation, with efficient computational cost. This shape-wise consideration is weighted to modify the local length with great precision, accounting for all the different morphologies of the biopolymer sample, and resulted with accurate length estimation, as the error falls below 0.07%, an order of magnitude improvement compared to previous findings.

#### 1. Introduction

The Atomic Force Microscopy (AFM) system has the ability to probe samples at the nanometer scale, owing to its ability in sensing the sample surface to resolve force interaction at the pico-Newton level [1]. This feature makes AFM systems a useful imaging device in the field of nanotechnology, molecular biology, and many others. It is well known in AFM’s biological application to image biopolymers thanks to its ability to image in liquid, the biopolymer’s natural environment [2].

One very interesting characteristic is the length of a single DNA strand, denoted as . This contour length can be applied to identify genome editing results and other application outcomes [3]. And accuracy in getting correct is essential at this scale, as there is small room for error in genome editing, since one base-pair distance for DNA is only 0.34 nm. Thus, AFM images of DNA samples provide means for such studies on accurate DNA length estimation, like the image illustrated here in Figure 1.

There are two ways in finding from AFM images. One is by manual fitting, and the other is by automatic skeleton tracing with image processing. Fitting typically relies on human operators picking specific positions along the DNA contour by examination on the acquired image, which relies on the trained eye of a scholar to map out the contour length [5, 6], as illustrated in Figure 2.

**(a)**

**(b)**

On the other hand, the automatic estimation traces the DNA image along its backbone skeleton. This is done by thinning the strand image to its median position from the overall acquired outline and retaining only the skeleton of the DNA, as is illustrated in Figure 3.

**(a)**

**(b)**

**(c)**

The backbone extracted from the original AFM image is a *single-width* connected pixel arranged with the following rule: only one adjacent pixel is allowed to connect to the central pixel to form a continuous contour, either directly (horizontal or vertical) or diagonally, as is illustrated in Figure 4.

**(a)**

**(b)**

From this single width contour pixel arrangement, a continuous *chain code* (CC), defined as , can be formed by tracing the connectedness of adjacent pixels from the skeleton’s one end to the other, according to the 8-connect Freeman’s eight directions [7].

Researchers have been using the Freeman CC to estimate , by counting the number of even and odd occurrences along the DNA skeleton, which is to trace along the chain code, , and tally up the occurrences of even number chain codes as well as its odd occurrences .

Since the even chain code connects adjacent pixels directly (vertically/horizontally), and the odd CC connects diagonally, one estimates first by finding the Euclidean length (norm) of all the pixel center connections and then multiply the pixel resolution to find . This is defined as the Freeman estimator [8].

However, lacks the accuracy that is required in these microscopy systems. Thus, there are researches that made modifications to . These include the Kupla estimator and the corner estimator . modified the diagonal values due to digital slope inclination, and further accounts for tight turns geometrically. Thus, in the end and end up with different coefficients from [9].

There were further researches to improve accuracy. One research smooths out the digitized pixilation of the contour skeleton backbone and applied a spatial Fourier transform on the image. Through tuning the Gaussian filter in 2D, a smother is estimated [10].

Other than modifying the pixel connection Euclidian length, another research modifies by adjusting the pixel center coordinate representation . A weight is added to modify the coordinate location by considering the three consecutive points with . This length estimator calculates according to the modified [11].

Another estimator is designed specifically for DNA strand samples, named . This estimator introduced a nominal coefficient for different DNA lengths and is defined as , where is inversely calculated from simulated data, so a table of helps to match the expected value of [12].

More recently, a machine learning approach utilized a feature extraction to fit different cubic spline segment occurrences with the following: *horizontal*, *vertical*, *diagonal*, *perpendicular*, variating *height* and *thickness*, as defined by [13]. This machine learning estimator is trained to generate coefficients considering the abovementioned feature from known DNA .

A summary table in Table 1 provides a quick review of the abovementioned estimators.

In this paper, the authors propose an estimator based on the DNA imaged contour shape, thus having the name estimator , where is designed to be robust to image resolution and only uses minimal computational resource. This is achieved by considering the neighboring shape of the original two-pixel connection inspired from , but as all the DNA local morphology shapes are considered for estimating , the resultant accuracy is shown to improve by more than an order of magnitude.

Detailed methodology of the estimator is explained in Section 2, starting from the general image preprocessing to the identification of twelve local 4-pixel segment configuration shapes. Then, the 12-shape correction coefficients are calibrated in Section 3, with different resolutions considered. Finally, the values for are compared with and in Section 4.

#### 2. Contour Length Estimation with Local Shape Consideration

estimation essentially takes into account the local shape considerations. As two neighboring pixels are connected together in this AFM image, the overall shape around the two connected pixels represents different local lengths as this DNA morphology is observed. In a tight turn; i.e., a “kink,” this local length will certainly be longer then a smooth linear local profile.

Thus, considers the two additional pixels extending from the center two-pixel connection and identifies the different 4-pixel segmented shapes surrounding along the DNA skeleton backbone. Then, makes shape-corrected length adjustments, by multiplying the local shape’s corresponding coefficient to adjust for the estimated . It can be observed that the extension of this segmented elemental shape is not limited to 4 pixels, as with more pixels such as a 5-pixel segment can also be considered. However, due to the trade-off for computational cost and performance, this research investigates the with 4-pixel elements.

##### 2.1. Pixel Resolution and Image Preprocessing

A standard preprocess extracts the DNA image into the skeleton backbone, by thinning the DNA strand into the centerline of the biopolymer. This research’s automatic image process is illustrated here in Figure 5.

**(a)**

**(b)**

**(c)**

**(d)**

**(e)**

**(f)**

First, the DNA image is prefiltered and mapped into a binary image with thresholding. Then, further, 2-D filters remove isolated pixel islands, ensuring that a single DNA contour is captured. And finally, an iterative debranch thinning morphology is applied to find the skeleton backbone that can be chain-coded [14].

It is well known that AFM systems have a tip broadening effect when imaging, which expands the DNA strand width to a larger value. A repeated thinning preprocess in average converges the single-width pixel contour, towards the mid-point of the DNA strand automatically, given an AFM image with enough resolution across the DNA width [15].

##### 2.2. Identification of 4-Pixel Segment Shape Connectivity

Given the resultant single-width pixels for the contour’s skeleton backbone, its CC is coded from one end to the other. Note that this research utilizes the 8-connect chain code, resulting in integers ranging from for all and that is one off from , as there are connections between pixels.

With the 4-pixel segment setup, there are up to a total of 64 ways to connect the 4 pixels into single-pixel width arrangements. This research paper has fully outlined all the possibilities, and the full table of all 64 different single-width 4-pixel segments is arranged in Figures 6 and 7. They are arranged by the assigned types, with all the same types grouped together.

It is clear that all the same types of shape are grouped with the 4-pixel segment’s mirror and rotational images. Take for example the shape, where the segment is rotated clockwise/counterclockwise for 90 degrees individually and mirrored on the -axis, shown here in Figure 8.

Having these segment shapes distinguished, the original inner 2-pixel connection’s distance can now be corrected, by considering the outward extended 4-pixel segment shape. This would take into account the local geometric features according to its categorized shape. Since the skeleton backbone is composed of consecutive 4-pixel segments all along its contour, when tracing from one end to the other, this research makes sure that the estimator identifies every 4-pixel segment to the *twelve* unique shapes, as shown in Figure 9.

###### 2.2.1. Chain Code, Shape Number, and Identifier

In order to identify a skeleton backbone’s different 4-pixel segment shapes along the contour, this research utilizes its chain code, formed as a series of integer number, and developed a novel algorithm called the shape number (SN) identification, labeled , and uses it to derive an exclusive identifier (ID) number for matching the abovementioned unique shapes.

A typical CC collection, , is a series of integers made from , provided the single-width skeleton backbone pixels . Note that is one-off from and that is numbered from . This research emphasizes the general ability to distinguish any skeleton backbone, and while for any given backbone, it creates a set of two distinct CC for every skeleton, due to starting the connection from different ends of the pixel chain. The algorithm will demonstrate the ability to converge on the distinguished 4-pixel segment shapes.

As the algorithm needs to continuously identify the 4-pixel segments throughout the contour backbone, a rolling window starts from any end of *C* and collects the following -segments (Figure 10)

It is clear that each segment is comprised of three consecutive chain codes , since a 4-pixel segment consists of three connections. With the exception of the first and last segments of the contour skeleton, where there are not enough pixels to form a 4-pixel segment, thus the algorithm just takes the original two connecting pixels, i.e., the original or . The pixel/geometric representation of a rolling window CC segmentation is demonstrated in the Figure 11.

Notice that the rolling window in moves the 4-pixel segment consecutively from *Head* to *Tail*, and each segment can be coded as , from , excluding and .

This research now defines a shape number (SN), derived from each of the rolling segments as

In short, is a collection of the ordered cyclic difference from each segment’s continuous 3 chain codes. Thus, for each segment, SN is composed of three integer numbers as , defined as

Since all CC is comprised of integers from , SNs are also retained between . Thus, whenever is derived as negative, we automatically take 8’s compliment to correct it, with if .

One such example of SN derived is illustrated in the lower part of Figure 11, where the SN is calculated from both directions of the chain code inside each segment: the *Start to End Chain Code* (SECC) as well as the *End to Start Chain Code* (ESCC). It is obvious that SECC and ESCC are different; therefore, the resulting *Start to End Shape Number* (SESN) and *End to Start Shape Number* (ESSN) are also derived different, albeit representing the exact same segment.

To ensure exclusive identification on the same 4-pixel segment, for both bidirectional CC and SN coding, in addition to all the same shape mirroring and rotational configuration, a simple unique identifier (ID) number is needed to match the rolling 4-pixel segments to the shapes.

###### 2.2.2. Unique Identifier (ID) Matching

In order to deal with such bidirectional, mirroring, and rotational segment ambiguity, the following rule has been applied to ensure a single SN identifier (ID) to match explicitly one shape, for any given random shape number in the lengthy contour skeleton backbone.

This is provided by examining all the SESN and ESSN for all the shapes and capturing the combinatory relative adjacent arrangements from the 4-pixel segment geometry, i.e., reordering the representative numerals of SN to allow for the direct/diagonal connections, to make representation of the given shape.

Since the rolling window segmental SN will fall into the recognizable shapes, the ID number can be derived from the known segment numbers as specified in Figure 9. Thus, the identifier is a unique number for each of the shape , such that when the rolling window covers a 4-pixel segment, by performing this numeral operation (algorithm), one will find the identifier.

The following Algorithm (1) outlines the unique identifier (ID)’s reorder methodology for all of the shapes.

| ||||||||||||||||||||||||||||||||||||||||

Algorithm 1 : ID number derived from shape number. |

The rules for the identifier are stated as follows:

*Unique*—there exists one unique ID number for each of , , , and shapes.

*Common*—there exists *common* ID numbers for the following pairs: , , , and .

*Distinguish***—**the aforementioned sets are discerned by the connection type of the center pixels, {direct or diagonal}, by checking its original CC number , with even/odd numbers representing direct/diagonal, respectively.

Take for example the SN, as there exist four different SN combinations: 116, 772, 727, and 161 form shaping of the eight different segments, as shown in the last row of Figure 8, due to coding of all the different mirrors/rotations and bidirectional CC.

After performing the abovementioned ID algorithm, we are able to uniquely transform all SN to the same identifier (ID) number: 116, as shown from Table 2.

Finally, the algorithm arrives with , , , and matching up with their respective ID numbers: 026, 116, 206, and 367. In addition, the algorithm matches the pairs , , , and commonly to 000, 017, 107, and 277, respectively. In order to distinguish between the pairs, the original {direct, diagonal} connection is once again used: by checking if the original is either *even* or *odd*, then it can be trivially matched to the correct shape in the pair Table 3.

All the ID numbers are listed for shapes here in Figure 9, derived from all the 64 shapes in Figures 6 and 7. Note that the common ID numbers are annotated with (-even/-odd) for distinguishing.

##### 2.3. Parameter Equation Representation

Now that the unique ID number is obtained, it is then ready to amass a collection of the different samples of a given length, in order to retrieve the correction parameter for the different , provided with the same DNA characteristics, i.e., with a fixed .

###### 2.3.1. Length Calculation with Coefficients

This is first done by identification on one individual DNA sample’s contour, by summing up each shape component occurrence contribution for its segment’s connection length. In other words, one identifies along the skeleton backbone and tallies the individual occurrences of the twelve , multiplied by the corresponding connection length (1 or √2 pixel length) along with its correction coefficient. This makes the sum of all the length contributions equal to the contour length as where is the number of occurrences of the type shapes, provided from the identification along the skeleton backbone. is the correction coefficient, and is the connection length (either 1 or √2 according to the shape). and are the head and tail length, respectively, and finally is the AFM image pixel resolution.

From Figure 9, length is ordered in Table 4.

###### 2.3.2. Matrix Form for Inverse Calculation

The second step is to collect a sufficient amount of representation of this same type of biopolymer samples and list all the length equations based on these samples. The logic is that with multiple samples of the same kind, imaged under the same pixel resolution, the shapes collectively represent the same type of twist/turn, resulting in the same length contribution for the same class of biopolymers .

Combining equation (4) with the associated values in Table 4, we arrive at for any given backbone skeleton length , given the index ’th sample. Equation (5) can be represented with a matrix form, with where

and

and provided is a 12-by-12 square matrix, such that .

Note that is the *head* and *tail* (boundary) connection length summation. Also note that is an *-*by-12 matrix, is 12 by 1, and and are both *-*by-1 matrices.

The final procedure here is to derive matrix using a standard linear regression and find the best fit for the value. The final results are presented in the next section.

#### 3. Contour Parameter Calibration Result

In order to guarantee convergences of the coefficients, different known values of and single-pixel-width AFM images were simulated for calibration. Due to the combinations of different and , plus a surplus amount of samples for each pair, a total of 58,800,000 images were generated.

All the simulated images are based on DNA characteristics, as mentioned in Introduction, where all the samples have the same persistance length of .

The different lengths calculated ranged from 340 to 1020 nm, for every 34 nm, and the different resolution is simulated between 5.1 and 7.8 nm/pixel, with a 0.1 nm interval. Thus, there are 21 different scales, along with 28 altering , making a total of test cases. Each case is studied with 10,000 DNA images, for sufficient representation on s. In order words, the test index was used for equation (6).

##### 3.1. Convergence of the Coefficient

This research first checks the convergence of all coefficients, given a growing number of image files, i.e., growing number of in equation (6). The results are illustrated in Figure 12.

All coefficients verify its convergence when given more than 0.5 million samples and remain constant with fluctuation of less than 0.01% after 1 million samples. This result is verified for all resolutions ∼ 7.6 nm/pixel, showing similar trend for all .

##### 3.2. Linear Variation of Dependence on Resolution

With the convergence for all the coefficients confirmed for all different , the relationship for each as a function of , i.e., the linear fit for results, is found in Figure 13.

It is clear that using the converged values for a specified , the following linear fit equation, results with a table that contributes to all the twelve different coefficients; it is provided in Table 5.

##### 3.3. Performance with Shape Modification Coefficient

The above sections result in the calibration correction in equation (10) and can be used for the unknown estimation. In order to demonstrate such performance, the coefficient in equation (10) is used for the shape estimator and compared alongside the DNA estimator and the Freeman estimator .

All estimators , , and are compared with different length and resolution . All the estimators are applied for the same simulated pixel images and compared against the readily known for error calculation.

Tables 6, 7, and 8 outline the calculation error for different settings. It shows that the estimator has an averaged relative error maxed at 0.07%, performing with an order of magnitude difference from the estimator, and two orders of magnitude smaller than . The relative error translates to an absolute value of maximum 0.20 nm for the /pixel, well below the resolution, making ideal for estimation.

Since the error is averaged amongst the 100,000 samples provided, its standard deviation (STD) in nm is also an indicator for quantitative analysis. The estimator also has a smaller standard deviation compared to both and , against a growing contour estimated.

#### 4. Conclusion and Future Direction

This research provided a novel way to estimate digitized contour length, in a general way that is applicable towards all kinds of contour curvature. Utilizing a localized shape connection approach, and correct upon the local connectivity between pixels, this algorithm accounts for both resolution and the sample stiffness.

This research is general in the local 4-pixel segment identification method and extensible towards extension to more pixel elements. The general idea stands that a single-width pixel contour’s digital shape recognition is applicable towards all images acquired from different systems, not only with the AFM family but also optical microscopy systems, electron microscopy systems, and many others.

Experimental verification is also needed for future research, provided with calibrated accurate sample length from DNA samples or other biopolymer samples imaged with AFM systems.

#### Data Availability

The data used to support the findings of this study are included within the article.

#### Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.

#### Acknowledgments

The authors would like to thank the funding provided from the Ministry of Science and Technology, Taiwan. The research is supported through grant number MOST 105-2221-E-011-056.