Artificial Intelligence and Data Mining 2014View this Special Issue
A Solution to Reconstruct Cross-Cut Shredded Text Documents Based on Character Recognition and Genetic Algorithm
The reconstruction of destroyed paper documents is of more interest during the last years. This topic is relevant to the fields of forensics, investigative sciences, and archeology. Previous research and analysis on the reconstruction of cross-cut shredded text document (RCCSTD) are mainly based on the likelihood and the traditional heuristic algorithm. In this paper, a feature-matching algorithm based on the character recognition via establishing the database of the letters is presented, reconstructing the shredded document by row clustering, intrarow splicing, and interrow splicing. Row clustering is executed through the clustering algorithm according to the clustering vectors of the fragments. Intrarow splicing regarded as the travelling salesman problem is solved by the improved genetic algorithm. Finally, the document is reconstructed by the interrow splicing according to the line spacing and the proximity of the fragments. Computational experiments suggest that the presented algorithm is of high precision and efficiency, and that the algorithm may be useful for the different size of cross-cut shredded text document.
The reconstruction technology of the shredded document is usually used for obtaining the judicial exhibit, repairing the relics and acquiring military intelligence. It plays an important role in judicial investment to repair the sensitive document damaged on purpose and archaeological research to recognize the cultural relics. Generally, the documents which are single sided or double sided are cut into pieces by hand or paper machine. Schauer et al.  thought the shredded document can be considered to be a variation from typical jigsaw puzzles. They define that there were three types of the fragments: the manually torn documents, the cross-cut shredded documents, and the strip shredded documents  (see Figure 1).
(a) The manually torn document
(b) The cross-cut shredded document
(c) The strip shredded document
Recombining the document is tedious and time-consuming due to the tremendous number of the fragments and the missing information of the fragment. The efficient way is to reconstruct the shredded document by the automatic system. The approaches to reconstruct these two kinds of documents are different. The reconstruction of the paper with irregular boundaries shredded manually is based on the similarity of the boundary feature. Nevertheless, the fragments shredded by the paper shredder have smooth boundary, so the reconstruction method for irregular boundaries is not valid. For the smooth boundary, the key to reconstruct the document depends on the characters on the paper edges.
Most of the existing literatures investigate the reconstructing of the fragments shredded manually. Justino et al.  used a polyapproximation to simplify the complex contours of fragments and reconstruct the pieces of manually torn document by matching the feature of the polygon. Such method could only reconstruct the document on a small scale. Kesarkar et al.  joined the torn pieces of papers with the semiautomatic technique which is comparing edge length and angles, and Richter et al.  introduced an algorithmic framework for the automatic assembly of shredded documents based on shape- and content-based information.
Some literature focuses on the reconstruction of the strip shredded documents. Because the information in the boundary of the strip shredded documents is adequate, it is easy to calculate the correlation of two fragments and splice them if their proximity is high. Lin and Fan-Chiang  presented an algorithm based on image feature matching through the graph-based sorting scheme to reassemble the pieces of the shredded document. They merged the fragments with the average word length and the highest correlation of the binary codes. Also, some researchers made use of the character features to match two fragments. Perl et al.  proposed an optical character recognition (OCR) algorithm to match two fragments’ contours through the probability histograms of the characters in the border region. If the lines of the text decrease, the precision of the paper reconstruction drops. Diem and Sablatnig  proposed an optical character recognition (OCR) to recognize the characters in the ancient manuscripts.
For the cross-cut shreds, the reconstruction of cross-cut shredded text documents (RCCSTD) problem proved to be NP-complete by Prandtstetter . Biesinger et al.  investigated this problem with an improved genetic algorithm without using any pattern recognition technique. Schauer et al.  used a cost function to determine whether two shreds are adjacent according to the likelihood of the gray value of the pixels along the shreds edges through the memetic algorithm. The ant colony optimization and a variable neighborhood search were developed for RCCSTD by Prandtstetter and Raidl . Sleit et al.  put up with a different approach for RCCSTD based on iteratively building clusters of shreds. A cross-cut shredded document is shown, for example, in Figure 2, including its correct reconstruction.
The reconstruction of cross-cut shredded text documents (RCCSTD) problem is defined as an injective map , where the set of the fragments belongs to one single-side printed document and is the number of grid-shaped images. In this mapping, each fragment , where and , is assigned to one position in the two-dimensional (Euclidean) space , such that the virtual fragments are represented by the white rectangles in Figure 3, and each virtual fragment is allowed to be used once [9, 12]. Furthermore, assume that the orientation of the fragment is consistent and there is no character adhesion in the text.
The reconstruction of these fragments will be studied in this paper. First of all, statistic features of the English letters (capital and lowercase) including the letter size and the space between the letter and the ruled lines are measured, and we set a database of 52 letters (26 lowercase and 26 capital letters) of different fonts in the numeric way. The letter database is built for the character recognition, which can recognize the characters in the fragments and reconstruct the document in the row clustering and the intrarow splicing. Afterwards, since the ruled line of the characters in the common row is the same, cluster the fragments by the clustering algorithm with the help of the ruled line position determined by the identified characters in the fragments. The ruled lines in this paper are similar to the top lines and the based lines of the text . Later, sort the fragments of each row concurrently. Splice some fragments according to identifiability of the characters on the joint of the boundaries at first. Then the combined fragments are regarded as the new fragments or vertices and their edges are the Euclidean distance between the border matrices of two new fragments. The TSP model and genetic algorithm are adopted to sort the fragments in the second stage. Finally, reconstruct the whole document using the fixed line spacing and the proximity of the fragment strings. Our methodology in detail is shown in Figure 4.
This paper is organized as follows. Section 2 introduces how to establish the letter database for the character recognition system. In Section 3, the procedure of character recognition is introduced. The algorithm of row clustering based on the database is proposed in Section 4 while the process of intrarow splicing is shown in Section 5. Section 6 wraps up the work by interrow splicing. The simulation is given in Section 7. Finally, in Section 8, we conclude this paper and present some ideas in the future work.
2. Letter Database
Initially, the database of the letters is needed. We obtain the gray image matrices of the 26 English capital letters and 26 lowercase letters. And then we get the binary gray image matrix of each letter through setting the threshold at 205, which means the pixels whose gray value is over the threshold are indicated by 0, and other pixels are indicated by 1. The letter database will be used later to recognize the characters of the fragments during the reconstruction.
2.1. Letter Extraction
Character recognition is the key to document reconstruction. Importantly, all the given fragments should be transferred into the binary image matrix the same as the process for the letter mentioned above.
We adopt the approach to extract the letter from Zhang et al. .
For a letter binary matrix , where is the scale of the letter matrix. The leftmost side of the letter is . The rightmost side of the letter is . The top edge of the letter is . The bottom edge of the letter is (see Figure 5).
2.2. Size Feature Extraction
We obtain the height and width of a letter, which is the so-called size feature of a letter, through the approach in Section 2.1. The space between the top edge and the bottom edge of a letter is defined as the height of a letter. The space between the leftmost side and the rightmost side of a letter is defined as the width of a letter. The size feature of letters is one crucial aspect of the letter database.
2.3. Ruled Line Space Extraction
Lu et al.  used the local maxima of the horizontal projection histogram in order to identify the top lines and the base lines of the text lines. Similarly, we locate the ruled lines using the horizontal projection of the binary image of the whole alphabet. If the ruled lines are found, the space between the letter and the overline or underline will be measured (shown as Figure 6). We set the overline as the uppermost line of the horizontal histogram and the underline as the bottom of the horizontal histogram. As for different letters, the space between the letter and the ruled lines (the overline and the underline) is the important feature of letters, playing the vital roles in the letter database. In other words, once certain letter in a fragment is recognizable, the exact ruled lines are found based on the space between the letter and the ruled lines (Figure 6).
3. Character Recognition
Initially, the character to be recognized is extracted in the rectangle as the method mentioned in Section 2.1. If the character is cut by the boundary line, before extracting, the fragment should be merged with another fragment (Figure 7).
After the character extraction, the character recognition begins. Character recognition consists of two steps: height-width ratio judgment and binary matrix judgment (Figure 8).
A character is thought to be recognized as a letter if it has the same height-width ratio and the similar binary matrix to a certain letter in the letter database. In order to measure the similarity of the binary matrix, we set a threshold rate. That is to say, a character is recognizable if it shares the same height-width ratio with the certain letter in the database and its shape is over the threshold rate similar to this letter. Moreover, once a character is identified, the ruled lines of the line containing the identified character can be found according to the space between the letter and the ruled lines in the database.
4. Row Clustering
4.1. The Clustering Vector
We use the ruled line to cluster the fragments; however, the data of the gray image matrix is so enormous that we transform the data into the clustering vector to describe a fragment. The binary gray image matrix of each fragment can be described as a clustering vector through feature extraction. The clustering vector is defined as , where represents the lower position of the unidentified row on the top of the fragment and represents the upper position of the unidentified row at the bottommost of the fragment. Meanwhile, represents the position of the overline of the last identified row, and represents the position of the underline of the last identified row.
The steps of the feature extraction are as follows (see Figure 9).S1.If the border-top of binary image matrix is equal to 0, there is no unidentified row on the top of the fragment; let ; if not, continue to the next step.S2.If the upper part of the fragment is identified, there is no unidentified row on the top; let ; if not, let , where is the lower position of the unidentified row.S3.If the border-bottom of binary image matrix is equal to 0, there is no unidentified row at the bottom of the fragment; let ; if not, continue to the fourth step.S4.If the foot of the fragment is identifiable, let ; if not, let , where is the upper position of the unidentified row.S5.If there is any identified row in the fragment, identify one character in the identified row nearest to the border-bottom by the character matching algorithm. Let and , where is the overline position of the identified row and is the underline position of the identified line; if not, let .S6.Finally, we get the clustering vector of the fragment .
4.2. The Cluster Center
Define the first fragment in each row of the original document as the cluster center, which has the fixed and larger blank than the other fragments in the leftmost space. Therefore, we find the cluster centers of the fragments. According to the fact that the space before the document has the largest blank, the fragments of the first column as the cluster centers are easily found (Figure 10).
The steps of finding the cluster center are as follows.S1.Initialize .S2.Initialize the number of cluster center and .S3.If the pixels, which mean the element in the gray image matrix of fragment , from 1 to th column are equal to 0, let .S4.If , , return to S3.S5.If , , return to S2. If , we obtain fragments, also called cluster centers. The pixels from 1 to th column of their gray image matrices are equal to 0. Denote the cluster centers as the cluster centers .
4.3. The Distance between Other Fragments and Cluster Center
All fragments are assigned to their closest cluster centers according to the similarity of their feature vectors. Assume that the feature vector of the cluster center is . It is easier to estimate the similarity of two fragments by comparing two clustering vectors. If there is an identified row in both lines and the ruled lines of these two fragments are in the same position, that is, , these two fragments are likely to be in the same cluster. If two ruled lines are not in the same position, these two fragments are not adjacent. The distance vector means the distance between the clustering vector of the fragment and the clustering vector of cluster center , and it is defined as
The steps of the row clustering algorithm are as follows.S1.Compute the distance vector .S2.If both the components of the vector are no more than the threshold, it shows that the fragment and the cluster center are in the same cluster.S3.Cluster all the fragments into clusters, and find the number of the fragments of each cluster.
5. Intrarow Splicing
After row clustering, the set of the fragments in the same row is supposed to be . Intrarow splicing can be modeled as the problem of finding the optimal path of the graph (also called the traveling salesman problem). The traveling salesman problem figures out the shortest path for the salesman to visit each city exactly and only once and finally return to the original starting point.
As intrarow splicing can be regarded to be the travelling salesman problem, the solution to this problem is to find a shortest path in the undirected graph that visits each vertex exactly once. Each fragment is a vertex of the graph, and the adjacent correlations of the fragments are the edges of the vertices. The goal is to figure out the shortest path which connects the start vertex (the leftmost fragment) to the ending vertex (the rightmost fragment).
The intrarow splicing contains three steps: constructing distance matrix, the first stage splicing based on the character recognition, and the second stage splicing.
5.1. Constructing Distance Matrix
There are two boundaries in a fragment (the left edge and the right edge) when splicing the fragments to the others. Calculate the edges correlation according to the ordinary Euclidean distance metric. For fragments in the set of fragments , we define an correlation matrix to show the adjacency among the fragments. Let be the gray image matrix of the rightmost edge of fragment and let be the gray image matrix of the leftmost edge of fragment , where , represents the fragment and the fragment , respectively . Define the distance between the fragment and the fragment as where
And represents whether the character on the joint of two adjacent fragments is identifiable or not:
For example, in Figure 11(a), the character on the joint boundary is identifiable because the feature of the character is almost the same as that of the letter “e” in the database. In Figure 11(b), the character on the joint boundary is unidentifiable because the feature of the character is different from that of any letters in the database.
Therefore, we obtain the distance matrix of fragments in the same row.
5.2. The First Stage Splicing
The fragments and are possibly adjacent when their correlation . Select the pairs of the adjacent fragments to verify whether they can be merged into each other. Two fragments are adjacent when one character on the joint of the two fragments is identifiable according to the character database. Figure 11 is an example to explain whether the characters on the boundaries are identifiable or not.
In the first stage splicing, according to distance matrix, we denote the set of fragments as follows.(a)For each given , if , the fragment is denoted as .(b)If the following equations exist , and are pairwise different, the fragments are denoted as .
Finally, the set of fragments can be rewritten as follows: where , , , and .
Furthermore, , can be denoted as , where is represented by . After the first stage splicing, the set of fragments is represented by .
5.3. The Second Stage Splicing
After the first stage splicing, the fragments in the same row are denoted by . Calculate the distance as defined in the first stage splicing, where .
Suppose that the fragment contains the clustering center of the row, so the goal of TSP in the intrarow splicing is transferred and modeled as
In other words, we find the solution of the TSP problem with as the starting point. We apply the genetic algorithm to solve this problem.
With the usual coding method, generate an array randomly within the interval . The components in the array correspond to , respectively. The sort in ascending order of the array represents the position of . In addition, adopting the multiple point crossover method, select two random positions of the crossover point, and exchange the gene segments. Moreover, adopting the multiple point mutation method, select some random position of the mutation point and replace the value on the mutation point. The fitness function of the genetic algorithm is
The new genetic algorithm with multiple point crossover operators and multiple point mutation operators is applied in order to increase the diversity of the individual. In addition, as for the first stage splicing based on the character recognition, we can offer the optimal initial solution to the genetic algorithm. Therefore the algorithm can be converged quickly. The optimal sequence of the fragments in a row can be obtained through the genetic algorithm.
6. Interrow Splicing
In the text document, suppose that the overline and the underline of line are denoted as and , respectively. The interrow space is defined as the space between and (Figure 12). Moreover, all of the interrow spaces in the text document are consistent. Finally, we can splice the string of fragments according to the interrow space and the proximity of the fragments.
The shredded document with New Times Roman and 20 fonts is simulated to test the efficiency of the model, which is shown in Figure 2. As for the parameter, the threshold for character recognition is set to 0.83, and the threshold for row clustering is set to 232. In genetic algorithm, the population size is set to 100, the iteration is set to 50, and mutation rate is set to 0.6. We implemented our approach in MATLAB and performed all tests on a double core of an Intel core CPU with 2.10 GHz and 2 GB RAM.
Now, we cut the documents into fragments and use this as an example to illustrate the process of document reconstruction in detail. First of all, the database that fits for the document reconstruction is applied, which is shown in Table 2.
Secondly, with the help of the database and the character recognition, the clustering vector of each fragment is given through the row clustering algorithm (shown in Table 3).
Based on the algorithm of row clustering, 130 fragments are distributed into 13 rows according to their clustering vectors (Table 4).
Splicing in the other rows is similar to Row 13.
After intrarow splicing and interrow splicing, as the end of the reconstruction algorithm, the document is finally reconstructed. The reconstruction result is shown in Table 7, corresponding to Figure 13(a).
The precision of the reconstruction is calculated by
In our simulation, it is found that the effective information of the fragment and the precision of document reconstruction descend globally as the number of the fragments increases. But in the local, the precision of the reconstruction has three stages (see Figure 14). When the number of the fragments is less than 55, the precision of this algorithm is 100%; when the number is from 55 to 100, the precision is above 90%; when the fragments are more than 100 pieces, the reconstruction precision fluctuates around 75% to 85%.
Because the initial solution is optimized through combining the adjacent fragments in the first stage splicing, the complexity of genetic algorithm to find the optimal path is simplified. In addition, the fragment splicing algorithm for each cluster can operate in the same time which improves efficiency. Therefore, the total operating time of reconstructing the documents cut into fragments is 0.8276 seconds, which is much faster than restoring the fragments by naked eye.
To test the adaptability of the algorithm, we choose another document to compute the precision of document reconstruction (Figure 13(b)). The document is shredded in the same way as the one in the previous experiment, with New Times Roman and 20 fonts. On the one hand, compared with the result in Figure 15, we could find that they have the similar trend, indicating the robustness of the algorithm. On the other hand, the precision in Figure 14 is different from the one in Figure 15, indicating that the result of this algorithm will be influenced by the difference of documents.
Document is reconstructed by the algorithm proposed in this paper without human intervention. As for the unsatisfied result, that is to say, the precision cannot reach 100%, human intervention is necessary. After all, the complete document is our goal.
8. Conclusion and Future Work
This paper studies the reconstruction of the shredded text document cross-cut by the paper shredder. With the construction of the letter database, the character recognition helps find out the accurate ruled line in row clustering. In addition, the character recognition shortens the convergence time by offering the better initial solution to genetic algorithm in intrarow splicing. Meanwhile from the simulation results, the document is reconstructed precisely in that short convergence time. Hence, the feature-matching algorithm based on the character recognition can splice the fragments through row clustering, intrarow splicing, and interrow splicing with high efficiency and high precision.
As a solution for the reconstruction of cross-cut shredded text documents (RCCSTD) problem, the algorithm proposed in this paper can be improved in the future. More feature information of the characters in the same row would be considered to be extracted and used to improving the precision in the row clustering. Also, more advanced OCR technology may help solve the RCCSTD problem a lot. We will use it to splice a larger number of fragments from the cross-cut shredded text documents and keep the high precision which is the unique goal for RCCSTD problem.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
This work was partly supported by the National NSF of China (no. 11071089).
P. D. Kesarkar, M. RC. Prasad, and S. L. Tade, “Reconstruction of torn page using corner and segment matching,” Reconstruction, vol. 2, no. 6, 2013.View at: Google Scholar
J. Perl, M. Diem, F. Kleber, and R. Sablatnig, “Strip shredded document reconstruction using optical character recognition,” in Proceedings of the 4th International Conference on IET Imaging for Crime Detection and Prevention (ICDP '11), pp. 1–6, London, UK, 2011.View at: Google Scholar
M. Diem and R. Sablatnig, “Recognizing characters of ancient manuscripts,” in Computer Vision and Image Analysis of Art, 753106, vol. 7531 of Proceedings of SPIE, p. 12, IST/SPIE Electronic Imaging. International Society for Optics and Photonics, San Jose, Calif, USA, January 2010.View at: Publisher Site | Google Scholar
M. Prandtstetter, Hybrid optimization methods for warehouse logistics and the re-construction of destroyed paper documents [Ph. D. thesis], Vienna University of Technology, 2009.
B. Biesinger, C. Schauer, B. Hu, and G. R. Raidl, “Enhancing a genetic algorithm with a solution archive to reconstruct cross cut shredded text document,” in Computer Aided Systems Theory-EUROCAST, pp. 380–387, Springer, Berlin, Germany, 2013.View at: Google Scholar
C. Schauer, Reconstructing cross-cut shredded documents by means of evolutionary algorithms, [M.S. thesis], Vienna University of Technology, Institute of Computer Graphics and Algorithms, Vienna, Austria, 2010.
Z. Zhang, S. Yu, Z. Fang, and Y. Qiao, “Real-time recognition algorithm of Arabic numerals and English letters based on field-programmable gate array design,” Journal of Shanghai Jiaotong University, vol. 40, no. 1, pp. 12–15, 2006 (Chinese).View at: Google Scholar