Inverse Problems: Theory and Application to Science and EngineeringView this Special Issue
Research Article | Open Access
Reconstruction of Shredded Paper Documents by Feature Matching
Splicing the shredded paper means the technology that, according the paper, which has been shredded to design a particular algorithm to splice and recover the original paper. This paper introduced the algorithm of splicing the shredded paper based on the matching to texture feature. By means of this algorithm, we modeled for the problem of splicing the shredded paper and solved it. And, we used the algorithm to splice both pieces of English shredded paper and Chinese shredded paper. The recovered paper had proved the accuracy of the algorithm.
Repairing of fragments is important in that it can be applied to many areas such as restoration of archeological findings, the obtainment of modern war information. A direct and valid way is splicing the fragments manually. But when there was a huge quantity of the fragments, splicing manually would consume vast human resources . Even during the process of splicing some valuable fragments might be damaged by workers unintentionally. Along with the development of technology, more and more people tried to resolve this type of problem which could be attributed to the problem of how to repair the 2D fragments. As the typical problem of repairing of 2D fragments, the technology of splicing the shredded paper had received much more attention than before. In many cases, people need to recover the shredded paper which may carry some important information. Those pieces of paper might be shredded by shredder or some other ways. For now, to achieve the solution of this problem, researchers had proposed various algorithms. However, there were some limitation existed in many main trend of the splicing technologies . This paper introduced the algorithm based on the matching to texture feature. This algorithm is modeled for the paper which had been shredded only in lengthwise or been shredded horizontally and vertically. In the process of solving the problem, we also designed the different details for both kinds of shredded paper whose text was printed in English and Chinese.
This paper described the main problems of splicing the shredded paper and proposed the core algorithm of the matching to texture feature. Afterwards, this paper discussed the solutions for the all those problems in two conditions in which the text was printed in English and Chinese. Results showed that the proposed algorithm is efficient. And finally, there would be a summary of the recovered result.
2. Main Problem and Assumption
Based on the paper which had been shredded in different ways, there were the following problems to be solved.(1)For shredded paper which was cut in lengthways from the same page of file, it was necessary to build the corresponding mathematical model and algorithm.(2)Given the condition that the paper had been shredded either horizontally or vertically, we should model the corresponding mathematical model and algorithm which is certainly different from .(3)Under the condition in Problem 2, we took the further consideration that the document had been printed double-sided which had made it more difficult to complete the reconstruction of shredded paper. Based on the model from the above problem, we should propose an advanced algorithm to deal with this problem.(4)There was a conclusive difference that height of the word was not consistent between the English text and Chinese text.
In this paper, the model of those problems was presented based on the following necessary and reasonable assumptions.(1)The text form from exactly the same paper was consistent basically.(2)There was no piece of shredded paper lost.(3)The size of each piece of the shredded paper was identical.(4)The paper was shredded paper regularly.(5)The direction of the shredded paper was consistent.
All the conclusions which yield from this paper were based on the above assumptions. And it was worth noting that at some point the shredded document might need to be preprocessed to make sure that all pieces of paper satisfied the assumptions. For instance, when the text of paper was oblique or distorted for reason of wet, we should have to do further preprocess to correct it before the pieces of paper were scanned into computer.
3. Modeling and Solving for Problem: Repairing of Shredded English Paper
3.1. Analysis for Problem 1
For shredded paper which was cut in lengthways from the same page of file, we take the shredded paper in Figure 1 for an example while modeling the mathematical model and algorithm. From Figure 1 we could see that the paper had been shredded to 19 pieces.
The scanning copies shown in Figure 1 had never been handled with some preprocessing software. And in fact, nearly more than half of the pieces need to be preprocessed for various reasons such as wrinkled surface of paper. The scanning copies mentioned in this paper were the results of this kind of preprocessing.
To solve this problem, firstly, we would quantify the pixel from the left and right edge of the shredded paper, which could yield matching degree-matrix for each piece of shredded paper. In this case, the match degree expressed the similarity for splicing . Judgment of the left edge of each piece could help us determine the right piece which was from the left edge of the original file as the first piece. Then, we would select another piece of shredded paper and splice both pieces with each other. This selected piece should have the highest match degree with the right edge of the first piece. By this analogy, we could deduce that when processing the th splicing, we would also select the adjacent piece by judging the match degree. Finally, we can get the correct splicing result of the original file.
For the shredded paper in Figure 1, we can do the binarization of pixel from the edge of each piece of the shredded paper by programming. There were pixels which composed the left and right edge of the piece of shredded paper . Suppose that vector is the binarization of the pixel from the left edge of the th piece and vector is the binarization of the pixel from right edge. In addition,
And then, we could define the number of the pixel matching of the right edge of the th piece and the left edge of the th piece as
There was a necessary introduction that in the result of the binarization the value “1” represented the black pixel and “0” represented the white . Thereinto, the symbol is the (1-1) matching number of vector and vector . Analogously, the symbol is the (0-1) matching number, the symbol is the 1-0 matching number, the symbol is the (0-0) matching number.
The matching measure of binary vector is obtained by method of simple match coefficient:
In the above equation, the numerator is the sum of the matching number of (1-1) and (0-0). The denominator is the sum of all the numbers.
By that analogy, we can get the matching degree of left and right edge of any pieces within pieces of shredded paper:
The symbol represented the matching degree of both the left and right edges of the th piece. It meant nothing.
Through running the program we could regard the certain piece (“008” piece) as the part of the left edge of the original paper for the reason that the “008” piece of shredded paper had the regular blank on its left edge. So we chose 008 as the beginning of our splicing work.
We would get vectors of the “008” and other pieces of shredded paper.
From the matching vector we can know that the “014” piece has the maximum similarity with the 008 piece of shredded paper. So we spliced the 014 piece and the 008 piece as the first step.
Then, we would get vectors of the “014” and other pieces of shredded paper:
Following the above steps until all pieces of the shredded paper had been spliced we could recover the original paper. The correct order was 3-6-2-7-15-18-11-0-5-1-9-13-10-8-12-14-17-16-4.
The result of the repairing of English shredded paper was showed as Figure 2.
3.2. Analysis for Problem 2
For this problem we should design for the model that the paper had been shredded either horizontally or vertically. Here we took the following shredded paper for example in Figure 3. The paper has been shredded into 11(lines) 19(rows) pieces.
3.2.1. Solving Idea and Method
Supposed that all pieces of the shredded paper had been splicing regularly so we could get a grid. Each piece of the shredded paper could be located by its () coordinate. To simplify the whole problem, we might as well ignore the row coordinate at first. At this point we should only determine the line coordinate of each piece of shredded paper.
We supposed further that the pieces of shredded paper in each line had already been spliced in correct sequence . And then, we only needed to get the sequence of all the 11 lines of shredded paper. We defined symbol as the distance of pixels between the bottom of the text in the first line and the top of the piece of shredded paper. By using inequality and enumeration method we could get the values of all the 11 lines. And by comparing the values of lines with the value of each piece of shredded paper we can get the line coordinate of the latter (based on the actual situation, this paper considers that the values of the shredded paper from the same line in original paper were identical).
Up to this point, essentially, the Problem 2 was simplified into Problem 1. We could easily recover the shredded paper by using the model and algorithm that were proposed in Problem 1.
3.2.2. Preprocessing of Data
To get the values of all the 11 lines, it was necessary to do the preprocessing of data of shredded paper. We defined the height of row as the pixel distance between the bottom of the text in the th line and the bottom of the text in the ()th line. For the English shredded paper, there was the significant difference in values of between each other . On the basis of this characteristic, we could get the area which had the high concentration of pixels by progressive scanning technology of pixel. Through this method, we could get the height of row .
As we all know, some letters occupied additional space like “,” “,” “,” “,” and so on. When there was a sentence which was constituted by letters, the black pixel proportion of the middle part of the sentence on the paper was not uniform . Take the word “public,” for example, the frequency distribution of black pixel was showed as Figure 4.
It was obvious from Figure 4 that the black pixels concentrated on the middle part of the whole word. In view of the above, this paper selected the way of scanning the shredded paper from top to bottom to get the line which has the maximum sudden-change of black pixel. From this we might judge the location of the line from the original paper. In this case, we defined the bottom of the text as the third dotted line. Then we could get the height of the row pixel.
3.2.3. Modeling and Solving
As previously stated, we would ignore the sequence of the line of the shredded paper. At this point, we supposed that all pieces of the shredded paper had been splicing correctly as before. We modeled 9 pieces of shredded paper for analysis.
As Figure 5 showed, the rectangular boxes which were circled by thick lines meant a piece of shredded paper. While the rectangular boxes with the dotted lines represented the lines of text.
The symbol was the pixel height of the line of the shredded paper. From the pieces of shredded paper we could get that the of each piece of shredded paper was 180 pixels and the line coordinated . The symbol was the height of the word. Thus, we might get the pixel distance by the following method.
Suppose that the lines of the text were ,; we could get the minimum positive integer which meet the inequality that . Do plug into the following formula:
Then we could obtain the value of of th.
For example, assuming the value of first line of the shredded paper, we used the following process to get the value of and .
When , we could know the . From this equation we could easily work out that . So is 24 pixels.
And when , with the same method, we can know that is 48 pixels.
By that analogy, we could obtain all the values of each line of shredded paper, and then we would get the right order of each line of shredded paper.
By using of the model in Problem 1, we could get the row coordinate . When we got all the coordinates () we could locate all pieces of the shredded paper. And the final result is shown as Figure 6.
3.3. Analysis for Problem 3
Based on the Problem 2, the purpose of the Problem 3 was to repair the shredded paper which was printed double-sided. It posed a new problem that before repairing the shredded paper we should identify whether the shredded paper was the front or the back side of the original paper. Through the relation between the value and line coordinate of shredded paper we could obtain the row coordinate of shredded paper. And then, the problem came down to an easier problem: sorting and splicing of shredded paper which was printed double-sided in the same line of the original paper.
3.3.1. Solving Idea and Method
As previously stated, we should identify the front and back of the shredded paper while sorting and splicing them. Given that the text of both sides has the same line height, firstly, we can still use the model in Problem 2 to get the correspondence of the row coordinate of the shredded paper and the value of . By calculating the value of of each piece of shredded paper, we could get the row coordinate of each of them. Then, the problem was simplified into an easier problem of how to sort and splice both sides of the shredded paper which were on the same row. This paper introduced a new concept: sum-similarity . While we were splicing two pieces of shredded paper, both sides of the shredded paper would generate the similarity of splicing on the edge of themselves . And, the sum-similarity is the summation of both two values of similarity. For the shredded paper which was printed on both sides, there would have been two conditions which would generate two sets of sum-similarity. Consequently, the identification of the front and back of the shredded paper was turned into the comparison between the two sets of the sum-similarity. Afterwards, we would always take out the piece of shredded paper which had the maximum sum-similarity splicing with the special piece of shredded paper at every turn. Finally, we could get the recovered paper.
3.3.2. Modeling and Solving
By the conclusion based on Problem 2, we could obtain the value of each piece of the shredded paper which would yield the line coordinate of each piece. Suppose that the two sides of each piece of shredded paper had been named A and B side as well as the original paper. And we supposed further that the text in A side of the shredded paper ought to belong to the A side of the original paper. So, there were two cases to be considered when the th piece was splicing with ()th piece of shredded paper under the condition when both pieces came from the same line of the original paper.
Case 1. When we were splicing th piece of shredded paper with ()th piece, the A side of th piece should be spliced with the A side of ()th piece as well as their B sides. And at this point there would lead to two similarities of splicing. Then we could get the overall similarity by summing the two similarity of splicing. The exact process was showed as Figure 7.
Case 2. When we were splicing th piece of shredded paper with ()th piece, the A side of th piece should be spliced with the B side of ()th piece while the B side of th piece was spliced with the A side of ()th piece, there would also lead to two similarities of splicing refer to Case 1. Certainly, we would get the overall similarity by the same way.
After that, we could choose the most special piece of shredded paper through the program such as the piece which was from the left edge of the original paper. We defined this piece of shredded paper as the first piece. This way, when we try splicing the rest of 18 pieces of shredded paper with the first piece, there would be 36 values of overall similarity. By the algorithm in Problem 1, we chose the piece which had the maximum similarity to splice with the first piece.
By that analogy, we would foresee that when we sort the th piece, there would be values of overall similarity. We would always take out the piece of shredded paper which had the maximum sum-similarity splicing with the special piece of shredded paper at every turn until all pieces of shredded paper had been spliced. By that point, we might obtain the result of reconstruction of shredded paper as Figures 8 and 9 showed.
Finally, a summary of the time performance of this algorithm could be described by Figure 10.
Figure 10 had shown the relation between the number of fragments in a certain line and the time consumption when the algorithm was used to splice the double-sided of shredded paper. For an 11(lines) * 19(rows) double-sided shredded paper the algorithms would consume no longer than 10 milliseconds. From Figure 10 we could learn that there was a nonlinear relationship between the time consumption and the number of fragments. The time complexity of the algorithm could be expressed as .
4. The Application of Improved Algorithm Used in Chinese Shredded Paper
The algorithm above had been used in the repairing English shredded paper. But obviously, the algorithm was not absolutely appropriate for the Chinese shredded paper for the reason that there was a crucial difference between the two kinds of text—the height of the row.
More specifically, to the Chinese characters which were from the text, the height of the characters was mainly identical. Based on the actual situation we could get the height of the row by the program easily rather than calculate the frequency distribution of black pixel. And once we got the , we might model and solve the problem with the same way used in English shredded paper. Totally speaking, the algorithms for Chinese and for English shredded paper were similar in essence. Both kinds of the shredded document should be preprocessed to satisfy the five assumptions mentioned in chapter 2. Both executions of the two algorithms were also similar with each other. The main difference was the course of getting the height of the row . Here we gave the concrete steps which could be described as the following Figures 11, 12, and 13.
In this paper the model and algorithm for reconstruction of shredded paper had been discussed. For the condition in Problem 1, this paper compared the similarity degree between the rightmost pixels of a certain shredded paper and the leftmost pixels of other pieces of shredded paper. The time complexity of this algorithm was low. Based on this model and algorithm the problem was solved without any manual intervention. For the condition in Problem 2, this paper sorted the shredded paper according to the location of each piece of shredded paper and satisfactory result was obtained. For the condition in Problem 3, this paper sorted the shredded paper by the optimum matching for the double-sided printed of shredded paper. This model and algorithm could recover the shredded paper with only a little manual intervention. And, given the difference between English text and Chinese text, this paper had proposed the improved algorithm to complete the reconstruction of Chines shredded paper.
At last, it must also be pointed out that the proposed algorithms had their own limitation and bound. Through extensive tests, we ought to admit that in the practical problem, not every piece from various shredders was exactly identical. And at the same time, the text color, the texture of the paper, the cutting direction (such as documents were shredded askew and promiscuously), or the large stains on the paper, and so on, all of the conditions above would have an impact more or less on repairmen of the shredded document. In some cases, the proper preprocessing would reduce this kind of impact. But in many other much more serious cases, the algorithms introduced in this paper were not performing precisely enough. To improve the general applicability of the algorithm, we had been trying some other methods. As we explore more kinds of shredded paper, we believe that the technology would become more and more mature.
Conflict of Interests
The authors declare that there is no conflict of interests regarding the publication of this paper.
The authors gratefully acknowledge the service of WUT Mathematical Computation and Modeling Simulation Center for supplying the relating materials. This work was inspired by the activities of IMP Action of “the Fundamental Research Funds for the Central Universities (2012-IV-057)” and was also supported by the National Natural Science Foundation of China (nos. 10672128, 50878169).
- B. T. Ávila and R. D. Lins, “A fast orientation and skew detection algorithm for monochromatic document images,” in Proceedings of the ACM Symposium on Document Engineering (DocEng '05), pp. 118–126, November 2005.
- C. Schauer, Reconstructing cross-cut shredded documents by means of evolutionary algorithms [M.S. thesis], Institute of Computer Graphics and Algorithms, Vienna University of Technology, 2010.
- P. de Smet, J. de Bock, and E. Corluy, “Computer vision techniques for semi-automatic reconstruction of ripped-up documents,” in Visual Information Processing XII, vol. 5108 of Proceedings of SPIE, pp. 189–197, April 2003.
- M. Werman and D. Weinshall, “Similarity and affine invariant distances between 2D point sets,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 17, no. 8, pp. 810–814, 1995.
- J. Brassil, “Tracing the source of a shredded document,” in Proceedings of the 5th International Workshop on Information Hiding (IH '02), pp. 387–399, 2003.
- L. Zhu, Z. Zhou, and H. Dewen, “Globally consistent reconstruction of ripped-up documents,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 30, no. 1, pp. 1–13, 2008.
- D. Goldberg, C. Malon, and M. Bern, “A global approach to automatic solution of jigsaw puzzles,” Computational Geometry, vol. 28, no. 2-3, pp. 165–174, 2004.
- A. Ukovich and G. Ramponi, “Features for the reconstruction of shredded notebook paper,” in Proceedings of the IEEE International Conference on Image Processing (ICIP '05), pp. 93–96, September 2005.
- P. Faber, “A theoretical framework for relaxation processes in pattern recognition: application to robust nonparametric contour generalization,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 8, pp. 1021–1027, 2003.
- G. Papaioannou, E.-A. Karabassi, and T. Theoharis, “Reconstruction of three-dimensional objects through matching of their parts,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 1, pp. 114–124, 2002.
Copyright © 2014 Peng Li et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.