Abstract

The CAPTCHA has become an important issue in multimedia security. Aimed at a commonly used text-based CAPTCHA, this paper outlines some typical methods and summarizes the technological progress in text-based CAPTCHA breaking. First, the paper presents a comprehensive review of recent developments in the text-based CAPTCHA breaking field. Second, a framework of text-based CAPTCHA breaking technique is proposed. And the framework mainly consists of preprocessing, segmentation, combination, recognition, postprocessing, and other modules. Third, the research progress of the technique involved in each module is introduced, and some typical methods of segmentation and recognition are compared and analyzed. Lastly, the paper discusses some problems worth further research.

1. Introduction

As a multimedia security mechanism, CAPTCHA (Completely Automated Public Turing test to tell Computers and Humans Apart [1]) also called Human Interactive Proofs (HIP [2]), can protect multimedia privacy. Now, it has been successfully applied to Google, Yahoo, Microsoft, and other major websites. In order to verify security and reliability of CAPTCHA, the breaking technology came into being. It involves image processing, pattern recognition, image understanding, artificial intelligence, computer vision, and many other disciplines. The research on CAPTCHA breaking has great value in research and application. First of all, CAPTCHA breaking can verify the security of existing CAPTCHAs, and it can promote the development of CAPTCHA design technique. Secondly, the CAPTCHA is an integral part of artificial intelligence and an important prerequisite to actualize natural human-computer interaction. Finally, the research of breaking CAPTCHA not only constantly refreshes limits to Turing test, but also can be applied in other fields such as digital paper-based media, speech recognition, and image labeling.

In recent decades, with the continuous development of CAPTCHA technology, relevant literature sources are abundant day by day. In 2013, [3] introduced CAPTCHAs of the time and attacks against them; the authors investigated the robustness and usability of CAPTCHAs at the time and discussed ideas to develop more robust and usable CAPTCHAs. Five years later, it is necessary to reorganize the emerging literature sources. Based on the research of text-based CAPTCHA breaking technique, this paper will review the relative research and prospect future trends.

The remainder of this paper is organized as follows: Section 2 briefly introduces the text-based CAPTCHA. Section 3 provides an overview of the text-based CAPTCHA breaking technique. Sections 48 describe main steps in the overall framework of the text-based CAPTCHA breaking technique. Section 9 points out some problems which can be further studied. Section 10 concludes up the full manuscript.

2. Overview on Text-Based CAPTCHA

In September 2000, the Carnegie Mellon University (CMU) research team designed the first commercial CAPTCHAs-Gimpy series text-based CAPTCHAs to resist malicious advertisements scattered by illegal scripting programs in the Yahoo chat room. At the same time, the research on CAPTCHA design and breaking also started. In 2002 and 2005, the international seminars on HIP have been held, and a large number of related research results were published. In subsequent years, many research results were reported in international conferences including CVPR, NIPS, CCS, and NDSS. Many internationally renowned universities and research institutions have established research groups on CAPTCHA technology, such as CMU [1, 814], PARC [1519], UCB [16, 17, 20, 21], Microsoft [2, 2227], Google [2830], Bell Laboratory [31, 32], Yan et al. [4, 3342], Xidian University [4147], and University of Science and Technology of China [48, 49]. In addition, many websites offer CAPTCHA services in public such as CAPTCHA [10], BotBlock [50], JCAPTCHA [51], and HCaptcha [52]. And some research groups focus on CAPTCHA recognition, such as PWNtcha [53], Captchacker [54], aiCaptcha [55], and Gery Mori [56].

The security of text-based CAPTCHA mainly depends on the visual interference effects [25], including rotation, twisting, adhesion, and overlap. The typical types of text-based CAPTCHA and their features are shown in Table 1.

To resist machine recognition, the text-based CAPTCHA’s security is often protected by a series of technologies. From Table 1, we can sum up the following main features of the text-based CAPTCHA.

(1) A large enough character set. Only when a character set is large enough, the total number of CAPTCHA strings is large enough to resist violent breaking.

(2) The characters with distortion, adhesion, and overlap. Using characters with distortion, adhesion, and overlap, the breaking methods cannot easily segmented a CAPTCHA image into single characters.

(3) The characters are different in size, width, angle, location, and fonts. When comparing features of different characters, the various transformations may reduce recognition accuracy.

(4) The strings with unfixed length. In a CAPTCHA scheme, strings with unfixed length can increase breaking difficulty to a certain extent.

(5) Hollow characters and broken contours. Compared with the solid characters, hollow character’s features are less, and broken contours can effectively resist the filling attack.

(6) The color and shape of complex backgrounds are similar to those of characters. If the images meet these conditions, the noise is difficult to remove. This may reduce recognition accuracy.

The above features effectively enhance text-based CAPTCHAs’ security and bring great challenges to the CAPTCHA breaking research at the same time.

3. Research Progress of Breaking Text-Based CAPTCHA

For all kinds of text-based CAPTCHA schemes, the breaking methods are also various. According to whether there is segmentation or not, the existing breaking methods be contained in two categories.

3.1. Text-Based CAPTCHA Breaking Methods Based on Segmentation

The text-based CAPTCHA breaking based on segmentation has different processing methods for different objects and results. When there is no adherent character, individual characters are obtained using vertical projection and connected component with good effect. As shown in Table 2, the success rates of nonadherent character CAPTCHA range from 78% to 100%.

However, it had little success in adherent characters. Therefore, more complicated methods, such as different width, character features, and character contours, have been proposed one after another. With more and more antisegmentation technologies in CAPTCHA field, obtaining individual characters is becoming harder and harder. Then the researchers proposed the segmentation methods for obtaining character components by character structure, filters, and so forth. As can be seen from Table 3, the success rates of CAPTCHA breaking are generally low, with only a few higher than 80%.

3.2. Text-Based CAPTCHA Breaking Methods Based on Nonsegmentation

The text-based CAPTCHA breaking methods based on nonsegmentation can directly recognize preprocessed CAPTCHA images. The breaking method’s success rate relies on recognition technique. In early stage, different pattern matching algorithms such as shape context [20] and similarity [57] are used for recognition. Later, with the improvement of the success rates of individual character recognition, researchers focus on the character segmentation technique. However, the text-based CAPTCHA design uses antisegmentation technique, which can prevent obtaining complete and individual characters. Nowadays with the advantage of deep learning, the breaking based on nonsegmentation will bounce back. The success rates of typical text-based CAPTCHA breaking methods based on nonsegmentation are as shown in Table 4.

3.3. The Framework of Text-Based CAPTCHA Breaking Technique

With the improvement of text-based CAPTCHA design, the breaking technique changes to meet it. The early text-based CAPTCHA contains nonadherent characters. The breaking technique is the traditional framework of “preprocessing + segmentation + recognition.” In recent years, most of the text-based CAPTCHAs use CCT (Crowded Characters Together). Therefore, various breaking frameworks come into being, for example, “preprocessing + recognition,” “preprocessing + recognition + postprocessing,” “preprocessing + segmentation + combination + recognition,” and “preprocessing + segmentation + combination + recognition + postprocessing.”

In this paper, the existing frameworks are integrated into an overall framework of text-based CAPTCHA breaking, as shown in Figure 1. The framework mainly consists of preprocessing, segmentation, combination, recognition, postprocessing, and other modules. The research progress of each module will be described in the following.

4. Preprocessing Methods of Breaking Text-Based CAPTCHA

The CAPTCHA preprocessing is the first step of CAPTCHA image processing before segmentation and recognition. Its main purpose is to highlight the information related to characters in a given image and to weaken or eliminate interfering information. The preprocessing of existing CAPTCHA breaking methods mainly includes image binarization, image thinning, denoising, and so on.

4.1. Image Binarization

Image binarization is to highlight interesting objects’ contour and to remove noises in background. The key to binarization is to select an appropriate threshold. When the threshold is applied to the whole image, it is called the global threshold method; otherwise, it is called the local threshold method. If the threshold is not fixed during processing, it is called variable threshold method or dynamic threshold method. The common thresholding methods are Sauvola and Pietikainen’s method [65], Otsu’s method [66], and so on.

4.2. Image Thinning

Image thinning is to process the character’s contour as skeleton. It must not change the character’s adhesion. Its purpose is to highlight image contour and to simplify subsequent processing. The thinning algorithms contain two categories: noniterative algorithm and iterative algorithm. The common thinning algorithms include Hilditch algorithm [67] and Zhang and Suen algorithm [68].

4.3. Image Denoising

In order to resist breaking, there are noises and interference lines in CAPTCHA images. In addition, some noises are generated during grayscale and binarization. Therefore, we need to denoise CAPTCHA image. The typical methods are as shown in Table 5. We should choose the effective denoising method according to actual situation.

5. Segmentation Methods of Breaking Text-Based CAPTCHA

The segmentation aims to get individual characters or character components. There are the segmentation methods based on individual characters and the segmentation methods based on character components.

5.1. Segmentation Methods Based on Individual Characters

The segmentation methods based on individual characters segment a CAPTCHA image to individual characters. For individual characters, we can use segmentation methods based on character projection and connected components. For CCT characters, we can use segmentation methods based on character width, connected feature, and character contour.

5.1.1. Segmentation Methods Based on Character Projection

The segmentation methods based on character projection determine the optimal segmentation position by analyzing the number of pixels projected under different conditions. This method applies to recognizing CAPTCHA characters without adhesion or slight adhesion. However, its effect is not obvious for the seriously adherent and distorted characters. The typical methods include vertical projection segmentation, horizontal projection segmentation, and guideline projection segmentation.

Using (1), [61] defines three-color bar code to segment reCAPTCHA images:where represents the total of object pixels in the th column. In three-color bar a column is colored in blue if there is not any pixel that belongs to character in the column (). If there is only one pixel in column (), the column is encoded by white. Finally, the black corresponds to the column with more than one object pixel (), as shown in Figure 2(a). After denoising, the optimal segmentation line is determined in the middle of blue bar or white bar, as shown in Figure 2(b).

5.1.2. Segmentation Methods Based on Connected Components

The segmentation methods based on connected components effectively segment individual characters using different connected components in an image. For slope and distortion characters, this method is effective. However, it is limited by adherent characters.

Reference [4] tried to segment Microsoft MSN CAPTCHA by combining connected components and vertical projection, as shown in Figure 3. First, different connected components are marked with different colors. And then the character blocks are generated according to different colors. Finally, strings are segmented to individual characters using the vertical projection feature, with a success rate of more than 90%.

5.1.3. Segmentation Methods Based on Character Width

The segmentation methods based on character width are suitable for CAPTCHA images which are not easily segmented to individual characters. [60] used different widths (the average width of 0.75 times, 1 time, 1.5 times, and 2 times) to segment an image. Thus, each character corresponds to four recognition results, from which to find an optimal segment as the final recognition result. In addition, [5] did not take the average width as standard; they gave a set of character segments between the minimum width and the maximum width and then determined the optimal segmentation scheme using dynamic programming, as shown in Figure 4.

5.1.4. Segmentation Methods Based on Character Feature

The segmentation method based on character features uses the features of CAPTCHA string, including inside features and outside features. Reference [38] classifies characters according to their own inside features, and each class contains the characters as shown in Table 6.

Reference [6] segments characters according to outside features among them. This paper proposes a new segmentation algorithm called middle-axis point separation for CAPTCHAs. The algorithm utilizes the central pixel in background between two disconnected object pixels as segmentation points (see Figure 5).

5.1.5. Segmentation Methods Based on Character Contour

The segmentation method based on character contours is to analyze geometric features of character contours, so as to determine the appropriate segmentation lines. Reference [7] tried to connect connection edge points between two merged characters and determined the optimal segmentation line by confidence, as shown in Figure 6.

5.2. Segmentation Methods Based on Character Components

The segmentation methods based on character components produce multiple character components, rather than individual characters. The segmentation methods are mainly base on character structure or filter.

5.2.1. Segmentation Methods Based on Character Structure

Using structural feature of characters with black components and white components, [36] segmented a seriously overlapped string to multiple components. First, locate black components, as shown in Figure 7(b). And then, locate white components, as shown in Figure 7(c). Finally, identify black components of each character and the shared white components.

In [41], a CAPTCHA image contains several hollow characters, whose contours naturally form several closed regions (see Figure 8(a)). According to this structural feature, a character is segmented to several character components by color filling (see Figure 8(b)).

5.2.2. Segmentation Methods Based on Filter

Reference [42] is the first to apply Gabor filters for breaking CAPTCHAs, which extracts character components along four directions by convolving a CAPTCHA image with each of four filters, respectively, as shown in Figure 9. The segmentation method is not limited by adhesion, distortion, and overlap and is suitable for many kinds of characters.

In summary, the contrast among segmentation methods is given. As can be seen in Table 7, each segmentation method applies to different types of characters. It is only the individualized segmentation method that can obtain good results.

6. Combination Methods of Breaking Text-Based CAPTCHA

An individual character after segmentation can be recognized directly. But character components need to be combined into an individual character to be recognized. According to the number of generated candidate characters, combination technologies can be divided into two categories: the combination technique based on redundancy and the combination technique based on nonredundancy.

6.1. Combination Methods Based on Redundancy

The number of candidate characters generated by combination technique based on redundancy is more than the number of real characters. In [42], each character fragment is labeled in order from top to bottom and left to right, and then the components are combined on the idea of jigsaw puzzle to generate candidate characters.

6.2. Combination Methods Based on Nonredundancy

The number of candidate characters generated by combination technique based on nonredundancy is equal to the number of actual characters. In [36], the character components are nonredundant. The overlap area strokes may be reused to compose a complete character. Figure 7(a) shows a Megaupload CAPTCHA image. Figure 10 gives the combined four characters. The final success rate of combination is 78.25%.

7. Recognition Methods of Breaking Text-Based CAPTCHA

Nowadays, the recognition methods used in text-based CAPTCHA system include three categories: template matching, character feature, and machine learning.

7.1. Recognition Methods Based on Template Matching

Template matching is to compare similarity of each pixel between characters and every template and to find the highest similarity. According to matching range, there are the matching recognition methods based on global property and the matching recognition methods based on local feature.

7.1.1. Matching Recognition Methods Based on Global Property

The matching recognition methods based on global property is traverse scanning. Within search area, the optimal match point to each pixel is found by regional correlation matching calculation. Because many templates matching each pixel will be pretty slow, [45] proposes the second template matching algorithm to improve efficiency. Only if a rough matching is successful, an exact matching needs to be made.

7.1.2. Matching Recognition Methods Based on Local Feature

The shape context is a simple local feature shape descriptor. Its basic idea is to convert the matching problem of image into the matching problem of feature point set. In 2003, Mori and Malik [20] used shape context to break the CAPTCHA of Gimpy and EZ-Gimpy. For good robustness to image scaling and affine transformation, it is widely used in face recognition, CAPTCHA recognition, shape matching, and other fields.

7.2. Recognition Methods Based on Character Feature

Because the character of each CAPTCHA mechanism varies in design, we can define different methods according to the feature of characters, which is mainly based on character structural feature and character statistical feature.

7.2.1. Recognition Methods Based on Character Structural Feature

The structural feature can describe the details and structural information of characters, such as the number of loops, inflection point, convexo-concave degree, and cross points. For example, [46] uses the guidelines of characters (see Figure 11(a)) and closed loop detection (see Figure 11(b)) to break Yahoo CAPTCHA.

7.2.2. Recognition Methods Based on Character Statistical Feature

The recognition method based on character statistical feature uses commonly statistical features including pixel feature, projection feature, contour feature, and coarse mesh feature. This feature is robust to noise interference and is widely used in CAPTCHA recognition field. Reference [34] used the distinct pixel count for each of the letters A to Z (see Figure 12) to break captchaservice.org CAPTCHA with a near 100% success rate.

7.3. Recognition Methods Based on Machine Learning

The recognition methods based on machine learning is essentially using machine learning algorithms to correctly classify CAPTCHA characters. According to chronological order of mainstream, it can be basically divided into three categories: traditional methods, neural network, and deep learning.

7.3.1. Recognition Methods Based on Traditional Methods

In the field of text-based CAPTCHA recognition, the most widely used traditional classifiers include SVM and KNN.

The idea of SVM is to separate classes via a hyperplane. The key is kernel function, which is responsible for mapping original features into high-dimensional space in a nonlinear way, thereby improving the separability for data. Reference [5] compared four kernel functions: RBF (Radius-Based Function), POLY (polynomial), LINEAR, and SIGMOID. The experimental results showed that the performance of the first two kernel functions was optimal.

KNN is based on the category of the nearest samples to determine the category of a sample. Reference [42] tested SVM, BPNN (back-propagation neural network), template matching, CNN, and KNN. Among these classifiers, KNN achieved higher success rates on most of the schemes, but CNN was faster most of the time.

7.3.2. Recognition Methods Based on Neural Network

For the principle of parallel distributed operation in large number of neurons, the efficient learning algorithms, and the ability to imitate human cognitive systems, the neural network is very suitable to solve problems such as speech recognition and text recognition.

In [62], a BPNN used cross entropy for calculating the performance of a network with targets and outputs. Eventually, the system achieved an overall precision of 51.3%, 27.1%, and 53.2% for the CCT CAPTCHAs of Taobao, MSN, and eBay, respectively.

However, when applying neural network, we need to extract character features first. The quality of extracted features limits the final recognition rate to a certain extent.

7.3.3. Recognition Methods Based on Deep Learning

In recent years, deep learning has achieved remarkable achievements in recognition fields of text, image, audio, and so forth. The deep learning models commonly used in CAPTCHA recognition field are CNN, RNN, LSTM-RNN, and so forth.

CNN recognizes character images without feature extraction and has a certain degree of robustness in displacement, scale, and deformation. In the existing research results, a typical CNN is widely used [2, 4, 36, 38, 41] with a good recognition accuracy. Reference [30] trained large, distributed deep convolutional neural networks and achieved 99.8% accuracy in recognizing CAPTCHA images of reCAPTCHA.

However, due to lack of time dimension, CNN cannot combine context information in recognition. So RNN with feedback and time parameters was proposed to process time series data. Later, in order to solve vanishing gradient problem of RNN, LSTM was proposed in machine learning field. Reference [62] applied 2D LSTM-RNN in CCT CAPTCHAs recognition with a success rate of 55.2%. It innovatively obtained relative information not only in horizontal context, but also in vertical context.

In summary, a contrast among recognition methods is given, as shown in Table 8. According to the features of different networks, we should attempt to construct a new deep learning model by combining multiple networks. It will be a research trend in the field of text-based CAPTCHA recognition.

8. Postprocessing Methods of Breaking Text-Based CAPTCHA

In previous steps, some of character recognition results may be taken as final results directly, while others need to be further postprocessed. In postprocessing stage, the final result’s reliability is ensured by simplification, selection, and optimization. According to different objects and methods, there are the postprocessing methods based on selection and the postprocessing methods based on rejection.

8.1. Postprocessing Methods Based on Selection

Usually, there are many redundant individual characters generated in previous steps. This requires selecting the most likely combined string as the final recognition result of CAPTCHA image. The selection strategies include the local optimization and the global optimization.

The local optimization selection only takes into account the recognition confidence optimality of an individual character. In [60], each character corresponds to several candidate characters with different widths. Therefore, the candidate character with the highest recognition confidence is selected as the final character.

The global optimization selection strives for the best results for all characters in an image. In [41], all candidate characters are found by the graph traversal, and then the string with the highest sum of characters recognition confidence values is taken as the final result, while in [5, 42], to avoid enumerating all candidate characters, a dynamic programming is used to determine the final result with the highest sum of characters’ recognition confidence values directly. Compared with graph traversal, the dynamic programming is more effective and accurate.

8.2. Postprocessing Methods Based on Rejection

The purpose of postprocessing methods based on rejection is to determine whether the tested sample belongs to the types of training set by analyzing character recognition results. Therefore, the postprocessing methods based on rejection are a key to ensure high reliability of CAPTCHA recognition.

At present, the researchers have not been paid enough attention to the postprocessing methods based on rejection. To the best of our knowledge, there is only one paper [62] in CAPTCHA field. It considers multiple features, such as confidence, string length, character spaces, and the first and the last character of a string, to determine whether a candidate character should be rejected or not.

9. Some Problems Worth Further Research

As stated above, many achievements have been acquired. However, in view of the complexity of text-based CAPTCHA, there are still some issues worth exploring in depth in this field.

(1) Construction of Standard Test Database for Text-Based CAPTCHA. A rich and high quality text-based CAPTCHA image database is the necessary foundation for the research of text-based CAPTCHA breaking. At present, the researchers get CAPTCHA images mainly by web access and software generation. However, due to the diversity and timeliness of text-based CAPTCHA, it has not been possible to construct a common image database in the field of text-based CAPTCHA recognition. It is necessary to collect, classify, organize, and establish the text-based CAPTCHA images database. The database can provide the reliable training and testing data for research work and also provide the premise and basis of unified evaluation for various methods in this field.

(2) Multitype CAPTCHA Recognition. At present, only when training set and test set belong to the same type, the classifier can effectively recognize CAPTCHAs. In fact, there are a variety of character changes in a CAPTCHA. Therefore, it is an arduous and important task to design a reasonable classifier to recognize various types of CAPTCHAs.

(3) Segmentation-Free CAPTCHA Recognition. After more than ten years of development, the text-based CAPTCHA breaking has achieved a high success rate in individual character. However, the breaking success rate of the CAPTCHA string is generally low, and the results are less. With the wide application of CCT strings in text-based CAPTCHA, the problem of segmentation-free CAPTCHA recognition needs to be solved urgently. Now deep learning may provide new ideas and technical means to solve this problem.

(4) Application of Deep Learning Model. At present, in the field of CAPTCHA recognition, deep learning model can achieve better results than traditional methods. The most frequently used methods are based on CNN and its improved methods, while other deep learning models such as DBN (Deep Belief Networks), RNN, LSTM/BLSTM/MDLSTM, and DRL (Deep Reinforcement Learning) were not well used in text-based CAPTCHA recognition. Furthermore, the study of the interrelationships and fusion applications between the various deep learning models is not thorough. We hope that newer and better deep learning models are proposed to make a breakthrough in CAPTCHA recognition, which will certainly promote the development in this field.

(5) Rejection of Text-Based CAPTCHA. With the development of CAPTCHA breaking technique, the reliability of recognition results is also increasing. In this regard, on one hand, we should improve the correct rate of recognition; on the other hand, we should guarantee the correct rejection. In the field of CAPTCHA recognition, the concept of rejection has not been well known to the researchers. Therefore, this study has a potential development space.

(6) Misrecognition of Confusable Characters. When using the deep learning network to extract character features automatically, the characters with similar features are easily confused. It has practical significance to improve the precision of feature extraction and the training methods in the deep learning network.

10. Conclusions

Based on detailed investigation and in-depth analysis, this paper reviews the progress of text-based CAPTCHA breaking technique. First of all, this paper introduces various text-based CAPTCHAs and focuses on their features. Second, according to whether there is segmentation or not, we classify the existing breaking methods of text-based CAPTCHA and summarize their features. Meanwhile, we propose a framework of text-based CAPTCHA breaking technique and introduce the modules contained in the framework one by one. Next, we compare and analyze the basic principles, advantages, and disadvantages of the existing methods from five aspects: preprocessing, segmentation, combination, recognition, and postprocessing. Finally, some problems worth further research are discussed.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (nos. 61379151, 61401512, 61572052, and U1636219), the National Key R&D Program of China (nos. 2016YFB0801303 and 2016QY01W0105), and the Key Technologies R&D Program of Henan Province (no. 162102210032).