Abstract

In order to distinguish between computers and humans, CAPTCHA is widely used in links such as website login and registration. The traditional CAPTCHA recognition method has poor recognition ability and robustness to different types of verification codes. For this reason, the paper proposes a CAPTCHA recognition method based on convolutional neural network with focal loss function. This method improves the traditional VGG network structure and introduces the focal loss function to generate a new CAPTCHA recognition model. First, we perform preprocessing such as grayscale, binarization, denoising, segmentation, and annotation and then use the Keras library to build a simple neural network model. In addition, we build a terminal end-to-end neural network model for recognition for complex CAPTCHA with high adhesion and more interference pixel. By testing the CNKI CAPTCHA, Zhengfang CAPTCHA, and randomly generated CAPTCHA, the experimental results show that the proposed method has a better recognition effect and robustness for three different datasets, and it has certain advantages compared with traditional deep learning methods. The recognition rate is 99%, 98.5%, and 97.84%, respectively.

1. Introduction

CAPTCHA is an algorithm for regional human behavior and machine behavior [1]. With the rapid development of Internet technology, network security issues continue to expand. CAPTCHA recognition is an effective way to maintain network security and prevent malicious attacks from computer programs, and it has been widely used in major mainstream websites [2]. CAPTCHA is generally considered to be a reverse turing test to classify humans and computers [3].

The mainstream CAPTCHA is based on visual representation, including images such as letters and text. Traditional CAPTCHA recognition [46] includes three steps: image preprocessing, character segmentation, and character recognition. Traditional methods have generalization capabilities and robustness for different types of CAPTCHA. The stickiness is poor. As a kind of deep neural network, convolutional neural network has shown excellent performance in the field of image recognition, and it is much better than traditional machine learning methods. Compared with traditional methods, the main advantage of CNN lies in the convolutional layer in which the extracted image features have strong expressive ability, avoiding the problems of data preprocessing and artificial design features in traditional recognition technology. Although CNN has achieved certain results, the recognition effect of complex CAPTCHA is insufficient [7].

This paper introduces the focal loss function based on the CNN model to solve the problem of complex CAPTCHA recognition and improves the problems of the traditional convolutional neural network training such as the complexity of the model and the redundancy of the output layer parameters. The test results on three different datasets show the effectiveness of the proposed method. The rest of the content is arranged as follows: Section 2 introduces the related work, and Section 3 focuses on the based on convolutional neural network. In Section 4, the performance of the proposed method is verified by experiments. Finally, the summary and prospect are given.

CAPTCHA mainly includes text CAPTCHA [8], image CAPTCHA [9], and sound CAPTCHA [10], among which text CAPTCHA is the most widely used. Text CAPTCHA is mainly composed of numbers and English letters, and its security is mainly guaranteed by two factors: background interference information and character adhesion. Both of these security features increase the difficulty of recognition and segmentation to varying degrees. According to whether characters need to be segmented in the recognition process, text CAPTCHA recognition methods can be divided into segmentation recognition and overall recognition. Segmentation recognition is a common method for CAPTCHA cracking. Chellapilla and Simard [11] prove that the effective segmentation of characters in CAPTCHA can greatly improve the recognition accuracy. In the early stage, CAPTCHA service website was the representative of CAPTCHA, which was characterized by little or no background interference information, and the characters were also lacking complex transformation such as distortion, rotation, and adhesion, and the defense effect was limited. Yan and Ahmad [12] completely cracked this kind of CAPTCHA by calculating pixels. Since then, the CAPTCHA designer has improved the generation algorithm and added background interference information, but Yan and El Ahmad [13] have used the projection algorithm to effectively segment it with an accuracy of up to 90%, and the success rate of cracking is up to 60%. After two consecutive rounds of attacks and defenses, in order to better resist the segmentation algorithm, the designer further improved the CAPTCHA, adding multiple complex transformations such as character twist, rotation, and adhesion and more complex transformation of background interference information [14, 15]. For this kind of CAPTCHA, Gao et al. [16] used Gabor filtering to extract character strokes and used graph search to find the optimal combination for character segmentation, and the accuracy of reCAPTCHA cracking reached 77%.

With the development of deep learning technology, CAPTCHA recognition technology based on deep learning is widely used. Qing and Zhang [17] proposed a multilabel convolutional neural network for text CAPTCHA recognition without segmentation and achieved better results for character distortion and complex CAPTCHA. Shi et al. [18] combined CNN with recurrent neural network and proposed a convolutional recurrent neural network to realize the overall recognition of CAPTCHA. Du et al. [19] used fast RCNN for overall recognition, which has a better recognition effect for CAPTCHA of variable length sequences. Lin et al. [20] used convolutional neural network to learn stroke and character features of CAPTCHA, greatly improving the recognition accuracy of CAPTCHA with distortion, rotation, and background noise. Compared with the traditional methods, deep neural networks have better learning ability and can effectively improve the efficiency of classification and recognition [2123]. For example, AlexNet [24] further improves the CNN architecture and significantly improves the classification effect. It has been widely used to train CNN on GPU. However, deep learning technology is currently limited in the face of severe AI image processing problems (such as symmetry [25] and adversarial example [26]). Most end-to-end recognition algorithms directly use the existing convolutional neural network structure, which has deep network layers and large training parameters. When the number of effective samples is limited, it is easy to overfit and lack generalization ability [27]. Therefore, how to design CAPTCHA for the defects of deep learning is the key problem to be solved.

3. The Proposed Method

3.1. Preprocessing

Traditional processing methods are used to preprocess the CAPTCHA image, including grayscale, binarization, image denoising, image segmentation, and image annotation. Firstly, the weighted average method is used to process the gray level, and the formula is Y = 0.30R + 0.59G + 0.11B, where R, G, and B correspond to the values of the red, green, and blue components in the color image. Then, the image binarization is carried out. The Otsu algorithm is used to obtain the optimal threshold value of each image. The pixels higher than the threshold value are set to 255, and the pixels below the threshold value are set to 0. Then, the average filter is used to denoise the image, and the formula is used to set the current pixel value as the average value of eight neighboring pixels. Finally, the image is segmented, and the specific process is shown in Figure 1.

3.2. Focal Loss

Focal loss [28] is to solve the problem of low accuracy in one-stage target detection. This loss function reduces the weight of a large number of simple negative samples in training. It can also be considered as a difficult sample mining. Focal loss is modified on the basis of the cross-entropy loss function, which can reduce the weight of easy to classify samples to make the model focus more on difficult to classify samples during training.

For the two-category cross-entropy function, the formula is as follows:where is the estimated probability that the prediction sample belongs to 1 (the range is 0-1), is the label, and the value of is {+1, −1}. For the convenience of representation, the variable is introduced. The formula is as follows:

The cross entropy of the two categories can be expressed as . The common method to solve the class imbalance is to introduce the weight factor, which is for category 1 and for category −1.

Then, the balanced cross entropy is . Although α can balance the importance of positive and negative samples, it cannot distinguish the difficult and easy samples. Therefore, focal loss reduces the weight of easy samples and focuses on the training of difficult negative samples. By introducing the parameter to represent the difficulty of the weight difference between the difficulty and easy samples, the greater the , the greater the difference, so the focal loss is defined as follows:

Therefore, focal loss is a cross-entropy loss function with dynamically adjustable scale, which has two parameters and , where is to solve the imbalance between positive and negative samples and is to solve the imbalance of difficult and easy samples.

3.3. Simple CAPTCHA

For simple CAPTCHA, due to its small data image format and less information after image preprocessing, the model is relatively simple. The network structure is (as shown in Table 1 and Figure 2), repeated two layers of convolution combined with layer 1 pooling, followed by a layer of flatten layer and a layer of dense layer. Sigmoid function is the activation parameter of the full connection layer, and the label one-hot coding matrix with the maximum probability is transformed into the one-hot coding matrix. It is worth noting that each convolution layer uses the ReLU activation function, followed by a batch normalization batch standardization layer.

(1)Input layer: the input data is single-channel image data after binarization with a size of 2512.(2)Convolutional layer C1 layer: using 8 convolution kernels which size is 33, padding using the same convolution, filling the edge of the input image data matrix with a circle of 0 values, and the convolution operation step size is 1, each convolution kernel contains 9 parameters, and adds a bias parameter, so the required parameters for this layer are (33 + 1)8 = 80. The activation function is the ReLU function, followed by a batch normalization layer, and outputs 25128 feature maps.(3)Convolutional layer C2 layer: using 8 convolution kernels, but the size of the convolution kernel becomes 338, padding still has the same convolution, the convolution step size is 1, the total parameter (33 8 + 1)8 = 584, and the activation function is the ReLU function, followed by a batch normalization layer, and output 25128 feature maps.(4)Pooling layer P3 layer: we apply the maximum pooling algorithm to the output result of the C2 layer for pooling operation; this layer is also called the downsampling layer, the downsampling window size used is 22, and the output is 1268 feature maps.(5)Convolutional layer C4 layer: using 16 convolution kernels whose size is 338, padding has the same convolution, convolution step length is 1, the total required parameters are (338 + 1)16 = 1168, the activation function is the ReLU function, followed by a layer of batch normalization, and output 12616 feature maps.(6)Convolutional layer C5 layer: using 16 convolution kernels which size is 3316, padding is same convolution, convolution step length is 1, the total required parameters are (3316 + 1)16 = 2320, the activation function is the ReLU function, followed by a layer of batch normalization layer, and output 12616 feature maps.(7)Pooling layer P6 layer: we apply the maximum pooling algorithm to the output result of the C5 layer for pooling operation; the size of the downsampling window used is 22 and the output 6316 feature maps.(8)Flattening layer F7: we flatten the data of feature maps output by P6 layer, a total of 6316 = 288 nodes.(9)The Dense layer is fully connected with the F7 layer. The activation function is the focal loss function. All features are classified. The classification result corresponds to the character category of the CAPTCHA, including 10 numbers and 26 English uppercase characters, which means 36 possible results. A total of 36(6316 + 1) = 10404 parameters are required.
3.4. Complex CAPTCHA

Complex CAPTCHA is mainly aimed at the image CAPTCHA which is difficult to be segmented because of its more adhesion, slanting font, complex color, and more disturbing pixels. It is widely used in major Internet websites. For complex CAPTCHA, the end-to-end neural network is used to identify the CAPTCHA. The model structure is shown in Figure 3. This kind of problem is multilabel classification. It is repeated five times, two convolution layers, one pooling layer, and then one flattened layer, and finally four classifiers are connected. Each classifier is fully connected, including 62 neural nodes. Sigmoid function is the activation parameter of the full convergence layer, that is, the probability of each classifier outputting a character, and the final output is complete one-hot encoding of 4 characters of image.

(1)Input layer: the input data is RGB image data with the size of 2772 and 3 channels.(2)Convolutional layer C1 layer: 32 convolution kernels of 333 size are used, and the same content is used for padding; that is, a circle of 0 is filled into the edge of input image data matrix, and the convolution operation step is 1. Each convolution kernel contains 27 parameters and adds a bias parameter. Therefore, the required parameters of this layer are (333 + 1)32 = 896. The activation function is the ReLU function, which is next to the batch normalization layer and outputs 277232 feature maps.(3)Convolutional layer C2 layer: 32 convolution kernels are used, but the size of the convolution kernel becomes 3332, padding is still the same convolution, the convolution step is 1 and the total parameter is (3332 + 1)32 = 9248, and the activation function is the ReLU function, followed by a layer of batch normalization layer and output 277232 feature maps.(4)Pooling layer P3 layer: we use the maximum pooling algorithm for the output result of the C2 layer for pooling operation, and the downsampling window size used is 22 and output 133632 feature maps.(5)Convolutional layer C4 layer: we use 64 convolution kernels whose size is 3332, padding is the same convolution, convolution step is 1, the total required parameters are (3332 + 1)64 = 18496, and the activation function is the ReLU function, followed by a layer of batch normalization layer and output 133664 feature maps.(6)Convolutional layer C5: we use 3364 convolution kernels whose size is 64, padding is the same convolution, convolution step is 1, the total required parameters are (3364 + 1)64 = 36928, and the activation function is the ReLU function, followed by a layer of batch normalization layer and output 133664 feature maps.(7)Pooling layer P6 layer: the output result of C5 layer is used for the maximum pooling algorithm for pooling operation, and the size of the down-sampling window used is 22 and output 61864 feature maps.(8)Convolutional layer C7 layer: we use 128 convolution kernels whose size is 3364, padding is the same convolution, convolution step size is 1, the total required parameters are (3364 + 1)128 = 73856, and the activation function is the ReLU function, followed by a layer of batch normalization layer and output 618128 feature maps.(9)Convolutional layer C8: we use 128 convolution kernels whose size is 33128, padding is the same convolution, convolution step size is 1, the total required parameters are (33128 + 1)128 = 147584, and the activation function is the ReLU function, followed by a layer of batch normalization layer and output 618128 feature maps.(10)Pooling layer P9 layer: we use the maximum pooling algorithm for the output result of the C8 layer for pooling operation, and the size of the down-sampling window used is 22 and output 39128 feature maps.(11)Convolutional layer C10 layer: we use 128 convolution kernels whose size is 33128, padding is the same convolution, convolution step length is 1, the total required parameters are (33128 + 1)128 = 147584, and the activation function is the ReLU function, followed by a layer of batch standardization layer and output 39128 feature maps.(12)Convolutional layer C11 layer: we use 128 convolution kernels whose size is 33128, padding is the same convolution, convolution step is 1, the total required parameters are (33128 + 1)128 = 147584, and the activation function is the ReLU function, followed by a layer of batch standardization layer and output 39128 Feature Maps.(13)Pooling layer P12: the output of C11 layer is pooled by max pooling algorithm, and the size of the down-sampling window used is 22 and output 14128 feature maps.(14)Flattening layer F13 layer: we flatten the data of feature maps output by P12 layer and a total of 14128 = 512 nodes.(15)The dense layer is fully connected with the F13 layer, connecting 4 classifiers; each classifier contains 36 neural nodes, the activation function is the sigmoid function, and the maximum probability one-hot encoding of a character CAPTCHA is output. Each classifier requires a total of 36(14128 + 1) = 18468 parameters.

4. Experiments

4.1. Dataset

The CAPTCHA dataset used in this article includes CNKI CAPTCHA, Zhengfang CAPTCHA, and randomly generated CAPTCHA. All CAPTCHA datasets are composed of uppercase and lowercase letters and numbers, including 33 categories.

CNKI CAPTCHA contains common CAPTCHA interference methods, such as character scale change, linear noise, and character adhesion, which is more suitable for testing the applicability of CAPTCHA. The image dataset includes 4000 images in the training set and 600 images in the test set. The sample image is shown in Figure 4.

ZhengFang educational administration system CAPTCHA has the characteristics of point noise and partial distortion adhesion, which can be used to evaluate the performance of the recognition method of the adhesive character verification code. We use 2000 such CAPTCHA datasets as the training set and 200 as the test set, and we manually label some of the renamed pictures. The sample image is shown in Figure 5.

The random generated CAPTCHA has the characteristics of moderate distortion and adhesion, which cannot be recognized by the traditional CAPTCHA recognition methods. We generate 10,000 CAPTCHA images as the training set and 2000 as the test set. The naming format is the characters represented by the image plus the sequential serial number to prevent errors from appearing with the same image. The sample image is shown in Figure 6.

4.2. Performance

Firstly, the CAPTCHA is preprocessed, then the dataset is input into the network model for training and parameter adjustment, and then the test samples are predicted. We count the number of true positive (TP) and the number of true negative (TN) and finally calculate the accuracy rate according to the statistical results, . The following is a graph of the accuracy and loss value of the convolutional neural network on the three datasets as shown in Figures 79 and the test results on each dataset are also given in Figures 1012. It can be seen from the figure that the method proposed in the paper has a higher recognition rate and better robustness.

For CAPTCHA that contains complex interference information or adhesions, traditional methods based on image segmentation are difficult to identify, and segmentation will destroy character information and cause errors to accumulate. With the end-to-end deep learning technology, there will be a prediction of the result from the input end to the output end. The prediction error is transmitted and adjusted in each layer of the network until the expected result is obtained. By introducing this self-learning network architecture into the CAPTCHA recognition, the character segmentation step can be removed, and the preprocessing operation can be selected according to the interference complexity in the training sample, so as to better highlight and retain the characteristic information between characters.

In order to further verify the performance of the method proposed in the paper, Table 2 shows the recognition rates of different deep learning methods under three different verification codes, including methods such as AlexNet, VGG, GoogleNet, and ResNet. As can be seen from the figure, the recognition rate of the proposed method for CNKI CAPTCHA is 2.05%, 2.42%, 1.66%, and 0.24% higher than AxNet, VGG-16, GoogleNet, and ResNet, respectively, and the recognition rate of ZhengFang CAPTCHA is increased by 2.25%, 2.61%, 2.85%, and 2.3%, respectively. For the randomly generated CAPTCHA, it is increased by 3.46%, 1.95%, 0.98%, and 0.59%, respectively. The proposed method has high recognition rate, robustness, and good generalization ability for three different datasets.

5. Conclusion

This paper proposes a convolutional neural network method based on focal loss for CAPTCHA recognition. The focal loss function is introduced to solve the imbalance problem of positive and negative samples and difficult and easy samples. Firstly, preprocessing such as graying, binarization, denoising, segmentation, and labeling is carried out, and then a simple neural network model is constructed by using Keras library; in addition, an end-to-end neural network model is constructed for the complex CAPTCHA with high adhesion and more interfering pixels. The test results on three different CAPTCHA datasets show that the proposed method has certain advantages over the traditional methods and has higher recognition rate, robustness, and good generalization ability. In the future, we will study more types of CAPTCHA recognition.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the National Natural Science Foundation of China (61976198), the Natural Science Research Key Project for Colleges and University of Anhui Province (KJ2019A0726), High-Level Scientific Research Foundation for the Introduction of Talent of Hefei Normal University (2020rcjj44), and the Anhui Province Key Laboratory of Big Data Analysis and Application Open Project.