Abstract

Text detection is increasingly in demand recently and poses significant challenges for the tradeoff among the detection accuracy, memory resources, and inference speed in the case of applying to the portable device such as mobile phones. Current methods mainly focus on the detection accuracy but neglect either the running speed or the memory consumption. To this end, an agile and efficient neural network for scene text detection that balances the detection performance, running speed, and the model size is hereby proposed. In order to reduce the network parameters and speed up, the neural network for text detection is firstly pruned; and then, the pruned neural network is trained with the structured knowledge distillation for improving the detection performance. The method is implemented on three benchmark text datasets, i.e., ICDAR2015, Total-Text, and MSRA-TD500. The experimental results demonstrate that the hereby proposed method achieves the best comprehensive performance with a faster running speed and much less memory resources while the text detection accuracy is comparable to that acquired using the excellent text detection methods.

1. Introduction

Recently, given the practical applications of scene text images such as machine reading, mobile translation, and license plate recognition, related studies have been a hot topic in the field of computer vision. However, the varied shapes and styles of the scene text make it still a challenging task.

As the primary premise of recognition and translation, scene text detection is aimed at locating the text and giving the text bounding box. In recent years, arbitrary-shaped text detection has attracted increasing attention. Researchers keep attempting to design a complex semantic segmentation network for handling scene texts with assorted shapes, which generally demands huge computation time and redundant model parameters. Recent research results [1, 2] show that lightweight network structure can settle scene text detection tasks, and indicate that the current network is subject to sparsity and redundancy in this task.

To solve this problem, a simple and compact text detector that can trade off between speed and performance is hereby proposed based on DBNet. Inspired by [3], a pruning scheme is also proposed, which helps obtain a compact backbone network without affecting the feature fusion network. Additionally, structured distillation is adopted to improve the detection performance of the compact network. The hereby proposed method is compared with the excellent methods on MSRA-TD500, as shown in Figure 1, which is proven to achieve the fastest speed and successfully strike a balance between speed and performance.

Contributions of the present study can be summarized into three aspects: (i) an efficient scene text detection method is proposed, which can strike a balance between speed and accuracy; (ii) a pruning scheme is proposed for detection network, which helps obtain an agile and efficient detection network; and (iii) a distillation structure is designed for text detector, which transfers the dark knowledge of a complex text detector to the compact network and improves the detection performance of the compact network.

The rest of this paper is organized as follows: Section 2 briefly reviews related works on text detection; Section 3 details the proposed deep model; Section 4 gives experimental results and discusses the limitations; and finally, the conclusion is summarized in Section 5.

In recent years, scene text detection based on deep learning has achieved great results. Most of those methods can be roughly divided into two categories, i.e., regression-based methods and segmentation-based methods. Heavy network is generally adopted for high detection accuracy; agile structures are designed by a few other methods to maintain a balance between speed and accuracy.

Regression-based methods are usually inspired by generic object detectors such as Faster R-CNN [4] and SSD [5]. Based on SSD, TextBoxes [6] adjusts the anchor size to fit the long and narrow text shape; TextBoxes++ [7] and DMPNet [8] detect multidirectional texts by regression quadrilateral; EAST [9] reverts to the text rotation angle to detect multidirectional texts; SegLink [10] regresses text segments and their links to handle long texts; RRPN [11] proposes the rotation position suggestion based on Faster R-CNN; and SR_DeepText [12] constructs a scale-robust network for solving the problem of text scaling. Regression-based methods can achieve remarkable results in regular-shaped texts but have hard time handling text instances of arbitrary shapes.

Segmentation-based methods combine the segmentation results with the postprocessing algorithm for the text boundary box. PixelLink [13] predicts the category and relationship of pixels to get the text bounding box; Mask Text Spotter [14] segments text semantics based on regression to the location of text instances for processing arbitrarily shaped texts; TextSnake [15] treats the arbitrary shape text detection problem as a search for text centerlines and text regions; and PSENet [16] effectively solves the text proximity problem by fusing the postprocessing algorithm to predict the text center fields of different scales. Segment-based approaches often require more complex postprocessing algorithms but can effectively handle text instances of arbitrary shapes.

Fast text detection methods seek a balance between speed and performance. EAST [9] uses a lightweight backbone network and simple NMS for a faster speed; PANNet [2] designs a lightweight feature fusion module with high efficiency and powerful feature expression; and DBNet [1] proposes a differentiable binarization pixel prediction. Both PANNet and DBNet require a lightweight backbone network to perform quickly and efficiently, and illustrate the sparsity and redundancy in the current network. In this case, the hereby proposed method is aimed at solving the network redundancy problem and striking a balance between speed and performance.

3. Methods

The pipeline of this method is shown in Figure 2. First, a streamlined PrunedNet is obtained by pruning the baseline; next, the input image is fed into both PrunedNet and TeacherNet, with a series of feature maps and prediction probability maps obtained during the process; finally, dark knowledge in TeacherNet enhances the detection performance of PrunedNet through structured distillation in training.

3.1. Baseline

To better verify the efficiency and performance of this approach, DBNet [1], an efficient text detector with small parameters that is easy to implement, is chosen as the baseline, with its structure illustrated in Figure 3. First, four feature maps of different sizes are extracted through a backbone; second, these feature maps are spliced together after bottom-up fusion; third, the network predicts the probability map and the threshold map (the edge map of the predicted text) based on the fused features, and the target binary map is obtained through differentiable binarization; finally, the text boundary box is obtained using the postprocessing algorithm based on the binary map.

3.1.1. Deformable Convolution

Following [17], DBNet replaces the second convolutional layers of the residual block with deforming convolution in , , and .

3.1.2. Deforming Convolution

Deforming convolution can change the receptive field of convolution kernel and adapt it to changeable text instances [18]. Differentiable binarization is proposed to solve the problem that standard binarization cannot be optimized during training, which is formulated as where is the target binary map; and , the probability map and the threshold map, respectively; and , the amplifying factor, set to 50 during training. Differentiable binarization is similar to the sigmoid function, and makes the target binary map differentiable to better distinguish text boundaries.

3.2. Prune

The BN (Batch Normalization) layer, a data normalization operation in the network, is aimed at pulling the input data back to the standard normal distribution, so that the activated input values will fall in the region where the nonlinear function is sensitive to the input. Among them, learnable parameters () perform linear transformation on standard data. [3] points out that the channel feature will play a more limited role in subsequent operations such as convolution in the case of a small scaling factor . To this end, the convolution kernel of the network should be pruned according to the size of for a compact network.

Inspired by [3], this paper proposes a pruning method for the detector. Simplifying the ResNet18 network structure using the pruning method can improve the detection speed while keeping the performance comparable to that of ResNet50. The algorithm flow is as follows: (1)To select DBNet’s backbone (ResNet18) and take the first BN layer of each residual block (yellow module in Figure 4)(2)To take the scaling factor of each such a BN layer to the set(3)To sort set for pruning the ratio and take the th largest scaling factor according to (4)To traverse each BN layer and remove the convolution kernel corresponding to the upper and lower layers if

The pruning process is shown in Figure 4. The parameter sizes of the upper and lower convolution kernel are reduced after calculating the yellow BN module. The number of the first convolution kernel is reduced from 64 to 32, which means that there are only 32 greater than in the BN layer. The second convolutional layer only changes the depth of the convolution kernel, because the number of channels in the previous feature map is 32 and the constant number of convolution kernels ensures the output of the residual block not to be affected.

Compared with [3], the pruning method in this paper retains the output channel of the convolution residual block, without affecting the subsequent feature fusion and knowledge distillation, which can also be extended to other areas, such as semantic segmentation and target tracking.

3.3. Knowledge Distillation

Knowledge distillation extracts the abstract knowledge learned from the complex network and improves the performance of the compact network. In order to achieve the same performance for two networks, some loss functions will be set to constrain the network during training for guaranteeing the consistency of their feature expressions. Inspired by [19, 20], the distillation framework is designed as shown in Figure 2, where the network of teachers in the distillation framework is DBNet with ResNet50 as the backbone network.

For the baseline, the four feature maps from the backbone network are unified in dimension through convolution, respectively, and are provided with a one-to-one relationship in both the student network and the teacher network. In addition, there is a corresponding relationship between the fused feature maps, making it necessary to distil the knowledge from the teacher network into the student network using the loss constraint feature. The formula of feature map losscan be expressed as where , and represent the width, height, and channel number of the feature map, respectively; , the eigenvalue of the teacher network; and , the eigenvalue of the student network.

For the prediction map, KL Loss (Kullback-Leibler Divergence Loss) is adopted in most methods to make the probability distribution of the class prediction of the student network approximate to that of the teacher network, while for text detection, the classification of pixels is binary (text and background), with only a floating point number from 0 to 1 required to represent the probability. In this paper, Dice loss is used for constraining the predicted probability map and the binary map. The formula for the probability loss is as follows: where denotes the prediction map of the student network and represents the prediction map of the teacher network. During training, the feature loss can enhance the feature expression ability of the network, while the probability loss enables the network to flexibly predict the pixel. Combined with pruning, the simplified network still achieves excellent performance after removing redundant parameters.

3.4. Loss Functions

In this article, the proposed method inherits three losses from the baseline, i.e., probability map loss, binary map loss, and threshold map loss, the corresponding loss functions of which are defined as , , and , where and adopt binary cross-entropy (BCE) loss. The formula can be expressed as where is the predicted value and represents the true value. adopts loss, and the formula is as follows:

Combined with the supervisory loss of knowledge distillation, the total experimental loss can be expressed as where and are set as 1 and 10, respectively, in the experiment; represents the feature loss of the ith feature map; , the probability loss of the probability map; and , the probability loss of the binary map.

4. Experiment

In order to evaluate the performance of this method, the proposed method is implemented on three public benchmark datasets, i.e., Total-Text [21], ICDAR2015 [22], and MSRA-TD500 [23], and is compared with the excellent text detection methods. Then, ablation study is conducted to investigate the effects of pruning ratio and knowledge distillation.

4.1. Datasets

Total-Text is a word-level-based English curve text dataset, known as the first relatively large scene text dataset with three different text directions, i.e., horizontal, multidirectional, and curvilinear, and contains 1,255 training images and 300 test images.

ICDAR2015 is a multidirectional text dataset from Challenge 4 of ICDAR2015 Robust Reading Competition and contains 1500 training images and 500 test images collected from natural scenes. All text instances are labeled as multidirectional quadrilaterals.

MSRA-TD500 is a multilingual dataset containing both Chinese and English texts, involving 300 training images and 200 test images. The text instance is labeled as a rotating rectangle. In the experiment, 400 images from HUST-TR400 [24] are added to expand the training sample.

4.2. Experimental Details

For all models, the two models (ResNet18 and ResNet50) of DBNet are first trained according to [1]; then, the model of ResNet18 is pruned and fine-tuned; finally, the distillation training is carried out with the reduced model. During the training, the batch size is set to 16 when the model is trained alone and 8 for the distillation training due to memory limitations. In all the trainings, the number of epochs is 1,200; the initial learning rate is 0.007; and the learning rate is decreased exponentially, with a decrease rate of 0.9. The input size is , obtained by randomly flipping, rotating, and clipping.

During the whole testing, the same test image size is kept as the baseline. Although the GPU shows a little difference, the speed of the implementation is tested without any acceleration techniques using a single 2080 ti GPU in a single thread.

4.3. Ablation Study

Ablation study is conducted on Total-Text to investigate the effectiveness of pruning and knowledge distillation.

Table 1 shows the experimental results of the proposed method under different pruning ratios compared with the baseline. The accuracy begins to decrease with the increase of the pruning ratio, but the speed increases. When the pruning ratio reaches 0.1 or 0.2, this method can simultaneously improve both the performance and the speed compared to the baseline, and when the pruning ratio reaches 0.5, this method performs as well as the benchmark method in terms of -score and achieves 7 gains in terms of FPS.

As shown in Table 2, the -score is only 81.1% after pruning at the baseline, which presents a decrease from the baseline. However, after fine-tuning the model, the -score reaches 82.7%, close to the baseline, but the -score of the model reaches 83.0%, exceeding the baseline. Finally, under the supervision of probability loss, the performance of the model improves to 83.3%, with a 0.5% of gain in terms of -score.

4.4. Comparisons with the Excellent Methods

The hereby proposed method is compared with previous methods on three standard benchmarks, including a benchmark for curved text and two benchmarks for multioriented text. Some qualitative results are visualized in Figure 5.

The proposed method is compared with 10 text detection methods on the Total-Text, and the results are shown in Table 3. This method reaches 83.3% for the -score, ranking third. In terms of speed, compared with the baseline (DB-ResNet-18), this method improves the speed without decreasing the -score and achieves 57 FPS, achieving the fastest speed compared to all other methods.

Table 4 shows the experimental results of the proposed method and the previous method on MSRA-TD500. DB-ResNet-18 (512) represents a short edge of the input image of 512, while DB-ResNet-18 (736) represents a short edge of the input image of 736. The proposed method comes in second with an -score of 83.4%, just 1.5% behind the teacher network. In addition, the speed of the proposed method can reach 95 FPS, the fastest among all the methods and 13 FPS higher than the baseline.

Figure 1 depicts the visualization result of the combined -score and speed FPS on MSRA-TD500. The proposed methods are located on the outermost side of the figure and away from their baseline, which means that these methods are provided with a strong competitive advantage over others in both speed and performance.

The proposed method is compared with 12 methods on the ICDAR2015 dataset. As shown in Table 5, the -score of this method is 82.8, ranking the 8th, but is still competitive. This method improves the -score of the benchmark by 0.5% and FPS by 3 and remains the fastest detector compared to other methods.

4.5. Parameter Size

The proposed method is compared with segmentation-based methods [35] including DBNet [1], PSENet [16], TextSnake [15], and CRAFT [28] in terms of the parameter size [36]. They are all source codes from the Internet and some are reimplement codes. The comparison results are shown in Table 6, and the size of the proposed method is only half as large as that of DBNet, i.e., 6.8 M, the smallest among the five methods.

5. Conclusions

An efficient scene text detection method that balances speed and performance is hereby proposed. A pruning algorithm for text detection network is first introduced, which can simplify network parameters and speed up the computation. Furthermore, a structured knowledge distillation method is proposed to improve the detection performance of compact networks. These two features make the proposed method an agile and efficient arbitrary shape text detector. Besides, this method also presents competitive performance and fastest detection speed in three datasets.

Data Availability

The English curve text dataset used to support the findings of this manuscript is a public benchmark dataset named Total-Text. Copies of these data can be obtained free of charge from https://github.com/cs-chan/Total-Text-Dataset. The multidirectional text dataset used to support the findings of this manuscript is a public benchmark dataset named ICDAR2015. Copies of these data can be obtained free of charge from https://pan.baidu.com/s/1pN-St9-aTpmxcHsgFXl8jA?pwd=i11a. The multilingual dataset containing both Chinese and English used to support the findings of this manuscript is a public benchmark dataset named MSRA-TD500. Copies of these data can be obtained free of charge from http://www.iapr-tc11.org/dataset/MSRA-TD500/MSRA-TD500.zip.

Conflicts of Interest

The authors declare that they have no conflicts of interest in the work.

Acknowledgments

This work is supported by the Natural Science Foundation of Fujian Province, China (Nos. 2019J01889 and 2020J018751); the “Tiancheng Huizhi” Innovation and Education Promotion Fund, China (No. 2018A02005); and the National Natural Science Foundation of China (No. 62172095).