Abstract

The performance of text detection is crucial for the subsequent recognition task. Currently, the accuracy of the text detector still needs further improvement, particularly those with irregular shapes in a complex environment. We propose a pixel-wise method based on instance segmentation for scene text detection. Specifically, a text instance is split into five components: a Text Skeleton and four Directional Pixel Regions, then restoring itself based on these elements and receiving supplementary information from other areas when one fails. Besides, a Confidence Scoring Mechanism is designed to filter characters similar to text instances. Experiments on several challenging benchmarks demonstrate that our method achieves state-of-the-art results in scene text detection with an F-measure of 84.6% on Total-Text and 86.3% on CTW1500.

1. Introduction

Detecting text in the real world is a fundamental computer vision task that directly determines the subsequent recognition results. Many applications in the real world depend on accurate text detection, such as photo translation [1] and autonomous driving [2]. Now, horizontal- [35] and oriented-[610] based methods no longer meet our requirements, and more flexible pixel-wise detectors [11, 12] have become mainstream. However, precisely locating text instances is still a challenge because of arbitrary angles, shapes, and complex backgrounds.

The first challenge involves text instances with irregular shapes. Unlike other common objects, the shaped instance often cannot be accurately described by a horizontal box or an oriented quadrilateral. Some typical methods (e.g., EAST [8] and TextBox++ [10]) perform well on the common benchmarks (e.g., ICDAR 2013 [13] and ICDAR 2015 [14]) but degrade in curved text challenges, as shown in Figure 1(a).

The second challenge is separating text character boundaries. Although pixel-wise methods do not suffer from a certain shape, they may still fail to separate text areas with adjacent edges, as shown in Figure 1(b).

The third challenge is that text identification may face false positives [15] dilemma because of the lack of context information. Some symbols or characters similar to text may be misclassified.

To overcome the aforementioned challenges, we propose a novel method, called TextCohesion. As shown in Figure 2, our method treats a text instance as a combination of a Text Skeleton and four Directional Pixel Regions, where the previous one roughly represents the shape and profile, and the latter is responsible for refining the original region from four directions. Notably, a pixel belongs to more than one Directional Pixel Regions (e.g., up, left), which means the instance has more chances to be recovered. Furthermore, the confidence score of every Text Skeleton is reviewed, only higher then a threshold is considered as a candidate.

Detecting text in the wild has been widely studied in the past few years. Before deep learning era, most detectors adopt Connected Components Analysis [1621] or Sliding Window-based classification [2225].

Now, detectors are mainly based on deep neural networks. There are two main trends in the field of text detection: regression-based and pixel-based. Inspired by the promising of object detection architectures such as Faster R-CNN [26] and SSD [27], a bunch of regression-based detectors are proposed, which simply regress the coordinates of bounding boxes of candidates as the final prediction. TextBoxes [7] adopts SSD and adjusts the default box to relatively long shape to match text instances. PyrBoxes [28] proposes a SSD-based detector equipped with a grouped pyramid to enrich feature. Sheng [29] proposes a novel text detector with learnable anchors to cover all varieties of texts in natural scene. Lyu [30] detects scene text by localizing corner points of text bounding boxes and segmenting text regions in relative positions. By modifying Faster R-CNN, Rotation Region Proposal Networks [31] insert the rotation branch to fit the oriented shapes of text in natural images. These methods can achieve satisfying performance on horizontal or multioriented text areas. However, they may suffer from the shape of the bounding box, even with rotations. Mainstream pixel-wise methods drew inspirations from the fully convolutional network (FCN) [32], which removes all fully connected layers and is widely used to generate a semantic segmentation map. Convolution transpose operation then helps the shirked feature restore its original size. TextSnake [11] treats a text instance as a sequence of ordered, overlapping disks centered at symmetric order, each of which is associated with potentially variable radius and orientations. It made significant progress on curved text benchmarks. TexeField [33] learns a direction field pointing away from the nearest text boundary to each text point. An image of two-dimensional vectors represents the direction field. SPCNET [34], based on FPN [35] and Mask R-CNN [36], inserts Text Context Module and Rescore mechanism to leave the lack of context information clues and inaccurate classification score. PSENet [37] projects feature into several maps and gradually expand the detected areas from small kernels to large and complete instances. These pixel-based methods significantly improve the performance of curved benchmarks. However, detection failures are still possible in complex situations. Differs from the previous, the proposed method has more opportunities to recover itself. Specifically, the Text Skeleton represents the profile of the instance, which is smaller and less sticky than the original form. Pixels in text areas are divided into two groups according to four directions: the up-down and left-right. Ideally, a TS can be integrated with any group to restore itself. When some regions fail to reproduce, there is also an opportunity to get additional supplementary from others. We conduct extensive experiments on standard benchmarks, including the horizontal the oriented text, and curved text datasets. Evaluations demonstrate that TextCohesion achieves state-of-the-art or very competitive performance.

3. Methodology

The architecture of TextCohesion is depicted in Figure 2, which consists of a feature extraction section and a postprocessing section. For image feature extraction, an FCN-based convolutional backbone followed by an up-sampling step is employed. Five feature maps containing a Text Skeleton (TS) and four Directional Pixel Regions (DPRs) are generated after up-sampling. The TS features are evaluated by a Confidence Scoring Mechanism (CSM), and finally obtaining the predicted text regions incorporated with the DPRs regions. To optimize the proposed network, a corresponding loss function of the TS and DPRs is designed. More details are introduced in the following section.

3.1. Network

The proposed method inherits the popular VGG16 network by keeping the layers from Conv1 to Conv5, converting the last fully connected layers into convolution layers. The input images are first downsampled to the multilevel features with five convolution blocks, and five feature maps (i.e., ) are generated. Then, these features are gradually upsampled to the original size and mixed with the corresponding output of the previous convolution block. The upsampled process can be described bywhere is the output of the network, “” refers to feature concatenation, and is the upsample function (i.e., used to resize the feature map matching other layers. Five feature maps with the same resolution are leveraged as the prediction of the network (the blue box shown in Figure 2) after the upsample step. Each prediction is composed of a and four in the postprocessing. contain four feature maps according to different directions: , , , and . The is the skeleton of the text instance that is adopted to separate from each other. The CMS is introduced to reduce false positives in terms of evaluating each TS. For clarity, we take a curved text as an example to demonstrate the process of label generation in the rest of Section 3.

3.2. Text Skeleton

Text Skeleton (TS) is an essential component representing the center part of the text instance. As shown in Figure 3(b), the gray area is the TS of the instance. The first step of generating TS is to find the and of the text. Similar to [11], we also use the cosine of adjacent vertices to find the and of text instance, and the remaining two longest sides. The longest two sides along with the text instance (e.g., and ) are called sidelines in the proposed method. Then, vertices of even distribution are sampled from the two sidelines (i.e., Top Sideline and Bottom Sideline in Figure 3(a)), respectively. After that the vertices in the centerline ( in Figure 3) can be averaged from these sampled vertices:where and are vertices in two sidelines of the text instance, respectively, and are a set of vertices belong to the center line. Finally, TS is bold by the center line infd3where and are pixels that represent the expansion of the center line to both sidelines. The region of form a part of TS, as shown in Figure 3(b). is a parameter that holds the bold rate, and we set it to 0.2 experimentally. When these vertices are completely processed, TS is generated correspondingly.

3.3. Directional Pixel Region

Directional Pixel Regions (DPRs) are used to restore its original form, including , and . Pixels in text instance but not in are considered as falling into . In Figure 3(b), and illustrate a fraction of . The direction of every fraction is determined by the tangent angle between its corresponding center vertices () and the next (). More specifically, the tangent angle of two adjacent center vertices is calculated by the following equation:where and refer to the coordinates of the center vertices. By comparing the of center vertices with , the regions of and are labeled as DPRs or background. If falls into a specific range (e.g., ), the pixels within its corresponding ( and ) are considered belonging to the or . The can be calculated as follows:where is used to distinguish the angle of adjacent center vertices and ensures the selected pixels are above the TS. is a parameter that controls the boundary of specific directional regions, which is discussed in detail in the experiment section. and are the vertical coordinates of vertices on the sideline and the center line, respectively. The generating process of the is similar to the , but the only difference is that the pixels are located below the TS. Therefore, is reversed naturally:where and are logically equivalent to and , which are the vertical coordinates of the sampled vertices on the sidelines. The and are generated in the same way, as shown below:where and are the horizontal coordinates of vertices on the sideline and the center line, respectively.

3.4. Confidence Scoring Mechanism

To filter out false positives, the confidence score is utilized to weight every . If the score of is lower than a threshold, then all components of this instance are discarded:where is the total number of pixels in the . is the value of the pixel in the TS region. and refer to the true positives and false positives, respectively. is the threshold value to filter out the with a lower confidence score, and we set it to 0.6 empirically. with high confidence will be retained and processed to form the final prediction with its corresponding . Instead, TS belonging to with its components are filtered directly. The TS, as the central area of a text instance, contains the key features of the whole text, which are more valuable to use than the whole features of one text instance.

3.5. Loss Function

The proposed method is trained with the following loss function as the three objectives:where is a smooth L1 [26] loss and and are crossentropy classification loss functions. The loss of is computed as follows:where is a self-adjust crossentropy loss function and in equation (10) is a self-adjust weight [9]. For the instance with area = , every positive pixels within it have a weight of . is the average area of all text instances in one image. In that case, the pixels in text instances with small areas have a bigger weight than the pixels in big text areas. In our experiments, the weight is set to 3 as the is essential than other components. Losses for DPR and CSM are calculated:where is optimized by a Smooth L1 loss, and the pixels losses in are calculated, respectively, which means that one pixel can be simultaneously categorized as two regions (e.g., and ). is a standard crossentropy function. are ground truth labels and are predicted values.

3.6. Postprocessing

TextCohesion treats every text instance as and four previously; hence, these components should be grouped, forming the final prediction. The postprocessing algorithm is depicted in Algorithm 1:

Input:
Output:
(1)Function Grouping ()
(2)
(3)
(4)
(5)
(6)ifthen
(7)
(8)Grouping ()
(9)else ifthen
(10)
(11)Grouping ()
(12)else ifthen
(13)
(14)Grouping ()
(15)else ifthen
(16)
(17)Grouping ()
(18)else
(19) Return
(20)end if

Every represents a text instance, and after passing through CSM, instances with higher confidence are reserved as candidates. Based on these candidates, the corresponding can be obtained. The postprocessing mainly includes three steps. (1) The is used to differentiate the different text instances. (2) For each , the outer pixels as initial points are used to search the corresponding pixels in the iteratively. (3) The is eventually merged with corresponding searched regions to form the final prediction. The entire postprocessing is shown in Algorithm 1, where refers to a function that obtains the directional information of the adjacent pixels.

4. Experiment

To evaluate TextCohesion, we conduct extensive experiments on both oriented and curved benchmarks and give a detailed description of these datasets for model training and inference, experimental implementation, results with comparisons, and ablation study, respectively.

4.1. Datasets

SynthText [38] is a large scale dataset that contains about 800K synthetic images that are created by blending natural images with text rendered with random fonts, sizes, colors, and orientations. These texts look realistic as the overlaying follows carefully set up configurations and a well-set learning algorithm.

ICDAR2015 [14] contains 1000 training and 500 test images captured by wearable cameras with relatively low resolutions. Each image includes several oriented texts annotated by four vertices of the quadrangles.

ICDAR 2017 MLT (IC17-MLT) [39] is a large scale multilingual text dataset, which includes 7200 training images, 1800 validation images, and 9000 testing images. The dataset is composed of complete scene images that come from 9 languages. Similarly, with ICDAR 2015, the text regions in ICDAR 2017 MLT are also annotated by four vertices of the quadrangle.

CTW1500 [40] is a challenging dataset for curve text detection, which is constructed by Yuliang et al. [18]. It consists of 1000 training images and 500 testing images. Different from traditional text datasets (e.g., ICDAR 2015 and ICDAR 2017 MLT), the text instances in SCUT-CTW1500 are labeled by a polygon with 14 points that can describe the shape of an arbitrarily curve text.

Total-Text [41] is another word-level-based English curve text dataset which is split into training and testing sets with 1255 and 300 images, respectively (Figure 4).

4.2. Implementation Details

Training TextCohesion is optimized by SGD with backpropagation [42]. Momentum and weight decay are set to 0.9 and , respectively. Learning rate is initialized to and decayed by 0.1 every 30 epochs. Following [11], all training images are augmented online with rotated and cropped with areas ranging from 0.24 to 1.69 and aspect ratios ranging from 0.33 to 3. After that noise, blur, and lightness are randomly adjusted and lastly resized to . We ensure that the text on the augmented images is still legible if they are legible before augmentation. TextCohesion is firstly pretrained on SynthText for 2 epochs and fine-tuned on other datasets. All implementations are deployed on PC with (CPU: Intel(R) Core(TM) i7-7800X CPU @ 3.50 GHz; GPU: GTX 1080).

Inferencing to test the ability of detecting arbitrarily shaped text, we evaluate our method on Total-Text and SCUT-CTW1500, both of them containing the curved instances. Images in the test stage are also resized to . We report the performance on SCUT-CTW1500 in Table 1, in which we can find that the Precision (88.0%), Recall (84.6%), and F-measure (86.3%) achieved by TextCohesion significantly outperform the ones of other competitors. Remarkably, the recall and F-measure surpass the second-best record by 4.7% and 2.7%, respectively. Besides, the inference time of the proposed method is also compared with other methods, i.e., DB [43]. The testing scale of the input image is resized to pixels, and the batch size is set to 1 during all the comparison experiments. The main results are reported in Tables 14, where an acceptable inference time can be found.

4.3. Experiments on Curved Text Benchmarks

To test the ability to detect arbitrarily shaped text, we evaluate our method on Total-Text and CTW1500, both of them containing the curved instances. Images in the test stage are also resized to . We report the performance on CTW1500 in Table 1, in which we can find that the Precision (88.0%), Recall (84.6%), and F-measure (86.3%) achieved by TextCohesion significantly outperform the ones of other competitors. Remarkably, the Recall and F-measure surpass the second-best record by 4.7% and 2.7%, respectively.

Our method achieves 88.1%, 81.4%, and 84.6% in Precision, Recall, and F-measure, respectively, outperforming the second competitor with an F-measure of 1.0% on Total-Text. We attribute this excellence to the proposed flexible representation. Instead of taking the text as a whole, the representation treats text as a serial of components and integrates them together to form the final prediction.

4.4. Experiments on Oriented Text Benchmarks

In this section, we evaluate TextCohesion on oriented text datasets. The performance of ICDAR2015 and ICDAR2017 are demonstrated in Tables 3 and 4, which also achieves F-measure of 89.1% and 73.1%, respectively. From these results, it can be observed that our method also achieves very competitive performance in dealing with oriented text. Meanwhile, thanks to the robust feature representation, TextCohesion can as well locate the text instance with small instances and in complex illuminations and variable scales.

4.5. Analyses and Discussion
4.5.1. Influence of the Number of Samples ()

We sample points on the top sideline and bottom sideline for each text instance, and use these points to split text instances better. To further study the Influence of the number of points on the sampling precision, an ablation experiment is performed, as shown in Figure 5(a). Theoretically, the performance of the model will improve with the increase of sampling precision. In the experiment, we found that the performance of the model hardly improve further (around 85%) when the sampling number () is greater than 10. is set to 40 in all experiments.

4.5.2. Influence of in Equation (2)

as an important parameter is used to control the ratio of the TS area to the DPR area. As shown in Figure 5(b), when the value of is within the range of [0.1, 0.6], the network performs well. In all experiments, is set to 0.2.

4.5.3. Influence of in Equation (3)

is used to delineate the top, bottom, left, and right regions. , and are the three specific angles used to investigate the influence of . As shown in Table 1, the F-measure is relatively good when is , so we set to in all experiments.

4.5.4. Influence of the Confidence Scoring Mechanism

The CSM is used to filter out the false positives (e.g., those symbols or characters that are similar to text). The influence in the results of the model when using the CSM is shown in Table 5. The precision improves after the CSM (0.6) is used. To test the robustness of the proposed model while changing the in equation (8), a comparison experiment is set in Table 5, and the F-measure is relatively good when is 0.6. In all experiments, is set to 0.6.

5. Conclusion and Outlook

In this paper, we propose a novel text detector, which achieves upto 86.3% F-measure among common text benchmarks, including text instance with irregular shapes. The text instance modeling method utilized in this detector could precisely detect text with arbitrary boundaries by splitting one text instance into four DPRs and a TS region. Moreover, a Confidence Scoring Mechanism is incorporated into this detector to filter out false positives, which further improves its detection precision. Simulation experiment results show that the proposed text detector performs well in scene text detection. The proposed method might have potential applications in the field of photo translation, autonomous driving, and product identification.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Weijia Wu and Jici Xing contributed equally to this work.

Acknowledgments

This work was supported by the National Key Research and Development Project (Grant no. 2019YFC0118202), National Natural Science Foundation of China (Grant no. 61803332), and Scientific Research Fund of Zhejiang Provincial Education Department (Grant no. Y201941642).