Abstract

Multilingual text detection in natural scenes is still a challenging task in computer vision. In this paper, we apply an unsupervised learning algorithm to learn language-independent stroke feature and combine unsupervised stroke feature learning and automatically multilayer feature extraction to improve the representational power of text feature. We also develop a novel nonlinear network based on traditional Convolutional Neural Network that is able to detect multilingual text regions in the images. The proposed method is evaluated on standard benchmarks and multilingual dataset and demonstrates improvement over the previous work.

1. Introduction

Texts within images contain rich semantic information, which can be very beneficial for visual information retrieval and image understanding. Driven by a variety of real-world applications, scene text detection and recognition have become active research topics in computer vision. To efficiently read text from photography, the majority of methods follow the intuitive two-step process: text detection followed by text recognition [1]. To a large extent, the performance of text detection affects the accuracy of text recognition. Extracting textual information from natural scenes is a critical prerequisite for further text recognition and other image analysis tasks.

Text detection has been considered in many studies and considerable progress has been achieved in recent years [214]. However, most of the text detection methods have focused on English; few investigations have been done on the problem of the multilingual text detection. In our daily lives, multilingual texts coexist everywhere; many environments contain two or more scripts text in a single image and, for example, product tags, street signs, license plates, billboards, and guide information. More and more applications need to achieve text detection regardless of language type.

For different languages, the characters take many different forms and have inconsistent heights, strokes, and writing format. There are thousands of languages in the world, and the representative and universal features of multilingual text are still unknown. In addition, text embedded in images can be in variation of font style and size, different alignment and orientation, unknown colors, and varying lighting condition. Due to these factors, multilingual text detection in natural scenes is a challenging and important problem.

Our study is focused on learning the general stroke feature representations and detecting text from image even in a multiscript environment. Unlike traditional methods, which mainly relied on the combination of a number of hand-engineered features, we aim to test the feasibility of proposing a common text detector only using automatically learning text feature, by improving discriminative clustering algorithm, to obtain language-independent stroke features. The learned stroke features incorporating with nonlinear neural network provide an alternative way to effectively increase the character representational power. To use deep learning text feature, we are able to use simple nonmaximal suppression to locate text.

In the following, we first reviewed the recent published literature followed by the proposed multilingual text detection method from Section 3 to Section 4. In Section 5, the experimental evaluation is presented. The paper is concluded in Section 6.

Existing methods proposed for text detection in natural scenes can be broadly categorized into two groups: connected component methods and sliding window methods.

Connected component methods separate text and nontext information at pixel-level, group text pixels to construct character candidates from images by connected component analysis. Epshtein et al. [2] leveraged the idea of the recovery of stroke width and proposed using the CCs in a stroke width transformed image. Yao et al. [3] extract regions in the Stroke Width Transform (SWT) domain. Neumann and Matas [4] posed the character detection problem as an efficient sequential selection from the set of Extremal Regions (ERs). Chen et al. [5] determined the stroke width using the distance transform of edge-enhanced Maximally Stable Extremal Regions (MSER). Using MSERs as CCs representing characters has become the focus of several recent works [69].

Sliding window-based methods, also known as region-based methods, scan a sliding subwindow through the image to search for possible texts and then use machine learning techniques to locate text. Wang et al. [10], extending their previous work [11], have built an end-to-end scene text recognition system based on a sliding window character classifier using Random Ferns. Wang et al. [12] use multilayer neural networks for text detection. Jaderberg et al. [13] achieve state-of-the-art performance by implementing sliding window detection as a byproduct of the Convolutional Neural Network (CNN).

In the task of multilingual text detection, previous studies are mostly sliding window-based method. In [14, 16], authors have proposed similar methods using hand-engineered features to describe the text. Subwindow scanned on different scales and positions on the image pyramid in order to classify texts in images. Therefore, to achieve text detection which is invariance to language type, the feature representation is very important. However, little research attempted to apply deep learning to learn multilingual text feature. CNN is a special kind of neural network, and its deep nonlinear network structure shows the strong ability of learning discriminative features of datasets from observation samples. Therefore, we alternatively investigate the problem of multilingual text detection based on the framework of CNN.

3. Stroke Feature Learning

According to the study of linguistics, the basic feature of text is stroke, such as the Latin alphabet and Chinese basic stokes. And different languages share the same characteristics in appearance. Inspired by these ideas, it is possible that language-independent stroke features can be designed.

In order to cope with multilingual scenes, we seek to learn a bank of universal low-level stroke features directly from raw images. The learning stroke features should be able to capture the essential substructures of strokes. At the same time, they are of the most representative and discriminative stroke features. Many unsupervised learning algorithms can be used for learning the hidden data prototypes from dataset, such as -means clustering and sparse coding. The goal of sparse coding is to construct a dictionary and minimize the error in reconstruction , so that a data vector () can be mapped to a code vector . For every , sparse coding algorithm is required to repeatedly solve a convex optimization problem. When applied to large scale image data, the optimization problem during the sparse coding procedure is very expensive. Relatively speaking, the optimal in classic K-means algorithm is simply as follows:In addition, -means has been identified as a fast and effective method to learn feature from images by computer vision researchers. Therefore, we improve the variant -means clustering method proposed by Coates et al. [18] and use it to learn stroke feature representations, since it learns representative stroke features from large collections while much faster.

In particular, we first collect a set of training images, which are gray scale images extracted from ICDAR 2003, ICDAR 2011, and ICDAR 2013 dataset, multilingual dataset, and Google. It contains a character in the middle of each image. Characters in training images include 26 uppercase letters, 26 lowercase letters, 10 digits, 20 Chinese basic strokes, and 28 Arabic alphabets. Some training images used for stroke feature learning are illustrated in Figure 1. We randomly extract    pixel patches from images. Before training the cropped patches, we apply contrast normalized preprocessing for each patch. In order to avoid generating many highly correlated stroke features, ZAC whitening is used for the patches to yield vectors ,

Because -means algorithm is highly dependent on the initialization process, the different initial guess of centroids affects the clustering result seriously. In order to lead to desirous clustering result, we propose a novel initialization method to choose suitable initial stroke features. We introduce the dispersion metric in the local information of data, guaranteeing the selection of initial centroids from the local spatial data-intensive region and the centroids apart from each other with a certain distance. Our initialization framework includes three steps: estimating local dispersion metric for each set of data, selecting the data which have higher metric than a threshold as candidates for initial features, and determining initial stroke features from candidates. The implements of the proposed initialization method are as follows. We firstly construct an adjacency graph and Gram matrix; Gram matrix is computed according to the following:where is the distance of the patches and. Secondly, we introduce the dispersion metric with components whose entries are given by (). Set a threshold ; if the value of associated with the data is larger than threshold; is marked as a candidate of initial features. Then, we use an algorithm similar to [19] to select initial stroke features from candidates. Because we use the stroke features as the first layer convolution kernels of our proposed CNN, is the number of first layer convolutional filters. The detailed steps of the proposed initialization method are presented in Algorithm 1.

Input: the patches ,
Output: initial features
(1)
(2) construct based on (2)
(3) computer dispersion metric and threshold
(4) for all data in
(5) if data with dispersion metric
(6)     
(7) end if
(8) end for
(9) , is random selected from
(10)
(11) set
(12) repeat
(13) 
(14) 
(15) until  

After initializing , we learn stroke features according to the following: For all , compute inner products . Set the value of which equals the value of which maximizes the inner products. If , then ; or else . Then, fix , minimizing (3) to obtain . The optimization is done by alternating minimization over and . The full stroke feature learning algorithm with -means is summarized in Algorithm 2.

Input: input patches
Output: learning stroke features
Procedure:
(1) Normalize input
       
(2) ZAC whiten input
       
       
(3) Initialize , follow the steps in Algorithm 1
(4) Repeat
   Set for
   Set for all other
   Fix ,    and
   Until convergence or reach iteration limitation

For general clustering algorithm, the number of clusters is known in advance or set by prior knowledge. In our method, the learned stroke features incorporate with Convolutional Neural Network classifier for text detection. Therefore, we further study how to choose the appropriate number of features to achieve the highest text/no text classification accuracy. In order to analyze the impact of learned stroke feature number, we learned four stroke feature sets with different number. Given that the first layer convolution kernels of our CNNs have and 320, we train detector with different stroke feature sets. Evaluate the performance of the detection model at the subset of ICDAR 2003 test images. As shown in Figure 2, the -measure increases as gets larger. Once equals 256, the recall is at maximum value, and about of detected text matches ground truth. While is greater than 256, -measure is not increased and even slightly reduced. Based on our detailed analysis, in our method, we select .

4. Multilingual Text Detection

The idea of our text detection is to design “feature learning” pipeline that can lead to representative text features and use these features for detecting multilingual text. Two main components in this pipeline are as follows: use the unsupervised clustering algorithm to generate a set of stroke features ; build a hierarchy network and combine it with stroke features to learn a high-level text feature. The first component has been described in detail in Section 3. How to build and train the multilayer neural network is presented in Section 4.

By making several technical changes over traditional Convolutional Neural Network architectures [20], we develop a novel classifier for multilingual text detection. We have two major improvements: different from traditional method that convolution kernel of CNN is randomly generated, we select the unsupervised learning stroke features as the first layer convolution kernels of our network; the intermediate features obtained from the learning process, which function in the second layers convolution kernels, can be used to more efficiently extract text features.

Our network has two convolutional layers with and filters, respectively. We fix the filters in the first convolution layer which are stroke features learned in Section 3; so low-level filters are and . We build a set of labeled training datasets; all training images are fixed-sized images (8877 positive, 9000 negative). Starting from the first layer, given an input image, the input is a grayscale cropped training image; that is, . The input convolves with 256 filters of size , resulting in a map of size (to avoid boundary effects) and 256 channels. The first convolutional layer output is a new feature map computed by a nonlinear response function , where . Convolutional layers can be intertwined with pooling layers that simplify system parameters by statistical aggregation of features. We average pool over the first convolutional layer response map to reduce the size down to . The sequence continues by another convolutional and pooling layers, resulting in feature maps with 256 channels and size of ; this size is the same as the dimension of the second layer convolutional filters. The second layer outputs are fully connected to the classification layer. The SVM classifier is used as a binary classifier that aims to estimate whether a image contains text. We train the network using stochastic gradient descent and back-propagation. Classification error function includes loss term and regularization term. Loss term is a squared hinge loss and the norm used in the penalization is L2. We also use dropout in the second convolutional layer to help prevent over fitting. The structure of the proposed neural network is presented in Figure 3.

After our network has been trained, the detection process starts from a large, raw pixel input image and leverages the convolutional structure of the CNN to process the entire image. We slide a pixels’ window across an input image and put these sliding windows to the learned classifier. Use the intermediate hidden layers as features to classify text/no text and generate text bounding boxes. We set 12 different scales in our detection method. At a certain scale , the input image’s scale changes; the sliding window scans through the scaled image. At each point , if windows contain single centered characters, produce positive detector response . In each row of the scaled image, check whether there are . If there exists positive detector response, then form a line-level bounding box with the same height as the sliding window at scale . And and are defined as the left and right boundaries of . At each scale, the input image is resizing and a set of candidate text bounding boxes are generated independently. The above procedure was repeated 12 times and yields groups of possibly overlapping bounding boxes. We then apply nonmaximal suppression (NMS) to score each box and remove all boxes that overlaps by more than 50% with lower score and obtain the final text bounding boxes .

5. Experiments

5.1. Dataset

To evaluate the effectiveness and robustness of the proposed text detection algorithm, we have conducted experiments on standard benchmarks, including the challenging datasets ICDAR 2003 [21], MSRA-TD500 [3], and KAIST [17].

The ICDAR 2003 Robust Reading and Text Locating database is a widely used public dataset for scene text detection algorithm. The database contains 258 training images and 251 testing images. It contains the path to each image and text bounding box annotations for each image.

MSRA-TD500 dataset contains images with text in English and Chinese. The dataset contains 500 images in total, with varying resolutions from 1296 × 864 to 1920 × 1280. These images are taken from indoor (office and mall) and outdoor (street) scenes using a packet camera.

KAIST provides a scene text dataset consisting of 3000 images of indoor and outdoor scenes. Word and character bounding boxes are provided as well as segmentation maps of characters. Texts in KAIST images are English, Korean, and a mixture of English and Korean.

We also created a new multilingual dataset that is composed of three representative languages: English, Chinese, and Arabic. These three languages stand for three types of writing systems: English standing for alphabet, Chinese standing for ideograph, and Arabic standing for abjad. Each group corresponding to the one language contains 80 images.

To learn the stroke feature, train samples include 5980 English text samples, 800 Chinese text samples, and 1100 Arabic text samples. Then, 3000 nontext samples are extracted from 200 negative images using bootstrap method. All these samples are normalized to , which is consistent with the detected window.

5.2. Results

The proposed algorithm is implemented using Intel Core i5 processor at 2.9 GHz 8 GB RAM and MATLAB R2014b.

To validate the performance of our proposed algorithm, we use the definitions in ICDAR 2003 competition [21] for text detection precision, recall, and -measure calculation. Therefore, and , where is the best match for a rectangle in a set of rectangles , , and which are our estimated rectangles and the ground truth rectangles, respectively. We adopt the -measure to combine the precision and recall figures into a single measure of quality, , where .

Experiments are carried out on a set of images containing text in four different languages, namely, English, Chinese, Arabic, and Korean. English text images are selected from ICDAR 2003, ICDAR 2011, and ICDAR 2013, Korean images from KAIST, some Chinese images from MSRA-TD500 and the other from multilingual dataset, and Arabic images from multilingual dataset and Google. The results of these evaluations are summarized in Table 1. As can be seen from Table 1, -measures on different language are close to each other, except Arabic, because the Arabic special nature of continuous writing style makes the recall of this script lower. The experiment result indicates that our method is not tuned to any particular language and performs approximately equally good on all the scripts.

Figure 4 shows some texts successful detection by our system on images containing different language text. Although the texts contained in training samples are only in English, Chinese, and Arabic, our method can detect the text not only in three representative languages, but also in a number of other languages, such as French, German, Korean, and Japanese. This shows that our method has some robustness.

We also picked methods proposed by Zhou et al. [16], Pan et al. [15], and Lee et al. [17] for further consideration. These algorithms have good results on the standard benchmarks and use different approaches to detect text. The performance comparison analysis can be seen in Table 2. Our method has achieved high recall at different benchmarks. It also reflects the representative of our learned feature is strong, which can successfully detect all the information associated with the text in images.

But our test results are not good, with -means of 0.30/0.32/0.31 on the MSRA-TD500 dataset. Shi et al. [6] have achieved the state-of-the-art text detection performance with 0.52/0.53/0.5 on the same dataset. The main reason is that the MSRA-TD500 dataset is created for the purpose of study of multiorientation text detection, which has a lot of images containing no horizontal text lines. But our method gives the text bounding boxes based on the horizontal direction.

Figure 5 shows some other test samples. The results reflect our method is efficient on the circumstance that a single image contains two or more different languages texts and numbers. The bottom row in Figure 5 shows some fail samples; some of these problems are miss detection for part of Arabic text, because Arabic words mostly are linked by continuous line. In this case, use of the stroke feature to detect text is not sufficient. Stroke width of the implementation is essential for such languages as Arabic. There are other problems caused by the interference terms which have the appearance similar to the text.

6. Conclusion

The aim of the study is to propose a multilingual text detection method. Traditional methods in this area mainly rely on large amounts of hand-engineered features or prior knowledge. Our work is distinct in two ways: we use primitive stroke feature learned by unsupervised learning algorithm as network convolutional kernels; we leverage the trained multilayer neural network to learn high-level abstract text features used for detector. Experiments on the public benchmark and multilingual dataset show our method is able to localize text regions of different scripts in natural scene images. The experiment results demonstrate the robustness of the proposed method.

From the failed samples in the experiments, we analyze the limitations of our technology for further improvement. On the one hand, some languages have continuous writing style, like Arabic; automatically learning features are not enough for detection; the connected components analysis will be added into our method to improve the precision of final results. On the other hand, multiorientation text problem will be considered.

Conflict of Interests

The authors declared that they have no conflict of interests regarding this work.