Abstract

Multioriented text detection and recognition in natural scene images are still challenges in the document analysis and computer vision communities. In particular, character segmentation plays an important role in the complete end-to-end recognition system performance. In this work, a robust multioriented text detection and segmentation method based on a biological visual system model is proposed. The proposed method exploits the local energy model instead of a common approach based on variations of local image pixel intensities. Features such as lines and edges are obtained by searching for the maximum local energy utilizing the scale-space monogenic signal framework. The candidate text components are extracted from maximally stable extremal regions of the local phase information of the image. The candidate regions are filtered by their phase congruency and classified as text and nontext components by the AdaBoost classifier. Finally, misclassified characters are restored, and all final characters are grouped into words. Experimental results show that the proposed text detection and segmentation method is invariant to scale and rotation changes and robust to perspective distortions, blurring, low resolution, and illumination variations (low contrast, high brightness, shadows, and nonuniform illumination). Besides, the proposed method achieves often a better performance compared with state-of-the-art methods on typical natural scene datasets.

1. Introduction

Nowadays, imagery has become an indispensable source of human communication and interaction. Millions of images are shared every day, and new content-based image applications have been developed. In particular, digital images with textual content provide useful information for tasks related to document classification, multimedia retrieval, language translator, text to voice converter, robotic navigation, and augmented reality, to name a few [1, 2]. The analysis of this textual information involves basically three stages: text detection, character segmentation, and word recognition. The fundamental goal of text detection is to determine whether there is text in a given image, while character segmentation considers the extraction and localization of characters from background pixels. Word recognition considers character grouping and error correction in order to recognize the final words.

Since text localization, character segmentation, and word recognition stages are not necessarily applied in a specific order, the character segmentation as the first stage could provide a better performance for the following processes. However, text localization and character segmentation are still challenges in the document analysis and computer vision communities (http://rrc.cvc.uab.es/?com=introduction). Natural text scenes contain different types of fonts, symbols, colors, scales, and character orientations, which make text detection a complicated task. Moreover, natural scenes are commonly captured under uncontrolled conditions (illumination changes, partial occlusion, low resolution, sensor noise, blur, and alignment) and could contain complex backgrounds (people, buildings, fences, bricks, grass, trees, and cars) [13].

In the last decades, several techniques have been explored to solve the text detection and segmentation problem. These methods can be broadly divided into four categories: sliding window-based, connected component-based, deep learning-based, and hybrid methods [1]. Sliding window-based methods, also called texture-based methods, consider a sliding window across all over the images under different scales to identify text regions. Fourier-statistical features (FSF) [4], discrete cosine transform (DCT) [5], spatial filters [6], and wavelet coefficients [7] are commonly used as textural properties. Nevertheless, sliding window methods are sensitive to scale and rotation variations, besides they are computationally expensive. Connected component-based methods consider connected component properties such as color, stroke width, aspect ratio, and size to distinguish between character and noncharacter regions. Usually, connected components are obtained by color clustering [8, 9], image binarization [10, 11], edge detection [12], stroke width transform (SWT) computation [13], and maximally stable extremal region (MSER) extraction [14, 15].

In the last years, the MSER and SWT techniques have become the most used techniques for text detection process due to their invariance to scale and rotation transformations. Besides, not only the MSER but also all extremal regions (ERs) are used for text segmentation [1620]. However, ER-based methods need to process multiple repeated regions to obtain correct character segmentation, generating classification errors and a high computational cost. Furthermore, SWT-based techniques are dependent on the accurate edge detector, which is not feasible in many cases.

Recently, deep learning-based techniques have become popular for pattern recognition. In particular, for the multioriented text detection task, different neural networks (NNs) and configurations have been proposed [2124]. However, NNs need to be pretrained using thousands of images in order to achieve a good performance, and in many cases, a final fine-tune is realized with the training images of the dataset to be evaluated. Moreover, it has been shown that this kind of approach can be easily fooled by modifying some values of the image pixels [25].

Lastly, hybrid methods combine the sliding window techniques, connected components, and neural network-based methods [2630]. Until now, most of the proposed methods related to natural scene text detection are based on the pixel intensity values. As a consequence, method performance is affected by the presence of nonuniform illumination, low contrast, blur, or noise degradations. In contrast, we propose a robust multioriented text detection and segmentation method based on the biological visual system model. Psychophysical evidence suggests that the human visual system decomposes the visual information into border and line components by using phase information. Furthermore, it is known that different groups of cells in V1 extract particular image features as frequency, orientation, and phase [31].

In this work, a new multioriented text detection and segmentation method based on the biological energy model is suggested. This paper is an extended version of the conference papers [32, 33]. Unlike the previous works, we utilize the phase-based MSER approach and the AdaBoost classifier instead of applying only heuristic rules for the character filtering, retrieval, and grouping stages.

The main contribution of this work is as follows. First, the proposed character segmentation method is based on a biologically inspired model rather than being based on local intensities. Thus, the proposed text segmentation is robust to variations of the image pixel values (nonuniform illumination, low contrast, and shadows), and it is invariant to slight scale and rotation changes. Second, the phase congruency approach for character filtering and noise control is utilized, which significantly reduces the number of generated components, keeps a low number of regions, and preserves the most relevant regions. Third, AdaBoost classifiers are used rather than heuristic rules at character filtering, retrieval, and grouping stages. Finally, the computational complexity of the proposed system at the training stage is much lower compared with that of deep learning techniques, while the performance of the system with a small training set is competitive and, in some cases, better than that of the state-of-the-art algorithms.

The paper is organized as follows. In Section 2, a brief description of the related works is presented. In Section 3, the proposed text detection and segmentation method is described. In Section 4, experimental results are presented and discussed. Section 5 summarizes our conclusions.

Until now, there are two representative connected component-based techniques used for text segmentation, that is, the SWT [13] and the MSER [14].

The local operator SWT computes the character stroke width for each edge map pixel. Therefore, strokes that have constant width values can be considered as characters, and those components which have similar stroke width values can be grouped into words. Since the original SWT is invariant to rotation and scale variations, several SWT-based methods have been developed. In [34, 35], a SWT-based method is proposed for multioriented text detection. The Canny edge detector is used to calculate the SWT map from the image. The image pixels are associated considering the SWT ratio and grouped into connected components. The obtained components are classified into character and noncharacter elements using a two-layer filtering scheme. A set of heuristic rules are considered, and a trained random forest (RF) classifier is applied. Finally, the character candidates are aggregated into text chains satisfying a certain set of rules. In [36], an extended version of the SWT, called stroke feature transform (SFT), is proposed. In addition to stroke width constrains, the SFT considers color uniformity and local relationships of edge pixels during ray tracking. Then, two text covariance descriptors are defined for component-level and text-line RF classifier training. In [37], an efficient stroke width value computation is proposed. The obtained stroke width value is used together with a perceptual diverge cue and an edge histogram of oriented gradient (HOG) descriptor to measure the properties of characters under a Bayesian framework.

On the contrary, the MSER method basically extracts image regions that remain stable under a certain number of thresholds, which are considered as potential character candidates. The MSER technique was first introduced by Matas and Zimmermann [15] for character detection and was recently extended for text detection and recognition [18]. In [16], an MSER-based text segmentation method is proposed. The character candidates are extracted using the MSER algorithm. The candidates are grouped using orientation, morphology, and protection clustering via adaptive hierarchical clustering. Then, the text candidates are classified into text and nontext components. In [17], a subpath division from the ER tree is done. Multiple subpaths are created according to the size and position similarities or ER regions. Then, an AdaBoost classifier is trained using mean local binary patterns (MLBP) for text and nontext classifications. Finally, heuristic rules are used for misclassified character filtering. In [20], the character candidates are extracted from low-variation ERs and classified using a support vector machine (SVM) and geometrical features. The obtained characters are grouped into text lines using heuristic rules, and a final restoration stage is considered if adjacent regions satisfy a set of predefined conditions. In [19], a similar ER-based method is proposed, but instead, geometrical features, the HOG, and local binary pattern (LBP) features are selected for character classification and recognition. Then, characters are grouped into text lines, and a CNN model is used to verify text lines, removing noncharacter components. In [28], a multichannel and multiresolution (MC-MR) strategy is proposed. The text candidates are extracted using MSER technique under RGB and YUV color spaces under different resolutions. Then, candidates are filtered by a coarse-fine strategy and classified as text and no-text components by a NN classifier.

3. Proposed Text Detection and Segmentation Method

In this section, the methodology for the proposed text detection and segmentation method is described. Connected components are obtained from the local image phase information. In order to extract the local phase-based image features, the scale-space monogenic signal framework [38, 39] is utilized. Basically, connected component regions are extracted from the local phase image using the MSER approach. Then, the obtained connected components are filtered considering geometrical properties, and the remaining components are considered as character candidates. Using an AdaBoost classifier, the character candidates are predicted as a character or noncharacter component. Finally, a second AdaBoost classifier is applied to restore misclassified characters. Figure 1 shows a block diagram of the proposed method.

3.1. Image Preprocessing

Morrone and Owens [40, 41] proposed a local energy model. This model argues that the biological visual system can locate features of interest by searching for maximum local energy and identifying the feature type (shadow, edge, or line) by evaluating the argument at that point. That is, edges, lines, and shadows, can be obtained at points where the Fourier components of the signal are maximum in the phase distribution, called phase congruency. Continuing with this approach, in [42], a dimensionless measure of phase congruency () is proposed as follows:where is a weight for the frequency spread; is a small constant to avoid division by zero; and is a noise threshold parameter. goes from 0 to 1. The value indicates the significance of the current feature: unity means the most significant feature, and zero indicates the lowest significance. We refer to papers [42, 43] for more details.

In practice, local frequency information is obtained via banks of oriented 2D filters, which are computationally expensive. Instead, we used the scale-space monogenic signal framework to compute the local phase information of the image.

Let be an image and be its Fourier transform. The scale-space monogenic signal () representation is defined as [38]where is the transfer function of the first-order Riesz transform in the frequency domain:and represents the image filtered by the band-pass filter:where indicates the relative bandwidth, indicates the coarsest scale, and indicates the band-pass number. Figure 2 shows a block diagram of the scale-space monogenic signal framework.

Then, the local amplitude , local orientation , and local phase (note that the function , where the factor sign (y) indicates the direction of rotation) can be computed as follows:

3.2. Phase-Based Character Candidate Generation

As we mentioned earlier, the local image phase describes the image structural information, while local amplitude gives us an intensity measure of the structure. Furthermore, the local phase allows us to distinguish between edge, edge-line, and line features. A phase value of 0 indicates an upward going step, a bright line feature, a downward going step, and a dark line feature [43]. However, we are not interested to make a distinction between dark or bright lines but in finding upward and downward going step features for region detection. For this reason, we consider the range from 0 to , mapping the angles grater then back into the range.

On the contrary, the MSER method [14] was first introduced for grayscale images, but it can be applied for any type of images as long as it maintains the two following conditions: totally ordered set and existence of adjacency relation. Thus, the proposed phase-MSER method is described as follows.

Let be a grayscale image and its local phase (equation (7)). The binary image is defined aswhere denotes a threshold value. An extreme region with threshold is defined as

The extremal region is maximally stable if and only ifhas a local minimum at , with denoting cardinality, and is a parameter that considers the stability of the region under a certain number of thresholds. The obtained regions are called character candidates (CC). Figure 3 shows an example of the MSER technique and the proposed phase-MSER method.

It is important to note that the local phase information is scale- and rotation-invariant. Moreover, due to the invariance-equivariance property, local phase information is independent of the local intensity; therefore, it is robust to contrast and illumination variations.

3.3. Character Candidate Feature Computation

Once the character candidate generation stage is done, a morphological closing operation is applied to each candidate in order to eliminate small holes. The size of the structural element was experimentally defined as . Next, for each candidate, geometrical connected component properties are computed.

Table 1 summarizes the computed properties.

Then, the obtained properties are used to compute the suggested candidate features:(1)The mean phase congruency value is computed to consider the phase congruency value of the candidate. As mentioned above, the value indicates the significance of the current feature. Thus, one means the most significant edge component, and zero indicates the lowest significance. is computed as follows:where and denotes cardinality.(2)The phase congruency ratio is computed to consider the contribution of the edge pixels of the candidate. One means a complete contribution from all the edge pixels, and zero indicates the lowest contribution. is obtained aswhereand is a threshold from 0 to 1.(3)The filled convHull ratio is computed to consider the convexity of the candidate:(4)The approximated area ratio considers the stroke uniformity of the candidate. One means a complete uniformity of the candidate stroke, and zero indicates the lowest uniformity. The approximated area ratio is computed aswhere .(5)The contour length ratio considers the difference between the external and internal candidate contours. This is to consider the complexity of the candidate edge. The contour length ratio is computed aswhere represents the external contour of the candidate.

In addition, the features used in [37, 44] are also considered:(1)The filled area ratio:(2)The solidity:(3)The compactness:(4)The occupancy:(5)The eccentricity:(6)The aspect ratio:(7)The stroke width value:where and are mean and variance, respectively.(8)The minimum stroke width ratio:(9)The maximum stroke width ratio:(10)The skeleton perimeter ratio:

All the described features are used for AdaBoost classifier training to classify character candidates into text and nontext components. The text-component AdaBoost classifier was trained using the ICDAR2013 training dataset (299 images).

3.4. Character Candidate Classification

In this stage, the character candidate classification is performed. As a first step, coarse candidate filtering is applied taking into account the following noncharacter properties:(1)The candidate area: to eliminate noncharacter candidates that are either larger or smaller than a predefined value, that is,where is the image area.(2)The aspect ratio: to eliminate noncharacter candidates that are too narrow or wide. was considered.(3)The phase congruency value: to eliminate low phase congruency value candidates. If (equation (11)) is lower than a predefined threshold (), then the candidate is discarded. Figure 4 shows an example of the phase-based candidates under different values.

After the filtering stage, the remaining candidates are classified as text and nontext components using the already trained AdaBoost classifier. A candidate is considered as a text character () if the sum of votes of the classifier is positive. The remaining candidates with the negative vote sum are considered as candidate neighbors () and are used in the next stage of character retrieval.

3.5. Character Retrieval

During the classifier training stage, some characters were purposely mislabelled as noncharacters (“I,” “i,” “L,” and “1”) to reduce classification errors since these characters are usually similar to noncharacter structures in the image. The retrieval stage seeks to recover these characters and others that have been misclassified. The character retrieval method is described as follows.

For each , a neighborhood of radius is defined. All the inside the radius are considered as character neighbors. If has no possible , then the character is discarded from the retrieval stage but continues as a final character. It means that isolated characters are not discarded.

Next, each is evaluated to determine if it is a misclassified character. For this, a second AdaBoost classifier is applied. The classifier is trained using the following features between and its :(1)The area difference:(2)The rotated rectangle area difference:(3)The mean grayscale value difference:(4)The height ratio:(5)The width ratio:(6)The mean stroke width difference:

The character retrieval AdaBoost classifier was also trained using the ICDAR2013 training dataset.

Once the character retrieval AdaBoost classifier is trained, it is used to retrieve the as if the classifier vote sum is positive. Then, the retrieval neighbors are considered as characters, and they are also used for retrieval of their candidate neighbors recursively. The method stops when no new neighbor component is classified as a new character.

Note that no alignment feature is computed, as in many related works. Considering horizontal alignment helps to avoid character misclassification but restricts the method to horizontal text only. Thus, the proposed method can be applied for nonhorizontal text images.

3.6. Character Grouping

Since most of the state-of-the-art text detection methods evaluate word localization instead of character segmentation, a character grouping stage for text detection is considered. Similar closest characters are grouped together and considered as candidate words. Then, the Hough transform is applied to obtain the final candidate word lines. The character grouping method is described as follows.

First, for each character, the distance between the character and all its neighbors within a radius is computed. The distance is obtained as the minimum Euclidean distance between the convex hull of the character and its neighbors. All the characters are grouped into pairs, and a minimum region containing both components is created. The region is expanded to the minimum distance between characters.

All intersecting regions are considered as candidate words. Then, the Hough transform is applied to obtain the candidate word lines. Each of these lines is processed individually to verify if all the selected characters belong to a single word. This is done by applying the AdaBoost classifier used in the retrieval stage. All the characters from the candidate word are compared with each other. Those characters that are classified as nonword characters to all other characters form a new word, and so on. The method stops when no new word is created. At the end, those final words that have only one element and its AdaBoost vote sum value is lower than zero, are eliminated. Figure 5 shows a character grouping example.

4. Experimental Results

4.1. Evaluation Protocol

The performance evaluation of the proposed method was realized using the following metrics. Two evaluation types are selected for text segmentation and text localization. For text segmentation, the character level recall-similarity rate [17] and the pixel-atom-based measures are utilized [45].

For character candidate generation evaluation, the recall-similarity rate is utilized. The recall-similarity is defined as the ratio between the total correctly detected candidate regions and the ground truth characters. A region is considered as a character candidate if the similarity value is up to 50%. The similarity value is defined as follows [17]:where and represent the detected and ground-truth bounding box, respectively.

For pixel-level segmentation evaluation, the pixel- and atom-based measures are utilized. Pixel- and atom-based measures not only consider pixel-level accuracy but also take into account the morphological properties of characters. In [45], the minimal and maximal coverage criteria are introduced, which measure the degree of overlap between the ground truth area and the obtained segmented component. The minimal coverage criterion is fulfilled if the predefined threshold of the ground-truth skeleton pixels is covered by the segmented component. Similarly, for the maximal criterion, the pixel distance to the ground-truth edge pixels should not exceed a maximum threshold , where is the maximum stroke width of the character.

On the contrary, although the proposed method is designed specifically for the text segmentation task, text localization evaluation is carried out to compare its performance with that of the state-of-the-art methods. The recall (R), precision (P), and F-measure (F) are defined as follows [46]:

and represent the ground-truth rectangle set and detection rectangle set, respectively. and are the recall and precision constrains, respectively. For more details, we refer to Wolf and Jolion [46].

For the MSER algorithm, the simulations were carried out using the reported MSER parameter [20], that is, , maximum variation , and minimum diversity .

4.2. Computer Simulations

First, to analyze the tolerance of the proposed segmentation method to low contrast, high brightness, shadows, and nonuniform illumination degradations, computer simulations using synthetic images were performed. For the experiments, ten representative images from the ICDAR2013 dataset were selected. The selected images contain different symbols, font types, colors, sizes, and backgrounds. Each image was scaled, rotated, and synthetically degraded, obtaining 1000 synthetic images per degradation (see Figure 6). Table 2 shows the obtained results compared with the MSER method in terms of recall-similarity measure.

The proposed method shows a high candidate generation performance. The recall-similarity measure was up to 90% in most of the cases, excepting the brightness degradations. That is because brightness variations caused the loss of regions with low contrast (see Figure 6, second row, fifth column). Besides, the proposed segmentation method shows performance up to 30% for nonuniform illumination and shadow degradations and performance up to 10% for brightness and contrast variations compared with the MSER technique.

4.3. Typical Dataset Evaluation
4.3.1. Datasets

For performance evaluation of the proposed method, ICDAR2013 (http://rrc.cvc.uab.es/), USTB-SV1K [16], OSTD [47], and MSRA-TD500 [34] datasets are used. The ICDAR2013 dataset consists of 462 complex scenes divided into training (299) and test (233) images. Note that the ICDAR2013 dataset contains images with horizontally aligned texts. Each image contains different complex backgrounds, font types, sizes, blurring, illumination, contrast, etc. The size of the images varies from to . USTB-SV1K dataset consists of 1000 Google Street View images () divided into training (500) and test (500) images. The images contain multiorientated and perspective-distorted text. OSTD dataset includes 89 multioriented text images. The images contain different font types, sizes, and orientations. The size of the images varies from to . Finally, MSRA-TD500 contains 500 natural images divided into training (300) and test (200) images, which are taken from indoor and outdoor scenes. The resolution of the images varies from to . The images contain English and Chinese texts, different fonts, sizes, colors, and orientations.

4.3.2. Text Segmentation Evaluation

Since text segmentation depends on the quality of connected component generation, the proposed phase-based character candidate generation method is evaluated. Table 3 shows the obtained results in terms of recall-similarity measure and the obtained mean candidate regions. The obtained result shows that the proposed method obtains less character candidates with a high similarity rate than the other methods. Our method outperforms the results obtained in [8, 17], even when the methods utilize grayscale, RGB, Cb, and Cr channels. Although the recent methods [19, 28] report good similarity results for the given dataset, the mean number of candidates per image is too high, almost 30 and 15 times more than the proposed method. It is important to note that there exists a compromise between candidate region generation and computational complexity.

For the text segmentation evaluation, the precision and recall metrics were computed, as well as the F-measure. Table 4 shows the proposed method results on the ICDAR2013 dataset. The proposed method outperforms the methods [20, 48], which utilize grayscale images for character candidate extraction.

Both results, character candidate generation and text segmentation, show that the proposed method obtains fewer candidate regions with a more accurate pixel-level segmentation result.

Now, we provide the performance of the proposed method at different stages of its work. Table 5 presents character-level results in terms of recall, precision, and F-measure. We can observe that, after classification of candidates, the precision improves by 58%, while recall decreases by almost 24%. This is because at the classifier training stage, some characters were purposely mislabelled as noncharacters. As expected, the retrieval stage recovers some characters that were misclassified; however, nontext components are also restored. Finally, the grouping stage discards noncharacters, which were recovered at the retrieved stage, as well as correct characters.

4.3.3. Text Localization Evaluation

Since most of the existing methods present text localization evaluation instead of character segmentation, we also carry out the same evaluation. Table 6 shows the text localization performance of the MSER-based techniques on the ICDAR2013 dataset. It can be seen that the proposed method shows better F-measure results than most other methods, except the techniques [17, 28] in which multiple image channels are used. However, the method [17] is designed for horizontal text only, decreasing its performance for multioriented text, while method [28] yields a lower F-measure than the proposed method with only grayscale images. Besides, the proposed method outperforms the latter one on the multioriented USTB-SV1K dataset (see Table 7).

Next, the performance of the proposed method and state-of-the-art algorithms [16, 20, 24, 2830, 34, 37] on four datasets is evaluated using the protocol given in [34]. The results are shown in Table 7. One can observe that the proposed technique using only 299 training images outperforms the state-of-the-art methods on USTB and OSTD multioriented datasets. The performance of the methods [28, 29] drops by almost 30% compared with the performance of these methods on the ICDAR2013 dataset containing horizontally aligned texts. Since the MSRA dataset has Chinese characters that we are not familiar with, we perform two evaluations of the proposed method: over the entire MSRA dataset and English text images of the dataset. Note that classifiers used in our method were only trained using Latin-based characters. For a fair comparison with other methods on this dataset, the proposed technique needs additional training with Chinese characters. It is of interest to note that the proposed method can detect parts of Chinese texts (see Figure 7). Although the deep learning-based method [30] outperforms the proposed method (for the complete test set), the authors report a decrease of 20% on F-measure using only the MSRA training set (300 images), thereby obtaining a lower F-measure than the proposed method.

Figures 8 and 9 show examples of correct text detection images and common errors of the proposed method in the USTB dataset, respectively. Three types of errors were found: the Google logo error (first row), where the proposed method recognized the Google watermark from the images; the unmarked text error (second row), where the proposed method recognized the text, but it was not considered as the text by the dataset ground truth; and the false positive and false negative errors (third row).

Finally, the average processing time of the proposed method was estimated using the ICDAR2013 dataset on a 2.8 GHz Intel Xeon E5-1603 PC with 16 GB of RAM. Table 8 summarizes the running time of all tested algorithms, as well as hardware features reported by the authors. Note that the processing time of the algorithms at each stage depends on various factors, such as hardware features, specific implementation of algorithms, size, and contextual complexity of processed images, which make a fair comparison difficult. One can observe that methods [28, 30] achieved the best runtimes of recognition since GPU was utilized for implementation. Methods [18, 48] work only for the horizontal text, which reduces the computational complexity (runtime) of these methods. Note that all deep learning algorithms require significantly longer training time compared with the proposed method, which is reasonably fast for detection and segmentation even using a conventional computer without a graphics processor. Further optimization of the method implementation, as well as the use of GPU technology, can definitely reduce the overall processing time of our method.

5. Conclusion

In this paper, a novel multioriented text detection and segmentation method inspired by the human vision system was proposed. The method is based on the local energy model and the scale-space monogenic signal framework to extract essential local phase information. The proposed method consists of phase-based text segmentation, character retrieval, and character grouping stages. The phase-based candidate regions are extracted by applying the MSER algorithm to the local phase image; meanwhile, character retrieval and grouping are done by applying AdaBoost classifiers to avoid the use of heuristic rules.

The proposed method proved to be robust to geometric distortions, font variations, complex backgrounds, low contrast, high brightness, shadows, and illumination changes. The method achieves a high character segmentation performance possessing low computational complexity (number of extracted components). The method outperforms the state-of-the-art algorithms on typical databases in terms of character segmentation, text localization, and the number of candidate regions. Besides, our method is not restricted to only horizontal texts like most of the existing methods but also to multioriented texts.

Finally, the proposed method can be used for text detection in different languages or handwritten texts.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the RFBR (grant 18-08-00782).