Mathematical Methods Applied to Digital Image ProcessingView this Special Issue
Research Article | Open Access
Fuzzy-Based Segmentation for Variable Font-Sized Text Extraction from Images/Videos
Textual information embedded in multimedia can provide a vital tool for indexing and retrieval. A lot of work is done in the field of text localization and detection because of its very fundamental importance. One of the biggest challenges of text detection is to deal with variation in font sizes and image resolution. This problem gets elevated due to the undersegmentation or oversegmentation of the regions in an image. The paper addresses this problem by proposing a solution using novel fuzzy-based method. This paper advocates postprocessing segmentation method that can solve the problem of variation in text sizes and image resolution. The methodology is tested on ICDAR 2011 Robust Reading Challenge dataset which amply proves the strength of the recommended method.
Recently there has been a rapid surge in multimedia reservoirs that raised the need of efficient retrieval, indexing, and browsing of multimedia information. Several methodologies are presented in the literature to retrieve image and video data, which exploit color, texture, shape, and relation between objects, and so forth. However, embedded text in images can be extraordinarily instrumental for data retrieval as visual texts in multimedia communicate information regarding news headlines, title of movie, trade-name of products, summaries of sports contest, date and time of events, and so forth. Such information can be influential for the understanding and retrieval of images or videos.
Text implanted in images may be categorized in two classes, namely, caption text and scene text. Caption text is imposed over the image in the editing process for example news headings and match summary/score. It is also referred to as artificial text or superimposed text, whereas scene text is an actual part of the scene, that is, brand name of the product during commercial break, text on sign-board, name plate and text visible on dresses or product, and so forth.
One of the key challenges posed to the text detection process is to deal with text size variations. The text variation may be classified in two types: firstly, the variation of spatial resolution of images and secondly the variation of font sizes within an image. This paper focuses on the above mentioned problem in text detection and provides viable solutions for both categories of the problem.
The rest of the paper is ordered as follows. Section 2 highlights some related work of the field. Section 3 introduces the proposed method to segment text in images. Section 4 presents the dataset used and results of text segmentation algorithm. Section 5 provides some concluding remarks.
2. Literature Review
A variety of techniques for text extraction have appeared in the recent past [1–6]. Comprehensive surveys can be traced explicitly in [7–9]. These techniques can be categorized into two types mainly with reference to the utilized text features, that is, region-based and texture-based methods . Texture-based methods pertain to textural properties of the text, distinguishing it from the background. These techniques mostly use Gabor filters, Wavelet, Fast fourier transform, Spatial variance, and so forth. This approach further uses machine learning methods such as support vector machine (SVM), multilayer perceptron (MLP), and adaBoost [11–15]. Region-based methods use distinct region features to extort text content. This methodology deals with the color dissimilarity of the text and its surrounding pixels. Procedures based on color, edge, and connected components are frequently exercised in this category [16–19]. These techniques typically work in the bottom up fashion by initially segmenting the small regions and lately grouping the potential text regions. Region-based methods are generally composed of three modules: (1) segmenting the image into small regions which aims at segregating the character regions from its background, (2) merging and grouping of small regions to form words and sentences, and (3) differentiating between text and nontext objects.
Segmentation identifies the occurrence of different regions in the image but does not recognize the relation between these regions. It is substantial to merge the characters of a word to form a text object, because most of the text detection techniques work on group of characters and it is very difficult to detect the isolated character [20, 21]. This grouping can utilize the pixel level features or can exploit the high level features.
Presently, few pixel level merging methods are introduced in the literature pertaining to text detection. Dilation is the most commonly used merging technique [22–26], wherein the dimensions of the morphological operator intrinsically characterize the range of the homogeneous segmented regions. Consequently, hefty text blocks are tending to oversegmentation, whereas diminutive text areas are possibly skipped. Fixed size of the structuring element can only materialize for limited spatial resolution and small range of font sizes. Besides, size of the structuring element should be dependent upon the size of the text but usually has the fixed value which cannot deal with the variation in resolution of image and size of text. Some methodologies in literature utilize pyramid approach to solve this problem and extend the range of text sizes for detection [23, 27, 28]. This highly increases the computational requirements or demands for parallel processing mechanisms.
Object level merging is more close to human vision and deals with the objects and regions instead of pixels. It connects the potential character objects to form the text strings. Hence, the grouping and merging are dependent upon some high level features which gives better performance.
Wolf and Jolion  used disparity in heights and positions of the connected component to merge the characters. Minetto et al.  developed a grouping step, based on the space between the two text areas relative to their height. Pan et al.  built component relation using minimum spanning tree. This text detection method merges the characters into words using shape and spatial difference. Gonzalez and Bergasa  suggested that characters of the same word should have several similar characteristics, for instance, stroke size, altitude, position, adjacency, and constant interletter and interword spacing.
Shi et al.  used the graph model to merge the neighboring regions to form text strings. The adjoining nodes for each node are those ones that persuade the certain conditions based upon difference in color, position, width ratio, and height ratio. Character candidates are linked into pairs in Yao et al.  method. If two regions have similar stroke widths (ratio between the mean stroke widths is fewer than 2.0), matching sizes (ratio between their characteristic scales does not surpass 2.5), and similar colors and are closely placed (distance between them is less than two times the sum of their characteristic scales), they are tagged as a couple. Subsequently, a greedy hierarchical agglomerative clustering approach is exercised to combine the pairs into candidate chains.
Though these features are defined by strict boundaries in the existing techniques, the relation between the neighboring characters is not crisp. It is principally inequitable to declare a character as a neighbor if its distance to height ratio is 1.50 or less, whereas the same verdict gets void, if the ratio turns to even 1.51. The parameters to define the proximity of potential character should have been diffused instead of crisp logic. Thus, there is a need to architect a merging process in which the rules of inference are formulated in a general way, utilizing diffused categories. There is a requirement to frame a system which gives some weight to each of the features used for measuring the degree of neighborhood. Moreover, the similarity obtained by the currently reported features mostly does not correspond to human perception. Human perception of propinquity, similar heights, and similar color cannot be fully expressed using discrete and rigid boundaries or thresholds. These linguistic variables can be better defined by the fuzzy logic.
Component extraction or segmentation is the procedure of dividing a digital image into multiple fragments, called superpixels [34, 35]. The objective of segmentation is to reduce the computational complexity of the under process image and make its representation easier to analyze. Image segmentation is classically used to trace objects and boundaries in images. In particular, image segmentation is the process to label the pixels of image, where the pixels with same labels share some common characteristics such as color, intensity, and texture and; moreover, edge detection is a basic instrument used in most image processing applications to obtain sharp alteration in intensity of the region boundaries.
Proposed segmentation method consists of two processes: splitting and merging. Splitting is performed by the traditional region-based segmentation techniques, whereas merging is based on the novel fuzzy-based method. Figure 1 provides the architecture of the proposed work.
There exists sharp transition between the text and its background. Edge detection is the budding segmentation tool for text images because sharp intensity transition is the common feature in all the text objects. Exploiting this feature, edge detection along with the connected component labeling is used for segmentation in the proposed methodology, where Sobel edge detection technique is used for edge detection, and image dilation is applied to connect the broken edges.
Adaptive size of the dilation operator is calculated in consonance with the resolution of the image, which ranges between 3 and 5% of the width of the image. Dilation is performed prior to the fuzzy merging just to minimize the computational efforts. Proposed fuzzy merging method can work without this morphological operation.
3.2. Fuzzy Merging
Succeeding section explains the fuzzy merging process.
Let be the input image and the set of all the regions of , extracted by the above mentioned method. Let , and is the total number of regions in the image .
The problem of merging process can be defined using the graph theory. Let denote the undirected graph and represent the vertices of the graph . Edges of the graph are . These edges show the probability of joining two vertices. Initially, are set to null. This probability can be calculated by fuzzy logic and based upon the four factors. Four factors considered are explained later in the paper.
Fuzzy-based methods assign gradual membership value to the objects, to join with other text instances, which are measured as degrees in [0, 1]. This gives the flexibility to connect the object based on more than one feature, depending upon the different membership values of all the parameters.
3.2.1. Feature Extraction
The merging of character candidates relies on number of factors. Four features are extracted for the decision of joining objects as words or sentences. These features are color, height, position, and distance.
Color. Color is taken as the parameter to join the two text objects. Color of the characters of a single word or sentence is mostly the same. If the colors of two text objects are similar, then these objects can be the candidates to merge. In order to get the degree of similarity, difference between the two colors is calculated.
Lab color coding is used to describe the color of the object. Unlike RGB and CMYK, Lab color coding approximates the human vision system. is described in the color space with differences in lightness, chroma, and hue calculated from coordinates. Difference of two colors having coordinates and can be defined as Here,
Geometrically, the amount presents the arithmetic mean of the chord lengths of the equal chroma circles of the two colors.
Height. Difference of heights is the second input parameter for fuzzy system. Only objects with similar heights should be merged because characters of the same word or sentence have the same or similar heights. Difference of heights of two objects is measured as follows: where and are the heights of th and th objects, respectively.
Position. Position of the two objects should be the same for merger. This merging process is proposed for horizontal text only as most of the caption text is horizontally aligned. This can be expanded to other directions by considering position at different angles. Consider where and are the bottom coordinates of bounding boxes of th and th objects, respectively.
Distance. Characters of the same word or sentence are placed closely. The distance between characters varies with the variation in font size and is highly dependent upon the heights of the characters. Distance () between two regions in an image is calculated by where and are the left and right coordinates of bounding box of th object. Figure 2 explains the height, position, and distance phenomena pictorially.
This step gets the inputs and decides the degree to which suitable fuzzy sets belong by means of membership functions. The input has to be a crisp numerical value bounded to the universe of discourse of the input variable and the output is a fuzzy degree of membership in the qualifying linguistic set. Fuzzification of the input refers to either a table lookup or function estimation.
Let the inputs to the fuzzy system be represented in the vector notation: where belonging to represents real value points. We define symmetric Gaussian function and sigmoid function for the input.
The symmetric Gaussian function is defined by two parameters and : where ; ; ; and represent the number of fuzzy sets. , , , and represent the means of fuzzy sets, where , , , and represent the variances of fuzzy sets.
Third membership function of all the inputs exhibits a progression from miniature start that advances and reached a culmination over time. Sigmoid function is used to express this phenomenon. Consider where ; ; ; and represent the fuzzy set’s number. and are the model parameters to be fitted.
The following function is used to map belonging to into fuzzy set : Minimum -norm operator is used for fuzzification.
3.2.3. Product Inference Engine
Multiple inputs and single output fuzzy rule-base is employed for the current merging problem. Product inference engine (PIE) makes use of fuzzy rule base and linguistic rules. PIE encompasses individual rule-based inference with union combination, min implication, min operator for -norm, and max operator for -norm: where , , and are the input fuzzy membership functions for the same, similar, and different, are the membership values for minimum, average, and maximum and are the output membership functions corresponding to Not join and join. Triangular curve function is used as output membership function that can be defined by More compactly, it can be expressed as The parameters and define the feet of the triangle and the parameter defines the peak. Figure 3 shows different membership functions used in the system.
PIE can be fully defined by
Defuzzification is the mapping of fuzzy values into the real-world values. Center average fuzzifier (CAD) is used as the weighted average of the centers of fuzzy sets as it provides a reasonable approximation: where and are the center and height of the output fuzzy sets. CAD is chosen because it is computationally less expensive and has more accuracy and continuity when compared to other defuzzifiers .
4. Results and Experiments
Dataset of ICDAR 2011 Robust Reading Competition, Challenge 1: “Reading Text in Born-Digital Images (Web and Email),” is applied in this research, wherein the dataset comprises 102 images of test and 420 images of training sets. The above dataset possesses vast variation in font size, resolution, background complexity, and font type. However, ICDAR dataset is recognized as the most widely used benchmark for text detection.
The ranking metric used for the text segmentation task is accurate. Accuracy of segmentation can be defined as In the text detection and localization problem, isolated character is also considered as under segmentation. Proposed method obtained 90.7% accuracy for segmentation of text objects. Comparison of the segmentation results with and without fuzzy merging can be viewed in Figure 4. Segmentation without fuzzy merging is tested for adaptive and fixed size structuring elements. Achieved results show that fuzzy merging has a very effective role in segmentation for text detection.
In order to prove the practicability of the proposed segmentation method, fuzzy merging is added as the post segmentation process in textorter , which is the best technique in ICDAR Robust Reading Competition 2011 , whereby the results justify a major improvement in the detection rate of textorter. It is also factual that many isolated characters are not detected as text by textorter, as these are not merged as a complete word. The ranking metric used for the text localization task is the harmonic mean which is computed according to the methodology proposed in . It is a combination of two measures: precision and recall. Table 1 shows the comparison of results for different text detection methods.
Figure 5 shows the superiority of the proposed method. Results show that fuzzy merging really enhances the segmentation process for text detection.
Different combinations for input and output membership functions are tested, where the results show that the combination testified in the proposed methodology ensures the best outcome. Gaussian, triangular, sigmoid, trapezoidal, and bell-shaped are commonly used membership functions. These functions are tested for making different combinations of fuzzy inference engine. Gaussian, triangular, and sigmoid functions are defined in Section 3.2.2. Bell-shaped function can be defined as The comprehensive bell function can be defined using three parameters , , and , and here the parameter is mostly positive, whereas parameter traces the middle of the curve.
The trapezoidal curve is a function of a vector “” and is dependent upon four scalar parameters , , , and : or it can be defined compactly as The parameters and trace the “feet” of the trapezoid and the parameters and set the “shoulders.”
Figure 6 shows the comparison of different membership functions regarding four inputs.
The paper addresses very crucial problem of text detection, which is variation in font size and resolution. Earlier approaches are primarily dataset specific and unable to deal with enormous variation of font sizes. This paper devises a fuzzy-based postprocessing method for segmentation duly operatable with combination of any segmentation method. Four factors are mainly put forth for joining characters into words. These factors are fed into the fuzzy system which gives the verdict of joining or not joining regions. Dataset of ICDAR 2011 Robust Reading Competition, Challenge 1: “Reading Text in Born-Digital Images (Web and Email),” is applied into this research, whereby the results achieved stand out to be productive when pitched against the above referred retrieval problems.
Conflict of Interests
The authors declare that they have no conflict of interests regarding the publication of this paper.
- H. Li, D. Doermann, and O. Kia, “Automatic text detection and tracking in digital video,” IEEE Transactions on Image Processing, vol. 9, no. 1, pp. 147–156, 2000.
- K. I. Kim, K. Jung, and J. H. Kim, “Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1631–1639, 2003.
- M. Zhao, S. Li, and J. Kwok, “Text detection in images using sparse representation with discriminative dictionaries,” Image and Vision Computing, vol. 28, no. 12, pp. 1590–1599, 2010.
- K. Wang and S. Belongie, “Word spotting in the wild,” in Proceedings of the European Conference on Computer Vision (ECCV '10), pp. 591–604, Springer, 2010.
- L. Neumann and J. Matas, “A method for text localization and recognition in real-world images,” in Proceedings of the Asian Conference on Computer Vision (ACCV '11), pp. 770–783, Springer, 2011.
- P. Shivakumara, T. Q. Phan, and C. L. Tan, “A Laplacian approach to multi-oriented text detection in video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 33, no. 2, pp. 412–419, 2011.
- K. Jung, K. I. Kim, and A. K. Jain, “Text information extraction in images and video: a survey,” Pattern Recognition, vol. 37, no. 5, pp. 977–997, 2004.
- J. Liang, D. Doermann, and H. Li, “Camera-based analysis of text and documents: a survey,” International Journal on Document Analysis and Recognition, vol. 7, no. 2-3, pp. 84–104, 2005.
- C. P. Sumathi, T. Santhanam, and G. Gayathri, “A Survey on various approaches of text extraction in images,” International Journal of Computer Science & Engineering Survey, vol. 3, no. 4, 2012.
- R. Lienhart, Video OCR: A Survey and Practitioner's Guide, Video mining, Springer, Burlingame, Calif, USA, 2003.
- C. Li, X. G. Ding, and Y. S. Wu, “An algorithm for text location in images based on histogram features and Ada-boost,” Journal of Image and Graphics, vol. 3, article 003, 2006.
- K. I. Kim, K. Jung, and J. H. Kim, “Texture-based approach for text detection in images using support vector machines and continuously adaptive mean shift algorithm,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 25, no. 12, pp. 1631–1639, 2003.
- R. Lienhart and A. Wernicke, “Localizing and segmenting text in images and videos,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 4, pp. 256–268, 2002.
- J. Gllavata, E. Qeli, and B. Freisleben, “Detecting text in videos using fuzzy clustering ensembles,” in Proceedings of the 8th IEEE International Symposium on Multimedia (ISM '06), pp. 283–290, IEEE, December 2006.
- D. Chen, J.-M. Odobez, and H. Bourlard, “Text detection and recognition in images and video frames,” Pattern Recognition, vol. 37, no. 3, pp. 595–608, 2004.
- J. Fabrizio, M. Cord, and B. Marcotegui, Text Extraction from Street Level Images, City Models, Roads and Traffic (CMRT), 3, 2009.
- M. León Cristóbal, V. Vilaplana Besler, A. Gasull Llampallas, and F. Marqués Acosta, Region-Based Caption Text Extraction, 2012.
- B. Epshtein, E. Ofek, and Y. Wexler, “Detecting text in natural scenes with stroke width transform,” in Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR '10), pp. 2963–2970, IEEE, June 2010.
- M. Anthimopoulos, B. Gatos, and I. Pratikakis, “A two-stage scheme for text detection in video images,” Image and Vision Computing, vol. 28, no. 9, pp. 1413–1426, 2010.
- L. Neumann and J. Matas, “A method for text localization and recognition in real-world images,” in Proceedings of the Asian Conference on Computer Vision (ACCV '10), pp. 770–783, Springer, 2011.
- S. T. Deepa and S. P. Victor, “A novel method for text extraction,” International Journal of Engineering Science & Advanced Technology, no. 4, pp. 961–964, 2013.
- R. Farhoodi and S. Kasaei, “Text segmentation from images with textured and colored background,” in Proceedings of 13th Iranian Conference on Electrical Engineering, Zanjan, Iran, May 2005.
- M. S. Das, B. H. Bindhu, and A. Govardhan, “Evaluation of text detection and localization methods in natural images,” International Journal of Emerging Technology and Advanced Engineering, vol. 2, no. 6, pp. 277–282, 2012.
- S. T. Deepa and S. P. Victor, “A novel method for text extraction,” International Journal of Engineering Science Advanced Technology, vol. 2, no. 4, pp. 961–964, 2013.
- S. Li and J. T. Kwok, “Text extraction using edge detection and morphological dilation,” in Proceedings of the International Symposium on Intelligent Multimedia, Video and Speech Processing (ISIMP '04), pp. 330–333, IEEE, October 2004.
- J. Poignant, L. Besacier, G. Quenot, and F. Thollard, “From text detection in videos to person identification,” in Proceedings of the IEEE International Conference on Multimedia and Expo (ICME '12), pp. 854–859, IEEE, July 2012.
- R. Minetto, N. Thome, M. Cord, J. Fabrizio, and B. Marcotegui, “Snoopertext: a multiresolution system for text detection in complex visual scenes,” in Proceedings of the 17th IEEE International Conference on Image Processing (ICIP '10), pp. 3861–3864, IEEE, September 2010.
- M. Anthimopoulos, B. Gatos, and I. Pratikakis, “Multiresolution text detection in video frames,” in Proceedings of the 2nd International Conference on Computer Vision Theory and Applications (VISAPP '07), pp. 161–166, March 2007.
- C. Wolf and J.-M. Jolion, “Extraction and recognition of artificial text in multimedia documents,” Pattern Analysis and Applications, vol. 6, no. 4, pp. 309–326, 2004.
- Y.-F. Pan, X. Hou, and C.-L. Liu, “A hybrid approach to detect and localize texts in natural scene images,” IEEE Transactions on Image Processing, vol. 20, no. 3, pp. 800–813, 2011.
- A. Gonzalez and L. M. Bergasa, “A text reading algorithm for natural images,” Image and Vision Computing, vol. 31, pp. 255–274, 2013.
- C. Shi, C. Wang, B. Xiao, Y. Zhang, and S. Gao, “Scene text detection using graph model built upon maximally stable extremal regions,” Pattern Recognition Letters, vol. 34, pp. 107–116, 2012.
- C. Yao, X. Bai, W. Liu, Y. Ma, and Z. Tu, “Detecting texts of arbitrary orientations in natural images,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR '12), pp. 1083–1090, IEEE, June 2012.
- O. J. Tobias and R. Seara, “Image segmentation by histogram thresholding using fuzzy sets,” IEEE Transactions on Image Processing, vol. 11, no. 12, pp. 1457–1465, 2002.
- N. Senthilkumaran and R. Rajesh, “Edge detection techniques for image segmentation-a survey of soft computing approaches,” International Journal of Recent Trends in Engineering, vol. 1, no. 2, pp. 250–254, 2009.
- L. X. Wang, A Course in Fuzzy Systems, Prentice-Hall Press, Upper Saddle River, NJ, USA, 1999.
- S. Tehsin, A. Masood, S. Kausar, and Y. Javed, “Text localization and detection method for born-digital images,” IETE Journal of Research, vol. 59, no. 4, pp. 343–349, 2013.
- D. Karatzas, S. R. Mestre, J. Mas, F. Nourbakhsh, and P. P. Roy, “ICDAR 2011 robust reading competition—challenge 1: reading text in born-digital images (web and email),” in Proceedings of the 11th International Conference on Document Analysis and Recognition (ICDAR '11), pp. 1485–1490, IEEE, September 2011.
- C. Wolf and J.-M. Jolion, “Object count/area graphs for the evaluation of object detection and segmentation algorithms,” International Journal on Document Analysis and Recognition, vol. 8, no. 4, pp. 280–296, 2006.
Copyright © 2014 Samabia Tehsin et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.