Abstract

Any document in Serbian language can be written in two different scripts: Latin or Cyrillic. Although characteristics of these scripts are similar, some of their statistical measures are quite different. The paper proposed a method for the extraction of certain script from document according to the occurrence and co-occurrence of the script types. First, each letter is modeled with the certain script type according to characteristics concerning its position in baseline area. Then, the frequency analysis of the script types occurrence is performed. Due to diversity of Latin and Cyrillic script, the occurrence of modeled letters shows substantial statistics dissimilarity. Furthermore, the co-occurrence matrix is computed. The analysis of the co-occurrence matrix draws a strong margin as a criteria to distinguish and recognize the certain script. The proposed method is analyzed on the case of a database which includes different types of printed and web documents. The experiments gave encouraging results.

1. Introduction

Cryptography studies the problems concerning the conversion of information from a readable to some other state. It deals with information which is changing from one to another state. The initial information represents a plain text. When the information becomes encrypted, it is referred as a cipher text [1]. A substitution cipher is a method of encoding. According to it, the units of plain text are replaced with cipher text [2]. They can be single letters, pairs of letters, triplets of letters, mixtures of the above, and so forth. In our application the encryption function is not needed to be injective [3] due to nature of further statistical analysis. It does not matter if it will encrypt two different plain texts into the same cipher text, because decryption of the cipher text is not considered. Hence, the cryptography is used only as a basis for modeling and analyzing documents written in Serbian language. Serbian language represents the European minority language. However, it is distinct due to its capability to be written in Latin and Cyrillic script, interchangeably. According to the baseline characteristics [4], each letter in the text file is replaced with the cipher which is taken from the set of four counterparts only. The basic idea is to distinguish the script (Latin or Cyrillic) according to statistical analysis of the cipher text. It is accomplished with frequency analysis concerning occurrence [5] as well as with the method using statistical measures extracted from gray-level co-occurrence matrix [6]. The letter frequency distribution is a function which assigns each letter a frequency of its occurrence in the text sample [7]. The gray-level co-occurrence matrix (GLCM) have used for the extraction of features needed for texture classification [8]. Nevertheless, it can be exploited for a letter co-occurrence in a text document [9]. At the final stage, the experiment is made on a custom oriented database containing text from printed and Web documents.

The rest of the paper is organized as follows. Section 2 describes the full procedure of the proposed algorithm. Section 3 defines the experiment. Section 4 presents the results from experiment and discusses them. Section 5 makes a conclusion.

2. Proposed Algorithm

The proposed algorithm converts document written in Latin and Cyrillic script which represent the plain text into cipher text according to predefined encryption based on text line structure definition. Then, the equivalent cipher texts are subjected to the frequency and co-occurrence analysis. The results of frequency analysis indicated a substantial difference between cipher texts obtained from Latin and Cyrillic text. Similarly, co-occurrence analysis shows obvious quantitative disparity in some measures. This draws a strong margin as a criterion in order to distinguish and recognize a certain script type (Figure 1).

2.1. Text Line Structure

Text in printed and Web documents is defined as well-formed text type. It is characterized by strong regularity in shape. The distances between the text lines are adequate to be split up. The words are formed regularly with similar distance. Their inter word spacing is decent as well. However, in certain script, the letters or signs have different position according to its baseline. It is shown in Figure 2.

From Figure 2 four virtual lines can be defined [4]: (i)The top-line, (ii)The upper-line, (iii)The base-line, and (iv)The bottom-line.

Accordingly, a text line can be considered as being composed of three vertical zones [4]: (i)The upper zone, (ii)The middle zone, and (iii)The lower zone.

Each text line has at least a middle zone. The upper zone depends on capital letters and letters with ascenders, while the lower zone depends on letters with descenders. Only a few letters occupy the upper and lower zone.

2.2. Encryption

Two different sets are produced. They are and for Latin and Cyrillic alphabet, respectively:

Each of them consists of 60 elements that is, letters, which are valid for Serbian language. Furthermore, both sets and are mapped into set .

These mappings are achieved in accordance with the text line area definition. The structure of text line allows definition of following script types [4]. (i)Full letter (F), where letter is present in all three zones. (ii)Ascender letter (A), where character parts are present in the upper and middle zones. (iii)Descender letter (D), where character parts are present in the lower and middle zones, and (iv)Short letter (S), where character parts are present in the middle zone only.

Accordingly, all letters will be replaced with the cipher from the following set:

All letters can reach certain position, which belongs to set with a unique designation according to Table 1.

It should be noted that above mappings are surjective.

Serbian language contains 30 letters. Each letter in Latin has a corresponding equivalent letter in Cyrillic. Table 2 shows Latin and Cyrillic letters as well as theirs designation according to Table 1.

Statistical analysis of the letters and theirs corresponding type for Latin and Cyrillic scripts is shown in Table 3.

2.3. Frequency Analysis of the Occurrence

In the proposed algorithm, all letters from certain script has been substituted with equivalent members of the set according to Table 2. These circumstances for Latin document are shown in Figure 3.

Figure 3(b) shows the cipher text which is obtained from Latin documents according to modeling given in Table 2. Figures 3(c)3(f) shows a subset of cipher text with each element of set , that is, S, A, D and F, respectively. Statistical analysis of the cipher text shows following: 2217 elements of S, 598 elements of A, 261 elements of D and 8 elements of F types. Accordingly, distribution of set elements for Latin document is shown in Figure 4.

Currently, the same Latin document is converted into Cyrillic one. Similarly as in Latin document, all letters from Cyrillic document are exchanged with the equivalent members of the set according to Table 2. These circumstances for Cyrillic document are shown in Figure 5.

Figure 5(b) shows the cipher text which is obtained from Cyrillic documents according to modeling given in Table 2. Figures 5(c)5(f) shows a subset of cipher text with each element of set , that is, S, A, D and F, respectively.

Statistical analysis of the Cyrillic document image shows following: 2516 elements of S, 53 elements of A, 445 elements of D and 26 elements of F types. It should be noted that the sum of all set elements in Latin and Cyrillic document is not quite identical. It is valid due to difference in definition of letters in two scripts. In the Cyrillic script, each letter is given one and only one sign. However, in Latin script letters such as , lj and nj are represented by two letters. Distribution of set elements for Cyrillic document is presented in Figure 6.

According to Figures 4 and 6, the comparison chart is drawn. It is shown below in Figure 7.

Quantification of the script type appearance in a document written in Latin and Cyrillic is shown in Table 4.

It is obvious that the Latin document compared to Cyrillic one has slightly smaller number of short (S), descender (D) and full (F) letters. Nonetheless, the crucial margin is seen in ascender (A) letters. Hence, it can be a measure of confidence for detection of the script in a document given in Serbian language.

2.4. Co-Occurrence Analysis

Let be the gray scale image which is under consideration. It has row and columns, while is the total number of gray levels. The spatial relationship of gray levels in the image is expressed by the grayscale co-occurrence matrix (GLCM) [6, 10]. Hence, is a matrix that describes the frequency of one gray level appearing in a specified spatial linear relationship with another gray level within the area of investigation [11]. In order to compute a co-occurrence matrix , we considered a central pixel with a neighborhood defined by the window of interest. This window is defined by two parameters: inter-pixel distance and orientation . Typically, the choice of is 1 (one pixel), while the value of depends on the neighborhood. Because of that, each pixel has 8 neighbors given at following angles , 45°, 90°, 135°, 180°, 225°, 270°, 315°. However, the case of neighbors at or at is similar to the GLCM definition [12]. So, the choice may fall to 4 neighbors pixels at , 45°, 90° and 135°, that is, horizontal, right diagonal, vertical and left diagonal [13]. These orientations refer to 4-adjacent pixels at , , and , where is 1. For each pixel of the neighborhood, it is counted the number of times a pixel pair appears specified by the distance, and orientation parameters. The entry of represents the number of occasions a pixel with an intensity is adjacent to a pixel with an intensity . Hence, for the given image , the co-occurrence matrix is defined as [14]: where and are the image intensity values of the image, and are the spatial positions in the image . The offset is specifying the distance between the pixel-of-interest and its neighbor. It depends on the direction that is used and the distance at which the matrix is computed. The square matrix is of the order . Using a statistical approach like GLCM provides a valuable information about the relative position of the neighboring pixels in an image [12]. In order to normalize matrix , matrix is calculated as [10]:

The normalized co-occurrence matrix is obtained by dividing each element of by the total number of co-occurrence pairs in .

To illustrate the computing of GLCM, a four gray level image is used. The window parameters are and (horizontal). Initial matrix is shown in Figure 8.

The procedure of calculating co-occurrence matrix for grayscale matrix ( and ) [12] is given in Figure 9.

In order to GLCM be applied in our case, set is mapped into set by bijective function as: where . Furthermore, the neighborhood is given as 2-connected ( and around , where ). According to that, the same document in Latin and Cyrillic script is converted into cipher text. It is shown below in Figure 10.

To evaluate these cipher documents GLCM method is employed. Nevertheless, various statistic measures obtained from the co-occurrence matrix is introduced. The primary goal is to characterize the cipher text. Five descriptors can be used to describe the image [15]: (i)Uniformity (UNI), (ii)Entropy (ENT), (iii)Maximum probability (MAX), (iv)Dissimilarity (DIS), and (v)Contrast (CON).

Uniformity (UNI) which is sometimes called angular second moment (ASM) or energy (ENG) measures the image homogeneity. It receives the highest value when GLCM has few entries of large magnitude. In contrast, it is low when all entries are nearly equal. The equation of the uniformity is [15]:

Entropy (ENT) measures the disorder or the complexity of the image. The highest value is found when the values of are allocated quite uniformly throughout the matrix. This happens when the image has no pairs of gray level, with particular preference over others. The equation of the entropy is [15, 16]:

Maximum probability (MAX) extracts the most probable difference between gray scale value in pixels. It is defined as [15]:

Dissimilarity (DIS) is a measure of the variation in gray level pairs of the image. It depends on distance from the diagonal weighted by its probability. The equation of the dissimilarity is [15]:

Contrast (CON) or inertia is a measure of the intensity contrast between a pixel and its neighbor over the entire image. Hence, it shows the amount of local variations present in the image. If the image is constant, then the contrast will be equal 0. The highest value of contrast is obtained when the image has random intensity and the pixel intensity and neighbor intensity are very different. The equation of the contrast is [15, 16]:

A brief look at the normalized co-occurrence matrix for the same document written in Latin and Cyrillic scripts (text representing the excerpt of the first four paragraphs from a document given in Figure 10) shows quite a different characterization. The test results are given in Table 5.

Furthermore, the calculation of five co-occurrence descriptors shows the values given in Table 6.

3. Experiments

For the sake of the experiment, a custom-oriented database is created. It consists of 10 documents. These documents represent excerpts from printed and web documents written in Serbian language. The documents are created in both scripts: Latin and Cyrillic. Printed documents are created from PDF documents, while web documents are extracted from web news. The total length of documents given in the database is approx. 75000 letter characters per script (approx. 40 pages). The length of printed documents is from 2273 to 15840 letter characters. Web documents are smaller compared to printed documents. Their length is from 1231 to 2502 letter characters. It should be noted that all documents have more than 1000 letter characters. The example of the printed and web document from the database is shown in Figure 11.

4. Results and Discussion

According to the proposed algorithm, all documents from the database are converted into equivalent cipher texts and subjected to the frequency and co-occurrence analysis. First, the frequency analysis of the script type occurrence in Latin as well as in Cyrillic documents is examined (Table 7). The obtained results for each document are given in Table 8.

The final processing of the results is based on cumulative measures like sum, average, max and min of script type occurrence in the database. According to that the criteria are established. All these are shown in Table 9.

From cumulative results given in Table 10 some criteria can be established. It can be noted that the biggest margin between results are seen in the ratio of ascending letters. This ratio has the value of at least 8. Hence, it is the strongest point of qualitative characterization and recognition of the certain script. Furthermore, the smaller number of short and descending scripts are common in Latin compared to Cyrillic documents. At the and, full letters are quite rare in a Latin document. However, its characterization in criteria form is quite problematic due to their absence in Latin documents from time to time.

Furthermore, the analysis of the script type co-occurrence in Latin as well as in Cyrillic documents is examined according to GLCM method. The obtained results for each document are given in Table 10.

The co-occurrence descriptor for Latin and Cyrillic text and its ratio is presented in Figure 13.

From the above results, some criteria can be established. It is clear that uniformity and maximum probability receive the most distinct values in Latin and Cyrillic text. Hence, these descriptors are suitable for qualitative characterization of Latin and Cyrillic text as well as for creating criteria to distinguish a certain script type. From the above results, the margin criteria should be uniformity of 0.3 and maximum probability of 0.5. These values of both descriptors represent the strong margin in qualifying the script in certain Serbian text. If we accompany them with the criteria obtained from frequency analysis of the script type occurrence, then the full criteria of decision making can be established. This will lead to correct recognition of the script in Serbian text.

5. Conclusion

The paper proposed the algorithm for recognition of exact script in Serbian document. Documents in Serbian language can be written in two different scripts: Latin or Cyrillic. The proposed algorithm converts document written in Latin and Cyrillic script into cipher text. This way, all alphabetic characters are exchanged with only four different encrypted signs according to predefined encryption based on text line structure definition. Such ciphers texts are then subjected to the frequency and co-occurrence analysis. According to the obtained results a criteria for recognition of the certain script is proposed. The proposed method is applied to the custom-oriented database which includes different types of printed and web documents. The experiment shows encouraging results. Possible applications can be seen in the area of web page recognition.

Future work will be toward the recognition of related languages as well as different languages written in the same script.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this article (“Recognition of the Script in Serbian Documents using Frequency Occurrence and Co-occurrence Analysis” by Darko Brodić, Zoran N. Milivojević, Čedomir A. Maluckov).