Abstract

A new method for clustering of spam messages collected in bases of antispam system is offered. The genetic algorithm is developed for solving clustering problems. The objective function is a maximization of similarity between messages in clusters, which is defined by π‘˜-nearest neighbor algorithm. Application of genetic algorithm for solving constrained problems faces the problem of constant support of chromosomes which reduces convergence process. Therefore, for acceleration of convergence of genetic algorithm, a penalty function that prevents occurrence of infeasible chromosomes at ranging of values of function of fitness is used. After classification, knowledge extraction is applied in order to get information about classes. Multidocument summarization method is used to get the information portrait of each cluster of spam messages. Classifying and parametrizing spam templates, it will be also possible to define the thematic dependence from geographical dependence (e.g., what subjects prevail in spam messages sent from certain countries). Thus, the offered system will be capable to reveal purposeful information attacks if those occur. Analyzing origins of the spam messages from collection, it is possible to define and solve the organized social networks of spammers.

1. Introduction

E-mail is an effective, fast and cheap communication way. Therefore spammers prefer to send spam through such kind of communication. Nowadays almost every second user has an E-mail, and consequently they are faced with spam problem. E-mail Spam is nonrequested information sent to the E-mail boxes. Spam is a big problem both for users and for ISPs. The causes are growth of value of electronic communications on the one hand and improvement of spam sending technology on the other hand. By spam reports of Symantec in 2010, the average global spam rate for the year was 89.1%, with an increase of 1.4% compared with 2009. The proportion of spam sent from botnets was much higher for 2010, accounting for approximately 88.2% of all spam. Despite many attempts to disrupt botnet activities throughout 2010, by the end of the year the total number of active bots returned to roughly the same number as at the end of 2009, with approximately five million spam-sending botnets in use worldwide [1].

Spam messages cause lower productivity; occupy space in mail boxes; extend viruses, trojans, and materials containing potentially harmful information for a certain category of users; destroy stability of mail servers, and as a result users spend a lot of time for sorting incoming mail and deleting undesirable correspondence. According to a report from Ferris Research, the global sum of losses from spam made about 130 billion dollars, and in the USA, 42 billion in 2009 [2]. Besides expenses for acquisition, installation, and service of protective means, users are compelled to defray the additional expenses connected with an overload of the post traffic, failures of servers, and productivity loss. So we can do such conclusion: spam is not only an irritating factor, but also a direct threat to the business. Considering the stunning quantity of spam messages coming to E-mail boxes, it is possible to assume that spammers do not operate alone; it is global, organized, creating the virtual social networks. They attack mails of users, whole corporations, and even states.

Every day E-mail users receive hundreds of spam messages with a new content, from new addresses which are automatically generated by robot software. To filter spam with traditional methods as black-white lists (domains, IP addresses, mailing addresses) is almost impossible. Application of text mining methods to an E-mail can raise efficiency of a filtration of spam. Also classifying spam messages will be possible to establish thematic dependence from geographical (e.g., what subjects prevail in the spam-messages sent from the certain countries). Methods of text clustering and classifying were successfully applied to spam problem from last decade. A filtration of E-mail onto legitimate and spam with the help of clustering analysis is considered in the papers [3–9].

This paper focuses on the classification of textual spam E-mails using data mining techniques. Our purpose is not only to filter messages into spam and not spam, but still to divide spam messages into thematically similar groups and to analyze them, in order to define the social networks of spammers [10].

The rest of the paper is organized as follows. Section 2 presents related works. Section 3 describes the representation of spam messages in databases for analyzing them and defines similarity measure between spam messages. The proposed clustering method is presented in Section 4. A genetic algorithm for solving clustering problem is offered in Section 5. Classification of the collected spam messages using the π‘˜NN method is described in Section 6. To receive the information about clusters, the document summarization method is applied in Section 7. Conclusion and future work are given in Section 8.

Spam messages are one of weapons of information war. Since 2003 in scientific literature the notions spam and war appear in one context [11, 12]. But problems of spammers’ social networks are considered in articles begining from 2009. In [13], the clustering of spammers considering them in groups is offered. In [14, 15], spectral clustering method is applied to the set of spam messages collected by Project Honey Pot for defining and tracing of social networks of spammers. They represent a social network of spammers as a graph, nods of which correspond to spammers, and a corner between two junctions of graph as social relations between spammers.

In this paper, the document clustering method is applied for clustering and analyses of spam messages. In our case, the text documents are textual E-mails. In spite of the fact that there are many approaches to representation of text documents, the most widespread of them is the vector model. The vector model for representation of texts has been offered in Salton’s works [16, 17]. In the elementary case, the vector model assumes comparison to each document of a frequency spectrum of words and accordingly a vector in lexical space. In more advanced vector models, the dimension of space is reduced by rejection of the most widespread or infrequently words, increasing thereby percent of the importance of the basic words. The main advantage of vector model is the possibility of ranging of documents according to similarity in vector space [18].

Clustering is one of the most useful approaches in data mining for detection of natural groups in a data set. For the solution of clustering problem the traditional algorithms, such as π‘˜-means algorithm [19, 20], hierarchical clustering, differential evaluation algorithm, particle swarm optimization algorithm, artificial bee colony optimization, ant colony algorithm, and neural network algorithm GEM (Gaussian expectation-maximization), are usually used [21–26]. The up-to-date survey of evolutionary algorithms for clustering, especially the partition algorithms, are described in detail in [27]. The comparison of advanced topics like multiobjective and ensemble-based evolutionary clustering; and the overlapping clustering as soft, fuzzy clustering are also mentioned in that paper. Each of the surveyed algorithms is described with respect to fixed or variable number of clusters; cluster-oriented or nonoriented operators; context-sensitive or context-insensitive operators; guided or unguided operators; binary, integer, or real encodings; and centroid-based, medoid-based, label-based, tree-based, or graph-based representations.

Clustering of spam messages means automatic grouping of thematically close spam messages. In case of information streams as E-mails, this problem becomes complicated necessity to carry out this process in real-time mode. There are some complications connected with plurality of a choice of algorithms for clustering of spam messages. Different methodologies use different similarity algorithms for electronic documents in case of a considerable quantity of signs. As soon as classes are defined by clustering method, there is a necessity of their support as spam constantly changes, and spam messages collection replenishes. In considered work, the new algorithm for definition of criterion function of spam messages clustering problem is offered. The clustering problem itself is solved by genetic algorithm [28]. Genetic algorithms are the subjects of many scientific works. For example, in [29] a survey of genetic algorithms designed for clustering ensembles, the genotypes, fitness functions, and genetic operations is presented and concludes that using genetic algorithms in clustering ensemble improves the clustering accuracy.

In this work, for the classification of spam messages the π‘˜-nearest neighbor method is applied, and for the determination subjects of spam messages, clusters will be applied to a multidocument summarization method offered in papers [30–32].

3. Problem Statement

Assume there is a collection of spam messages collected on servers of hierarchical system of spam filtration, described in paper [33]. Before applying any clustering method for clustering these spam messages and analyzing them, one should specify the input data. There are some approaches to information representation in databases for maintenance of the subsequent analysis of this information. We will consider the most popular approaches to representation of the text information dynamically arriving in databases of information systems. Let us consider the collection of spam messages in vector space. Assume 𝑆={𝑠1,…,𝑠𝑛} is a collection of spam messages, and 𝑇={𝑑1,…,π‘‘π‘š} is a set of terms (spam keywords) in spam messages collection. In vector model, any message can be represented as a point in π‘š dimensional space, where π‘š is the number of terms. Each spam message, identified with the weighted vector:𝑠𝑖=𝑀𝑖1,…,π‘€π‘–π‘šξ€»,(1) where 𝑀𝑖𝑗 is a weight of term 𝑑𝑗 in spam message 𝑖, 𝑖=1,…,𝑛, 𝑗=1,…,π‘š, and 𝑛 is the number of spam messages in collection. There are different ways to calculate the weights of terms [34]. Let us consider the most popular TFβˆ—IDF weighting (term frequencyβ€”inverse document frequency) method for determination the weight of term 𝑀𝑖𝑗. This method considers frequency of occurrence of a term in all messages of sample and its discriminating ability. By TFβˆ—IDF weighting scheme the weight of a term 𝑑𝑗 in the message 𝑠𝑖 is calculated by the following formula:𝑀𝑖𝑗=𝑛𝑖𝑗𝑛log𝑛𝑗,(2) where 𝑛𝑖𝑗 is a frequency of appearance of term 𝑑𝑗 in spam message 𝑠𝑖 and 𝑛𝑗 is the number of spam messages containing the term 𝑑𝑗.

After representation of spam messages, one should define similarity measure between them. Similarity measure can be defined by one of the metrics: cosine measure, Euclidian distance, and Jaccard measure. In this paper, the similarity measure between spam messages 𝑠𝑖 and 𝑠𝑗 will be defined by cosine measure. Cosine measure defines similarity by calculation of a cosine of the angle between vectors 𝑠𝑖 and 𝑠𝑗 [31]:𝑠sim𝑖,𝑠𝑗=βˆ‘π‘šπ‘™=1π‘€π‘–π‘™π‘€π‘—π‘™ξ”βˆ‘π‘šπ‘™=1𝑀2π‘–π‘™β‹…βˆ‘π‘šπ‘™=1𝑀2𝑗𝑙,𝑖,𝑗=1,…,𝑛.(3)

Considering spam messages in vector model and choosing metrics for similarity measure between spam messages, the offered algorithm of content analyses can be applied. The offered algorithm consists of the following steps:(1)clustering method,(2)algorithm for solving the clustering problem,(3)classification of collected spam messages,(4)knowledge extraction from classes.

Below the detailed description of each step is given.

4. Clustering Method

Data clustering is the process of dividing data elements into classes or clusters so that items in the same class are as similar as possible, and items in different classes are as dissimilar as possible. Depending on the nature of the data and the purpose for which clustering is being used, different measures of similarity may be used to place items into classes, where the similarity measure controls how the clusters are formed. Two types of clustering are defined: hard or fuzzy clustering. In fuzzy clustering (also referred to as soft clustering), data elements can belong to more than one cluster, and associated with each element is a set of membership levels. In hard clustering, data is divided into distinct clusters, where each data element belongs to exactly one cluster [35].

The goal of clustering is to split the set 𝑆={𝑠1,…,π‘ π‘š} into nonoverlapping clusters 𝐢={𝐢1,𝐢2,…,πΆπ‘ž},π‘ž>1, with the purpose of maintenance of the maximum similarity between messages of one cluster corresponding to certain semantic subjects, and the maximum distinction between clusters. That is, the following conditions of hard clustering should take placeπΆπ‘πΆβ‰ βˆ…for𝑝=1,…,π‘ž,π‘βˆ©πΆπ‘§=βˆ…for𝑝≠𝑧,𝑝,𝑧=1,…,π‘ž,π‘˜ξšπ‘=1𝐢𝑝=𝑆.(4)

Let us introduce the following designations:π‘‚π‘˜NN𝑠𝑖=ξ€½π‘ π‘—ξ€·π‘ βˆ£sim𝑖,𝑠𝑗𝑠β‰₯simπ‘˜π‘–,𝑠𝑗(5) is a set of π‘˜ nearest neighbors of spam message 𝑠𝑖, where π‘ π‘˜π‘– is a π‘˜th nearest neighbor of spam message 𝑠𝑖.𝑒𝑖𝑗=ξ‚»1ifπ‘ π‘—βˆˆπ‘‚π‘˜NN𝑠𝑖,𝑣0,otherwise,𝑖𝑗=ξ‚»1ifπ‘ π‘–βˆˆπ‘‚π‘˜NN𝑠𝑖,0,otherwise.(6)

If 𝑒𝑖𝑗=1 and 𝑣𝑖𝑗=1, then 𝑠𝑖 and 𝑠𝑗 will be mutual nearest neighbors. If 𝑒𝑖𝑗=1 and 𝑣𝑖𝑗=0 or 𝑒𝑖𝑗=0 and 𝑣𝑖𝑗=1, then 𝑠𝑖 and 𝑠𝑗 will be nearest neighbors. If 𝑒𝑖𝑗=0 and 𝑣𝑖𝑗=0, then 𝑠𝑖 and 𝑠𝑗 will not be nearest neighbors.

Let π‘₯𝑖𝑝 be a Boolean variable, which is equal to 1, if the spam message 𝑠𝑖 belongs to cluster 𝐢𝑝; otherwise, it is equal to 0: π‘₯𝑖𝑝=ξ‚»1ifπ‘ π‘–βˆˆπΆπ‘,0ifπ‘ π‘–βˆ‰πΆπ‘,𝑖=1,…,𝑛;𝑝=1,…,π‘ž.(7)

Taking into account the above designations, the criterion function of clustering can be defined as follows: 𝑓(π‘₯)=π‘žξ“π‘›π‘=1𝑛𝑖=1𝑗=1𝑒𝑖𝑗+𝑣𝑖𝑗𝑠sim𝑖,𝑠𝑗π‘₯𝑖𝑝π‘₯π‘—π‘βŸΆmax.(8)

As clusters are nonoverlapping, that is, each of the 𝑛 messages belongs to only one of the π‘ž clusters, the following condition should be satisfied:π‘žξ“π‘=1π‘₯𝑖𝑝=1,𝑖=1,…,𝑛.(9)

On the other hand, it is supposed that each cluster contains at least one spam message and does not contain all spam messages.1<𝑛𝑖=1π‘₯𝑖𝑝<𝑛,𝑝=1,…,π‘ž,(10) whereπ‘₯π‘–π‘βˆˆ{0,1}forany𝑖,𝑝.(11)

So, the clustering problem of spam messages is formalized as the Boolean quadratic programming (8)–(11), the solution of which provides non-overlapping clusters. This kind of problem is called 𝑁𝑃-complete problem, the solution of which requires feasible time and computing resources. For solving this problem, it is possible to apply such algorithms, as genetic algorithm, differential evaluation algorithm, particle swarm optimization algorithm, artificial bee colony optimization, ant colony algorithm, and neural network. As the number of spam-messages is huge and a collection dynamically replenishes, the solution of such problem demands the big computing expenses; therefore, to solve the problem (8)–(11) the genetic algorithm is offered.

5. Genetic Algorithm for Solving the Clustering Problem

Genetic algorithms are powerful tools for solving large dimension problems. But they do not guarantee an optimality of a found solution. In genetic algorithms, the first step is an encoding of solutions in the form of chromosomes which depends on the character of a solved problem. Therefore, before using genetic algorithm, at first it is necessary to design solutions of a problem in the form of a chromosome. Proceeding from character of a solved problem (8)–(11), a chromosome in populations is represented in such kind: 𝑋=(π‘₯11,…,π‘₯1π‘˜,π‘₯21,…,π‘₯2π‘˜,…,π‘₯𝑛1,…,π‘₯𝑛𝑝), where genes (variables) π‘₯𝑖𝑝(𝑖=1,…,𝑛;𝑝=1,…,π‘ž) according to (11) accept values 0 or 1. At such encoding, the size of a chromosome equals to π‘›β‹…π‘ž, where the first π‘ž position corresponds to the first spam message, following π‘ž position to the second spam-message. For example, for 𝑛=7 and π‘ž=3, encoding 𝑋=(0,0,1,1,0,0,0,1,0,1,0,0,1,0,0,0,1,0,0,0,1) describes that clustering in which spam messages 𝑠1 and 𝑠7 belong to cluster 𝐢3(π‘₯13=π‘₯73=1); spam messages 𝑠2,𝑠4, and 𝑠5 belong to cluster 𝐢1(π‘₯21=π‘₯41=π‘₯51=1); and spam messages 𝑠3  и  𝑠6 belong to cluster 𝐢2(π‘₯32=π‘₯62=1).

It is necessary to note, that genetic algorithms are easily applied to unconstrained optimization problems, but when solving constrained problems genetic algorithms are faced by a problem of occurrence of infeasible solutions. Infeasible solutions are solutions which break restrictions (in our case conditions (9) and (10)). When solving problem with genetic algorithm, the most important is to continuously support a feasibility of solutions during algorithm work, that is, maintenance of chromosomes such that they did not break restrictions (conditions) of a problem. There are different approaches for prevention of occurrence of infeasible chromosomes [36–38]. In this paper, in order to avoid maintenance of infeasible chromosomes, the penalty functions method is applied. The penalty functions method is very effective in constrained optimization problems [39, 40].

The idea of the penalty functions method is that it decreases the value of fitness function when infeasible chromosome occurs. So if the minimization problem is considered, then the introduced penalty function should sharply increase the fitness value of an infeasible chromosome. And on the contrary, if the maximization problem is considered, then the penalty function should be constructed so that it sharply reduced fitness value of an infeasible chromosome. Before constructing penalty function, we introduce the following designations:𝑒π‘₯𝑖‒=π‘žβˆ‘π‘=1π‘₯𝑖𝑝𝑣π‘₯,𝑖=1,…,𝑛,‒𝑝=π‘›βˆ‘π‘–=1π‘₯𝑖𝑝,𝑝=1,…,π‘ž.(12)

As the problem (8)–(11) is a maximization problem, when infeasible chromosome occurs the penalty function should sharply reduce the fitness value.

Taking into account the last statement, the penalty functions should be defined as follows:π‘ˆ(π‘₯)=𝑛𝑖=1π‘’βˆ’π›Ό|𝑒(π‘₯𝑖‒)βˆ’1|,(13)𝑉(π‘₯)=π‘žξ‘π‘=1π‘’βˆ’π›Όβ„Ž(𝑣(π‘₯‒𝑝)),(14) where 𝛼β‰₯1 is a deterioration coefficient and ⎧βŽͺ⎨βŽͺ⎩0β„Ž(𝑑)=if1<𝑑<𝑛,2βˆ’π‘‘if𝑑≀1,1if𝑑=𝑛.(15)

The function π‘ˆ(π‘₯) (13) prevents occurrence of the infeasible chromosomes violating a condition (9), and function 𝑉(π‘₯) (14) prevents occurrence of the infeasible chromosomes violating a condition (10).

It is easy to show that the functions (13) and (14) have the following conditions:(i)if the condition (9) is satisfied, then π‘ˆ(π‘₯)=1;(ii)if for any 𝑖 the condition (9) is not satisfied, then π‘ˆ(π‘₯)β‰€π‘’βˆ’π›Ό;(iii)if the condition (10) is satisfied, then 𝑉(π‘₯)=1;(iv)if for some 𝑝 the condition (10) is not satisfied, then 𝑉(π‘₯)β‰€π‘’βˆ’π›Ό.

Hence, if both conditions (9) and (10) are satisfied, that is, the solution is feasible, then π‘ˆ(π‘₯)𝑉(π‘₯)=1, and on the contrary if at least one of these conditions is not satisfied, that is, the solution is infeasible, then π‘ˆ(π‘₯)𝑉(π‘₯)β‰€π‘’βˆ’π›Ό.

Considering the properties of penalty functions, multiplying the criterion function (8) to π‘ˆ(π‘₯)𝑉(π‘₯), the problem (8)–(11) with restrictions can be reduced to the following problem without restriction: 𝐸(π‘₯)=𝑓(π‘₯)π‘ˆ(π‘₯)𝑉(π‘₯)⟢max.(16)

In another way, the chromosome can be designed in the form of a raw 𝑋=(𝑦1,𝑦2,…,𝑦𝑛) with length 𝑛 where alleles 𝑦𝑖 define the number of clusters and accept the value from the set {1,2,…,𝑝}, and loci (positions of genes) correspond to numbers of spam messages. On the basis of such representation, the solution of the problem is defined thus:π‘₯𝑖𝑦𝑗=ξ‚»1if𝑖=𝑗,0if𝑖≠𝑗,𝑖=1,…,𝑛,𝑗=1,…,π‘š.(17)

For example, 𝑋=(2,3,1,3,4,2,3) corresponds to that division, that the spam messages 𝑠1 and 𝑠6 belong to cluster 𝐢2(π‘₯12=π‘₯62=1), the spam messages 𝑠2,𝑠4, and 𝑠7 belong to cluster 𝐢3(π‘₯23=π‘₯43=π‘₯73=1), the spam message 𝑠3 belongs to cluster 𝐢1(π‘₯31=1), and the spam message 𝑠5 belongs to cluster 𝐢4(π‘₯54=1).

Let us note that at such designing of chromosomes the condition (9) will be always satisfied. In [37], it has been shown that implementation of such operators to feasible chromosomes does not lead to occurrence of infeasible chromosomes. It is effective only in that case when initial population is generated at observance of a condition (10). As the number of clustering spam messages is much more than the number of clusters π‘›β‰«π‘ž, in that case for 𝑖1≠𝑖2 the equality 𝑦𝑖𝑖=𝑦𝑖2 takes place then for the generation of initial population the probability of occurrence of infeasible chromosomes will be very high. Thus, the time spent for observance of a condition (10), for the generation of initial population, can be compared with time for solving problem. Therefore, it is reasonable to apply a penalty function method. As for such encoding, the condition (9) is always satisfied then the criterion function will be in such form: 𝐸(π‘₯)=𝑓(π‘₯)𝑉(π‘₯)⟢max.(18)

As a selection operator, the proportional selection, where the chromosome from current population 𝑍=(𝑋1,𝑋2,…,𝑋𝑑) is selected according to the probability defined by the formula: 𝑧𝑑=πΉξ€·π‘‹π‘‘ξ€Έβˆ‘π·π‘‘=1𝐹𝑋𝑑,𝑑=1,…,𝐷,(19) is used.

Here 𝐷 is a size of population, and 𝐹(𝑋𝑑) is a fitness value of chromosome 𝑋𝑑. Fitness function depends on a character of a solved problem. As in our case, the problem purpose consists in maximization of the function 𝐸(π‘₯) then chromosome with bigger value of criterion function 𝐸(π‘₯) should have every prospect to survive for the following generation. According to the last formula, it means that the chromosome with smaller value of criterion function 𝐸(π‘₯) should have big fitness value. Taking into account this, fitness value 𝐹(𝑋𝑑) of chromosome 𝑋𝑑 is defined as𝐹𝑋𝑑𝑋=𝐸𝑑.(20)

So, applying the penalty function, method the infeasible solutions generated by operators of genetic algorithm will be eliminated during the process on ranging of fitness values, and feasible decisions will have more chances to survive, that is, the penalty functions method allows to accelerate process of convergence of genetic algorithm. This is because the penalty functions method at occurrence of infeasible chromosomes does not demand performance of additional operations (to make recoil and to return chromosome to the previous state, correction of infeasible chromosomes, etc.). Here it is necessary to note that any type of operators of crossing and a mutation could be used as input of penalty functions.

Now the stop criterion should be defined, which is an important step of genetic algorithm.

The maximization of compactness causes points in each cluster to be very similar to the corresponding center. Therefore, we will define coordinates of the centers of clusters. 𝑗th coordinate of the center 𝑂𝑝 of the cluster 𝐢𝑝 is calculated by the formula:π‘œπ‘π‘—=1𝑛𝑝𝑛𝑝𝑑=1𝑀𝑑𝑗,𝑝=1,…,π‘ž;𝑗=1,…,π‘š,(21) where 𝑛𝑝 is a number of points in cluster 𝐢𝑝. Obviously βˆ‘π‘žπ‘=1𝑛𝑝=𝑛.

The compactness of cluster 𝐢𝑝 is calculated by the following formula:π‘Ÿπ‘=1𝑛𝑝𝑛𝑖=1𝑠sim𝑖,𝑂𝑝π‘₯𝑖𝑝.(22)

The average similarity of the cluster 𝐢𝑝 to other clusters we define as follows:𝑅𝑝=1π‘žβˆ’1π‘žξ“π‘§=1𝑂sim𝑝,𝑂𝑧,𝑝=1,…,π‘ž,(23) where sim(𝑂𝑝,𝑂𝑧) is the similarity between the centers of the clusters 𝐢𝑝 and 𝐢𝑧.

If the condition max𝑝(𝑅𝑝/π‘Ÿπ‘)<1 is satisfied, then stop the genetic algorithm.

6. Classification Using the π‘˜π‘π‘ Method

As the collection of spam messages permanently changes, replenishing with new types of spam messages after clustering, it is necessary to accompany clusters. So the collection of spam messages should be classified. For classification, the π‘˜NN method is used. The π‘˜NN method is used in many problems to determine a class to which the object belongs. This classification method is based on already available set of the classified objects. As in our case, objects are spam messages; then designating each new spam message coming to the collection as 𝑠𝑛+1, we will define the π‘˜ nearest spam messages which already belonged to one of the classes.

The π‘˜NN classifier for each cluster 𝐢𝑝 calculates the relevance score𝑠score𝑛+1,𝐢𝑝=ξ“π‘ β€²βˆˆπ‘‚π‘(𝑠𝑛+1)βˆ©π‘†π‘ξ€·π‘ cos𝑛+1,π‘ ξ…žξ€Έ=ξ“π‘–βˆˆπΌπ‘(𝑠𝑛+1)𝑠cos𝑛+1,𝑠𝑖π‘₯𝑖𝑝,(24) where 𝑂𝑝(𝑠𝑛+1), 𝐼𝑝(𝑠𝑛+1) are elements and their indexes of π‘˜-nearest neighbors of the spam message 𝑠𝑛+1; correspondingly, 𝑆𝑝 is a set of spam messages in cluster 𝐢𝑝 and𝑒𝑖𝑝=ξ‚»1ifπ‘ π‘–βˆˆπΆπ‘,0ifπ‘ π‘–βˆ‰πΆπ‘.(25)

The spam message 𝑠𝑛+1 belongs to that class 𝐢𝑝, for which the value score(𝑠𝑛+1,𝐢𝑝) is a maximum. If score(𝑠𝑛+1,𝐢𝑝)<πœƒ, then spam message does not belong to any of the clusters 𝐢𝑝 and in this case a new cluster πΆπ‘ž+1 is created, where πœƒ is a predefined threshold.

7. Knowledge Extraction from Classes Using Summarization Technique

At this stage after clustering of messages and solving the problem (8)–(11), it is necessary to define themes, descriptions of clusters. To receive the information about clusters, the document summarization method is applied. In our case, documents the spam messages which are belonged to the same theme and are in the same cluster. It is necessary to take the content from these sets of spam messages, deleting the unnecessary information and taking into account similar and differing moments in the content, and to present the most important information in a condensed form. Therefore, the multidocument summarization method described in the paper [30] can help to find informative sentences from each cluster. The multidocument summarization is a process of automatic creation of the compressed version of set of the documents giving to the user the helpful information. At the first stage in order to defining thematic sections, the clustering of spam messages is satisfied. And at the second stage in order to define the informativeness value and for extracting the informative sentences, the ranging is made. This will ensure to define representative sentences and their quantity for each thematic section, avoiding redundancy in the summary.

The representativeness of a sentence is defined by similarity measure between them and corresponding cluster centroid, that is, the less Euclidean distance between the sentence and corresponding cluster centroid means the sentence is more representative. To include sentences into summary, they are ranged in ascending order according to their similarity measures to corresponding cluster centroid. In this paper, each cluster consists of thematically close messages. Some messages contain many sentences and, hence, form the main content of the cluster. Other themes can be shortly mentioned to complete the main subjects. Hence, quantities of sentences in different clusters are different. Such approach allows maximum covering of main content of the cluster and avoids redundancy.

In general, the number of sentences included into summary depends on compression factor. Compression factor 𝛼comp is determined by length summary and message:𝛼comp=len(summ)len(E-mail)(26) and is a main factor influencing the quality of summary, where len(summ), len(E-mail) are the lengths of summary and message, correspondingly. As for minimum value of compression factor, the summary will be shorter, and the main part of the information will be lost. At the same time, at great value of factor of compression, the summary will be plentiful; however, it will contain insignificant sentences.

Considering the above-stated, the quantity of the representative sentences 𝑁𝑝 which have been selected from each cluster 𝐢𝑝, calculated by the following formula is defined: 𝑁𝑝𝐢=INTlen𝑝⋅𝛼complenaverξƒ­,𝑝=1,…,π‘ž,(27) where len(𝐢𝑝) and lenaver=len(E-mail)/π‘š are the length of cluster 𝐢𝑝 and the average length of sentences in message correspondingly, and INT[β‹…] means the whole part.

8. Conclusion and Future Work

In this paper, the problem of clustering of spam messages collection is formalized. The criterion function is a maximization of similarity between messages in clusters, which is defined by π‘˜-nearest neighbor algorithm. Genetic algorithm including penalty function for solving clustering problem is offered. Classification of new spam messages coming to the bases of antispam system is also given. After classification, the knowledge extraction from divided classes is considered. Multidocument summarization method is applied for knowledge extraction from clusters. The information extracted from clusters and thematic dependence of spam messages from their origin can be also helpful in detection of social networks of spammers if they exist.

Though there are a lot of works offering methods for classification of E-mails into spam or nonspam, there is no one previous scientific work and experimental study showing classification of spam messages into thematically groups. In this context, it is decided to make experiments on this subject in future works, especially to show efficiency of the clustering method and genetic algorithm used in this paper in comparison with others.

Acknowledgment

The authors would like to express their appreciation to the anonymous reviewers for their very useful comments and suggestions.