Abstract

Fraud activities have contributed to heavy losses suffered by telecommunication companies. In this paper, we attempt to use Gaussian mixed model, which is a probabilistic model normally used in speech recognition to identify fraud calls in the telecommunication industry. We look at several issues encountered when calculating the maximum likelihood estimates of the Gaussian mixed model using an Expectation Maximization algorithm. Firstly, we look at a mechanism for the determination of the initial number of Gaussian components and the choice of the initial values of the algorithm using the kernel method. We show via simulation that the technique improves the performance of the algorithm. Secondly, we developed a procedure for determining the order of the Gaussian mixed model using the log-likelihood function and the Akaike information criteria. Finally, for illustration, we apply the improved algorithm to real telecommunication data. The modified method will pave the way to introduce a comprehensive method for detecting fraud calls in future work.

1. Introduction

Every year telecommunication companies register loses amounting to millions of dollars due to fraud activities. Examples of such activities given by [1, 2] include the use of the customer’s line in Premium Rate Service without their knowledge or autodialers with no intention to pay for the outgoing calls; PABX for international calls; an unregistered user with an assigned number accessing the network (such activity is called stolen line unknown); and international roaming manipulation. Vendors, seeing the above as an opportunity not to be missed, compete to provide data mining applications that can detect the said activities effectively using various methods such as OLAP, deviation based outlier detection, and hidden Markov model. The focus of this paper is Gaussian mixed model, henceforth, GMM.

A GMM is best known for providing a robust speaker representation for the difficult task of speaker identification on short-time speechspectra [3]. Its function is extended to detect fraud activities on the number (as well as length) of domestic and international calls made on a daily basis during office, evening, and night hours. Reference [4] presented three approaches to fraud detection in communication networks: neural networks with supervised learning, probability density estimation methods, and Bayesian networks. Information describing a subscriber’s behavior kept in toll tickets was used. For example, supervised learning used summary statistics over the whole observed time period (especially the number of times fraud activities were recorded in the data). The two latter approaches used a subscriber’s daily behavior. To improve the fraud detection system, they recommended the combination of the three presented methods together with the incorporation of rule based systems.

The maximum likelihood estimation for a GMM is generally difficult to obtain directly, but it is made easier with the availability of the Expectation Maximization (EM) algorithm which was first introduced by [5]. Since then, there has been a significant increase in its use especially in finding the maximum likelihood for probabilistic models. For example, [6, 7] developed an online system for detecting fraud calls using a hierarchical switching generative model. The model is trained by using the EM algorithm on an incomplete data set and is further improved by using a gradient-based discriminative method. In this paper, we propose an improved EM algorithm for GMM in detecting fraud calls in telecommunication.

This paper is organized as follows. Section 2 gives an introduction to GMM. We then describe the EM algorithm for a GMM, the kernel method, and eventually the proposed modified EM algorithm for GMM in Section 3. In Section 4, we study the performance of the modified algorithm in estimating the parameters and the effect of overlapping areas of Gaussian components in GMM. In the next section, we propose graphical plots to identify the “best” number of components in a GMM. For illustration, an application of the improvement on a real data set is presented in Section 6.

2. Gaussian Mixed Model

Let and be the number of components where each component has its own prior probability and probability density function with mean and covariance . A Gaussian mixed model is then given by where . We next define the likelihood function and the log-likelihood function by and where , respectively. The maximum likelihood estimation (m.l.e) method aims at finding that maximizes ; see [8]. The expression in is difficult to compute. We use the Expectation Maximization (EM) algorithm to overcome this problem.

3. Expectation Maximization (EM) Algorithm

3.1. EM for Gaussian Mixed Model

In a general setup of the EM algorithm given in [5], the authors considered an unobservable variable in sample space , which is indirectly observed through observed variable in sample space . Assuming that is the sampling density depending on the parameter , the corresponding family of sampling densities for , say , can be derived from where is a subset of under the mapping from to . The main objective of the EM algorithm is to find the value of that maximizes (2). Consider the expected value of given and , denoted by , where with the expectation assumed to exist for all pairs and for . According to [5], the EM iteration consists of two steps, namely, the -step and the -step. At the th iteration with the estimate of denoted by , the -step will give the value of and the -step will find a new estimate of , say , that maximizes . The steps are repeated until convergence is achieved.

For the case of a GMM, we define , where and , if the th sample is generated by the th mixture component. It is simplified, by applying, amongst others, the Bayes formula where is the posterior probability, is the likelihood function, and is the prior probability to the following equations (see [9, 10]): where Hence, the EM iteration for a GMM is defined by the following.-step: use (5). -step: use the formulas which are derived from the Lagrange multipliers, and (for details, see the appendices).

The above steps (i.e., -step and -step) are repeated until convergence is achieved.

3.2. The Kernel Method

The kernel method can be used to find the probability density estimate for univariate data; see, for example, [11]. Let , , ; let be the bandwidth for some integer , ; and let be the th grid point where . The density estimate at grid point is represented by the following equation: where . For , the density estimate is defined by where . To compute at a grid of points, a method which makes use of the Fourier transform is employed. Let be the Fourier transform of the kernel density estimate . It can be shown that , where is the Fourier transform of the Gaussian kernel and is the Fourier transform of the data. Thus, is the convolution of the data with the kernel.

We will use the following algorithm by [11] to discretize the data to very fine grids and to find by convolving the data with the kernel.

Step A. Discretize the data to find the weight sequence with . If , it is split into a weight at and a weight at ; these weights are accumulated over all the data points to give a sequence of () weights summing up to .

Step B. Find the sequence defined by , where . It can be shown that when , , where .

Step C. Find the sequence , where . Here, , where , is the standard deviation, and IQR is the interquartile range. The IQR is chosen here by [11], who claimed that the bandwidth is useful for a wide range of densities.

Step D. Let be the inverse discrete Fourier transform of , that is, .

It can be shown that when , . We then identify where its density estimate, denoted by , is greater than those of its nearest neighbors and . In other words, and ; refer to Figure 1 where the vertical line that touches and shows the location of the peak. Note that we may obtain more than one maximum points, which means that the data may consist of more than one Gaussian distribution. These results form a very important component of the improved EM algorithm for GMM to be described next.

3.3. Improved EM Algorithm for GMM

A number of authors highlighted the importance of identifying the right number, say , of components in a GMM and subsequently choosing good initial values for the model parameters and , , in the EM algorithm. Reference [12] noted the difficulty of using log-likelihood-ratio statistics to test the number of components and subsequently suggested using a nonparametric bootstrapping approach. Similarly, [13] pointed out the same concerns and introduced an algorithm called the stepwise-split-and-merge EM algorithm to solve the said problem. In addition, [14] investigated the possibility of using the minimization of the Kullback-Leiber distance between fitted mixture model and the true density as a method for estimating where the said distance was estimated using cross validation. Reference [15] viewed the mixture distribution as a contaminated Gaussian density and proposed a recursive algorithm called the Gaussian Mixture Density Decomposition algorithm for identifying each Gaussian component in the mixture. Other works on this topic can also be found, for example, in [16, 17].

In this paper, we propose an improved EM algorithm for GMM which can perform both tasks: identifying the initial number of components and providing automatic initial values for the EM algorithm. The full improved EM algorithm for a GMM is now presented.

Step 1. The kernel method as described in Section 3.2 is used to determine the number, say , of components and also the corresponding means of each component, where . The initial estimates of the standard deviations are set to unity while the prior weights are set to be .

Step 2. The EM algorithm for a GMM as described in Section 3.1 is executed to give the final estimates of parameters , , and , . The log-likelihood function and Akaike information criteria (AIC) are calculated using the said parameters.

Step 3. Step 2 is repeated for other possible number of components with , for the other  components, and .

Step 4. The log-likelihood function and AIC values for are plotted. The final number of components is chosen when adding extra components in the model does not significantly increase or decrease the values of the log-likelihood function and the AIC, respectively.

4. Simulation

We use simulation to investigate the performance of the proposed modified algorithm.

4.1. Simulation Scheme

Simulation data were generated using the Box and Muller Transformation [18] as defined by (9) below: where . For the case of two components, we start by generating a random number . If , we generate two random numbers and and calculate using the first and second equations of (9) with and . Otherwise, we use , and . The process continues until the required sample size is obtained. The scheme is easily extended to any number of components. For further details, refer to [19].

4.2. Study of Performance Based on the Log-Likelihood Function

We first look at the performance of the standard method, called Method 1, followed by that of the modified method, called Method 2. For Method 1, in place of Step 1 of the modified method, we assign values zero and unity, respectively, to the means and variances of all components. We compare the performance by looking at the log-likelihood function via simulation study.

Following [20], we consider two cases with two and one case with three components with the true values of the parameters given in Table 1. For each case, we generate 100 samples of size 1000 where the chosen sample size reflects the large size of data sets found in the telecommunication industry, the focus of our interest. We then apply Method 1 and Method 2 on the simulated data. For each case, for better quality viewing, we plot only 50 values of the log-likelihood function for both methods on the same plot, as given in Figure 2. It can be seen that, for Samples 1 and 3, the proposed Method 2 clearly outperforms the standard Method 1 with the values of the log-likelihood function corresponding to Method 2 being always larger than those of Method 1. However, we see that some values overlap for Sample 2, though the proposed Method 2 still generally performs better. In this case, the prior probabilities are distinctly different from the chosen values of in Sample 1 while other true values remain the same which leads to different percentages of overlapping of the Gaussian components in the GMM. Hence, we will investigate the performance of the improved EM algorithm in estimating the parameters of the GMM by taking into account the effect of different percentages of overlapping between the components observed in the data.

4.3. The Effects of Different Overlapping Percentages on Performance

The main objective here is to investigate the performance of the modified EM algorithm for different overlapping percentages of the components in the GMM. For simplicity, we restrict our attention to two components so that are to be estimated. Data is simulated using the simulation scheme described in Section 4.1.

After performing Steps 1 and 2, we find , where is the true value of the th parameter and is the EM estimate of the parameter, . The sample mean and standard deviation of are computed using the formulas and . The estimates are considered good if is close to zero, indicating small biases observed in the simulation results, and is also close to zero, indicating that the parameter estimates are concentrated around their respective true values.

We determine the area of overlapping between the two components for each model by using the misclassification concept given in [21], the details of which are provided in Appendix B. The formula to estimate the overlapping areas depends on the mean and standard deviation of the components. The effects of prior probabilities should not affect the estimates greatly as their sum equals unity.

We consider three cases for different combinations of parameter which give different percentages of overlapping of the GMM components. The results are tabulated in Tables 24. Table 2 deals with case 1, where the true values of , , and are fixed but the true values of and are varied. In all cases, the percentage of overlapping is 0% as the separation of the means is rather large with small values of dispersion. We can see that the values of the mean are close to zero with the small standard errors less than unity for all parameters considered. On the other hand, Table 3 gives the results for case 2 where , , , and are fixed but and are varied to give 25% of overlapping. The bias is still considered small but generally larger than that for case 1. In addition, the values are also more dispersed here. Finally Table 4 shows the results of case 3 where , , , and are fixed with 45% of overlapping. As expected, the results deteriorate when the percentage of overlapping increases. We conclude that the modified EM algorithm for GMM performs well when the percentages of overlapping are small, but its performance is affected when the percentages increase. More comprehensive simulation results can be obtained from the authors upon request.

5. Determination of the Final Number of Components in the GMM

In the last two steps of the modified algorithm, we intend to confirm that the choice of the initial number of components in the GMM using kernel method is final. This can be done by considering extra components in the model. For that, as stated in Section 3.3, we repeat Step 2 for other possible number of components, by setting , for the other components, and . The final number of components is determined when adding extra component neither increases the log-likelihood nor decreases the AIC values significantly. The changes can easily be seen on a line plot of the values.

6. Real Example: Phone Call Data

The call detail record, which was supplied by Telekom Malaysia Berhad (henceforth, TM), consists of calls made by customers that fell victim to fraud activities. Table 5 shows the format of the call detail record for each TM’s customer. We performed several steps on the original data in order to have the data in a desired format, that is, group the real data according to service number, find the country that matches with the country code as well as dialed digits, and sort the real data according to seize time. The column entitled “Seize time” gives the time when the call was made; the fourth column details the duration of the calls in the following format: hour (hh), minute (mm), and second (ss); and the fifth column is the result of converting the information in the fourth column into day format.

We consider real data consisting of the converted duration of each call made by Customer A (referring to the fifth column of Table 5), whose identity is not revealed to ensure confidentiality, on March 31, 2011. Seventeen (17) calls were made and the data are displayed in Figure 3. Step 1 of the improved EM algorithm for GMM identifies two initial components. The plots of the log-likelihood function and AIC in Figures 4(a) and 4(b) are the results from performing Steps 2, 3, and 4 of the improved EM algorithm for GMM, which reveal that the EM algorithm fails to achieve convergence when the number of components equals to five or above. It can also be seen that a GMM with 2 components is identified as the “best” model, since the inclusion of more components not only fails to increase the value of the log-likelihood but also fails to decrease the values of the AIC. The final EM estimates for the two-component GMM are , , , , , and , and they represent the behavior of calls made by Customer A on March 31, 2011. In a future paper, we will show how the above information produced from the improved EM algorithm for GMM can be used in the process of detecting fraud activities in the telecommunication industry.

7. Conclusion

In this paper, we proposed a modified EM algorithm which can numerically identify the number of components of a GMM and estimate the parameters of the model using the kernel method. We showed via simulation that the performance of the algorithm is generally good but, as expected, is affected by increasing percentages of overlapping of the Gaussian components. We then used the line plots of the log-likelihood and AIC values to identify the final number of GMM components. They could clearly be determined via the concave-like shape of the AIC plot which indicates that the AIC decreases to a minimum value and then increases as the number of components increases. Finally, the modified EM algorithm for GMM was tested on real telecommunication data. The results serve as testimony to the effectiveness of the improved EM algorithm for GMM and should be useful when considering the problem of fraud calls faced by the telecommunication companies.

Appendices

A. Derivation of the First, Second, and Third Equations of (7)

Using Lagrange multipliers defined by max/min subject to , , [22] on max subject to (or ), we get the first equation of (7). From , we get the second equation of (7) by using the following matrix properties: . The first and second expressions of use the following matrix properties: and to get the third equation of (7) [8].

B. The Value of Intersections

For the case when and , and are obtained from , and where , and .

Firstly, using the above formula as well as , we find the area between and (and convert it into percentage) for each component; refer to Figure 5(a). Secondly, we find the minimum between the areas of the two components. This value represents the percentage of overlapping between two components (which is an approximation). For the case when , , , and , let . The value of the intersection, say , is obtained from the following formula (which is an approximation): Taking similar steps, the area for the component on the left hand side of Figure 5(b) is obtained from and that of the component on the right hand side of Figure 5(b) from . We convert them into percentages before adding them up to represent the percentage of overlapping between two components (which is an approximation).

Conflict of Interests

Mohd Izhan Mohd Yusoff, the main author of the paper worked in Telekom Research & Development Sdn Bhd. (a subsidiary of Telekom Malaysia Berhad) for almost 20 years, the same organization that is sponsoring his Ph.D. study at the Institute of Mathematical Sciences, University of Malaya. The title of his study was proposed by two supervisors, who are the coauthors of the paper, namely, Ibrahim Mohamed and Mohd Rizam Abu Bakar. Due to the affiliation of the first author to Telekom Malaysia Berhad, the authors are able to obtain the relevant data sets to illustrate the development of the theory considered in this paper. In other words, Telekom Malaysia Berhad, as mentioned in the paper, is merely supplying the customer’s call detail record and does not provide any financial contribution towards the project. The data analysis and the paper preparation have been carried out independently. In conclusion, there is no potential conflict of interests in the study.

Acknowledgments

The authors would like to express their thanks to Telekom Malaysia Berhad for their cooperation in this study. This research is financially supported by the Fundamental Research Grant Scheme (no. FP012-2013A).