Advances in Decision Sciences

Volume 2016, Article ID 7546963, 7 pages

http://dx.doi.org/10.1155/2016/7546963

## Classification of the Entities Represented by Samples from Gaussian Distribution

Université des Sciences et de la Technologie, Houari Boumediene, BP 32, El Alia, 16111 Bab Ezzouar, Alger, Algeria

Received 29 February 2016; Accepted 12 May 2016

Academic Editor: Panos Pardalos

Copyright © 2016 Amar Rebbouh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper aims to cluster entities which are described by a data matrix. Under the assumption of normality of observations contained in each table, each entity is represented by samples from Gaussian distribution, that is, a number of measurements in the data matrix, the sample mean vector, and the sample covariance. We propose a new distance based on Mahalanobis’s discriminant score to measure the similarity between objects. The present study is thought to be an important and interesting topic of research not only in the quest for an adequate model of the data representation but also in the choice of the distance index between entities that would allow justifying the homogeneity of any observed classes.

#### 1. Introduction

One of the fundamental problems in automatic classification is the development and validation of similarity indices between the objects to be classified. These indices must adapt to classify objects and allow measuring the adequacy between an object and a class of objects. If the objects to be classified are described by matrices comprising such repeated observations of individuals for the variables that describe them over a finite period of time, we present a new distance based on Mahalanobis’s discriminant score to measure the similarity between objects.

Usually, for this type of data, before the classification stage, we proceed to a reduction step. We can summarize each table by a vector, or a hyperrectangle, and we can use factorial techniques to reduce each table.

Therefore, these reduction techniques require assumptions that are difficult to achieve in practice. Indeed, the first type of reduction makes sense only if the mean or another central value summarizes perfectly the observations of each individual , and this reduction does not take into account the variability of the observations. The hyperrectangles are Cartesian products of intervals. The interval estimated depends on the variability of the observations but does not consider the possible relationship between the variables. This type of reduction requires that the variables must be uncorrelated. Several distances between interval objects have been extended to distances between hyperrectangles and remain a subject of research in automatic classification. These include the distance based on city block distance [1], Hausdorff distance between hyperrectangles, Wasserstein based distance [2], and single adaptive distance [3]. Finally, the third type of reduction leads to new uncorrelated variables but poses significant mathematical problems such as the search for compromise space and the number of observations to be used for the reduction of each entry table (see [4, 5]). If the number of observations of each variable is the same for each object, the input data can be considered as a structure of data matrices (see [6]).

This paper aims to cluster entities which are described by a data matrix. Under the assumption of normality of observations contained in each table, each entity is represented by samples from Gaussian distribution, that is, a number of measurements in the data matrix, the sample mean vector, and the sample covariance. We define a new distance based on Mahalanobis’s discriminant score to measure the similarity between objects. We propose an extension of the -means algorithm to this case. The approach can be extended to cluster objects described by variable subjects with errors of measurements.

In analogy to the classical squared-error criterion and the -means algorithm, clustering is here proceeding by defining and minimizing a joint clustering (heterogeneity) criterion for a partition (with a given number of classes) and a set of class prototypes, that is, the sum of the class-specific sums of dissimilarities between class elements and the corresponding class prototype.

The paper is organized as follows. In Section 2, we present the data structure and some references. In Section 3, we introduce the index of distance between objects and the steps of the algorithm. In Section 4, we provide a numerical example and do a comparative study with the classical approach. In Section 5, we explain how the algorithm is applied to cluster the workdays according to the degree of the traffic pollution at the most important roundabout. In combination with six weather conditions parameters measured on the same days, the resulting classes are analyzed and described in terms of six meteorological characteristics. In Section 6, we draw the corresponding conclusions.

#### 2. The Data Structure

Let be a set of objects described by a set of quantitative variables .

is a map defined bywhere is the value taken by the individual for the variable .

We assume that the individual is described by the matrix (i), for example, represents the medical record of the patient for variables made in the daytime; represents, in this case, the value taken by the patient for the variable .(ii) contains in our study the value of the seven pollution parameters for the day for the 24 hours of the day.

The input data are

#### 3. Classical Approach

(i) A standardized principal component analysis on each table leads to the construction of orthogonal factor axes on which we project the observations of the individual , and we obtain new uncorrelated variables which give systems of axes

In order to compare the objects, we must be in the same reference frame. Thus the basic problem of the search of a compromise axis system is posed. This problem also concerns other disciplines of mathematics, especially in differential geometry [5]. The proposed criteria in literature, for the search of compromise space on which we project the objects to compare them in terms of proximity, are not really justified. The proposed technics are purely heuristics [7], available online for free. Relations between tables are also analyzed with Procrustes analysis and compromise factorial axes in the context of multiple factorial analysis. One important reference can be Gardner et al. [8].

Finally, the conclusion regarding Bouroche’s [4] proposal is too reductive of the large domain of research so that this reference could be removed.

(ii) If the matrix has the same dimension , in [6], an algorithm of -means type is proposed based on the Hilbert-Schmidt inner product to classify these matrix objects. If does not have the same dimension, we can envisage a step of completion in order to obtain a structure of juxtaposition of data tables of the same dimension.

such that ; we can use the following procedure to complete the tables. We assume that ; Let be the least common multiple of :

There exists so that

Now, by duplicating times each table , we obtain a new table of dimension . So, if is a large number, the least common multiple becomes necessarily large and the procedure leads to a structure of large tables. Moreover, this completion removes any chronological order of the data. It seems more reasonable to carry out the classification without processing with this completion step. It seems necessary to study the case where the tables do not have the same dimension and without a reduction stage. If the hypothesis of normality of the observations in each column of table is verified, this matrix can then be considered as regrouping a realization of the normal random vector whose distribution parameters should be estimated. These parameters will be estimated in an empirical way from the observations in the entry tables. The aim of the present paper is therefore to present a new approach of classification based on the -means algorithm. This approach uses a new distance index based on the Mahalanobis discrimination scores. The proposed algorithm expands to the tables of different dimensions and is validated on real data of the traffic pollution.

#### 4. Proposed Approach

##### 4.1. Estimating the Distribution Parameters

If the variables are unspecified, for , the mathematical expectations and the components of the estimated covariance matrix are given by

These estimators are unbiased, convergent, and consistent and do not depend on the number of observations or trials.

##### 4.2. Classification Algorithm

We wish to gather the individuals in homogeneous classes. The heterogeneity of the classes is measured by a criterion of the inertia sum of the classes. This criterion is expressed bywhere is the prototype or the kernel of the class ; is the observation of the individual ; and is an index of distance between the objects and the prototype or representative elements of the classes. This criterion expresses the adequacy between the individuals with regard to the classes where they are affected.

##### 4.3. Description of Individuals

We suppose that every table groups a sample of size of the Gaussian random vector of parameters . For example, in the case of data with errors of measure, the tables data groups the repeated observations about the description of the variables. These observations are the realizations of the Gaussian random vector. It is clear that these observations are not correlated and the estimated variance covariance matrix is complete and thus not singular. Each is described by , where(i), where is the number of observations of the individual ;(ii), where is the vector containing the estimated means for each variable;(iii) is the set of real symmetric positive definite matrices of order .

##### 4.4. Distance between Individuals

Let and be 2 individuals described, respectively, by and . We wish to build an index of distance which takes into account the distribution parameters. To do this, we use the notion of discriminant score. For a realization of the individual , the discriminant score of Mahalanobis of this observation with regard to the realizations of the individual is given bywhere and are the average vector of the individuals and , respectively. It supposes that the th observation of the individual is assimilate to the average vector (empirical value of the distribution of its observations), for all the realizations of the individual :Similar arguments lead to

These scores are positive quantities and perfectly express the similarity between two individuals.

The map is defined bywhere is an index of weighted distance.

Without loss of generality, we assume that all objects are observed the same number of times; for all . We assume that(1);(2). This hypothesis implies that the matrices and are nonsingular.

##### 4.5. Criteria and Optimization Problem

Let be the set of partitions with clusters and let be the set of prototypes of the classes. For The criterion writesfor and .

We search which realizesThe algorithms used to solve such problems are of -means type. These algorithms are based on the definition of the function of representation and the function of affectation which will be used alternatively to decrease the criterion. The representation function satisfies the following procedure:

##### 4.6. Characterization of a Class of Individuals

We seek for the kernel from each of the classes generated by the algorithm. Let be the individuals of the class and let be the map defined by, without loss of generality,

Proposition 1. * and which minimize are given by*

*Proof. *We focus on the case where, for all , .

We research and which minimize . We put , , and . We haveThe necessary condition is written as(a) One has ; then We note that the first expression of does not depend on . We putThenand alsoThe expansion of gives(b) implies that ; thenFinally ()

*Remark 2. *In the case of measuring errors on the obtained classes, characterization is not flawed and is given exactly. This seems quite natural.

###### 4.6.1. The Distance between Individual and a Class

The individual is described by and the class which contains individuals is characterized by given bywhere . The distance between the individual and the class is given by

The affectation function is given by

The minimum is obtained by

###### 4.6.2. Classification Algorithm

We choose and in all cases we alternatively use the functions and . The algorithm runs as follows:

The algorithm stops as soon as the partition does not change. We build two sequences and .

Proposition 3. *The sequence is decreasing and converges.*

*Proof. *We have by definition of ; then by definition of .

Proposition 4. *The sequence is stationary for a given rank.*

*Proof. *We put , exists; exists;; .

#### 5. Numerical Illustrative Example

We wish to cluster the six objects into 2 clusters. Each object is described by three variables , , and . The three variables are unspecified and we assume that the condition of normality of these observations is verified. The artificial input data is as follows:

*Data Input*

In (32) are not in the same dimension.

##### 5.1. Classical Approach

Usually, we summarize observations of each individual by a central value that can be the mean. This method of data reduction can lead to erroneous results. This is shown in this numerical example.(i)The mean value of each variable and the coordinate of final centers of clusters is obtained by using the -means algorithm:

##### 5.2. The Proposed Approach

(i)The means are given as follows:(ii)The matrix is given as

##### 5.3. The Final Partition Obtained and the Distance between Each Object and the Prototype of the Class Where It Is Assigned in the Case of Reduction and with Proposed Approach

Distances between objects and kernel of the class in the classical approach, after reduction step, are as follows:

Final partition is as follows:

Distances between objects and kernel of the class in the proposed approach are as follows:

Final partition is as follows:

Taking into account the variations that have resulted in errors of measurement and drop of the -means algorithm with the weighted Mahalanobis distance it appeared that individuals 2 and 4 must be in the same class which has not been reported with the precedent procedure. In the case where the variability of observations plays an important part in the description of the individuals, the classification, made without taking into account these variabilities, leads to incorrect results compared with the reality of the data.

#### 6. Application

Two files were used. The first file contains the observations of the seven parameters measuring the pollution caused by gases emitted by cars at a major intersection center of a city. The seven measured pollutants are* carbon monoxide*,* nitrogen monoxide*,* nitrogen dioxide*,* PM10 dust*,* sulfur dioxide*,* volatile organic compounds,* and* ozone*. These pollutants were measured each hour for each day. These observations concerned 420 days without gaps over the past three years. This file contains 420 tables of dimension each. For these 420 days, we build up another file by measuring the daily average of 6 meteorological parameters:* temperature*,* rainfall*,* atmospheric pressure*,* humidity*,* wind speed,* and* hours of sunshine*. This table is of dimension . The interest is on the possible relationships between the variables measuring pollution and meteorological variables. We classify the days in three classes according to the degree of pollution and explained them using meteorological variables. Each day is described by curves corresponding to the pollutants; the proposed algorithm, written in Matlab, brought together the days in classes without reduction step:* class 1* “low-pollution days,”* class 2* “days of average pollution,” and* class 3* “days of high pollution.” The results are convincing; the profile of each class was explained by meteorological variables.

As a result of this, many questions arise, and we want to study the relationship between the pollution variable and the weather conditions variables. We also need to explain the classes in connection with the weather conditions variables and determine the profile of each class in connection with these weather conditions variables.

The first approach to this study has consisted in summarizing the pollution file (420 tables of dimension ) in a table of dimension by measuring the daily average for each pollutant. The variability effect of the measures is removed. We have studied the relationship between the 2 groups of variables “pollution and weather conditions.” We are not interested in this approach.

The results are conclusive; the profile of each class has been explained by the weather conditions variables.

Table 1 shows the discriminating variables of each table. It describes the classes of pollution obtained according to the weather conditions parameters.