Advances in Decision Sciences

Advances in Decision Sciences / 2008 / Article

Research Article | Open Access

Volume 2008 |Article ID 125797 | https://doi.org/10.1155/2008/125797

Amar Rebbouh, "Clustering Objects Described by Juxtaposition of Binary Data Tables", Advances in Decision Sciences, vol. 2008, Article ID 125797, 13 pages, 2008. https://doi.org/10.1155/2008/125797

Clustering Objects Described by Juxtaposition of Binary Data Tables

Academic Editor: Khosrow Moshirvaziri
Received03 Apr 2008
Revised14 Aug 2008
Accepted28 Oct 2008
Published14 Jan 2009

Abstract

This paper seeks to develop an allocation of 0/1 data matrices to physical systems upon a Kullback-Leibler distance between probability distributions. The distributions are estimated from the contents of the data matrices. We discuss an ascending hierarchical classification method, a numerical example and mention an application with survey data concerning the level of development of the departments of a given territory of a country.

1. Introduction

The automatic classification of the components of a structure of multiple tables remains a vast field of research and investigation. The components are matrix objects, and the difficulty lies in the definition and the delicate choice of an index of distance between these objects (see Lerman and Tallur [1]). If the matrix objects are tables of measurements of the same dimension, we introduced in Rebbouh [2] an index of distance based on the inner product of Hilbert Schmidt and built a classification algorithm of k-means type. In this paper, we are interested by the case when matrix objects are tables gathering the description of the individuals by nominal variables. These objects are transformed into complete disjunctive tables containing 0/1 data (see Agresti [3]). It is thus a particular structure of multiple data tables frequently encountered in practice each time one is led to carry out several observations on the individuals whom one wishes to classify (see [4, 5]). We quote for example the case when we wish to classify administrative departments according to indices measuring the level of economic 𝑖𝑑𝐸 and human development 𝑖𝑑𝐻 which are weighted averages calculated starting from measurements of selected parameters. Each department 𝑖 gathers a number 𝐿𝑖 of subregions classified as rural or urban. Each department is thus described by a matrix with 2 columns and 𝐿𝑖 lines of positive numbers ranging between 0 and 1. But the fact that the values of the indices do not have the same sense according to the geographical position of the subregion, and its urban or rural specificity led the specialists to be interested in the membership of the subregion to quintile intervals. Thus, each department will be described by a matrix object with 10 columns and 𝐿𝑖 lines. This matrix object is thus a juxtaposition of 2 tables of the same dimension containing only one 1 on each line which corresponds to the class where the subregion is affected and the remainder are 0. The use of conventional statistical techniques to analyze this kind of data requires a reduction step. Several criteria of good reduction exist. The criterion that gives results easily usable and interpretable is undoubtedly the least squares criterion (see, e.g., [6]). These summarize each table of data describing each object for each variable in a vector or in subspace. Several mathematical problems arise at this stage:(1) the choice of the value which summarizes the various observations of the individual for each variable, do we take the most frequent value or another value, for example an interval [7]; why this choice? and which links exist between variables,(2)the second problem concerns the use of homogeneity analysis or multiple correspondence analysis (MCA) to reduce the data tables. We make an MCA for each of the 𝑛 data tables describing, respectively, the 𝑛 individuals. We get 𝑛 factorial axis systems. To compare elements of the structure, we must seek a common system or compromise system of the 𝑛 ones. This issue concerns other mathematical discipline such as differential geometry (see [8]). The proposed criteria for the choice of the common space are hardly justified (see Bouroche [9]). This problem is not fully resolved,(3) the problem of the number of observations that may vary from one individual to another. We can use the following procedure to complete the tables. We assume that 𝐿𝑖>1, for all 𝑖=1,…,𝑛, 𝐿𝑖 is the number of observation of the individual 𝜔𝑖 and define 𝐿 as the least common multiple of 𝐿𝑖. Hence, there exists 𝑟𝑖 such that 𝐿=𝐿𝑖×𝑟𝑖. Now, we duplicate each table 𝑇, 𝑟𝑖 times, we obtain a new table 𝑇∗𝑖 of dimension 𝐿×𝑑, 𝑑 is the number of variables. But if 𝐿𝑖 is large, the least common multiple becomes large itself, and the procedure leads to the structure of large tables. Moreover, this completion removes any chronological appearance of data. This cannot be a good process of completion, and it seems more reasonable to carry out the classification without the process of completion.

To overcome all these difficulties with the proposed solutions which are not rigorously justified, we introduce a formalism borrowed from the theory of signal and communication see (Shannon [10]) and which is used to classify the elements of the data structure [11]. Our approach is based on simple statistical tools and on techniques used in information theory (physical system, entropy, conditional entropy, etc.) and requires the introduction of the concept of discrete physical systems as models for the observations of each individual for the variables which describe them. If we consider that an observation is a realization of a random vector, it appears reasonable to consider that each value of the variable or the random vector represents a state of the system which can be characterized by its frequency or its probability. If the variable is discrete, the number of states is finite, each state will be measured by its frequency or its probability. This approach gives a new version and in the same time an explanation of the distance introduced by Kullback [12]. This index makes it possible to build an indexed hierarchy on the elements of the structure and can be used if the matrix objects do not have the same dimension.

In Section 2, we introduce an adapted formalism and the notion of physical random system as a model of description of the objects. We define in Section 3 a distance between the elements of the structure. The numerical example and an application are presented in Section 4. Concluding remarks are made in Section 5.

2. Adapted Formalism

Let Ω={𝜔1,…,𝜔𝑛} be a finite set of 𝑛 elementary objects, {𝑉1,…,𝑉𝑑} be 𝑑 discrete variables defined over Ω and taking a finite number of values in 𝐷1,…,𝐷𝑑, respectively, 𝐷𝑗={𝑚𝑗1,…,𝑚𝑗𝑟𝑗} and 𝑚𝑗𝑡 is the 𝑡th modality or value taken by 𝑉𝑗. We suppose that the observations of the individual 𝜔𝑖 for the variable 𝑉𝑗 are given in the table 𝐸𝑗𝑖=âŽ¡âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ£ğ‘‰ğ‘—ğ‘‰(1)𝑗⋮𝑉(2)𝑗⋮𝑉(𝑙)𝑗(𝐿𝑖)âŽ¤âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¦ğ‘‰ğ‘—,𝑙=1,…,𝐿𝑖,(2.1) and 𝐿𝑖 represents the number of the observations of the individual 𝜔𝑖, 𝑉𝑗(𝑙)=𝑚𝑗𝑡 if the 𝑙th observation of the individual 𝜔𝑖 for the the variable 𝑉𝑗 is 𝑚𝑗𝑡 where 𝑡=1,…,𝑟𝑗. 𝐸𝑗𝑖 is the vector with 𝐿𝑖 components corresponding to the different observations of 𝜔𝑖 for 𝑉𝑗.

The structure of a juxtaposition of categorical data tables is𝐸𝐸=1,…,𝐸𝑛with𝐸𝑖=𝐸1𝑖,…,𝐸𝑑𝑖,(2.2)𝐸𝑖 is a matrix of order 𝐿𝑖×𝑑. For the sake of simplicity, we transform each vector 𝐸𝑗𝑖 in a 0/1 data matrix Δ𝑗𝑖:Δ𝑗𝑖=𝑚𝑗1𝑚𝑗2⋯𝑚𝑗𝑡⋯𝑚𝑗𝑟𝑗12â‹®ğ‘™â‹®ğ¿ğ‘–âŽ¡âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ£01⋯0⋯010⋯0⋯0⋮⋮⋮⋮⋯(𝑍𝑗𝑖)ğ‘™ğ‘¡â‹¯âŽ¤âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¥âŽ¦,𝑍⋮⋮⋮00⋯0⋯1𝑗𝑖𝑙𝑡=1,ifatthe𝑙thobservation𝜔𝑖takesthemodality𝑚𝑗𝑡,0,otherwise.(2.3)

The structure of a juxtaposition of 0/1 data tables isΔΔ=1,…,Δ𝑛withΔ𝑖=Δ1𝑖,…,Δ𝑑𝑖,(2.4)Δ𝑖 is a matrix of order 𝐿×𝑀 where ∑𝑀=𝑑𝑗=1𝑟𝑗.

𝑝𝑗𝑖𝑡=𝑃𝑟(𝑉𝑗=𝑚𝑗𝑡) is estimated by the relative frequency of the value 1 observed in the 𝑡th column of the matrix Δ𝑗𝑖.

Let 𝑆𝑗𝑖 be the random single physical system associated to 𝜔𝑖 for 𝑉𝑗:𝑆𝑗𝑖=𝑟𝑡=1(𝑆)⇝𝑚𝑗𝑡;𝑃𝑟(𝑆)⇝𝑚𝑗𝑡=𝑝𝑗𝑖𝑡,(2.5)where the symbol (𝑆)⇝𝑚𝑡 means that the system lies in the state 𝑚𝑡, and ⋀ is the conjunction between events.

In the multidimensional case, the associated multiple random physical system 𝑆 is 𝑆=𝑟1𝑙1=1…𝑟𝑑𝑙𝑑=1𝑚(𝑆)⇝1𝑙1,…,𝑚𝑑𝑙𝑑𝑚;𝑃𝑟(𝑆)⇝1𝑙1,…,𝑚𝑑𝑙𝑑=𝑝𝑙1,…,𝑙𝑑,(2.6) where𝑟1𝑙1=1…𝑟𝑑𝑙𝑑=1𝑝𝑙1,…,𝑙𝑑=1.(2.7) The multiple random physical system associated to the marginal distributions is𝑆=𝑑𝑗=1𝑆𝑗,(2.8) where ⋀ is the conjunction between single physical systems, and {𝑆𝑗,𝑗=1,…,𝑑} are the single random physical systems given by∀𝑗=1,…,𝑑;𝑆𝑗=𝑟𝑗𝑡=1(𝑆)⇝𝑚𝑗𝑡;𝑃𝑟(𝑆)⇝𝑚𝑗𝑡=𝑝𝑗𝑡,𝑝(2.9)𝑗𝑙=𝑃𝑟(𝑆)⇝𝑚𝑗𝑙=𝑟1𝑙1=1⋯𝑟𝑗−1𝑙𝑗−1𝑟=1𝑗+1𝑙𝑗+1=1⋯𝑟𝑑𝑙𝑑=1𝑝𝑙1,…,𝑙𝑑.(2.10)

3. Distance between Multiple Random Physical Systems

3.1. Entropy as a Measure of Uncertainty of States of a Physical System

For measuring the degree of uncertainty of states of a physical system or a discrete random variable, we use the entropy which is a special characteristic and is widely used in information theory.

3.1.1. Shannon's [10] Formula for the Entropy

The entropy of the system is the positive quantity: 𝐻(𝑆)=−𝑟𝑡=1𝑝𝑡log2𝑝𝑡.(3.1)

The function 𝐻 has some elementary properties which justify its use as a characteristic for measuring the uncertainty of a system.

(1) If one of the states is certain (∃𝑙∈{1,…,𝑟} such that 𝑝𝑡=𝑃𝑟[(𝑆)⇝𝑚𝑡]=1), then 𝐻(𝑆)=0.(2) The entropy of a physical system with a finite number of states (𝑚1,…,𝑚𝑟) is maximal if all its states are equiprobable: for all 𝑡∈{1,…,𝑟};𝑝𝑡=𝑃𝑟[(𝑆)⇝𝑚𝑡]=1/𝑟. We have also 0≤𝐻(𝑆)≤log2(𝑟).

The characteristic of the entropy function expresses the fact that probability distribution with the maximum of entropy is the more biased and the more consistent with the information specified by the constraints [10].

3.2. Entropy of a Multiple Random Physical System

Let 𝑆 be a multiple random physical system given by (2.6). If the single physical systems (𝑆𝑗;𝑗=1,…,𝑑) given by (2.9) are independent, then 𝐻(𝑆)=𝑑𝑗=1𝐻𝑆𝑗.(3.2)

The conditional random physical system 𝑆1/[(𝑆2)⇝𝑚2𝑙] is given by𝑆1/𝑆2⇝𝑚2𝑙=𝑟1𝑗=1𝑆1⇝𝑚1𝑗/𝑆2⇝𝑚2𝑙𝑆;𝑃𝑟1⇝𝑚1𝑗/𝑆2⇝𝑚2𝑙=𝑝𝑗/𝑙,(3.3)where 𝑝𝑗/𝑙 is a conditional probability.

The entropy of this system is𝐻𝑆1/𝑆2⇝𝑚2𝑙=−𝑟1𝑗=1𝑝𝑗/𝑙log2𝑝𝑗/𝑙.(3.4)

The multiple random physical system (𝑆1/𝑆2) is written by𝑆1/𝑆2=𝑟2𝑙=1𝑟1𝑡=1𝑆1⇝𝑚1𝑡/𝑆2⇝𝑚2𝑙𝑆;𝑃𝑟1⇝𝑚1𝑡/𝑆2⇝𝑚2𝑙=𝑝𝑡/𝑙,(3.5) which implies𝐻𝑆1/𝑆2=−𝑟2𝑙=1𝑟1𝑗=1𝑝𝑗/𝑙log2𝑝𝑗/𝑙.(3.6) Hence,𝑆𝐻(𝑆)=𝐻1𝑆+𝐻2/𝑆1𝑆+𝐻3/𝑆1𝑆2𝑆+⋯+𝐻𝑑/𝑆1𝑆2⋯𝑆𝑑−1.(3.7) The quantity𝐾(𝑃,𝑄)=−𝑟𝑖=1𝑝𝑖log2î€·ğ‘žğ‘–/𝑝𝑖(3.8) is nonnegative. We have 𝐾(𝑃,𝑄)=𝐾(𝑄,𝑃)⟺𝑃=𝑄almostsurely.(3.9)

It is clear that 𝐾(⋅,⋅) is not a symmetric function, thus it is not a distance in the classical sense but characterizes (from a statistical point of view) the deviation between the distributions 𝑃 and 𝑄. It should be noted that 𝐾(𝑃,𝑄)+𝐾(𝑄,𝑃) is symmetrical.

Kullback [12] explains that the quantity 𝐾(𝑃,𝑄) evaluates the average lost information if we use the distribution 𝑃 while the actual distribution is 𝑄.

Let 𝑆Π𝑑 be a set of random physical systems with ∏𝑑𝑗=1𝑟𝑗 states𝑆∈𝑆Π𝑑⟹𝑆=𝑟1𝑙1=1⋯𝑟𝑑𝑙𝑑=1𝑚(𝑆)⇝𝑙1,…,𝑚𝑙𝑑𝑖;𝑝𝑙1⋯𝑙𝑑.(3.10)

Let dist be the application defined bydist:𝑆Π𝑑×𝑆Π𝑑⟶ℝ+𝑆1,𝑆2𝑆⟶dist1,𝑆2=𝐾𝑑𝑃1,𝑃2+𝐾𝑑𝑃2,𝑃1−𝐻𝑆1𝑆+𝐻2.(3.11)

𝑃1 and 𝑃2 are the multivariate distributions of order 𝑑 governing, respectively, the random physical systems 𝑆1 and 𝑆2. 𝐾𝑑 is defined as follows: 𝐾𝑑𝑃1,𝑃2=−𝑟1𝐼1⋯𝑟𝑑𝐼𝑑𝑝𝐼(1)1⋯𝐼𝑑log2𝑝𝐼(2)1⋯𝐼𝑑.(3.12) dist verifies (1)dist(𝑆1,𝑆2)≥0,(2)dist(𝑆1,𝑆2)=0⇔𝑆1=𝑆2,(3)dist(𝑆1,𝑆2)=dist(𝑆2,𝑆1)(symmetry).

We admit that dist(𝑆1,𝑆2)=0⇔𝑆1=𝑆2⇔𝜔1=𝜔2.

dist measures the similarity between physical systems. The smaller the value of dist is, the larger the uncertainty of the systems is. dist represents the lost quantity of average information if we use the distribution 𝑃1 (𝑃2) to manage the system while the other distribution is true. dist is nothing else than the Kullback-Leibler distance between the multivariate distributions 𝑃1 and 𝑃2. Indeed, the Kullback-Leibler distance between 𝑃1 and 𝑃2 is given by 𝑃𝐾𝑢1,𝑃2=𝑟1𝐼1⋯𝑟𝑑𝐼𝑑𝑝𝐼(1)1⋯𝐼𝑑−𝑝𝐼(2)1⋯𝐼𝑑log2𝑝𝐼(1)1⋯𝐼𝑑/𝑝𝐼(2)1⋯𝐼𝑑.(3.13)

Developing this expression will give dist.

4. Numerical Application

4.1. Procedure to Estimate the Joint Distribution

In the case where all variables involved in the description of the individuals are discrete, we give a procedure taken from classical techniques of factor analysis to estimate the joint distribution and derive the entropy of the multiple physical system.

Let Δ𝑖=[Δ1𝑖,…,Δ𝑑𝑖] be a juxtaposition of 𝑑0/1 data tables. For 𝜔𝑖∈Ω fixed, we have 𝑝𝑙(𝑖)1⋯𝑙𝑑𝑚=𝑃𝑟(𝑆)⇝1𝑙1,…,𝑚𝑑𝑙𝑑=1𝐿𝑖𝑁(𝑖)𝑙1,…,𝑙𝑑.(4.1)

𝑁(𝑖)(⋅) is the number of simultaneous occurrences of the modalities 𝑚1𝑙1,…,𝑚𝑑𝑙𝑑: 𝑟1𝑙1=1⋯𝑟𝑑𝑙𝑑=1𝑝𝑙(𝑖)1⋯𝑙𝑑=1.(4.2)

4.2. Algorithm

We use an algorithm for ascending hierarchical classification [13]. We call points either the objects to be classified or the clusters of objects generated by the algorithm.

Step 1. There are 𝑛 points to classify (which are the 𝑛 objects).

Step 2. We find the two points 𝑥 and 𝑦 that are closest to one another according to distance dist and clustered in a new artificial point ℎ.

Step 3. We calculate the distances between the new point and the remaining points using the single linkage of Sneath and Sokal [14] 𝐷 defined by𝐷(𝜔,ℎ)=Mindist(𝜔,𝑥),dist(𝜔,𝑦),𝜔≠𝑥,𝑦.(4.3)
We return to Step 1 with only (𝑛−1) points to classify.

Step 4. We again find the two closest points and aggregate them. We calculate the new distances and repeat the process until there is only one point remaining.

In the case of single linkage, the algorithm uses distances in terms of the inequalities between them.

4.3. Numerical Example

Consider 6 individuals described by 2 qualitative variables with, respectively, 2 and 3 modalities. 10 observations for each individual, the observations are grouped in Table 1.


𝜔 1 𝜔 2 𝜔 3
𝑉 1 𝑉 2 𝑉 1 𝑉 2 𝑉 1 𝑉 2
𝑚 1 1 𝑚 1 2 𝑚 2 1 𝑚 2 2 𝑚 2 3 𝑚 1 1 𝑚 1 2 𝑚 2 1 𝑚 2 2 𝑚 2 3 𝑚 1 1 𝑚 1 2 𝑚 2 1 𝑚 2 2 𝑚 2 3

1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 0 0 1 0 1 0 0 1 0 1 0 0 1
1 0 0 1 0 0 1 1 0 0 1 0 0 1 0
0 1 1 0 0 0 1 1 0 0 0 1 0 1 0
0 1 1 0 0 1 0 0 1 0 0 1 1 0 0
1 0 0 0 1 0 1 1 0 0 1 0 0 0 1
0 1 0 1 0 0 1 0 0 1 1 0 1 0 0
1 0 0 1 0 1 0 0 1 0 0 1 0 1 0
1 0 0 1 0 0 1 0 1 0 0 1 1 0 0
1 0 1 0 0 1 0 0 0 1 1 0 0 1 0

𝜔 4 𝜔 5 𝜔 6
𝑉 1 𝑉 2 𝑉 1 𝑉 2 𝑉 1 𝑉 2
𝑚 1 1 𝑚 1 2 𝑚 2 1 𝑚 2 2 𝑚 2 3 𝑚 1 1 𝑚 1 2 𝑚 2 1 𝑚 2 2 𝑚 2 3 𝑚 1 1 𝑚 1 2 𝑚 2 1 𝑚 2 2 𝑚 2 3

1 0 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 0 1 0 0 1 1 0 0 1 0 0 0 1
1 0 0 0 1 1 0 0 0 1 0 1 0 1 0
0 1 1 0 0 1 0 0 1 0 1 0 1 0 0
0 1 0 1 0 0 1 0 1 0 0 1 0 0 1
1 0 0 1 0 1 0 0 1 0 0 1 0 0 1
1 0 0 1 0 0 1 0 0 1 0 1 1 0 0
1 0 0 0 1 1 0 0 1 0 1 0 1 0 0
0 1 0 0 1 1 0 0 1 0 1 0 0 1 0
1 0 1 0 0 1 0 1 0 0 1 0 1 0 0

4.3.1. Procedure to Build a Hierarchy on these Objects

The empirical distributions which represent the individuals are given by Table 2.


( 𝑚 1 1 , 𝑚 2 1 ) ( 𝑚 1 1 , 𝑚 2 2 ) ( 𝑚 1 1 , 𝑚 2 3 ) ( 𝑚 1 2 , 𝑚 2 1 ) ( 𝑚 1 2 , 𝑚 2 2 ) ( 𝑚 1 2 , 𝑚 2 3 )

𝐹 1 0 . 2 0 . 3 0 . 1 0 . 2 0 . 1 0 . 1
𝐹 2 0 . 1 0 . 2 0 . 1 0 . 3 0 . 1 0 . 1
𝐹 3 0 . 2 0 . 2 0 . 1 0 . 2 0 . 2 0 . 1
𝐹 4 0 . 2 0 . 2 0 . 2 0 . 1 0 . 2 0 . 1
𝐹 5 0 . 1 0 . 4 0 . 2 0 . 1 0 . 1 0 . 1
𝐹 6 0 . 4 0 . 1 0 . 1 0 . 1 0 . 1 0 . 1

The program is carried out on this numerical example. We obtain the following results (Table 3).


𝑆 1 𝑆 2 𝑆 3 𝑆 4 𝑆 5 𝑆 6

𝑆 1 2 , 4 4 6 4 2 , 6 0 4 9 2 , 5 2 1 9 2 , 6 2 1 9 2 , 6 2 1 9 2 , 8 2 1 9
𝑆 2 2 , 6 0 4 9 2 , 4 4 6 4 2 , 6 2 1 9 2 , 8 2 1 9 2 , 8 2 1 9 2 , 9 2 1 9
𝑆 3 2 , 6 0 4 9 2 , 7 0 4 9 2 , 5 2 1 9 2 , 6 2 1 9 2 , 8 2 1 9 2 , 8 2 1 9
𝑆 4 2 , 7 0 4 9 2 , 8 6 3 4 2 , 6 2 1 9 2 , 5 2 1 9 2 , 7 2 1 9 2 , 8 2 1 9
𝑆 5 2 , 4 8 7 9 2 , 6 6 3 4 2 , 6 2 1 9 2 , 5 2 1 9 2 , 3 2 1 9 3 , 0 2 1 9
𝑆 6 2 , 6 6 3 4 2 , 8 6 3 4 2 , 6 2 1 9 2 , 6 2 1 9 3 , 0 2 1 9 2 , 3 2 1 9

Step 1. From the similarity matrix, using the single linkage of Sneath (4.3), we obtainmin𝑙≠𝑡;𝑙,𝑡=1,…,5𝑆dist𝑙,𝑆𝑡𝑆=dist1,𝑆3=0,1585.(4.4)Then, the objects 𝜔1 and 𝜔3 are aggregated into the artificial object 𝜔7 which is placed at the last line, and the rows and columns corresponding to the objects 𝜔1 and 𝜔3 are removed in the similarity matrix.

Step 2. From the new similarity matrix, we obtainmin𝑙≠𝑡;𝑙,𝑡=2,4,5,6𝑆dist𝑙,𝑆𝑡𝑆=dist2,𝑆7=0,317.(4.5)The objects 𝜔2 and 𝜔7 are aggregated into the artificial object 𝜔8.

Step 3. min𝑙≠𝑡;𝑙,𝑡=4,5,6𝑆dist𝑙,𝑆𝑡𝑆=dist4,𝑆8=1,268.(4.6) The objects 𝜔4 and 𝜔8 are aggregated into the artificial object 𝜔9.

Step 4. min𝑙≠𝑡;𝑙,𝑡=5,6,9𝑆dist𝑙,𝑆𝑡𝑆=dist5,𝑆9=2,732.(4.7) The objects 𝜔5 and 𝜔9 are clustered in the new object 𝜔10. The object 𝜔6 is aggregated with the object 𝜔10 and dist(𝑆6,𝑆10)=4,8. (see Figure 1)

In Figure 1 it can be seen that two separated classes appear in the graph by simply cutting the hierarchy on the landing above the individual 𝜔2. In this algorithm, we started by incorporating the two closest objects using the index of distance between corresponding physical systems. The higher the construction of the hierarchy is, the more dubious the obtained states of the mixed system are. The example shows that the index of Kullback-Leibler and the index of aggregation of the minimum bound (single linkage) lead to the construction of a system with a maximum of entropy, and thus lead to a system for which all the states are equiprobable.

If the total number of modalities of the various criteria is large compared with the number of observations, the frequency of choosing a set of modalities becomes small, and a lot of frequencies are zero. The set of modalities whose frequency is zero will be disregarded and does not intervene in the calculation of the distances. This can lead to the impossibility of comparing the systems.

4.3.2. Classification of the Six Objects after Reduction

If each object is described by the highest frequencies “mode,” we obtain the following table: 𝜔1𝜔2𝜔3𝜔4𝜔5𝜔6𝑉1𝑉2âŽ¡âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ£ğ‘š11𝑚22𝑚12𝑚23𝑚11𝑚21𝑚11𝑚21𝑚11𝑚22𝑚11𝑚22⎤⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎥⎦.(4.8)

This table contradicts the fact that in our procedure, the objects 𝜔1 and 𝜔2 are very close while 𝜔1 and 𝜔5 are not the same. This shows that the classification after reduction, for this type of data, can lead to contradictory results.

4.4. Application

The data come from a survey concerning the level of development of 𝑛 departments 𝐸1,𝐸2,…,𝐸𝑛 of a country. The aim is to search the less developed subregions in order to establish programs of adequate development. Every department 𝐸𝑖 is constituted of 𝐿𝑖 subregions 𝐶𝑖1,𝐶𝑖2,…,𝐶𝑖𝐿𝑖. For every 𝑖=1,…,𝑛 and 𝑙=1,…,𝐿𝑖, we measured the composite economic development index 𝑖𝑑𝐸 and the composite human development index 𝑖𝑑𝐻. These two composite indices are weighted means of variables measuring the degree of economic and human development, developed by experts of the program of development of the United Nations for ranking countries. These indices depend on the geographical situation and on the specificity of the subregions.

For every 𝑖=1,…,𝑛,𝑙=1,…,𝐿𝑖,0≤𝑖𝑑𝐸(𝐶𝑖𝑙)≤1, and 0≤𝑖𝑑𝐻(𝐶𝑖𝑙)≤1. The closer to 1 the value of the index is, the more the economic or human development is judged to be satisfactory. However, these indices are not calculated in the same manner. They depend on whether the subregions are classified as farming or urban zone. The ordering of the subregions according to each of the indices do not have sense anymore. The structure of data in entry is for every 𝑖=1,…,𝑛:𝐸𝑖(𝐿𝑖,2)⟼𝐶𝑖𝑑𝐸𝑖𝑑𝐻𝑖1𝐶𝑖2â‹®ğ¶ğ‘–ğ¿ğ‘–âŽ¡âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ¢âŽ£ğ‘–ğ‘‘ğ¸(𝐶𝑖1)𝑖𝑑𝐻(𝐶𝑖1)𝑖𝑑𝐸(𝐶𝑖2)𝑖𝑑𝐻(𝐶𝑖2)⋮⋮𝑖𝑑𝐸(𝐶𝑖𝐿𝑖)𝑖𝑑𝐻(𝐶𝑖𝐿𝑖)⎤⎥⎥⎥⎥⎥⎥⎦.(4.9)

The structure is not exploitable in this form. It is therefore necessary to transform the tables in a more tractable form. The specialists of the programs of development cut the observations of each index in quintile intervals and affect each of the subregions to the corresponding quintile. We thus determine for the 𝑛 series of observations of the two indices the various corresponding quintiles:ğ‘–ğ‘‘ğ¸âŸ¶ğ‘ž1𝑖1,ğ‘ž1𝑖2,…,ğ‘ž1𝑖5,ğ‘–ğ‘‘ğ»âŸ¶ğ‘ž2𝑖1,ğ‘ž2𝑖2,…,ğ‘ž2𝑖5.(4.10)The quintile intervals are𝐼1𝑖1,𝐼1𝑖2,…,𝐼1𝑖5,𝐼2𝑖1,𝐼2𝑖2,…,𝐼2𝑖5.(4.11)For every 𝑖=1,…,𝑛, the table 𝐸𝑖 is transformed into a table of 0/1 data.

The problem is to build a hierarchy on all departments of the territory in order to observe the level of development of each of the subregions according to the two indices and thus to make comparisons. The observations are summarized in tables Δ1,Δ2,…,Δ𝑛 which constitute a structure of juxtaposition of 0/1 data matrices. These data presented are from a study of 1541 municipalities involved in Algeria. The municipalities are gathered in 48 departments. The departments do not have the same number of municipalities which have not the same specificities: size, rural, urban, and the municipalities do not have the same locations: mountain, plain, costal, and so forth. We have to build typologies of departments according to their economic level and human development according to the United Nations Organization standards.

The result of the study made it possible to gather the great departments (cities) which have large and old universities and the municipalities which have a long existence. Another group emerged which includes enough new departments of the last administrative cutting and develops activities and services of small and middle companies. The other groups are distinguished by great disparities between municipalities in their economic level and human development and according to surface and importance.

5. Conclusion

In this paper, the definition of the entropy is that stated by Shannon [10]. This definition is still used in the theory of signal and information. The suggested formalism gives an explanation and a practical use of the distance of Kullback-Leibler as an index of distance between representative elements of a structure of tables of categorical data. It is possible to extend these results to the case of a structure of data tables of measurements and to adapt an algorithm of classification to the case of functional data.

References

  1. I. C. Lerman and B. Tallur, “Classification des éléments constitutifs d'une juxtaposition de tableaux de contingence,” Revue de Statistique Appliquée, vol. 28, no. 3, pp. 5–28, 1980. View at: Google Scholar
  2. A. Rebbouh, “Clustering the constitutive elements of measuring tables data structure,” Communications in Statistics: Simulation and Computation, vol. 35, no. 3, pp. 751–763, 2006. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  3. A. Agresti, Categorical Data Analysis, Wiley Series in Probability and Statistics, John Wiley & Sons, New York, NY, USA, 2nd edition, 2002. View at: Zentralblatt MATH | MathSciNet
  4. I. T. Adamson, Data Structures and Algorithms: A First Course, Springer, Berlin, Germany, 1996. View at: Zentralblatt MATH
  5. J. Beidler, Data Structures and Algorithms, Springer, New York, NY, USA, 1997.
  6. T. W. Anderson and J. D. Jeremy, The New Statistical Analysis of Data, Springer, New York, NY, USA, 1996. View at: Zentralblatt MATH
  7. F. de A. T. de Carvalho, R. M. C. R. de Souza, M. Chavent, and Y. Lechevallier, “Adaptive Hausdorff distances and dynamic clustering of symbolic interval data,” Pattern Recognition Letters, vol. 27, no. 3, pp. 167–179, 2006. View at: Publisher Site | Google Scholar
  8. P. Orlik and H. Terao, Arrangements of Hyperplanes, vol. 300 of Grundlehren der Mathematischen Wissenschaften, Springer, Berlin, Germany, 1992. View at: Zentralblatt MATH | MathSciNet
  9. J. M. Bouroche, Analyse des données ternaires: la double analyse en composantes principales, M.S. thesis, Université de Paris VI, Paris, France, 1975.
  10. C. E. Shannon, “A mathematical theory of communication,” The Bell System Technical Journal, vol. 27, pp. 379–423, 623–656, 1948. View at: Google Scholar | MathSciNet
  11. G. Celeux and G. Soromenho, “An entropy criterion for assessing the number of clusters in a mixture model,” Journal of Classification, vol. 13, no. 2, pp. 195–212, 1996. View at: Publisher Site | Google Scholar | Zentralblatt MATH | MathSciNet
  12. S. Kullback, Information Theory and Statistics, John Wiley & Sons, New York, NY, USA, 1959. View at: Zentralblatt MATH | MathSciNet
  13. G. N. Lance and W. T . Williams, “A general theory of classification sorting strategies—II: clustering systems,” Computer Journal, vol. 10, pp. 271–277, 1967. View at: Google Scholar
  14. P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy: The Principles and Practice of Numerical Classification, A Series of Books in Biology, W. H. Freeman, San Francisco, Calif, USA, 1973. View at: Zentralblatt MATH | MathSciNet

Copyright © 2008 Amar Rebbouh. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.


More related articles

 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder
Views544
Downloads305
Citations

Related articles

We are committed to sharing findings related to COVID-19 as quickly as possible. We will be providing unlimited waivers of publication charges for accepted research articles as well as case reports and case series related to COVID-19. Review articles are excluded from this waiver policy. Sign up here as a reviewer to help fast-track new submissions.