Journal of Probability and Statistics

Volume 2015, Article ID 657965, 14 pages

http://dx.doi.org/10.1155/2015/657965

## Estimating Parameters of a Probabilistic Heterogeneous Block Model via the EM Algorithm

Institute of Mathematics, Budapest University of Technology and Economics, Műegyetem Rakpart 3, Budapest 1111, Hungary

Received 16 June 2015; Revised 6 November 2015; Accepted 10 November 2015

Academic Editor: Hyungjun Cho

Copyright © 2015 Marianna Bolla and Ahmed Elbanna. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

We introduce a semiparametric block model for graphs, where the within- and between-cluster edge probabilities are not constants within the blocks but are described by logistic type models, reminiscent of the 50-year-old Rasch model and the newly introduced - models. Our purpose is to give a partition of the vertices of an observed graph so that the induced subgraphs and bipartite graphs obey these models, where their strongly interlaced parameters give multiscale evaluation of the vertices at the same time. In this way, a profoundly heterogeneous version of the stochastic block model is built via mixtures of the above submodels, while the parameters are estimated with a special EM iteration.

#### 1. Introduction

So far many parametric and nonparametric methods have been proposed for so-called community detection in networks. In the nonparametric scenario, hierarchical or spectral methods were applied to maximize the two- or multiway Newman-Girvan modularity [1–4]; more generally, spectral clustering tools (SC), based on Laplacian or modularity spectra, proved to be feasible to find community, anticommunity, or regular structures in networks [5]. In the parametric setup, certain model parameters are estimated, usually via maximizing the* likelihood function* of the graph, that is, the joint probability of our observations under the model equations. This so-called ML estimation is a promising method of statistical inference, has solid theoretical foundations [6, 7], and also supports the common-sense goal of accepting parameter values based on which our sample is the most likely.

As for the parametric scenario, in the 2010s, and models [8, 9] were developed as the unique graph models where the* degree sequence is a sufficient statistic*: given the degree sequence, the distribution of the random graph does not depend on the parameters any more (microcanonical distribution over the model graphs). This fact makes it possible to derive the ML estimate of the parameters in a standard way [10]. Indeed, in the context of network data, a lot of information is contained in the degree sequence, though, perhaps, in a more sophisticated way. The vertices may have* clusters* (groups or modules) and their memberships may affect their affinity to make ties. We will find groups of the vertices such that the within- and between-cluster edge probabilities admit certain parametric graph models, the parameters of which are highly interlaced. Here the degree sequence is not a sufficient statistic any more, only if it is restricted to the subgraphs. When making inference, we are partly inspired by the stochastic block model, partly by the Rasch model, and by the rectangular analogue of the - models.

The generalized random graph model, sometimes called stochastic block model (SBM), was first introduced in [11] and discussed later in [12–18]. This model is the generalization of the classical Erdős-Renyi random graph , the first random graph of the history introduced in [19] which corresponds to the one-cluster case: between any pair of the vertices edges come into existence independently, with the same probability . The graph on vertices is a* generalized random graph* with symmetric probability matrix and proper -partition of the vertices if vertices of and are connected independently, with probability ; further, any pair of the vertices within is connected with probability . Therefore, the subgraph of confined to the vertex set is an Erdős-Renyi type random graph, while the bipartite subgraphs connecting vertices of and are random bipartite graphs of edge probability . Sometimes we refer to as* clustering*, where are the clusters. In fact, seminal Szemeredi’s regularity lemma [20] guarantees the existence of such a structure in any huge graph, albeit with an enormously large number of clusters; therefore, it is not applicable for practical purposes.

Though in [18] the SBM is called heterogeneous, it is, in fact, a homogeneous one: the probability to make ties is the same within the clusters or between the cluster pairs. Nonetheless, this probability depends on the actual cluster memberships; given the memberships of the vertices, the probability that they are connected is a given constant, for the estimation of which an algorithm (maximizing the likelihood modularity) is proposed in [12] and the the EM algorithm is described in [5]. Here we want to model more complicated within- and between-cluster relations.

As for the nonparametric scenario, the role of the degree sequence is the best enhanced in the normalized Laplacian or normalized modularity based SC. In [4] we extended the notion of the* modularity matrix* to weighted graphs as follows. Let be an* edge-weighted graph* on the -element vertex-set with the symmetric weight matrix ; the entries satisfy , , and they are similarities between the vertex pairs. The* modularity matrix* of is defined as , where the entries of are the* generalized vertex degrees *. Here is normalized in such a way that , an assumption that does not hurt the generality, since the forthcoming* normalized modularity matrix*, to be mostly used, is not affected by the scaling of the entries of :where is the diagonal* degree matrix*.

In [4], we also introduced the following spectral relaxation technique to approximate the -partition of the vertices minimizing the within- and between-cluster discrepancies in the spirit of Szemeredi’s regularity lemma. (This discrepancy measures the homogeneity of the clusters; we will not use this notion here; see [21] for more information.) Let the eigenvalues of , enumerated in decreasing absolute values, be . Assume that , and denote by the corresponding unit-norm, pairwise orthogonal eigenvectors. Let be the row vectors of the matrix of column vectors ; they are called -dimensional representatives of the vertices. The* weighted **-variance* of these representatives is defined as where is the weighted center of cluster . It is the* weighted **-means algorithm* that gives this minimum, and the point is that the optimum is just the minimum distance between the eigensubspace corresponding to and the one of the suitably transformed step-vectors over the -partitions of . In Chapter of [5] we also discussed that, in view of subspace perturbation theorems, the larger the gap between and , the smaller .

Note that the normalized modularity based spectral clustering is the same as the normalized Laplacian based one (see [5]) with the exception that here not only the bottom, but the large absolute value eigenvalues are considered; further, the heterogeneity of the vertex degrees is encoded into the diagonal degree matrix . With the above technique we embed the vertices into a low dimensional space (spectral relaxation via eigenvectors) and hence perform metric clustering. However, above the spacial location, SC does not assign any parameters to the vertices. Our method to be introduced finds parameters and classifies the vertices at the same time.

Here we propose a profoundly heterogeneous block model by carrying on the Rasch model developed more than 50 years ago for evaluating psychological tests [22, 23]. We will call it Logistic Block Model (LBM). Given the number of clusters and a classification of the vertices, we will use the Rasch model for the bipartite subgraphs but the - models for the subgraphs themselves and process an iteration (inner cycle) to find the ML estimate of their parameters. Then, based on their contributions to the overall likelihood, we find a new classification of the vertices via taking conditional expectation and using the Bayes rule. Eventually, the two steps are alternated, giving the outer cycle of the iteration.

Our algorithm fits into the framework of the EM algorithm [7, 24], in the context of exponential families. The method was originally developed for missing data, and the name comes from the alternating* expectation* (E) and* maximization* (M) steps, where in the E-step (assignment phase) we complete the data by substituting for the missing data via taking conditional expectation, while in the M-step (estimation phase) we find the usual ML estimate of the parameters based on the so completed data. The EM algorithm naturally extends to situations, when not the data itself is missing, but it comes from a finite mixture, and the grouping memberships are the missing parameters. This special type of the EM algorithm developed for mixtures is often called collaborative filtering [25, 26] or Gibbs sampling [27], the roots of which method can be traced back to [28].

After proving the convergence of the inner cycle to the unique solution of the likelihood equation in each block separately, the convergence of the outer cycle to a local maximum of the likelihood function is easily seen. The advantage of the LBM is that, unlike SC, above clustering the vertices, it also assigns parameters to them, where parameters depend on their cluster memberships. Therefore we call is semiparametric. In the context of social networks, the clusters can be identified with social strata and the parameters with attitudes of people of one group towards people of the other, where attitude is the same for people in the second group but depends on the individual in the first group. The number of clusters is fixed during the iteration, but an initial number of clusters are obtained by SC, via inspecting the normalized modularity spectrum of the graph. We apply the algorithm to randomly generated and real-world data, where the initial clustering was the one obtained by SC. Then we compare the results of SC, SBM, and LBM by the Rand index, and our LBM shows a better agreement with the SC clusters than the SBM. It seems that SC gives a solution close to a local maximum of LBM, which can be regarded as fine tuning of SC. In fact, without good starting clustering, LBM can run into a local maximum (there are many) far from the global one. Therefore, it needs SC starting; however, its advantage is that, in addition, it is able to estimate parameters too.

The paper is organized as follows. In Section 2 we describe the building blocks of our model. In the context of the - models we refer to already proved facts about the existence of the ML estimate, and if it exists, we discuss the algorithm proposed by [9] together with convergence facts; meanwhilewhile, in the context of the - model, we introduce a novel algorithm and prove the convergence of it in the appendix. In Section 3 we use both of the above algorithms for the subgraphs and bipartite subgraphs of our sample graph, and we connect them together in the framework of the EM algorithm. In Section 4, the algorithm is applied to randomly generated and real-world data, while Section 5 is devoted to a brief discussion.

#### 2. The Building Blocks of the LBM

Log-linear and logistic type models to describe contingency tables in a combinatorial fashion were proposed, for example, by [11, 29] and widely used in statistics. Together with the Rasch model, they give the foundation of the unweighted graph and bipartite graph models which are the building blocks of our EM iteration.

##### 2.1. Models for Undirected Random Graphs

With different parameterization, [8, 9] introduced the following random graph model, where the degree sequence is a sufficient statistic. We have an unweighted, undirected random graph on vertices without loops, such that edges between distinct vertices come into existence independently, but not with the same probability as in the classical Erdős-Renyi model [19]. This random graph can uniquely be characterized by its symmetric adjacency matrix which has zero diagonal and the entries above the main diagonal are independent Bernoulli random variables whose parameters obey the following rule. Actually, we formulate this rule for the ratios, the so-called* odds*: where the parameters are positive reals. This model is called * model* in [9]. With the parameter transformation , it is equivalent to the * model* of [8] which applies to the* logits*: with real parameters .

Conversely, the probabilities and can be expressed in terms of the parameters, like whose formulas will be intensively used in the subsequent calculations.

We are looking for the ML estimate of the parameter vector or based on the observed unweighted, undirected graph as a statistical sample. (It may seem that we have a one-element sample here; however, there are independent random variables, the adjacencies, in the background.)

Let denote the degree vector of the above random graph, where . The random vector , as a function of the sample entries ’s, is a* sufficient statistic* for the parameter , or, equivalently, for . Roughly speaking, a sufficient statistic itself contains all the information—which can be retrieved from the data—for the parameter. More precisely, a statistic is sufficient when the conditional distribution of the sample, given the statistic, does not depend on the parameter any more. By the* Neyman-Fisher factorization theorem* [6], a statistic is sufficient if and only if the likelihood function of the sample can be factorized into two parts: one which does not contain the parameter and the other, which includes the parameter, contains the sample entries merely compressed into this sufficient statistic. Consider this factorization of the likelihood function (joint probability of ’s) in our case. Because of the symmetry of , this iswhere we used (5) and the facts that , and , . Here the partition function only depends on , and the whole likelihood function depends on ’s merely through ’s. Therefore, is a sufficient statistic. The other factor is constantly 1, indicating that the conditional joint distribution of the entries—given —is uniform, but we will not make use of this fact. The whole model comes from the so-called log-linear way of model building; see [29]. In [8, 10], the converse statement is also proved: the above model (reparametrized as model) is the unique one, where the degree sequence is a sufficient statistic.

Let be the matrix of the sample realizations (the adjacency entries of the observed graph), let be the actual degree of vertex , and let be the observed degree vector. The above factorization also indicates that the joint distribution of the entries belongs to the exponential family, and hence, with natural parameterization [24], the maximum likelihood estimate (or equivalently, ) is derived from the fact that, with it, the observed degree equals the expected one; that is, . Therefore, is the solution of the following* maximum likelihood equation*: The ML estimate is easily obtained from via taking the logarithms of its coordinates.

Before discussing the solution of the system of (7), let us see what conditions a sequence of nonnegative integers should satisfy so that it could be realized as the degree sequence of a graph. The sequence of nonnegative integers is called* graphic* if there is an unweighted, undirected graph on vertices such that its vertex degrees are the numbers in some order. Without loss of generality, ’s can be enumerated in nonincreasing order. The Erdősi-Gallai theorem [30] gives the following necessary and sufficient condition for a sequence to be graphic. The sequence of integers is graphic if and only if it satisfies the following two conditions: is even and

Note that for nonnegative (not necessarily integer) real sequences, a continuous analogue of (8) is derived in [8]. For given , the convex hull of all possible graphic degree sequences is a polytope, to be denoted by . Its extreme points are the so-called* threshold graphs* [31]. It is interesting that for all undirected graphs are threshold, since there are 8 possible graphs on 3 nodes, and there are also 8 vertices of ; the case is also not of much interest; therefore we will treat the cases only. The number of vertices of superexponentially grows with [32]; therefore the problem of characterizing threshold graphs has a high computational complexity. Its facial and cofacial sets are fully described in [10]. Apart from the trivial cases (when there is at least one degree equal to 0 or ), in [33], the authors give the following equivalent characterization of a threshold graph for : it has no four different vertices, , such that and are connected by an edge, but and are not; that is, it has no two disjoint copies of the complete graph .

The authors of [8, 9] prove that is the topological closure of the set of expected degree sequences, and, for given , if is an interior point, then maximum likelihood equation (7) has a unique solution. Later, it turned out that the converse is also true: in [10] the authors prove that the ML estimate exists if and only if the observed degree vector is an inner point of . On the contrary, when the observed degree vector is a boundary point of , there is at least one 0 or 1 probability which can be obtained only by a parameter vector such that at least one of ’s is not finite. In this case, the likelihood function cannot be maximized with a finite parameter set; its supremum is approached with a parameter vector with at least one coordinate tending to or .

The authors in [9] recommend the following algorithm and prove that, provided , its iteration converges to the unique solution of system (7). To motivate the iteration, we rewrite (7) as Then starting with initial parameter values and using the observed degree sequence , which is an inner point of , the iteration is as follows:for , until convergence.

##### 2.2. - Model for Bipartite Graphs

This bipartite graph model, since there is a one-to-one correspondence between bipartite graphs and 0-1 rectangular arrays, traces back to Haberman [34], Lauritzen [29], and Rasch [22, 23] who applied it for psychological and educational measurements, later market research. According to the Rasch model, the entries of an binary table are independent Bernoulli random variables, where for the parameter of the entry the following holds: with real parameters and . As an example, Rasch in [22] investigated binary tables where the rows corresponded to persons and the columns to items of some psychological test, whereas the th entry of the th row was 1 if person answered test item correctly and 0 otherwise. He also gave a description of the parameters: was the ability of person , while was the difficulty of test item . Therefore, in view of model equation (11), the more intelligent the person is and the less difficult the test is, the larger the success/failure ratio was on a logarithmic scale.

Given an random binary table , or, equivalently, a bipartite graph, our model is with real parameters and ; further, .

In terms of the transformed parameters and , model (12) is equivalent to where and are positive reals.

Conversely, the probabilities can be expressed in terms of the parameters:

Observe that if (12) holds with the parameters ’s and ’s, then it also holds with the transformed parameters and with some . Equivalently, if (13) holds with the positive parameters ’s and ’s, then it also holds with the transformed parameters with some . Therefore, the parameters and are arbitrary to within a multiplicative constant.

Here the row-sums and the column-sums are the sufficient statistics for the parameters collected in and . Indeed, the likelihood function is factorized asSince the likelihood function depends on only through its row- and column-sums, by the Neyman-Fisher factorization theorem, , are a sufficient statistic for the parameters. The first factor (including the partition function) depends only on the parameters and the row- and column-sums, whereas the seemingly not present factor—which would depend merely on —is constantly 1, indicating that the conditional joint distribution of the entries, given the row- and column-sums, is uniform (microcanonical) in this model. Note that, in [35], the author characterizes random tables sampled uniformly from the set of 0-1 matrices with fixed margins. Given the margins, the contingency tables coming from the above model are uniformly distributed, and a typical table of this distribution is produced by the - model with parameters estimated via the row- and column-sums as sufficient statistics. In this way, here we obtain another view of the typical table of [35].

Based on an observed binary table , since we are in exponential family and , are natural parameters, the likelihood equation is obtained by making the expectation of the sufficient statistic equal to its sample value. Therefore, with the notations and , the following* system of likelihood equations* is yielded:Note that, for any sample realization of , holds automatically. Therefore, there is dependence between the equations of system (17), indicating that the solution is not unique, in accord with our previous remark about the arbitrary scaling factor of (15). We will prove that, apart from this scaling, the solution is unique if it exists at all. For our convenience, let denote the equivalence class of the parameter vector , which consists of parameter vectors satisfying (15) with some . So to avoid this indeterminacy, we may impose conditions on the parameters; for example,

Like the graphic sequences, here the following sufficient conditions can be given for the sequences and of integers to be row- and column-sums of an matrix of 0-1 entry (see, e.g., [36]):Observe that the cases imply and , whereas the and cases together imply . This statement is the counterpart of the Erdős-Gallai conditions for bipartite graphs, where—due to (18)—the sum of the degrees is automatically even. In fact, the conditions in (20) are redundant: one of the conditions—either the one for the rows or the one for the columns—suffices together with (18) and or . The so obtained necessary and sufficient conditions define* bipartite realizable sequences* with the wording of [33]. Already, in 1957, the author of [37] determined arithmetic conditions for the construction of a 0-1 matrix having given row- and column-sums. The construction was given via swaps. More generally, [38] referred to the transportation problem and the Ford-Fulkerson max flow–min cut theorem [39].

The convex hull of the bipartite realizable sequences and forms a polytope in , actually, because of (18), in its -dimensional hyperplane. It is called* polytope of bipartite degree sequences* and denoted by in [33]. It is the special case of the transportation polytope describing margins of contingency tables with nonnegative integer entries. There is an expanding literature on the number of such matrices, for example, [40], and on the number of 0-1 matrices with prescribed row- and column-sums, for example, [41].

Analogously to the considerations of the - models, and applying the thoughts of the proofs in [8–10], is the closure of the set of the expected row- and column-sum sequences in the above model. In [33], it is proved that an binary table, or equivalently a bipartite graph on the independent sets of and vertices, is on the boundary of if it does not contain two vertex-disjoint edges. In this case, the likelihood function cannot be maximized with a finite parameter set; its supremum is approached with a parameter vector with at least one coordinate, or , tending to or , or, equivalently, with at least one coordinate, or , tending to or . Based on the proofs of [10], and stated as Theorem 6.3 in the supplementary material of [10], the maximum likelihood estimate of the parameters of model (13) exists if and only if the observed row- and column-sum sequence , the relative interior of , satisfying (18). In this case for the probabilities, calculated by formula (14) through the estimated positive parameter values ’s and ’s (solutions of (17)), holds .

Under these conditions, we define an algorithm that converges to the unique (up to the above equivalence) solution of maximum likelihood equations (17).

Theorem 1. *If , then the following algorithm gives a unique equivalence class of the parameter vectors as the fixed point of the iteration, which therefore provides the ML estimate of the parameters.**Starting with positive parameter values and and using the observed row- and column-sums, the iteration is as follows:for , until convergence.*

#### 3. Parameter Estimation in the LBM

In the several clusters’ case, we are putting the bricks together. The above discussed - and - models will be the building blocks of the LBM to be introduced. Here the degree sequences are not any more sufficient for the whole graph, only for the building blocks of the subgraphs.

Given , we are looking for -partition, in other words, clusters of the vertices, such that(i)different vertices are independently assigned to a cluster with probability , where ;(ii)given the cluster memberships vertices and are connected independently, with probability such that for any , pair. Equivalently, where is the cluster membership of vertex and .

The parameters are collected in the vector and the matrix of ’s . The likelihood function is the following mixture: Here is the incomplete data specification as the cluster memberships are missing. Therefore, it is straightforward to use the EM algorithm, proposed by [24], also discussed in [7, 42], for parameter estimation from incomplete data. This special application for mixtures is sometimes called* collaborative filtering* (see [25, 26]) which is rather applicable to fuzzy clustering.

First we complete our data matrix with latent membership vectors of the vertices that are -dimensional i.i.d. (multinomially distributed) random vectors. More precisely, , where if and zero otherwise. Thus, the sum of the coordinates of any is 1, and .

Note that if the cluster memberships were known, then the complete likelihood would be that is valid only in case of known cluster memberships.

Starting with initial parameter values and and membership vectors , the th step of the iteration is the following ():(i)E-step: we calculate the conditional expectation of each conditioned on the model parameters and on the other cluster assignments obtained in step and collectively denoted by . The responsibility of vertex for cluster in the th step is defined as the conditional expectation , and, by the Bayes theorem, it is . For each , is proportional to the numerator; therefore the conditional probabilities should be calculated for . But this is just the part of likelihood (25) affecting vertex under the condition . Therefore, where is the number of edges within that are connected to .(ii)M-step: we update and : and if and 0 otherwise (in case of ambiguity, we select the smallest index for the cluster membership of vertex ). This is an ML estimation (discrete one, in the latter case, for the cluster membership). In this way, new clustering of the vertices is obtained. Then we estimate the parameters in the actual clustering of the vertices. In the within-cluster scenario, we use the parameter estimation of model (3), obtaining estimates of ’s () in each cluster separately ; as for cluster , corresponds to and the number of vertices is . In the between-cluster scenario, we use bipartite graph model (13) in the following way. For , edges connecting vertices of and form a bipartite graph, based on which the parameters and are estimated with the above algorithm; here ’s correspond to ’s, ’s correspond to ’s, and the number of rows and columns of the rectangular array corresponding to this bipartite subgraph of is and , respectively. With the estimated parameters, collected in the matrix , we go back to the E-step and so forth.

As in the M-step we increase the likelihood in all parts, and in the E-step we relocate the vertices into the cluster where their likelihoods are maximized, the nonnegative likelihood function is increased in each iteration. Since the likelihood function is bounded from above (unless in some inner cycle we start from the boundary of a polytope of Section 2), it must converge to a local maximum.

Note that here the parameter with embodies the affinity of vertex of cluster towards vertices of cluster ; and likewise, with embodies the affinity of vertex of cluster towards vertices of cluster . By the model, these affinities are added together on the level of the logits. This so-called - model, introduced in [43], is applicable to social networks, where attitudes of individuals in the same social group (say, ) are the same toward members of another social group (say, ), though this attitude also depends on the individual in group . The model may also be applied to biological networks, where the clusters correspond, for example, to different functioning synopses or other units of the brain; see [44].

After normalizing the and to meet the requirement of (19), for any pair, the sum of the parameters will be zero, and their sign and magnitude indicate the affinity of nodes of to make ties with the nodes of and vice versa: This becomes important when we want to compare the parameters corresponding to different cluster pairs. For selecting the initial number of clusters, we can use considerations of [45], while for the initial clustering, we can use spectral clustering tools of [5].

#### 4. Applications

Now we illustrate the performance of our algorithm via randomly generated and real-world data. Note that while processing the iteration, we sometimes run into threshold subgraphs or bipartite subgraphs on the boundary of the polytope of bipartite degree sequences. Even in this case, our iteration converged for most coordinates of the parameter vectors, while some coordinates tended to or 0 (numerically, when stopping the iteration, they took on a very “large” or “small” value). This means that the affinity of node towards nodes of the cluster is infinitely “large” or “small”; that is, this node is liable to always or never make ties with nodes of cluster .

First we generated a random graph on vertices and with underlying vertex-clusters , , in the following way. Let , , and . The parameters , , and were chosen independently at uniform from the intervals , , and , respectively. The parameters , , and were chosen independently at uniform from the intervals , , and , respectively. The parameters , , and were chosen independently at uniform from the intervals , , and , respectively.

Starting with 3 clusters, obtained by spectral clustering tools, and initial parameter values collected in of all 1 entries, after some outer steps, the iteration converged to . With , we plotted the , pairs for , . Figure 1 shows a good fit with of the estimated parameters to the original ones. Indeed, by the general theory of the ML estimation [6], for “large” , the ML estimate should approach the true parameter, based on which the model was generated. So the good fit means that our algorithm finds estimates close to the true parameters.