Journal of Probability and Statistics

Volume 2018, Article ID 1348147, 17 pages

https://doi.org/10.1155/2018/1348147

## Similarity Statistics for Clusterability Analysis with the Application of Cell Formation Problem

Department of Mechanical and Manufacturing Engineering, University of Calgary, Alberta, Canada

Correspondence should be addressed to Simon Li; ac.yraglacu@ilomis

Received 9 August 2018; Accepted 17 October 2018; Published 2 December 2018

Academic Editor: Luis A. Gil-Alana

Copyright © 2018 Yingyu Zhu and Simon Li. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

#### Abstract

This paper proposes the use of the statistics of similarity values to evaluate the clusterability or structuredness associated with a cell formation (CF) problem. Typically, the structuredness of a CF solution cannot be known until the CF problem is solved. In this context, this paper investigates the similarity statistics of machine pairs to estimate the potential structuredness of a given CF problem without solving it. One key observation is that a well-structured CF solution matrix has a relatively high percentage of high-similarity machine pairs. Then, histograms are used as a statistical tool to study the statistical distributions of similarity values. This study leads to the development of the U-shape criteria and the criterion based on the Kolmogorov-Smirnov test. Accordingly, a procedure is developed to classify whether an input CF problem can potentially lead to a well-structured or ill-structured CF matrix. In the numerical study, 20 matrices were initially used to determine the threshold values of the criteria, and 40 additional matrices were used to verify the results. Further, these matrix examples show that genetic algorithm cannot effectively improve the well-structured CF solutions (of high grouping efficacy values) that are obtained by hierarchical clustering (as one type of heuristics). This result supports the relevance of similarity statistics to preexamine an input CF problem instance and suggest a proper solution approach for problem solving.

#### 1. Introduction

The research of this paper is like a crossroad of manufacturing systems and computer science. Based on our disciplinary background, we initially study the cell formation (CF) problem that seeks for the clustering of similar machines and parts to support mass customization in [1]. In other words, a CF problem is a two-mode clustering problem [2]. Due to the NP-hard nature of the CF problem [3], many algorithms, including exact, metaheuristic, and heuristic approaches, have been proposed (to be discussed in Section 2.2.3). In the study of hierarchical clustering (abbreviated as HC, classified as a greedy-based heuristic approach), although HC is not the most powerful in searching for near-optimal solutions, it can yield satisfactory results comparable to some powerful metaheuristic approaches (e.g., genetic algorithms) for “well-structured” solutions. In this context, this research investigates the conditions based on the statistics of similarity values to estimate the potential structuredness of a given CF problem without solving it.

In the domain of computer science, the notion of structuredness somehow corresponds to the clusterability concept [4]. Intuitively, clusterability can be interpreted as a measure of an “intrinsic structure” of a dataset to be clustered [5]. Computer scientists have observed that a dataset of good clusterability can be clustered quite effectively (i.e., less impact from the NP-hard nature of the clustering problem). This observation has been summarized in a statement that “clustering is difficult only when it does not matter” (abbreviated as the CDNM thesis) [4, 6].

Notably, the measure of clusterability remains an open topic in computer science. Ackerman and Ben-David [4] have surveyed different definitions of clusterability and shown their incompatibility in pairwise comparisons. Nowakowska et al. [5] argued that a clusterability measure should be partition-independent so that it does not depend on the clustering algorithms and the resulting solutions. Ackerman et al. [7] proposed the use of the statistical distributions of pairwise distances between any two objects to evaluate clusterability.

Back to the context of the CF problem, in response to the CDNM thesis, we also observed that a heuristic approach (e.g., HC in our case) can yield satisfactory results. To further utilize this observation in practice, this research develops the criteria that assess the potential structuredness (corresponding to clusterability in computer science) of a given CF problem and suggest either using HC or genetic algorithm (GA) for problem solving. To verify the development, we have applied numerical examples to examine the results of the structuredness criteria and the quality of CF solutions via HC and GA.

Though developed independently, we want to acknowledge that our approach of evaluating the structuredness criteria is similar to the statistical approach by Ackerman et al. [7]. The difference lies in our application’s focus on the CF problem, while Ackerman et al. [7] have focused on the relatively high-level development for clustering tasks. This difference explains our use of similarity measures (instead of distances) in statistical analysis since they are common for the CF problem and allow for some normalization in setting the structuredness criteria. Further, our work numerically checks the relations between structuredness criteria and the solution quality by two different clustering approaches (i.e., HC and GA).

Notably, this paper was extended from our conference paper [8] with the improvement of the techniques (e.g., the threshold setting and the normalization approach). Also, additional numerical examples have been used in the evaluation.

The rest of this paper is organized as follows. Section 2 will overview the CF problem and discuss the three properties of a well-structured CF solution in order to clarify the logical relation of similarity statistics. Section 3 will introduce the histogram analysis of similarity values and develop the U-shape criteria. Section 4 will introduce the Kolmogorov-Smirnov (K-S) test, which is used to develop another criterion to inform the matrix’s structuredness. Section 5 will discuss the procedure that applies the developed criteria to classify well-structured and ill-structured matrices. Section 6 will examine the structuredness criteria via numerical examples, which are also used to check the effectiveness of metaheuristics via a two-stage solution process. Section 7 will conclude this paper.

#### 2. Background: Cell Formation Problem

##### 2.1. Problem Introduction

In the design of a cellular manufacturing system, one early and important decision is the formation of machine groups and part families, and it is often referred to as the cell formation (CF) problem. A simple CF problem can be compactly captured by a machine-part incidence matrix. Let* M* = (for* i* = 1 to* m*) be the set of machines and* P* = (for* j* = 1 to* n*) be the set of parts. Then, an incidence matrix, denoted as* B = *[*b*_{ij}], indicates whether machine* m*_{i} is required to produce part* p*_{j} (if so,* b*_{ij} = 1; otherwise,* b*_{ij} = 0). After solving the CF problem, the matrix’s rows and columns can be reordered to reveal which subset of machines (i.e., a machine group) is highly related to which subset of parts (i.e., a part family).

By using the incidence matrices to represent CF solutions (i.e., block-diagonal matrices), they can be roughly classified into two types: well-structured and ill-structured matrix [2, 9]. As illustrated in Figure 1, a well-structured matrix has few nonzero matrix entries outside the blocks (defined as exceptional elements) and few zero matrix entries inside the blocks (defined as voids). Precisely, exceptional elements are the matrix entries of* b*_{ij} = 1 with* m*_{i} and* p*_{j} in different cells, and voids are the matrix entries of* b*_{ij} = 0 with* m*_{i} and* p*_{j} in the same call. The opposite conditions apply for an ill-structured matrix (i.e., a matrix solution with many exceptional elements and voids). A well-structured matrix implies that part families can be produced quite exclusively by some machine groups so that the changes of few part families will not be adversely impacting the production of other parts. This is one desirable feature of cellular manufacturing systems [1].