Research Article  Open Access
Yingyu Zhu, Simon Li, "Similarity Statistics for Clusterability Analysis with the Application of Cell Formation Problem", Journal of Probability and Statistics, vol. 2018, Article ID 1348147, 17 pages, 2018. https://doi.org/10.1155/2018/1348147
Similarity Statistics for Clusterability Analysis with the Application of Cell Formation Problem
Abstract
This paper proposes the use of the statistics of similarity values to evaluate the clusterability or structuredness associated with a cell formation (CF) problem. Typically, the structuredness of a CF solution cannot be known until the CF problem is solved. In this context, this paper investigates the similarity statistics of machine pairs to estimate the potential structuredness of a given CF problem without solving it. One key observation is that a wellstructured CF solution matrix has a relatively high percentage of highsimilarity machine pairs. Then, histograms are used as a statistical tool to study the statistical distributions of similarity values. This study leads to the development of the Ushape criteria and the criterion based on the KolmogorovSmirnov test. Accordingly, a procedure is developed to classify whether an input CF problem can potentially lead to a wellstructured or illstructured CF matrix. In the numerical study, 20 matrices were initially used to determine the threshold values of the criteria, and 40 additional matrices were used to verify the results. Further, these matrix examples show that genetic algorithm cannot effectively improve the wellstructured CF solutions (of high grouping efficacy values) that are obtained by hierarchical clustering (as one type of heuristics). This result supports the relevance of similarity statistics to preexamine an input CF problem instance and suggest a proper solution approach for problem solving.
1. Introduction
The research of this paper is like a crossroad of manufacturing systems and computer science. Based on our disciplinary background, we initially study the cell formation (CF) problem that seeks for the clustering of similar machines and parts to support mass customization in [1]. In other words, a CF problem is a twomode clustering problem [2]. Due to the NPhard nature of the CF problem [3], many algorithms, including exact, metaheuristic, and heuristic approaches, have been proposed (to be discussed in Section 2.2.3). In the study of hierarchical clustering (abbreviated as HC, classified as a greedybased heuristic approach), although HC is not the most powerful in searching for nearoptimal solutions, it can yield satisfactory results comparable to some powerful metaheuristic approaches (e.g., genetic algorithms) for “wellstructured” solutions. In this context, this research investigates the conditions based on the statistics of similarity values to estimate the potential structuredness of a given CF problem without solving it.
In the domain of computer science, the notion of structuredness somehow corresponds to the clusterability concept [4]. Intuitively, clusterability can be interpreted as a measure of an “intrinsic structure” of a dataset to be clustered [5]. Computer scientists have observed that a dataset of good clusterability can be clustered quite effectively (i.e., less impact from the NPhard nature of the clustering problem). This observation has been summarized in a statement that “clustering is difficult only when it does not matter” (abbreviated as the CDNM thesis) [4, 6].
Notably, the measure of clusterability remains an open topic in computer science. Ackerman and BenDavid [4] have surveyed different definitions of clusterability and shown their incompatibility in pairwise comparisons. Nowakowska et al. [5] argued that a clusterability measure should be partitionindependent so that it does not depend on the clustering algorithms and the resulting solutions. Ackerman et al. [7] proposed the use of the statistical distributions of pairwise distances between any two objects to evaluate clusterability.
Back to the context of the CF problem, in response to the CDNM thesis, we also observed that a heuristic approach (e.g., HC in our case) can yield satisfactory results. To further utilize this observation in practice, this research develops the criteria that assess the potential structuredness (corresponding to clusterability in computer science) of a given CF problem and suggest either using HC or genetic algorithm (GA) for problem solving. To verify the development, we have applied numerical examples to examine the results of the structuredness criteria and the quality of CF solutions via HC and GA.
Though developed independently, we want to acknowledge that our approach of evaluating the structuredness criteria is similar to the statistical approach by Ackerman et al. [7]. The difference lies in our application’s focus on the CF problem, while Ackerman et al. [7] have focused on the relatively highlevel development for clustering tasks. This difference explains our use of similarity measures (instead of distances) in statistical analysis since they are common for the CF problem and allow for some normalization in setting the structuredness criteria. Further, our work numerically checks the relations between structuredness criteria and the solution quality by two different clustering approaches (i.e., HC and GA).
Notably, this paper was extended from our conference paper [8] with the improvement of the techniques (e.g., the threshold setting and the normalization approach). Also, additional numerical examples have been used in the evaluation.
The rest of this paper is organized as follows. Section 2 will overview the CF problem and discuss the three properties of a wellstructured CF solution in order to clarify the logical relation of similarity statistics. Section 3 will introduce the histogram analysis of similarity values and develop the Ushape criteria. Section 4 will introduce the KolmogorovSmirnov (KS) test, which is used to develop another criterion to inform the matrix’s structuredness. Section 5 will discuss the procedure that applies the developed criteria to classify wellstructured and illstructured matrices. Section 6 will examine the structuredness criteria via numerical examples, which are also used to check the effectiveness of metaheuristics via a twostage solution process. Section 7 will conclude this paper.
2. Background: Cell Formation Problem
2.1. Problem Introduction
In the design of a cellular manufacturing system, one early and important decision is the formation of machine groups and part families, and it is often referred to as the cell formation (CF) problem. A simple CF problem can be compactly captured by a machinepart incidence matrix. Let M = (for i = 1 to m) be the set of machines and P = (for j = 1 to n) be the set of parts. Then, an incidence matrix, denoted as B = [b_{ij}], indicates whether machine m_{i} is required to produce part p_{j} (if so, b_{ij} = 1; otherwise, b_{ij} = 0). After solving the CF problem, the matrix’s rows and columns can be reordered to reveal which subset of machines (i.e., a machine group) is highly related to which subset of parts (i.e., a part family).
By using the incidence matrices to represent CF solutions (i.e., blockdiagonal matrices), they can be roughly classified into two types: wellstructured and illstructured matrix [2, 9]. As illustrated in Figure 1, a wellstructured matrix has few nonzero matrix entries outside the blocks (defined as exceptional elements) and few zero matrix entries inside the blocks (defined as voids). Precisely, exceptional elements are the matrix entries of b_{ij} = 1 with m_{i} and p_{j} in different cells, and voids are the matrix entries of b_{ij} = 0 with m_{i} and p_{j} in the same call. The opposite conditions apply for an illstructured matrix (i.e., a matrix solution with many exceptional elements and voids). A wellstructured matrix implies that part families can be produced quite exclusively by some machine groups so that the changes of few part families will not be adversely impacting the production of other parts. This is one desirable feature of cellular manufacturing systems [1].
To quantify the structuredness of a CF matrix solution, we use the traditional grouping efficacy (denoted as μ), which is formulated as follows [10].where n_{e}, n_{out}, n_{in} are the total number of nonzero matrix entries, exceptional elements, and voids, respectively. In a perfect CF solution where n_{out} = n_{in} = 0, the grouping efficiency is equal to its maximum value, i.e., one. When there are more exceptional elements (n_{out}) and voids (n_{in}), the grouping efficacy value will become smaller.
Yet, not all incidence matrices can be converted to a wellstructured matrix due to the original complex interdependency of the production requirements among machines and parts. This situation cannot be resolved by advanced optimization techniques as the root cause stems from the original inputs of the CF problem. However, we cannot practically know whether a given CF problem is going to have a wellstructured matrix or not until we actually solve this problem. In this context, the purpose of this paper is to assess the structuredness of a given CF problem by analyzing the similarity of machines without actually solving it. In the traditional CF notion, two machines can be said similar if they are required mainly to produce a subset of common parts. In this work, the Jaccard similarity coefficient is applied [11, 12]. Let s_{xy} be the similarity value between machines m_{x} and m_{y}. The formulation of the Jaccard similarity coefficient is provided below.where a_{xy} is the number of parts that need both machine m_{x} and m_{y}; b_{xy} is the number of parts that need machine m_{x} but not machine m_{y}; c_{xy} is the number of parts that need machine m_{y} but not machine m_{x}. Conceptually, the Jaccard similarity coefficient focuses on the number of common features (e.g., a_{xy}) that is normalized by the total number of relevant features (e.g., a_{xy}, b_{xy,} and c_{xy}). Notably, similarity is only evaluated for any two machines (i.e., a machine pair).
After specifying the notion of machine similarity, let us revisit the two examples in Figure 1. Each example has 30 machines, leading to 30×(301)/2 = 435 machine pairs. By examining the similarity of any two machines (or machine pairs), we find that the wellstructured matrix has a higher number of machine pairs with highsimilarity values. In the examples of Figure 1, we can get the following two statements concerning the statistics of the machine similarity values.(i)Wellstructured matrix: 81 (out of 435) machine pairs have similarity values higher than or equal to 0.80.(ii)Illstructured matrix: 4 (out of 435) machine pairs have similarity values higher than or equal to 0.50.
In this illustration, it is roughly identified that a wellstructured matrix can have quite a different statistical distribution of machine similarity values as compared to an illstructured matrix. This observation leads to an investigation question on the statistical conditions in which a wellstructured matrix can be classified. This investigation is the focus of this paper. By knowing such statistical conditions, engineers in the design of cellular manufacturing systems can initially assess their production requirements via the statistics of machine similarity. If the statistical data shows unfavorable results (i.e., chance of getting a wellstructured matrix is low), they can either modify the production requirements (e.g., buy more machines) or seek for other manufacturing systems. It can save the efforts to solve the CF problem with such initial assessment. Also, this paper will show that a wellstructured matrix can be satisfactorily obtained by some less timeconsuming heuristics (where complex optimization methods may not bring additional benefits).
2.2. Properties of a WellStructured CF Solution
To investigate the statistical conditions of the structuredness of a CF solution, this section will discuss the three properties of a wellstructured matrix. These three properties include high grouping efficacy, high percentage of highsimilarity machine pairs, and relative ease of obtaining satisfactory CF solutions. Afterward, a research plan will be discussed.
2.2.1. Property I: High Grouping Efficacy
The original formulation of the grouping efficacy (GE) in (1) can be found in Kumar and Chandrasekharan [10], and it is intended to replace a weighted sum function with a simple ratio to assess the goodness of a CF solution (in a blockdiagonal form). Since then, the GE measure has become popular in the CF research (e.g., [9, 13]). Despite its popularity, some researchers have criticized its “builtin weights” [14], where a lower number of voids (i.e., n_{in}) tend to give a better GE measure (as compared to exceptional elements (i.e., n_{out})). Brusco [15] has commented that the nonlinearity of the GE measure has incurred a challenge for finding the exact solutions for the CF problems. As commented by Sarker and Mondal [16] in their survey paper, it is not easy to develop a standard measure that fits all CF problems. It is generally recognized that the GE measure is good to discern the structuredness of the matrixbased CF solutions [2]. Thus, we choose the GE measure in this study.
Based on its definition, a wellstructured matrix should have few exceptional elements and voids, leading to a high value of GE. While GE is effective in indicating the structuredness of a CF solution (high value wellstructured matrix), this value cannot be known until the CF problem is solved. Thus, in this research, GE is used as a verification measure to examine how well machine similarity can be related to the structuredness of a CF solution.
2.2.2. Property II: High Percentage of HighSimilarity Machine Pairs
Compared to the property of high grouping efficacy, it is less obvious to know that a wellstructured matrix has a high percentage of highsimilarity machine pairs. In view of the Jaccard similarity coefficient in (2), there are two types of factors used to assess the machine similarity. While a_{xy} (i.e., the number of common parts) is taken as a commonality factor, both b_{xy} and c_{xy} (i.e., the number of parts processed in one machine but not another one) serve as differentiating factors to normalize the similarity measure. In turn, if the similarity value of both machines is high, a_{xy} cannot be zero and the values of b_{xy} and c_{xy} should be small, implying not only commonality but also exclusiveness of these two machines to process their common parts. This feature can potentially lead to smaller numbers of voids and exceptional numbers, leading to a wellstructured matrix.
In literature, the notion of similarity has been applied for many years to address the CF problem, and the Jaccard similarity coefficient is one of the early applications [11]. Since then, many similarity coefficients have been proposed, and the comparison study of similarity coefficients can be found in Sarker [17], Mosier et al. [18], and Yin and Yasuda [19]. Notably, similarity is a contextdependent concept, and it depends on the application and relevant information to assess how similar between two objects. In our investigation, we choose the Jaccard similarity coefficient because its notion on the commonality and differentiating factors is straightforward to the simple CF application.
While similarity coefficients have been studied extensively for CF problems, the statistical distribution of similarity values of a CF problem has not been investigated reasonably in our understanding. Notably, these similarity values can be found without solving the CF problems. Then, if we know the relation between the statistical distribution of similarity values and the GE measure, we can use the statistical distribution of similarity values to assess the potential of yielding a wellstructured matrix for a CF problem. This is the major aim of this paper.
2.2.3. Property III: Relative Ease of Obtaining Satisfactory CF Solutions
At this point, we may wonder why it is important to know the potential of yielding a wellstructured matrix before solving the CF problems. First of all, it has been recognized that a CF problem is a NPhard problem [3] so that there will be less likely to find a practical algorithm that can guarantee an exact solution for a moderatesize problem. As a result, the effort required to solve a CF problem is not trivial. In literature, many metaheuristic algorithms have been proposed to solve the CF problems such as genetic algorithms [20, 21] and simulated annealing [22, 23]. Related comprehensive reviews can be found in Papaioannou and Wilson [24] and Renzi et al. [25]. While metaheuristic algorithms have capacities to yield highquality solutions, they generally require users to have good mathematical skills to understand these algorithms [26] and good experiences to make some “implementation decisions” [15, p. 293] (e.g., terminating conditions in genetic algorithms).
In contrast to metaheuristic algorithms, heuristic algorithms are easier to implement but the quality of their solutions is often targeted [27, p. 159]; [24]. In a nutshell, a common feature of heuristic algorithms is their greedy or hillclimbing approaches that focus on best solutions at a stage without backtracking for other solution possibilities. This feature allows them to converge to some feasible solutions quickly with the tradeoff of checking a smaller solution space (thus, potentially weaker solution quality). Hierarchical clustering (HC), which was one early approach for CF problems [11], is one example of heuristic algorithms since HC always groups the object pairs with the highest similarity values progressively without backtracking.
As its third property, it is observed that a wellstructured matrix can be obtained relatively easily by a heuristic approach (referred to HC specifically in this paper), where the metaheuristic approach does not necessarily have an advantage for getting higherquality solutions. Alternately, the advantage of the metaheuristic approach is observed more often in the case of illstructured matrices. As discussed before, a wellstructured matrix demonstrates sharp differences between similar and dissimilar machine pairs. This feature supports the “greedy” nature of the heuristic approach, which can easily distinguish highsimilarity pairs in the progressive grouping process. In contrast, an illstructured matrix has more machine pairs with middlesimilarity values so that some borderline cases can potentially lead to solutions of lower quality. While this third property may not be obvious, more verifying examples will be reported later in Section 6.3 as part of the investigation effort of this paper.
Given this third property of a wellstructured matrix, the statistical analysis of similarity values can then lead to another application, i.e., supporting the choice of the algorithmic approach for solving CF problems. If the statistical analysis shows a high potential to obtain a wellstructured matrix, we can choose a heuristic approach to solve the CF problems. Alternately, if it indicates a high chance of getting an illstructured matrix, we may consider revising the input incidence matrix (e.g., adding more machines or changing some part requirements). Also, we can prepare to use the metaheuristic approach to seek for highquality solutions. In sum, the statistical analysis can preliminarily probe the structure of a given CF problem in order to determine the next problem solving step.
2.3. Research Plan
In view of the three properties of a wellstructured matrix discussed above, the research and development questions are set as follows.(i)What are the criteria related to the statistics of similarity values to assess the potential of getting a wellstructured matrix?(ii)How do we decide on whether using a metaheuristic or heuristic approach for solving a CF problem?
To address the first question, this paper will utilize two statistical tools: histogram and the KolmogorovSmirnov (KS) test. Histogram will be used to analyze the distribution of machine similarity values of a given CF problem, and twenty CF solutions will be set to investigate the threshold values for informing the potential structuredness of a matrix. The KS test will be used to assess the normality of the distribution of machine similarity values. That is, if the set of similarity values roughly follow the normal distribution, it means that many machine pairs have the average similarity value, implying a low proportion of highsimilarity values (i.e., an illstructured matrix).
Based on the investigation using the histogram and the KS test, we will develop a procedure to probe the structure of a given CF matrix and suggest whether using a metaheuristic or heuristic for problem solving (i.e., address the second question). In this paper, we have implemented genetic algorithm (GA) and hierarchical clustering (HC) as the metaheuristic and heuristic approaches, respectively, for solving the CF problems. To verify the procedure, additional forty CF matrices will be set. These CF matrices will be solved by HC and then genetic algorithm to observe the relation between the matrix’s structuredness and the utility of the metaheuristic approach for better CF solutions.
3. Histogram Analysis of Similarity Values
3.1. Histogram and the UShape
In this study, histograms are used to report the frequency distribution of machine similarity values with an increment of 0.1. Figure 2 shows two histograms for the wellstructured and illstructured matrices of Figure 1, respectively. In these histograms, the horizontal axis stands for the machine similarity values ranging from 0 to 1, and the vertical axis stands for the number of machine pairs within those ranges of similarity values. Notably, these histograms are independent of the orders of a matrix’s rows and columns. That is, we can get these histograms of similarity values without solving the CF problem.
From these two histograms, it is observed that a wellstructured matrix tends to yield an Ushape histogram, i.e., relatively high numbers of extreme similarity values. The right peak of the Ushape can be explained by the property of high percentage of highsimilarity machine pairs discussed in Section 2.2.2. While the numbers of lowsimilarity machine pairs are high in both cases of wellstructured and illstructured matrices, a wellstructured matrix has a low number of machine pairs of similarity values between 0.2 and 0.4. In contrast, an illstructured matrix has a good number of those middlesimilarity machine pairs, which cause a challenge of clear grouping in cell formation. Given this general Ushape observation, the next subsections will discuss the criteria that classify the structuredness of a matrix (i.e., wellstructured or illstructured) based on the histogram data.
3.2. Setup of 20 Benchmark Matrices
Since the frequency distribution of a histogram will not be altered by the orders of a matrix’s rows and columns, we can set the CF solution matrices with known structuredness and then observe their histograms to develop the structuredness criteria. In this investigation, twenty 30×40 solution matrices (i.e., 30 machines and 40 parts) with three cells (or blocks) are set. These matrices are varied by two factors: block sizes and numbers of exceptional elements and voids. Concerning the block sizes, five cases are set as follows, where each bracket indicates the size of a block as (number of machines × number of parts).(i)Case A ⟶ even case: (10×13) (10×13) (10×14)(ii)Case B ⟶ uneven case with a large block: (20×26) (5×7) (5×7)(iii)Case C ⟶ uneven numbers of machines and parts in two blocks: (20×7) (5×26) (5×7)(iv)Case D ⟶ uneven numbers of machines and parts in three blocks: (20×5) (5×17) (5×18)(v)Case E ⟶ uneven case with two large blocks: (14×18) (14×19) (2×3)
Besides, four cases are set below to characterize the structuredness of matrices via the control of the numbers of exceptional elements and voids.(i)Case I (wellstructured): few exceptional elements and no voids(ii)Case II (wellstructured): no exceptional elements and few voids(iii)Case III (wellstructured): few exceptional elements and few voids(iv)Case IV: (illstructured): good numbers of exceptional elements and voids
The resulting 20 matrices are shown in Figure 3. As general inspections, the matrices in Cases I and II have clear boundaries of three cells. The matrices in Case III have more exceptional elements and voids but their structures are still quite discernible. In contrast, the structure of matrices in Case IV is messier with higher numbers of exceptional elements and voids. Based on these matrices, the next subsection will investigate their histograms and develop the Ushape criteria to classify the matrix’s structuredness.
3.3. HistogramBased UShape Criteria
To inform the matrix’s structuredness, two conditions as the Ushape criteria are set toward the low and highsimilarity values. Let F_{left} (x) be the fraction of similarity values that are lower than x and F_{right} (y) be the fraction of similarity values that are higher than y. Then, the general Ushape criteria can be expressed as follows.where a and b are the thresholds of the minimum fractions of low and highsimilarity values, respectively, to characterize the Ushape of a wellstructured matrix. The setup of these parametric values (i.e., x, y, a, and b) will be based on the above 20 benchmark matrices.
Figure 4 shows the histograms of the 20 benchmark matrices. As the preliminary observations, the frequency distributions of these histograms are perceived quite different between the wellstructured (i.e., Cases I, II, and III) and illstructured matrices (i.e., Case IV). Yet, some Ushapes are not plainly obvious (e.g., Cases AIII and CII), and the peaks of highsimilarity values of the wellstructured matrices are not located at the rightmost region (e.g., Cases CI and DI). The Ushape criteria will then be set based on these observations.
Concerning the region of lowsimilarity values (i.e., the left side of the Ushape), it is found that both wellstructured (i.e., Cases I, II, and III) and illstructured (i.e., Case IV) matrices have high proportions because many machines, as long as they are not in the same cell, have less common parts to work with in both cases. As a result, the proportions of lowsimilarity values from a wellstructured matrix can become less discernible statistically. Thus, we choose to investigate the extreme value when the similarity values equal to zero, i.e., F_{left}(x=0). Table 1 records the number of machine pairs with the similarity values equal to zero. As observed, while the matrices of Cases II and III have low rightside peaks, they have high proportions of such zerosimilarity machine pairs. As the Ushape criteria will be used for the early screening, we set this criterion rather strictly as follows.This criterion requires 50% of machine pairs to have zerosimilarity values in order to qualify a wellstructured matrix. By checking the benchmark matrices with 30 machines (i.e., 435 machine pairs), the threshold is 218 machine pairs, and the matrices in Case II pass this criterion.

Concerning the region of highsimilarity values (i.e., the right side of the Ushape), as discussed earlier, not all wellstructured matrices have high proportions of highsimilarity values at the rightmost region. By inspecting the histograms in Figure 4, we identify a reasonable cutoff of highsimilarity values should be 0.5, i.e., F_{right}(y=0.5). Table 2 records the number of machine pairs with the similarity values greater than or equal to 0.5. As observed, the proportions of highsimilarity values (s_{xy} ≥ 0.5) in Case IV (i.e., illstructured matrices) are relatively low. In contrast, Case CII is the wellstructured matrix with the lowest number of highsimilarity values (i.e., 91), and the corresponding fraction is 91/435 ≈ 0.21. As a result, another Ushape criterion for the righthand side is set as follows.In sum, if an input incidence matrix satisfies one of the two Ushape criteria formulated in (5) and (6), this matrix has a good chance to yield a wellstructured CF solution. Notably, we treat the histogrambased Ushape criteria as a preliminary filter in this work. That is, if a matrix does not satisfy these criteria, it does not immediately imply that this matrix is illstructured. In fact, other parameters of an input incidence matrix, such as the number of machines and the density of nonzero matrix entries, can impact the frequency distribution of a histogram. Thus, the next section will develop another criterion based on the KS test.

4. Criterion Setting Based on the KolmogorovSmirnov (KS) Test
4.1. Background
The KolmogorovSmirnov (KS) test is one type of hypothesis testing in statistics (Corder and Foreman) [28]. As one of its applications, the KS test is used in this paper to evaluate how well a dataset represents a normal distribution (i.e., the normality of the dataset). The use of the KS test in this study is mainly motivated by the observation of the histograms in Figure 2 that a wellstructure matrix will tend to give a Ushape. As the Ushape will generally exhibit two peaks in the histogram representation, the normality of the associated data (i.e., similarity values) will be weak in comparison to that of an illstructured matrix.
Figure 5 illustrates the concept of the normality of similarity values with two cases: singlepeak histogram and Ushape histogram. The KS test essentially compares the curves of two cumulative distribution functions (CDFs) [29, 30]. While one CDF represents the empirical data points (i.e., empirical CDF, solid line), another CDF is based on the normal distribution curve fitted by the empirical data (i.e., hypothesized normal CDF, dashed line). As seen in Figures 5(c) and 5(d), the singlepeak histogram has higher normality than the Ushape histogram since the singlepeak histogram yields a closer match between the empirical and hypothesized normal CDFs. In contrast, the Ushape histogram yields its empirical CDF in Figure 5(d) with rapid increases at the beginning and the end, along with a relatively flat region in the middle, and this CDF curve significantly deviates from normality [31].
(a) A singlepeak histogram
(b) A Ushape histogram
(c) CDFs of the singlepeak histogram
(d) CDFs of the Ushape histogram
The P value is a common concept in hypothesis testing [32]. It can be interpreted as the smallest probability value associated with a given dataset to reject the null hypothesis (i.e., smaller P value more likely to reject the null hypothesis). In this work, we treat the P value of a KS test as a proxy measure on the normality of a set of similarity values. That is, if the P value is smaller, the dataset tends to be lessnormal [33]. Interpreted in our context, a lessnormal condition implies a Ushape and thus a wellstructured matrix. For example, the P value of the singlepeak histogram in Figure 5(c) is 7.44×, and the P value of the Ushape histogram in Figure 5(d) is 9.27×.
Notably, the purpose of using the KS test in this work is not about hypothesis testing, but only using its P value as a proxy measure to assess the normality of a set of similarity values and then inform the structuredness of a CF matrix. Yet, the P values in our applications tend to be very small. To conveniently handle this proxy measure, let P_{value} be the P value of a set of similarity values based on the KS test, and an alternative proxy measure (denoted as L_{p}) is defined as follows:As L_{p} is the negative logarithm of the P value, a higher value of L_{p} implies a higher tendency of having a Ushape of the dataset. For example, the values of L_{p} for the singlepeak histogram (i.e., Figure 5(c)) and the Ushape histogram (i.e., Figure 5(d)) are 3.13 and 21.03, respectively. In other words, if a CF matrix yields a higher value of L_{p}, it has a better chance to be solved as a wellstructured CF solution.
By knowing the property of the trend associated with L_{p}, it leads to the next investigation question on setting the threshold value of L_{p} to classify illstructured and wellstructured matrices. To do so, it is recognized that the values of L_{p} can be sensitive to the number of machines and the density of nonzero entries of a given matrix. Thus, the next subsection will investigate the upper bound of L_{p} of a given matrix to normalize the value of L_{p}. Then, we will apply the 20 benchmark matrices in Figure 3 to determine the threshold.
4.2. Estimate the Upper Bound of L_{p} for Normalization
The upper bound of L_{p} can be estimated by a perfect blockdiagonal matrix, where the numbers of exceptional elements (n_{out}) and voids (n_{in}) are zero (i.e., the grouping efficacy μ = 1). In this case, the machine pairs have similarity values equal to either one (when two machines belong to the same block) or zero (when two machines are in different blocks). This kind of “bipolar” distribution can be viewed as a far extreme of the normal distribution, and the corresponding P value can be taken as the upper bound of L_{p}.
In the normalization process, we can first identify the size and the number of nonzero entries of a given matrix. Let m and n be the numbers of machines and parts, respectively, as the size of the matrix. The number of nonzero matrix entries has been denoted as n_{e}. Then, the density of nonzero entries of a matrix (denoted as D_{s}) can be determined as follows.Given an incidence matrix, its upper bound of L_{p} can be considered in a case when its nonzero entries can be freely moved to form a nearly perfect blockdiagonal matrix. By fixing the values of m, n, and D_{s}, there can be a corresponding theoretical upper bound of L_{p}. Let L_{bp} denote such an upper bound of L_{p} of a given matrix. Then, for any given matrix, we can determine its L_{p} and L_{bp}, where L_{bp} is treated as a normalizing factor. Since this paper focuses on machine similarity, we drop the consideration of n to simplify the investigation. Then, the next step is to determine the following function.To estimate the function of L_{bp}, our strategy is to systematically generate a good number of perfect blockdiagonal matrices by varying the numbers of machines, parts, and evensize cells (note: the number of evensize cells will determine the number of nonzero entries). The ranges of these varying parameters in this work are listed as follows.(i)Number of machines: from 10 to 50 machines(ii)Number of parts: from 10 to 110 parts (with an increment of 10)(iii)Number of evensize cells: from 2 to 14 cells (also restricted by the matrix’s size to avoid extremely large and small cells)
Further details of the setup of these perfect matrices can be found in Zhu [34]. As a result, this work has generated 2519 perfect matrices. Then, the values of P value and L_{p} are determined for these matrices, giving 2519 points to approximate the function formulated in (9) via curve fitting techniques. The resulting regression equation is found as follows.In practice, we can determine the values of L_{p} via (7) and L_{bp} via (10) for a given matrix. Then, we can check its ratio of L_{p} to L_{bp} and examine the Ushapeness and then the possible structuredness of the matrix. The next subsection will discuss the criterion based on the ratio of L_{p} to L_{bp}.
4.3. Ratio Criterion Based on L_{p} and L_{bp}
The setting of the ratio threshold for L_{p} and L_{bp} is based on the 20 benchmark matrices in Figure 3. The values of L_{p}, L_{bp} and their ratios are recorded in Table 3. As a recall, Cases I, II, and III are set to represent the wellstructured matrices, and Case IV represents illstructured matrices. As an initial assessment, the average of the ratios of Cases I, II, and III (i.e., wellstructured matrices) is 0.48, while the ratio average of Case IV is 0.07. This observation indicates that the ratio L_{p}/L_{bp} can make distinctions between wellstructured and illstructured matrices quite effectively from a statistical standpoint.

Yet, when we examine the extreme situations, the lowest ratio of the wellstructured cases is 0.17 (i.e., Case DI, bold in Table 3), and the highest ratio of the illstructured cases is 0.15 (i.e., Case AIV, also bold in Table 3). As observed, the gap between the two is close, and we intend to impose a tight criterion to classify wellstructured matrices. As a result, we set the threshold value at 0.2, formulated as follows.At this point, Case DI is the only wellstructured matrix that does not satisfy this criterion. Yet, Case DI satisfies one of the earlier Ushape criteria. Thus, our next step is to combine the Ushape criteria and the ratio criterion in a procedure to examine the potential structuredness of an incidence matrix. That is, if a given matrix satisfies one of these criteria, it is indicated that this matrix has a high potential to yield a wellstructured CF solution. The next section will discuss this procedure to apply these criteria to inform the potential structuredness of a given matrix.
5. Procedure
This section provides a fourstep procedure below to assess the potential structuredness of an incidence matrix using the histogrambased Ushape criteria and the criterion based on the P value of the KS test. Figure 6 illustrates the decision branches of this procedure.
Step 1 (construct histogram). By receiving an incidence matrix as an input, the similarity values of machine pairs are first determined based on (2). If there are m machines, there will be m×(m1)/2 machine pairs with their similarity values, forming the dataset of the statistical analysis. A histogram is then constructed to analyze these similarity values.
Step 2 (apply the histogrambased Ushape criteria). This represents the preliminary check based on the frequencies of having high and lowsimilarity values. If either one of the criteria F_{left}(0) ≥ 0.5 or F_{right}(0.5) ≥ 0.2 is satisfied, the incidence matrix is considered having a good potential to yield a wellstructured CF solution. If none of these two criteria is satisfied, we will move on to the analysis based on the P value of the KS test.
Step 3 (compute and ). The dataset of similarity values is treated as the input to determine the P value of the KS test in view of assessing the normality of the dataset. This calculation can be performed via some statistics software tools. In this work, we have used the statistics functions from Matlab to compute the P value. Then, the value of L_{p} can be evaluated using (7). With the incidence matrix, the value of L_{bp} can be evaluated using (10) by identifying the number of machines (i.e., m) and the density of nonzero entries (i.e., D_{s}).
Step 4 (apply the ratio criterion / ). With the values of L_{p} and L_{bp}, we can check the criterion if L_{p} / L_{bp} ≥ 0.2. If this criterion is satisfied, the input matrix should have a good potential to yield a wellstructured CF solution. If not, the input matrix would have a good chance to result in an illstructured CF solution. The practitioners may consider modifying the input matrix by adding machines or revising the production requirements.
6. Application and Verification
To examine the statistical analysis of similarity values for CF problems in this paper, other 40 matrices (in addition to the earlier 20 benchmark matrices, making up a total of 60 matrices) will be generated and applied in this section. These 60 matrices will be used to examine the following two issues specifically.(i)Given the three criteria for assessing the potential structuredness of a matrix, we are going to use these 60 matrices to examine their effectiveness to distinguish wellstructured and illstructured matrices.(ii)While Property III (i.e., relative ease of obtaining satisfactory CF solutions) of a wellstructured matrix has been discussed in Section 2.3, it will be verified via these 60 matrices by two stages of CF problem solving.
6.1. Setup of the 60 Incidence Matrices
The strategy to generate 60 matrices is based on the extension of getting the 20 benchmark matrices in Section 3.2. The additional varying factors include the following.(i)In addition to the size of 30×40 matrix, another size of 40×100 matrix is set.(ii)We add cases with more numbers of cells (from 3 to 6, 8, and 12 cells)(iii)The evenness of cell sizes is also varied for each case.
Table 4 shows the setup of 60 matrices, where Cases A and E are repeated from Section 3.2 for comparison. Notably, the structuredness of matrices, which were classified as Cases I, II, III, and IV in Section 3.2, is also applied, leading to the study of 15×4 = 60 incidence matrices. As the intention of the setup, the matrices of Cases I and II have no voids and exceptional elements, respectively. Then, they should be classified as wellstructured matrices. The matrices of Case III have only few exceptional elements and voids, and they should also be classified as wellstructured matrices. In contrast, the matrices of Case IV have more exceptional elements and voids, and they should be classified as illstructured matrices. The images and histograms of these 60 matrices are provided as supplementary materials (available here).

6.2. Examination of the Criteria
To evaluate the effectiveness of the criteria to assess the structuredness of the matrices, we have evaluated the criteria values for the 60 matrices. The results are provided in Table 5, where the values satisfying the criteria of wellstructured matrices are bold. As observed in these results, the structuredness criteria can discern the wellstructured matrices of Cases I, II, and III, where each matrix there satisfies at least one criterion. In contrast, no matrices of Case IV satisfy any criteria of wellstructured matrices.

In view of the effectiveness of individual criteria, it is observed that F_{left}(0) is effective in filtering the matrices of Case II (i.e., few voids and no exceptional elements). Due to the absence of exceptional elements in this case, any two machines of different blocks will have similarity values equal to zero. This explains the high values of F_{left}(0) observed in Case II. In contrast, F_{right}(0.5) is less effectiveness when the matrices have more cells (e.g., Cases H and I) and large sizes (e.g., Cases J to O). Notably, the values of F_{right}(0.5) for Case IV are quite low (ranging from 0.00 to 0.09). In this view, the criterion of F_{right}(0.5) is quite tight.
By comparison, the ratio criterion (i.e., L_{p}/L_{bp}) seems effective in distinguishing wellstructured matrices, where Case DI is the only case not identified as a wellstructured matrix by this criterion only. Notably, the discernible gap of wellstructured matrices (lowest at 0.17 in Case DI) and illstructured matrices (highest 0.16 in Case LIV) is small. It explains the need of having F_{left}(0) and F_{right}(0.5), along with the ratio criterion, in the assessment of the structuredness of the matrices.
6.3. Examination of Property III via Optimization
As a recall from Section 2.2.3, Property III states that a wellstructured matrix can be fairly obtained via a heuristic approach, where more complex metaheuristics may not bring in additional benefits. To verify this property, the sixty matrices were tested with a twostage solution process. First, each matrix will be solved by a hierarchical clustering (HC) method as one heuristic to yield a CF solution. Then, we examine if we can further optimize the obtained CF solution via the genetic algorithm (GA), representing a metaheuristic method. In this way, we can check the correlation between grouping efficiency and the percentage of improvement of solution quality by GA. The algorithmic details of the HC method and the implementation details of GA applied in this study can be found in Zhu [34].
Table 6 lists the grouping efficacy (μ) results for the 60 matrices after running hierarchical clustering (HC) and then genetic algorithm (HC+GA). Also, the percentages of improvement in view of grouping efficacy by GA are reported for comparison. As observed, the matrix solutions in Cases I and II cannot be further improved by GA, while three matrix solutions in Case III can be improved by GA with small percentages (between 0.20% and 0.25%). In contrast, the illstructured matrix solutions in Case IV can be improved by GA in the percentages of improvement between 0.63% and 22.69%. Overall, we consider that the numerical results generally follow Property III, given that the matrices in Case III are close to the boundary between wellstructured and illstructured matrices.

Figure 7 shows the plots of the percentages of solution improvement versus the values of grouping efficacy based on HC+GA. Based on the 60 matrices studied in this paper, GA did not improve the quality of matrix solutions that have 0.60 or higher grouping efficacy. For the data points of grouping efficacy values less than 0.60, we find that these data points are negatively correlated, where the correlation value [32, p. 173] is 0.62. In the statistical interpretation, we can state that a lower value of grouping efficacy tends to allow a larger room of improvement by GA but its linearity is not strong. Notably, the capabilities of HC and GA to yield highquality solutions can depend on other factors (e.g., density of nonzero entries in a matrix). Thus, it is not easy to observe a linear correlation just between the percentage of improvement and the grouping efficacy. More control factors and samples should be required for an indepth investigation.
7. Conclusions
This paper has explored the statistics of similarity values to investigate the structuredness of cell formation (CF) matrix solutions. Using grouping efficacy (μ) as one recognized index to inform the quality of a CF matrix, it is found that a wellstructured matrix has a high percentage of highsimilarity machine pairs (i.e., Property II). Accordingly, this paper sets up 20 benchmark matrices, with varying structuredness, to develop the Ushape criteria and the criterion based on the KolmogorovSmirnov test. Then, a procedure is developed to assess the potential structuredness of a CF matrix without solving the CF problem. The criteria for assessing structuredness of matrices are examined via additional 40 matrices, and agreeable results are observed. Genetic algorithm (GA) is used to see if it can improve the CF solutions obtained by hierarchical clustering (as one type of heuristics). The results show that the matrix solutions with high grouping efficacy values (i.e., wellstructured matrices) cannot be effectively improved by GA.
While the worstcase computational complexity of clustering problems (e.g., NP hardness) is well recognized, the CDNM thesis (discussed in Section 1) has implied that not all clustering problems in practice are difficult to solve. This research corresponds to the “clustering pipeline” proposed by Ackerman et al. [7], where clusterability (or structuredness in our context) can be evaluated to inform the selection of effective clustering algorithms. In this view, one intended contribution of this work is to implement this idea in the context of the CF problem. In future work, we will explore more applications in manufacturing systems that require grouping and combinatorial decisions (e.g., product and systems modularity). Also, we can explore more statistical and machine learning techniques such as multimodality tests and random forest to replace the KS test for better predication performance.
Data Availability
The matrix data used to support the findings of this study are included within the supplementary information file (pictorial illustrations). Other data formats (e.g., Excel file) can be available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
This work was supported by the NSERC Discovery Grants, Canada.
Supplementary Materials
One file of supplementary materials is included with the manuscript. This file contains the images of the matrix data for the problems presented in Section 6. (Supplementary Materials)
Supplementary Materials
References
 N. L. Hyer and U. Wemmerlov, Reorganizing the Factory: Competing through Cellular Manufacturing, Productivity Press, Portland, OR, USA, 2002.
 M. J. Brusco, “An exact algorithm for maximizing grouping efficacy in partmachine clustering,” IIE Transactions, vol. 47, no. 6, pp. 653–671, 2015. View at: Publisher Site  Google Scholar
 C. Liu, Y. Yin, K. Yasuda, and J. Lian, “A heuristic algorithm for cell formation problems with consideration of multiple production factors,” International Journal of Advanced Manufacturing Technology, vol. 46, no. (912), pp. 1201–1213, 2010. View at: Google Scholar
 M. Ackerman and S. BenDavid, “Clusterability: a theoretical study,” in Proceedings of the International Conference on Artificial Intelligence and Statistics, JMLR: Workshop and Conference Proceedings, N. Lawrence, Ed., vol. 5, p. 18, 2009. View at: Google Scholar
 E. Nowakowska, J. Koronacki, and S. Lipovetsky, “Clusterability assessment for Gaussian mixture models,” Applied Mathematics and Computation, vol. 256, pp. 591–601, 2015. View at: Publisher Site  Google Scholar
 A. Daniely, N. Linial, and M. Saks, “Clustering is difficult only when it does not matter,” Computer Science (Machine Learning), 2012. View at: Google Scholar
 M. Ackerman, A. Adolfsson, and N. Brownstein, “An effective and efficient approach for clusterability evaluation,” Computer Science (Machine Learning), 2016. View at: Google Scholar
 Y. Zhu and S. Li, “Statistical Analysis of Similarity Measures for Solving Cell Formation Problems,” Procedia CIRP, vol. 63, pp. 248–253, 2017. View at: Publisher Site  Google Scholar
 M. M. Paydar and M. SaidiMehrabad, “A hybrid geneticvariable neighborhood search algorithm for the cell formation problem based on grouping efficacy,” Computers & Operations Research, vol. 40, no. 4, pp. 980–990, 2013. View at: Google Scholar
 C. S. Kumar and M. P. Chandrasekharan, “Grouping efficacy: a quantitative criterion for goodness of block diagonal forms of binary matrices in group technology,” International Journal of Production Research, vol. 28, no. 2, pp. 233–243, 1990. View at: Google Scholar
 J. McAuley, “Machine grouping for efficient production,” Production Engineering Research and Development, vol. 51, no. 2, pp. 53–57, 1972. View at: Publisher Site  Google Scholar
 P. H. A. Sneath and R. R. Sokal, Numerical Taxonomy: the Principles and Practice of Numerical Classification, Freeman, San Francisco, Calif, USA, 1973.
 T.H. Wu, C.C. Chang, and J.Y. Yeh, “A hybrid heuristic algorithm adopting both Boltzmann function and mutation operator for manufacturing cell formation problems,” International Journal of Production Economics, vol. 120, no. 2, pp. 669–688, 2009. View at: Google Scholar
 G. J. K. Nair and T. T. Narendran, “Grouping index: a new quantitative criterion for goodness of blockdiagonal forms in group technology,” International Journal of Production Research, vol. 34, no. 10, pp. 2767–2782, 1996. View at: Google Scholar
 M. J. Brusco, “An iterated local search heuristic for cell formation,” Computers & Industrial Engineering, vol. 90, pp. 292–304, 2015. View at: Publisher Site  Google Scholar
 B. R. Sarker and S. Mondal, “Grouping efficiency measures in cellular manufacturing: A survey and critical review,” International Journal of Production Research, vol. 37, no. 2, pp. 285–314, 1999. View at: Google Scholar
 B. R. Sarker, “The resemblance coefficients in group technology: A survey and comparative study of relational metrics,” Computers & Industrial Engineering, vol. 30, no. 1, pp. 103–116, 1996. View at: Google Scholar
 C. T. Mosier, J. Yelle, and G. Walker, “Survey of similarity coefficient based methods as applied to the group technology configuration problem,” Omega, vol. 25, no. 1, pp. 65–79, 1997. View at: Google Scholar
 Y. Yin and K. Yasuda, “Similarity coefficient methods applied to the cell formation problem: A taxonomy and review,” International Journal of Production Economics, vol. 101, no. 2, pp. 329–352, 2006. View at: Google Scholar
 C. Zhao and Z. Wu, “A genetic algorithm for manufacturing cell formation with multiple routes and multiple objectives,” International Journal of Production Research, vol. 38, no. 2, pp. 385–395, 2000. View at: Google Scholar
 F. M. Defersha and M. Chen, “A linear programming embedded genetic algorithm for an integrated cell formation and lot sizing considering product quality,” European Journal of Operational Research, vol. 187, no. 1, pp. 46–69, 2008. View at: Publisher Site  Google Scholar
 S. Lee and H. P. Wang, “Manufacturing cell formation: A dualobjective simulated annealing approach,” The International Journal of Advanced Manufacturing Technology, vol. 7, no. 5, pp. 314–320, 1992. View at: Publisher Site  Google Scholar
 R. TavakkoliMoghaddam, N. Safaei, and F. Sassani, “A new solution for a dynamic cell formation problem with alternative routing and machine costs using simulated annealing,” Journal of the Operational Research Society, vol. 59, no. 4, pp. 443–454, 2008. View at: Google Scholar
 G. Papaioannou and J. M. Wilson, “The evolution of cell formation problem methodologies based on recent studies (1997–2008): Review and directions for future research,” European Journal of Operational Research, vol. 206, no. 3, pp. 509–521, 2010. View at: Google Scholar
 C. Renzi, F. Leali, M. Cavazzuti, and A. O. Andrisano, “A review on artificial intelligence applications to the optimal design of dedicated and reconfigurable manufacturing systems,” International Journal of Advanced Manufacturing Technology, vol. 72, no. (14), pp. 403–418, 2014. View at: Google Scholar
 A. Stawowy, “Evolutionary strategy for manufacturing cell design,” Omega , vol. 34, no. 1, pp. 1–18, 2006. View at: Publisher Site  Google Scholar
 G. J. Nair and T. T. Narendran, “CASE: A clustering algorithm for cell formation with sequence data,” International Journal of Production Research, vol. 36, no. 1, pp. 157–180, 1998. View at: Google Scholar
 G. W. Corder and D. I. Foreman, Nonparametric Statistics: A StepByStep Approach, John Wiley & Sons, Hoboken, NY, USA, 2014.
 A. P. Bradley, “ROC curve equivalence using the KolmogorovSmirnov test,” Pattern Recognition Letters, vol. 34, no. 5, pp. 470–475, 2013. View at: Publisher Site  Google Scholar
 T. Arnold and J. Emerson, “Nonparametric GoodnessofFit Tests for Discrete Null Distributions,” The R Journal, vol. 3, no. 2, pp. 34–39, 2011. View at: Google Scholar
 A. Justel, D. Peña, and R. Zamar, “A multivariate KolmogorovSmirnov test of goodness of fit,” Statistics & Probability Letters, vol. 35, no. 3, pp. 251–259, 1997. View at: Publisher Site  Google Scholar
 D. C. Montgomery and G. C. Runger, Applied Statistics and Probability for Engineers, Wiley, Hoboken, NJ, USA, 5th edition, 2011.
 Z. Drezner and O. Turel, “Normalizing variables with toofrequent values using a Kolmogorov–Smirnov test: A practical approach,” Computers & Industrial Engineering, vol. 61, no. 4, pp. 1240–1244, 2011. View at: Google Scholar
 Y. J. Zhu, Hierarchical Clustering and Similarity Statistics for Solving and Investigating Cell Formation Problems, Department of Mechanical and Manufacturing Engineering, University of Calgary, 2017.
Copyright
Copyright © 2018 Yingyu Zhu and Simon Li. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.