Abstract

Data is the most valuable asset in any firm. As time passes, the data expands at a breakneck speed. A major research issue is the extraction of meaningful information from a complex and huge data source. Clustering is one of the data extraction methods. The basic K-Mean and Parallel K-Mean partition clustering algorithms work by picking random starting centroids. The basic and K-Mean parallel clustering methods are investigated in this work using two different datasets with sizes of 10000 and 5000, respectively. The findings of the Simple K-Mean clustering algorithms alter throughout numerous runs or iterations, according to the study, and so iterations differ for each run or execution. In some circumstances, the clustering algorithms’ outcomes are always different, and the algorithms separate and identify unique properties of the K-Mean Simple clustering algorithm from the K-Mean Parallel clustering algorithm. Differentiating these features will improve cluster quality, lapsed time, and iterations. Experiments are designed to show that parallel algorithms considerably improve the Simple K-Mean techniques. The findings of the parallel techniques are also consistent; however, the Simple K-Mean algorithm’s results vary from run to run. Both the 10,000 and 5000 data item datasets are divided into ten subdatasets for ten different client systems. Clusters are generated in two iterations, i.e., the time it takes for all client systems to complete one iteration (mentioned in chapter number 4). In the first execution, Client No. 5 has the longest elapsed time (8 ms), whereas the longest elapsed time in the following iterations is 6 ms, for a total elapsed time of 12 ms for the K-Mean clustering technique. In addition, the Parallel algorithms reduce the number of executions and the time it takes to complete a task.

1. Introduction

Most commercial organizations that generate vast amounts of data do so during their daily operations. These businesses require an easy means to obtain and access their stored data, which necessitates the use of a centralized storage concept known as a database. A database is a collection of data that is compacted and arranged in such a way that it is easy to access, retrieve, manage, and change. Because business analysts require stored data to make business decisions, important information should be extracted utilizing a discovery concept known as data mining, which is also known as knowledge discovery [1, 2].

Today, everything is based on data. People come across vast volumes of data daily and save it for later review or analysis. Because such massive datasets are continuously rising, extracting and mining valuable information using traditional techniques are becoming increasingly difficult [3]. As a computer system can process data in a specific order from a set of facts, numbers, or statistics, it is called data. Companies today are collecting vast volumes of data in a variety of circumstances, formats, and databases.

Clustering is a phenomenon of unsupervised learning, whereas classification is a process of supervised learning. These two methods are frequently employed when extracting data from large databases. The graphical representation of Supervised and Unsupervised Techniques is shown in Figure 1.

1.1. Classification of Clustering Techniques

The clustering techniques are categorized into four fundamental categories, as illustrated in Figure 2.

A massive amount of data can be difficult to turn into usable information. By using data mining algorithms, researchers can predict and evaluate students’ academic progress based on their academic records and forum involvement.

Even though various research has been conducted around the world to evaluate student academic performance, there is a dearth of acceptable studies to examine aspects that can help students improve their academic performance. The goal of this study was to evaluate the factors that influence student academic achievement in Pakistan.

Both basic and parallel clustering approaches are constructed and studied in this work to highlight their greatest qualities. Simple K-Mean methods have shortcomings, and parallel k-mean approaches resolve those weaknesses. The results of parallel k-mean techniques are always the same: improved cluster quality, fewer executions, and faster execution times. The outcomes of the Simple K-Mean are likewise variable for different iterations or executions; as a result, the number of iterations varies depending on the iterations or executions. In some circumstances, the clustering algorithms’ outcomes are always different, and the algorithms separate and identify unique properties of the K-Mean Simple clustering algorithm from the K-Mean Parallel clustering algorithm. The Parallel K-Mean algorithms have been proven to be more efficient than the Simple K-Mean algorithms in several tests. Parallel algorithms reduce the number of executions and the amount of time it takes to complete a task.

2. Literature Review

2.1. Simple K-Mean Clustering

J. B. MacQueen was one of the first users of the K-means clustering technique, which he introduced in 1967. The most recent research on K-Mean clustering is described here, and some of the related work has been published since. The author [4] introduced the Min-Max distance measure. The input dataset is first adjusted, and then initial centroids are chosen at random within the normalized range (0, 1). The distance is estimated using the min-max similarity measure.

Reference [5] divides the entire data collection into unit blocks using the lowest and highest bounds (UB). Following the modification, the items in the datasets are sorted by distance and then separated into subclusters (k sets). Each set of data is evaluated by computing the median. Initial centroids are computed using the specified medium, and clusters are built using the design of the initial cluster [6]. This method made use of sorting algorithms, which are more time consuming. The dataset’s simplest representation is found by finding the centroids of each unit block.

Simple K-Means are algorithms in which the information from each iteration is stored in a data structure, as described in [7, 8]. The recorded information is then utilized in the next iteration. A dynamic K-Mean clustering algorithm was introduced in [9]. In the first phase, subdatasets are created on the server side from the provided dataset. By modifying the datasets, items are now sorted by distance and arranged into subclusters (k sets).

2.2. Simple Parallel K-Mean Clustering Algorithm

Sanpawat and Alva [2, 10] proposed a parallelized K-Mean clustering method. The algorithm uses a (Client-Server) method. Technology, Earth sciences, engineering, social and economic sciences, medical sciences, and life are just a few of the fields that employ clustering.

A parallel K-Mean clustering technique is proposed in [6, 11]. Each data point’s distance from the next is calculated. The data items that are the furthest away from the origin are segregated from the rest of the dataset and placed in a separate list. For this new list, a threshold value is chosen. For the simultaneous K-Mean clustering process, [12] developed the ParaMeans program. They adopt the Basic parallelized K-Mean clustering technique for regular laboratory application. ParaMeans is a client-server application that is simple to use and manage.

2.3. Simple and Parallel K-Mean

[13, 14] explain the Simple K-Mean clustering technique. The distance between the original centroids and the data items is determined, and each of the data items is given to its proper location. The input dataset is first adjusted, and then initial centroids are chosen at random within the normalized range (0, 1). The min-max similarity measure is used to calculate the distance. (0, 1) The min-max similarity measure [15] is used to calculate the distance. The K-Mean algorithm, developed by Singh and Bhatia [16], identifies items with the lowest frequency. The centroids are calculated as the average of each section. All clusters are compiled on the server (received from all clients). Based on the clustering method, the arithmetic means of each cluster are determined. It is efficient and progressive due to the integration of a dynamic load balance technique and the K-Mean clustering method in [17, 18]. In this strategy, the main system assigns the client system the same size subdataset [19, 20].

The parallel K-mean clustering approach and the basic K-mean clustering technique have both been thoroughly investigated. Many academics worked individually on Simple and Parallel K-Mean techniques, offering alternative methodologies discussed in Section 2. However, they make no explicit recommendations or suggestions on how to use parallel and simple k-mean approaches in any of the domains where they are useful [2123].

3. Research Methodology

Researchers have created many methods for Simple and Parallel K-Mean clustering approaches. Some existing strategies concentrated on sorting the dataset to select initial centroids, while others focused on the random selection of first centroids. When it comes to the Parallel and Simple K-Mean techniques, there is no clear understanding of the best approach and which technique should be used in which situation. Researchers looked at, implemented, and evaluated both Parallel and Simple K-Mean clustering algorithms to see what qualities they had and how well they performed when applied to these problems in general. The overall research flow is depicted in Figure 3.

3.1. Data Sets

The scores of 10,000 students in two different topics and the attendance of 5000 employees for two months are represented by these two datasets of 10,000 and 5000 integers, respectively. The challenge of randomly selecting initial centroids in K-Mean clustering is solved in this paper.

These two sets of 10,000 and 5000 integers could represent 10,000 of students’ grades in two subjects and five thousand (5000) employees’ attendance over the duration of two months, respectively. Table 1 displays a typical representation of these pupils in two distinct disciplines using two different methodologies, while Table 2 illustrates staff/employees attendance.

For these two algorithms, below are some samples of input and output:(i)Input: the No. of clusters derived from students and workers’ scores in two separate topics and months,: There are two datasets, each with 10,000 pupils and 5,000 employees.(ii)OutputA set of clusters.

3.2. Method

Simple and parallel approaches are used on these data components individually. The flowchart of the Basic Simple K-Mean clustering technique is created using standard UML (Unified Modeling Language) notations.

The differences between the Parallel and K-Mean clustering methods are assessed and analyzed using experimental findings from both techniques. These two algorithms use the JAVA with Neat beans as an (IDE) and C++ platforms to execute different execution for varied data ranges and times.

3.2.1. Simple K-Mean Clustering Algorithm

The K-Mean clustering approach randomly chooses “k” initial centroids. The distances between data items and centroids are calculated using the Euclidean distance function in the second phase. [24, 25] mentions a couple of distance functions.

During relocation, each data item is relocated to the cluster that has the least amount of space. The earliest clusters are created in this manner. The arithmetic mean of each cluster is then calculated. That cluster’s data points are closer to the arithmetic mean. Following that calculation, data points are assigned a cluster based on the arithmetic mean. Until there are no more data points to transfer from one cluster to another, the process is repeated [26].

(1) Steps in the Simple K-Mean Clustering Algorithm. The pseudocode for the basic K-Mean clustering approach [14] is shown below:

Input: Array {a1, a2, a3, …, an}
a = data points
k = Number of Required Clusters
Output: A set of Clusters
Steps:
(1)Randomly select k data points from dataset D as initial centers.
(2)Calculate the distance between each data point di (1 < I ≤ n) and all the k clusters Cj (1 ≤ j ≤ k) and recalculate the cluster center by taking the Arithmetic Mean of each cluster.
(3)Repeat until no change in the center of clusters

(2) Flow Chart of Simple K-Mean Algorithm. The flowchart of the basic K-Mean method is created using standard UML (Unified Modeling Language) notations, which are depicted in Figure 4 as Simple k-mean algorithm’s Flow chart.

3.2.2. Parallel K-Mean’s Clustering Algorithm

When the dataset is sufficiently large, the space and processing performance requirements for the Simple K-Mean clustering approach are the most significant hurdles. The Simple or Basic K-Mean clustering technique is parallelized to solve these challenges.

(1) Main Steps of Parallel K-Mean Clustering Algorithm. Three main steps of the Simple Parallel K-Mean’s algorithm are as follows:(i)Compilation(ii)Partition(iii)Computation

In the first phase, subdatasets are created on the server side from the provided dataset. Each client computer connected to the server receives these subdatasets, which include the number of clusters, “k,” and starting centroids. Client systems that are affected calculate the clusters and send the results to the server. The process is continued until the clusters do not change.

(2) Flow chart of Parallel K-Mean Clustering Algorithm. The above-mentioned steps are depicted in Figure 5 as a flow chart. The flow chart is created using UML (Unified Modeling Language) standard notations.

4. Results and Discussion

The features of simple K-Mean and Parallel K-Mean techniques are highlighted in this research. Some existing strategies concentrated on sorting the dataset to select initial centroids, while others focused on the random selection of first centroids.

For the experiments, two datasets of 10,000 and 5000 integers representing students and teachers are chosen at random. The performance of Simple and Parallel clustering methods is tested using these datasets. The experimental results are presented in detail in the following sections of this chapter.

4.1. Experimental Results Analysis

For a dataset of 10,000 and 5000 integer data pieces, both techniques are tested and compared with each other. Using the Simple K-Mean clustering technique, both strategies produced positive experimental results. In the next phase, the results of the comparison of the Simple and Parallel algorithms are shown.

4.1.1. Comparison of Parallel and Simple K-Mean Algorithm

A comparison between the Simple and the Parallel K-Mean method is performed by considering the number of executions, elapsed time, and cluster quality.

4.1.2. Number of Iterations

The tables and graphs below illustrate the performance of the Parallel and Simple K-Mean clustering algorithms for varying numbers of clusters (K).

Table 3 compares the K-Mean technique versus the parallel K-Mean algorithm for identical datasets and cluster number (K = 3). The same dataset (10,000 data points) is used in each run to observe and perceive how the number of executions in the Simple K-Mean algorithm changes over time. Because the starting centroids are not produced at random, the number of executions in the Parallel K-Mean method is fixed.

The graph in Table 3 is depicted in Figure 6. With the Parallel K-Mean technique, k = 3 means that 3 executions are performed, but in the Simple K-Mean method, it fluctuates from run to run.

According to Tables 4 and 5, the number of parallel K-Mean clustering is lower than the number of Simple K-Mean clustering which is represented in Figures 6 and 7, respectively.

According to Table 5, the number of parallel K-Mean clustering is lower than the number of Simple K-Mean clustering which is represented in Figure 7, respectively.

As shown in Table 5, there are fewer executions of the K-Mean clustering method using the parallel approach, as k = 5 is fixed, which is given in Figure 8.

Table 6 shows the fixed and lower No. of iterations for the Parallel and Simple K-Mean clustering methods for k = 6, which is depicted in Figure 9.

The number of times the Parallel and Simple K-Mean algorithms were run for k = 7 is shown in Table 7 and presented in Figure 10.

4.2. Elapsed Time

For varied numbers of clusters, the following tables and graphs show the elapsed time of the Simple and Parallel K-Mean clustering methods (K).

A comparison between the Simple K-Mean algorithm and parallel K-Mean algorithm can be found in Table 8 for K = 3, which is depicted in Figure 11. Parallel K-Mean clustering consumes less time for each iteration than Simple K-Mean clustering.

The parallel K-Mean method takes less time than the Simple K-Mean method at different runs or executions.

According to Table 9, the Parallel K-Mean Clustering method takes about half the time as the Simple K-Mean clustering method for k = 4, which is presented in Figure 12.

Comparing the Parallel K-Mean clustering algorithm to the Simple K-Mean algorithm for k = 5, Table 10 compares the elapsed time of both methods, which is depicted in Figure 13.

Table 11 shows the elapsed time of the Parallel and Simple K-Mean algorithms for k = 6 and is given in Figure 14, while Table 12 shows elapsed time for k = 7, respectively.

4.3. Cluster Quality

The next section compares the cluster quality of the Simple K-Mean and Parallel K-Mean methods given in Tables 13 and 14, represented in Figure 16 and in Figure 17, respectively.

Table 13 displays the outcomes of numerous runs or executions of the same data collection of 10,000 data items.

Table 14 shows the same results for the same 10,000 data items over numerous runs or executions for the same dataset.

5. Conclusion

The current technique’s fundamental flaw is that it produces various results for the same data. Both basic and parallel clustering approaches are constructed and studied in this work to highlight their greatest qualities. Simple K-Mean methods have shortcomings, and parallel k-mean approaches resolve those weaknesses. The results of parallel k-mean techniques are always the same: improved cluster quality, fewer executions, and faster execution times. The outcomes of the Simple K-Mean are likewise variable for different iterations or executions; as a result, the number of iterations varies depending on the iterations or executions. In some circumstances, the clustering algorithms’ outcomes are always different, and the algorithms separate and identify unique properties of the K-Mean Simple clustering algorithm from the K-Mean Parallel clustering algorithm. The Parallel K-Mean algorithms have been proven to be more efficient than the Simple K-Mean algorithms in several tests. Parallel algorithms reduce the number of executions and the amount of time it takes to complete a task. Experiments have shown that Parallel algorithms outperform the Simple K-Mean algorithm by a wide margin. The findings of the Parallel techniques are also consistent; however, the Simple K-Mean technique assembles different outcomes with each iteration or execution. In addition, the Parallel techniques reduce overall iterations and elapsed time [27].

6. Future Work

A technique for K-Mean clustering that works for many types of data should be developed in the future. When dealing with categorical data, e.g., a method should perform better. The process of selecting a “k” number of clusters is still in progress. The user should input the number of clusters in the upgraded framework. To choose “k,” which denotes the number of clusters, sophisticated procedures might be used. Although the Parallel K-Mean approach has only been tested on integer-type data, it might be extended to text-type data, such as English words. Clustering datasets that include many keywords results in the same keywords being assigned to the same groups or clusters. To search for certain terms in a document, a search engine based on the expanded K-Mean clustering technique can be introduced.

Data Availability

The authors have added the available data to support the findings of this study that are included within the article.

Disclosure

The paper is a part of the Research Project and Masters in Software Engineering thesis. This paper is based on the second objective of our project, while one paper is already submitted in the same journal, which was based on the first objective of our master thesis [26].

Conflicts of Interest

All the authors declare no conflicts of interest.

Authors’ Contributions

Each author has worked equally.

Acknowledgments

The authors would like to thank Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R193), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. This research work was supported by the University of Nangarhar, Jalalabad Afghanistan, and University of Peshawar, Pakistan.