Abstract
Data is the most valuable asset in any firm. As time passes, the data expands at a breakneck speed. A major research issue is the extraction of meaningful information from a complex and huge data source. Clustering is one of the data extraction methods. The basic KMean and Parallel KMean partition clustering algorithms work by picking random starting centroids. The basic and KMean parallel clustering methods are investigated in this work using two different datasets with sizes of 10000 and 5000, respectively. The findings of the Simple KMean clustering algorithms alter throughout numerous runs or iterations, according to the study, and so iterations differ for each run or execution. In some circumstances, the clustering algorithms’ outcomes are always different, and the algorithms separate and identify unique properties of the KMean Simple clustering algorithm from the KMean Parallel clustering algorithm. Differentiating these features will improve cluster quality, lapsed time, and iterations. Experiments are designed to show that parallel algorithms considerably improve the Simple KMean techniques. The findings of the parallel techniques are also consistent; however, the Simple KMean algorithm’s results vary from run to run. Both the 10,000 and 5000 data item datasets are divided into ten subdatasets for ten different client systems. Clusters are generated in two iterations, i.e., the time it takes for all client systems to complete one iteration (mentioned in chapter number 4). In the first execution, Client No. 5 has the longest elapsed time (8 ms), whereas the longest elapsed time in the following iterations is 6 ms, for a total elapsed time of 12 ms for the KMean clustering technique. In addition, the Parallel algorithms reduce the number of executions and the time it takes to complete a task.
1. Introduction
Most commercial organizations that generate vast amounts of data do so during their daily operations. These businesses require an easy means to obtain and access their stored data, which necessitates the use of a centralized storage concept known as a database. A database is a collection of data that is compacted and arranged in such a way that it is easy to access, retrieve, manage, and change. Because business analysts require stored data to make business decisions, important information should be extracted utilizing a discovery concept known as data mining, which is also known as knowledge discovery [1, 2].
Today, everything is based on data. People come across vast volumes of data daily and save it for later review or analysis. Because such massive datasets are continuously rising, extracting and mining valuable information using traditional techniques are becoming increasingly difficult [3]. As a computer system can process data in a specific order from a set of facts, numbers, or statistics, it is called data. Companies today are collecting vast volumes of data in a variety of circumstances, formats, and databases.
Clustering is a phenomenon of unsupervised learning, whereas classification is a process of supervised learning. These two methods are frequently employed when extracting data from large databases. The graphical representation of Supervised and Unsupervised Techniques is shown in Figure 1.
1.1. Classification of Clustering Techniques
The clustering techniques are categorized into four fundamental categories, as illustrated in Figure 2.
A massive amount of data can be difficult to turn into usable information. By using data mining algorithms, researchers can predict and evaluate students’ academic progress based on their academic records and forum involvement.
Even though various research has been conducted around the world to evaluate student academic performance, there is a dearth of acceptable studies to examine aspects that can help students improve their academic performance. The goal of this study was to evaluate the factors that influence student academic achievement in Pakistan.
Both basic and parallel clustering approaches are constructed and studied in this work to highlight their greatest qualities. Simple KMean methods have shortcomings, and parallel kmean approaches resolve those weaknesses. The results of parallel kmean techniques are always the same: improved cluster quality, fewer executions, and faster execution times. The outcomes of the Simple KMean are likewise variable for different iterations or executions; as a result, the number of iterations varies depending on the iterations or executions. In some circumstances, the clustering algorithms’ outcomes are always different, and the algorithms separate and identify unique properties of the KMean Simple clustering algorithm from the KMean Parallel clustering algorithm. The Parallel KMean algorithms have been proven to be more efficient than the Simple KMean algorithms in several tests. Parallel algorithms reduce the number of executions and the amount of time it takes to complete a task.
2. Literature Review
2.1. Simple KMean Clustering
J. B. MacQueen was one of the first users of the Kmeans clustering technique, which he introduced in 1967. The most recent research on KMean clustering is described here, and some of the related work has been published since. The author [4] introduced the MinMax distance measure. The input dataset is first adjusted, and then initial centroids are chosen at random within the normalized range (0, 1). The distance is estimated using the minmax similarity measure.
Reference [5] divides the entire data collection into unit blocks using the lowest and highest bounds (UB). Following the modification, the items in the datasets are sorted by distance and then separated into subclusters (k sets). Each set of data is evaluated by computing the median. Initial centroids are computed using the specified medium, and clusters are built using the design of the initial cluster [6]. This method made use of sorting algorithms, which are more time consuming. The dataset’s simplest representation is found by finding the centroids of each unit block.
Simple KMeans are algorithms in which the information from each iteration is stored in a data structure, as described in [7, 8]. The recorded information is then utilized in the next iteration. A dynamic KMean clustering algorithm was introduced in [9]. In the first phase, subdatasets are created on the server side from the provided dataset. By modifying the datasets, items are now sorted by distance and arranged into subclusters (k sets).
2.2. Simple Parallel KMean Clustering Algorithm
Sanpawat and Alva [2, 10] proposed a parallelized KMean clustering method. The algorithm uses a (ClientServer) method. Technology, Earth sciences, engineering, social and economic sciences, medical sciences, and life are just a few of the fields that employ clustering.
A parallel KMean clustering technique is proposed in [6, 11]. Each data point’s distance from the next is calculated. The data items that are the furthest away from the origin are segregated from the rest of the dataset and placed in a separate list. For this new list, a threshold value is chosen. For the simultaneous KMean clustering process, [12] developed the ParaMeans program. They adopt the Basic parallelized KMean clustering technique for regular laboratory application. ParaMeans is a clientserver application that is simple to use and manage.
2.3. Simple and Parallel KMean
[13, 14] explain the Simple KMean clustering technique. The distance between the original centroids and the data items is determined, and each of the data items is given to its proper location. The input dataset is first adjusted, and then initial centroids are chosen at random within the normalized range (0, 1). The minmax similarity measure is used to calculate the distance. (0, 1) The minmax similarity measure [15] is used to calculate the distance. The KMean algorithm, developed by Singh and Bhatia [16], identifies items with the lowest frequency. The centroids are calculated as the average of each section. All clusters are compiled on the server (received from all clients). Based on the clustering method, the arithmetic means of each cluster are determined. It is efficient and progressive due to the integration of a dynamic load balance technique and the KMean clustering method in [17, 18]. In this strategy, the main system assigns the client system the same size subdataset [19, 20].
The parallel Kmean clustering approach and the basic Kmean clustering technique have both been thoroughly investigated. Many academics worked individually on Simple and Parallel KMean techniques, offering alternative methodologies discussed in Section 2. However, they make no explicit recommendations or suggestions on how to use parallel and simple kmean approaches in any of the domains where they are useful [21–23].
3. Research Methodology
Researchers have created many methods for Simple and Parallel KMean clustering approaches. Some existing strategies concentrated on sorting the dataset to select initial centroids, while others focused on the random selection of first centroids. When it comes to the Parallel and Simple KMean techniques, there is no clear understanding of the best approach and which technique should be used in which situation. Researchers looked at, implemented, and evaluated both Parallel and Simple KMean clustering algorithms to see what qualities they had and how well they performed when applied to these problems in general. The overall research flow is depicted in Figure 3.
3.1. Data Sets
The scores of 10,000 students in two different topics and the attendance of 5000 employees for two months are represented by these two datasets of 10,000 and 5000 integers, respectively. The challenge of randomly selecting initial centroids in KMean clustering is solved in this paper.
These two sets of 10,000 and 5000 integers could represent 10,000 of students’ grades in two subjects and five thousand (5000) employees’ attendance over the duration of two months, respectively. Table 1 displays a typical representation of these pupils in two distinct disciplines using two different methodologies, while Table 2 illustrates staff/employees attendance.
For these two algorithms, below are some samples of input and output:(i)Input : the No. of clusters derived from students and workers’ scores in two separate topics and months, : There are two datasets, each with 10,000 pupils and 5,000 employees.(ii)Output A set of clusters.
3.2. Method
Simple and parallel approaches are used on these data components individually. The flowchart of the Basic Simple KMean clustering technique is created using standard UML (Unified Modeling Language) notations.
The differences between the Parallel and KMean clustering methods are assessed and analyzed using experimental findings from both techniques. These two algorithms use the JAVA with Neat beans as an (IDE) and C++ platforms to execute different execution for varied data ranges and times.
3.2.1. Simple KMean Clustering Algorithm
The KMean clustering approach randomly chooses “k” initial centroids. The distances between data items and centroids are calculated using the Euclidean distance function in the second phase. [24, 25] mentions a couple of distance functions.
During relocation, each data item is relocated to the cluster that has the least amount of space. The earliest clusters are created in this manner. The arithmetic mean of each cluster is then calculated. That cluster’s data points are closer to the arithmetic mean. Following that calculation, data points are assigned a cluster based on the arithmetic mean. Until there are no more data points to transfer from one cluster to another, the process is repeated [26].
(1) Steps in the Simple KMean Clustering Algorithm. The pseudocode for the basic KMean clustering approach [14] is shown below:

(2) Flow Chart of Simple KMean Algorithm. The flowchart of the basic KMean method is created using standard UML (Unified Modeling Language) notations, which are depicted in Figure 4 as Simple kmean algorithm’s Flow chart.
3.2.2. Parallel KMean’s Clustering Algorithm
When the dataset is sufficiently large, the space and processing performance requirements for the Simple KMean clustering approach are the most significant hurdles. The Simple or Basic KMean clustering technique is parallelized to solve these challenges.
(1) Main Steps of Parallel KMean Clustering Algorithm. Three main steps of the Simple Parallel KMean’s algorithm are as follows:(i)Compilation(ii)Partition(iii)Computation
In the first phase, subdatasets are created on the server side from the provided dataset. Each client computer connected to the server receives these subdatasets, which include the number of clusters, “k,” and starting centroids. Client systems that are affected calculate the clusters and send the results to the server. The process is continued until the clusters do not change.
(2) Flow chart of Parallel KMean Clustering Algorithm. The abovementioned steps are depicted in Figure 5 as a flow chart. The flow chart is created using UML (Unified Modeling Language) standard notations.
4. Results and Discussion
The features of simple KMean and Parallel KMean techniques are highlighted in this research. Some existing strategies concentrated on sorting the dataset to select initial centroids, while others focused on the random selection of first centroids.
For the experiments, two datasets of 10,000 and 5000 integers representing students and teachers are chosen at random. The performance of Simple and Parallel clustering methods is tested using these datasets. The experimental results are presented in detail in the following sections of this chapter.
4.1. Experimental Results Analysis
For a dataset of 10,000 and 5000 integer data pieces, both techniques are tested and compared with each other. Using the Simple KMean clustering technique, both strategies produced positive experimental results. In the next phase, the results of the comparison of the Simple and Parallel algorithms are shown.
4.1.1. Comparison of Parallel and Simple KMean Algorithm
A comparison between the Simple and the Parallel KMean method is performed by considering the number of executions, elapsed time, and cluster quality.
4.1.2. Number of Iterations
The tables and graphs below illustrate the performance of the Parallel and Simple KMean clustering algorithms for varying numbers of clusters (K).
Table 3 compares the KMean technique versus the parallel KMean algorithm for identical datasets and cluster number (K = 3). The same dataset (10,000 data points) is used in each run to observe and perceive how the number of executions in the Simple KMean algorithm changes over time. Because the starting centroids are not produced at random, the number of executions in the Parallel KMean method is fixed.
The graph in Table 3 is depicted in Figure 6. With the Parallel KMean technique, k = 3 means that 3 executions are performed, but in the Simple KMean method, it fluctuates from run to run.
According to Tables 4 and 5, the number of parallel KMean clustering is lower than the number of Simple KMean clustering which is represented in Figures 6 and 7, respectively.
According to Table 5, the number of parallel KMean clustering is lower than the number of Simple KMean clustering which is represented in Figure 7, respectively.
As shown in Table 5, there are fewer executions of the KMean clustering method using the parallel approach, as k = 5 is fixed, which is given in Figure 8.
Table 6 shows the fixed and lower No. of iterations for the Parallel and Simple KMean clustering methods for k = 6, which is depicted in Figure 9.
The number of times the Parallel and Simple KMean algorithms were run for k = 7 is shown in Table 7 and presented in Figure 10.
4.2. Elapsed Time
For varied numbers of clusters, the following tables and graphs show the elapsed time of the Simple and Parallel KMean clustering methods (K).
A comparison between the Simple KMean algorithm and parallel KMean algorithm can be found in Table 8 for K = 3, which is depicted in Figure 11. Parallel KMean clustering consumes less time for each iteration than Simple KMean clustering.
The parallel KMean method takes less time than the Simple KMean method at different runs or executions.
According to Table 9, the Parallel KMean Clustering method takes about half the time as the Simple KMean clustering method for k = 4, which is presented in Figure 12.
Comparing the Parallel KMean clustering algorithm to the Simple KMean algorithm for k = 5, Table 10 compares the elapsed time of both methods, which is depicted in Figure 13.
Table 11 shows the elapsed time of the Parallel and Simple KMean algorithms for k = 6 and is given in Figure 14, while Table 12 shows elapsed time for k = 7, respectively.
4.3. Cluster Quality
The next section compares the cluster quality of the Simple KMean and Parallel KMean methods given in Tables 13 and 14, represented in Figure 16 and in Figure 17, respectively.
Table 13 displays the outcomes of numerous runs or executions of the same data collection of 10,000 data items.
Table 14 shows the same results for the same 10,000 data items over numerous runs or executions for the same dataset.
5. Conclusion
The current technique’s fundamental flaw is that it produces various results for the same data. Both basic and parallel clustering approaches are constructed and studied in this work to highlight their greatest qualities. Simple KMean methods have shortcomings, and parallel kmean approaches resolve those weaknesses. The results of parallel kmean techniques are always the same: improved cluster quality, fewer executions, and faster execution times. The outcomes of the Simple KMean are likewise variable for different iterations or executions; as a result, the number of iterations varies depending on the iterations or executions. In some circumstances, the clustering algorithms’ outcomes are always different, and the algorithms separate and identify unique properties of the KMean Simple clustering algorithm from the KMean Parallel clustering algorithm. The Parallel KMean algorithms have been proven to be more efficient than the Simple KMean algorithms in several tests. Parallel algorithms reduce the number of executions and the amount of time it takes to complete a task. Experiments have shown that Parallel algorithms outperform the Simple KMean algorithm by a wide margin. The findings of the Parallel techniques are also consistent; however, the Simple KMean technique assembles different outcomes with each iteration or execution. In addition, the Parallel techniques reduce overall iterations and elapsed time [27].
6. Future Work
A technique for KMean clustering that works for many types of data should be developed in the future. When dealing with categorical data, e.g., a method should perform better. The process of selecting a “k” number of clusters is still in progress. The user should input the number of clusters in the upgraded framework. To choose “k,” which denotes the number of clusters, sophisticated procedures might be used. Although the Parallel KMean approach has only been tested on integertype data, it might be extended to texttype data, such as English words. Clustering datasets that include many keywords results in the same keywords being assigned to the same groups or clusters. To search for certain terms in a document, a search engine based on the expanded KMean clustering technique can be introduced.
Data Availability
The authors have added the available data to support the findings of this study that are included within the article.
Disclosure
The paper is a part of the Research Project and Masters in Software Engineering thesis. This paper is based on the second objective of our project, while one paper is already submitted in the same journal, which was based on the first objective of our master thesis [26].
Conflicts of Interest
All the authors declare no conflicts of interest.
Authors’ Contributions
Each author has worked equally.
Acknowledgments
The authors would like to thank Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2022R193), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia. This research work was supported by the University of Nangarhar, Jalalabad Afghanistan, and University of Peshawar, Pakistan.