Handling Big Data Scalability in Biological Domain Using Parallel and Distributed Processing: A Case of Three Biological Semantic Similarity Measures
In the field of biology, researchers need to compare genes or gene products using semantic similarity measures (SSM). Continuous data growth and diversity in data characteristics comprise what is called big data; current biological SSMs cannot handle big data. Therefore, these measures need the ability to control the size of big data. We used parallel and distributed processing by splitting data into multiple partitions and applied SSM measures to each partition; this approach helped manage big data scalability and computational problems. Our solution involves three steps: split gene ontology (GO), data clustering, and semantic similarity calculation. To test this method, split GO and data clustering algorithms were defined and assessed for performance in the first two steps. Three of the best SSMs in biology [Resnik, Shortest Semantic Differentiation Distance (SSDD), and SORA] are enhanced by introducing threaded parallel processing, which is used in the third step. Our results demonstrate that introducing threads in SSMs reduced the time of calculating semantic similarity between gene pairs and improved performance of the three SSMs. Average time was reduced by 24.51% for Resnik, 22.93%, for SSDD, and 33.68% for SORA. Total time was reduced by 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA. Using these threaded measures in the distributed system, combined with using split GO and data clustering algorithms to split input data based on their similarity, reduced the average time more than did the approach of equally dividing input data. Time reduction increased with increasing number of splits. Time reduction percentage was 24.1%, 39.2%, and 66.6% for Threaded SSDD; 33.0%, 78.2%, and 93.1% for Threaded SORA in the case of 2, 3, and 4 slaves, respectively; and 92.04% for Threaded Resnik in the case of four slaves.
Massive data is generated daily from multiple sources such as electronic devices or the Internet; network sensors and healthcare and laboratory equipment; and sources of mobile data. Data generated from the Internet comes from social networking sites, governments, or large companies such as Google and Yahoo. In recent years, these data sources have grown continuously; traditional approaches to data management cannot handle this growth. This phenomenon is called “big data.”
Laney  defined challenges present in big data management in three dimensions (a.k.a., the 3Vs): volume, variety, and velocity. Volume refers to the increasing size of data. Variety refers to the types of data including text, graphs, images, video, audio, and other types. Velocity means that data are generated continuously as a stream at high speeds and needs to be processed as they are generated. Fan et al.  added two more Vs to this model: variability and value. Variability means there are changes in data structure and interpretation. Value is the business value that gives a competitive advantage to the organization. Volume and velocity were the focus of previous research; the variety of available data worldwide has received less attention. Abawajy  discussed dimensions in the variety of big data, terming them structure diversity, content diversity, source diversity, and processing diversity. Structure diversity includes three types of data: structured data, semistructured data, and unstructured data. Content diversity means data are single-media data, multimedia data, or graph data. Source diversity means data are machine-generated, human-generated, or process-generated. Finally, processing diversity represents the data processing types, namely, batch processing, stream processing, interactive processing, or graph processing.
Genetics is one of the biggest sources of big data. A single sequence of human genome is approximately 140 gigabytes; therefore, storing and comparing human genomes require more than a personal computer and online file-sharing applications. The European Bioinformatics Institute (EBI), one of the most important repositories of big data in biology, stores more than 20 petabytes of data on genes, proteins, and small molecules; one petabyte is 1015 bytes. Genomic data represents two petabytes of that, and this number is doubled every year. Biology labs access approximately one terabyte (1012 byte) of big data stored at EBI or the National Center for Biotechnology Information daily and generate more new data. Therefore, small labs are also generators of big data .
Biological data increase not only in size but also in diversity . Biological data are produced via a wide range of procedures; each procedure generates various pieces of information such as those on genetic or protein interactions. These data are analyzed within or across different heterogeneous sources, providing information that cannot be found from analyzing the literature or individual data sources. Therefore, it is important that companies and researchers have the ability to mine and analyze big data to find information, establish patterns, and form hypotheses.
Calculating semantic similarity is essential for comparing genes and gene products. A semantic similarity measure is a function that takes two GO terms or two sets of terms representing the annotations of two entities and returns a numerical value representing the closeness in meaning between them . Standard SSMs such as Palmer’s , cosine similarities [7, 8], and semantic proximity [9, 10] are suitable for some fields of study but are inaccurate for calculating semantic similarity between objects in other fields. In the field of biology, for example, comparing GO annotation terms is not enough; therefore, semantic similarity is measured by comparing features that describe the objects and the hierarchal relationships between these features [11–13]. Consequently, some SSMs are defined specifically for biology to measure the similarity between genes and gene products. A biological SS measure can be used to compute: similarity between gene ontology (GO) terms (term similarity), similarity between GO products (where each product is annotated with a set of GO terms), and gene product similarity.
There is no standard approach to determine the best similarity measures for each application; therefore, literature and recent surveys [14–16] compare and test SSMs. Recent reviews indicate that Resnik is the best SS measure in certain settings, followed by SSDD and SORA. To the best of our knowledge, no previous studies applied these similarity measures to big data. However, using semantic similarity measures to analyze large sets of biomedical data is addressed in , in which they used parallel computation on a multicore processor, and in , where GO information was stored in a hash table to avoid repeatedly traversing the GO graph, thereby improving computational efficiency. Here, we aimed to enhance the three best SSMs designed for biology (Resnik, SSDD, and SORA), enabling them to handle big data volume using a distributed processing system. Biological SSMs cannot handle big data. Therefore, a distributed processing system can be used to split data into multiple partitions. SS measures are then applied to each partition. This manages big data scalability and avoids computational problems, leading to good performance. Consequently, in this study, we investigated how using a distributed processing system can improve the performance of Resnik, SSDD, and SORA in the field of biology.
The rest of this paper is organized as follows. Section 2 introduces a background about gene ontology (GO) and Semantic Similarity Measures (SSMs). Section 3 describes in detail the materials and methods for enhancing the best three biological semantic similarity measures. Section 4 discusses and analyzes the results. Section 5 provides the conclusions and future directions.
2.1. Gene Ontology
GO is a valuable resource in bioinformatics. GO provides a structured, precisely defined, and controlled vocabulary to describe genes and gene products according to three categories: biological process (BP), molecular function (MF), and cellular components (CC) . Each of these categories is represented by a separate ontology of terms such as rooted Directed Acyclic Graph (rDAG)  (Figure 1). Each term in GO is associated with annotations describing MF, biological role, and localization. Annotation can be computationally inferred, such as Inferred from Electronic Annotation (IEA), or experimentally determined, which is indicated by an Evidence Code (EC). EC is more reliable than IEA in representing the type of process that generates the annotation .
A semantic similarity measure (SS measure) is a function that takes two GO terms, or two sets of terms representing annotations of two entities, and returns a numerical value representing the closeness in meaning between them . An SS measure can be used to compute similarity between GO terms (term similarity), similarity between GO products (where each product is annotated with a set of GO terms), and gene product similarity. Term similarity and gene product similarity are described below .
2.2.1. Term Similarity
This SS measure was developed by Rada et al. , who proposed a metric called distance to measure the distance between two concepts in a graph via the shortest path between these concepts. Distance has some limitations. It considers that all edges in the graph have the same weight. This is not the case in GO, where edges may have different weights even if they are at the same level. Moreover, it takes the shortest path between two nodes regardless of their distance to the root (depth). Previous studies used two methods to solve these issues. The internal method was to calculate the semantic similarity between two concepts based on GO structure. The external method was to calculate semantic similarity based on external corpora.(i)Internal methods resolved the aforementioned issues by considering the depth of the lowest common ancestor (LCA) between terms , distance to the nearest leave node , depth of the distinguished GO subgraph , and distance to the LCA between terms with a number of subclasses .(ii)External methods were developed by Resnik , where the semantic similarity between two terms is calculated based on Information Content (IC) and GO taxonomy structure. IC measures the similarity between two concepts by measuring how much information they share. The IC of a concept is acquired by calculating the probability of the occurrence of the concept in a selected corpus. As described in , uniformly scaling the IC values simplifies interpretation. There are two methods for applying IC to the common ancestor of two concepts: considering the most informative common ancestor (MICA) with the highest IC  or considering all the disjoint common ancestors (DCA) [29–31]. Therefore, the similarity between two concepts can be the IC of MICA  or the combined IC of MICA and that of the two concepts, which are weighed according to the IC value of MICA .(iii)Hybrid methods combine both internal and external methods, such as combining the IC-based strategy with the edge , number of descendants , depth and descendants , or entropy .
2.2.2. Gene Product Similarity
A gene product can be annotated by several GO terms. To calculate SS measure for these terms, pairwise or groupwise methods can be used:(i)Pairwise method calculates individual semantic similarity among all terms annotating two gene products and then calculates the average, maximum, minimum, or sum for all the pairs of terms or only for the best-matched pair of each term. For example, average (AVG)  calculates the average of all pairwise similarities; maximum (MAX)  calculates the maximum of all pairwise similarities; best match average (BMA)  calculates the average of the best-matched pairs; and FunSim  combines two semantic similarities by finding AVG, MAX, or BMA values and combining them in a nonlinear approach. IC-based semantic similarity  creates averages for the best-matched pairs. FuSSiMeG  is similar to MAX, but it weighs the IC pairwise similarities of the terms, after which the term with maximum IC weight is selected.(ii)Groupwise methods calculate semantic similarity using a set, graph, or vector approach:(a)In the set method, groupwise methods encompass set-based techniques with respect to all direct annotations. The main disadvantage of this method is that it does not take into account the shared ancestry between GO terms.(b)In the graph method, the direct and indirect annotations of gene products are represented as a graph, and set-based or graph-matching methods are used afterward to calculate semantic similarity. This method is better than the set method because it considers all direct and indirect annotations.(c)With vector methods, gene products are represented in a vector space, where each term is represented as a dimension; similarity is calculated using the vector similarity measure.
Several previous studies combined groupwise approach with the IC of terms. One study considered using the IC of terms to perform similarity computations, such as in simGIC , which compares two sets based on an IC-weighed Jacquard similarity. Additionally, IC can be used as a scalar value, such as in InteliGO , which combines the IC value and evidence content of annotations. Moreover, IC can be used to compute the IC value of shared subgraphs .
2.2.3. The Best SS Measure
There is no standard approach to determine the best similarity measure for each application; therefore, literature and recent surveys [14–16] have compared and tested SSMs. These reviews indicate that Resnik is the best SS measure in certain settings, followed by SSDD and SORA. Resnik  determines the semantic similarity of a protein based on the IC of the MICA. Additionally, the Best-match-avg function (Resnik)  determines the semantic similarity of proteins based on the average of best-matched terms. Shortest semantic differentiation distance (SSDD)  measures the semantic similarity of GO terms based on the “totipotency” concept, where each term is assigned a value representing its distance to the root and the number of descendants at each level in that path. The similarity between two terms is the smallest sum of “totipotency” along the path between them. In SORA , the IC value of the term and those of its inherited and extended terms are calculated separately and then combined with one IC value using term-set similarity. The similarity between two genes is the average of the IC values of their term sets.
3. Materials and Methods
In our study, enhancing Resnik , SSDD , and SORA  to be able to handle big data volume is based on distributed processing. In distributed processing, SSMs are used with a master-slave architecture, such that one device has unidirectional control over other devices. Our proposed process consists of three steps: the first two steps are the responsibility of the master node, and the third step is the responsibility of the slaves.(1)Split GO: this step is used as the initial step to divide GO into N splits, ensuring to render the similarity within each split very high, reduce the percentage of shared descendants with other splits, and make the split as balanced as possible.(2)Data clustering: this step is used to cluster or split data input into N splits based on the N splits generated during the first step. The resulting clusters are then sent to one of the slaves.(3)Semantic similarity calculation: in this step, Resnik, SSDD, or SORA is applied to the input data cluster; the results are then sent back to the master node.
There are two methods for using these enhanced SSMs with the distributed system. The first method is to divide the input data equally among the number of slaves. The second method is to divide the input data based on their similarity using split GO and data clustering algorithms. These two methods were applied to compare the average and total time used by enhanced Resnik, SSDD, and SORA. The details of each step are discussed in the following subsections.
3.1. Split GO
We tested several methods for splitting GO into several parts to be used in the distributed system. In this approach, each split is assigned to one of the slave systems. The main goal of our approach was to divide GO into N splits, ensuring a very high similarity within each split, reducing the percentage of shared descendants with other splits, and rendering the split as balanced as possible. Using these methods, the input is GO and the number of splits is N. The master node is responsible for dividing GO into N splits (one split for each slave node). Figure 2 illustrates the division of GO into 4 splits and assignment of each split to one slave.
The proposed methods are as follows: split graph by the main three roots (molecular function, cellular component, and biological process); split graph by roots or subroots (there are a total of 65 subroots, with 25 subroots under root1, 21 subroots under root2, and 19 subroots under root3); or split graph by subroots only. These methods are first used to initialize each split with one of the largest root/subroot, continuing until no roots/subroots remain. To avoid the issue of balance in our proposed methods, we initialized each split with a pair of the most similar subroots, and continued adding the most similar subroots to each split until no more similar subroots remained. Then, we selected the smallest split, found the most similar subroot from the remaining subroots, and assigned it to this split. If there were no more similar subroots for this split, we added one of the largest remaining subroots and repeated this process until there were no remaining subroots. This method increases the similarity of subroots within each split and reduces the overlap between the splits, which is our goal. This method can be used as an initial step before partitioning the input data (pairs of genes) on the distributed systems in order to calculate the similarity among gene pairs. A flowchart of the algorithm used in this method is shown in Figure 3.
3.2. Data Clustering
The data clustering step is used for clustering data input into N splits. The process starts at the master node. First, we took the N splits generated from the gene ontology splitting algorithm and a text file of gene pairs. Then, we clustered the input file into N clusters, based on the data clustering algorithm, before sending each cluster to one of the slaves. In this algorithm, a gene pair (X, Y) is added to the minimum cluster, ensuring that at least one of its LCAs belongs to its split. If the LCA does not belong to any split, then the algorithm adds the pair to the minimum cluster containing the gene X and Y. If the genes X and Y belong to different splits, the algorithm adds the pair to the minimum cluster that contains X or Y. LCA is used to group the neighbors/most similar gene pairs to speed the similarity calculation at the end. LCA plays the main role in the algorithm because it is used in the similarity calculation employed by many SSMs such as Resnik and SSDD. On the other hand, SORA does not use LCA directly but it depends on the IC value of the term and those of its inherited and extended terms (neighbors/most similar genes). Also, LCA represents the nearest ancestors to both X and Y. If LCA is near the root, the difference between X and Y is very high; however, if LCA is farther from the root, the difference between X and Y is low . A flowchart of the data clustering algorithm is shown in Figure 4.
3.3. Semantic Similarity Calculation
In this step, each slave applies one of the SSMs (Resnik, SSDD, or SORA) to its assigned data cluster. So, each slave calculates the similarity of each gene pair and then sends the results to the master node. Then, the master node combines the results into one file. To enhance the performance of these SSMs, we suggest using threads, as detailed in the following subsections.
3.4. Enhanced Semantic Similarity Measure in the Distributed System
Our proposed framework is composed of one master and N slaves, which communicate with each other across socket programming. Data are shared among the master and slave nodes via Samba file and print services  located at the master node. When a slave starts running, it can load GO from the Samba server, open a socket, and wait for any request from the master node. When all slaves are running and ready, the master node reads the input data from the Samba server, divides them into N splits, sends input data splits to slaves (one split for each slave), and waits for the response. When responses are sent to the Samba server at the master node, the results can be combined into one output file. The master node splits input data by dividing the total input equally into N splits, allocated to N blocks, which is the number of slaves. If there are less than N remaining lines not assigned to any block, the user can add them to the last block. Finally, to each slave, the master node sends the path to the original input file, the number of lines in the block, and the offset of the first line in the block. The master node can also divide the data based on their similarity by using the GO Splitting algorithm to divide GO into N splits, then clustering the input data via data clustering algorithm according to these splits. Then, each data split is placed into a separate file before the file paths are sent to slaves.
We tested the original Resnik, SSDD, and SORA SSM; then, we used parallel processing to enhance the performance of these SSMs. In Resnik, threads are introduced at the points of finding the ancestors of gene pairs X and Y. In SSDD, threads are introduced at the points of finding the T values for each vertex in the path from X/Y to LCA. In SORA, threads are introduced at the points of calculating the IC value of X descendants, Y descendants, and the union of X and Y descendants. The performance of the threaded measures is shown in the results section.
3.5. Implementation and Testing
We validated the performance of the GO split algorithm, data clustering algorithm, enhanced SSMs (Resnik, SSDD, and SORA), and applied these enhanced methods in the distributed system. Implementation and testing were conducted using the following settings and equipment:(i)Equipment:(a)Dell PowerEdge T620 server with a VMware Workstation Pro 14 software to create a set of five virtual machines (VM); each machine runs on Ubuntu 16.04 LTS, Intel® Xeon® processor E5-2600 product family × 4 processors, and 8 GB of memory. One VM works as a master and the rest work as slaves.(ii)Programming language:(a)JAVA programming language version 1.8.(b)Libraries:(1)Semantic measure library and toolkit (SML)  to read and process the GO.(2)JCIFS library  to access and manage shared data on a Samba Server installed on the master node using JAVA.(iii)Input:(a)GO  as input in Open Biomedical Ontologies (OBO) file format ; it is composed of 36638 genes.(b)Gene pairs are written in a text file, where genes are generated randomly to create six samples with different sizes. Sizes range from 10 to 1000000 (increased by a factor of 10).(1)Due to the diversity in the number of descendants under the main three roots in the original GO and to ensure that the generated samples are distributed equally among the GO genes, we generated 0.08% of the pairs from the descendants under the smallest root, 0.26% of the pairs from the descendants under the medium root, and the rest of the pairs from the descendants under the largest root for each sample. These percentages relate the number of descendants under each root to the total number of descendants under GO.(iv)SSM:(a)The original SSM (Resnik, SSDD, and SORA).(b)Enhanced versions of the SSM (Threaded Resnik, Threaded SSDD, and Threaded SORA).(v)Algorithms:(a)GO split algorithm to generate N GO splits, where N ranged from 1 to 4, because in our settings we can have 2, 3, or 4 slaves.(b)Data clustering algorithm to divide input data into N clusters based on the results of the GO split algorithm.(vi)Test cases:(a)Case 1: testing the performance of the enhanced SSMs. This test is performed on a single virtual machine to measure the following:(1)Performance of original SSMs (Resnik, SSDD, and SORA).(2)Performance of enhanced SSMs (Threaded Resnik, Threaded SSDD, and Threaded SORA).(3)Comparison of the performance of enhanced SSMs with the original SSMs.(b)Case 2: testing the performance of enhanced SSM in the distributed system. This test is conducted three times using one master and two slaves, one master and three slaves, and one master and four slaves. This test is used to measure the following:(1)Performance of enhanced SSMs (Threaded Resnik, Threaded SSDD, and Threaded SORA) if the input data are divided equally.(2)Performance of enhanced SSMs (Threaded Resnik, Threaded SSDD, and Threaded SORA) if the input data are divided by their similarity using the GO split and data clustering algorithms.(3)Comparison of the performance of enhanced SSMs (Threaded Resnik, Threaded SORA, and Threaded SSDD) when the data are divided equally, when data are divided based on their similarity.
In all these cases, performance is the total and the average time required to calculate the semantic similarity of the gene pairs. In our opinion and based on the experiment in , the average time is more important than the total time because average time reflects the time required to measure semantic similarity for the majority of gene pairs. That is not the case with total time, which can increase with values that are far from the average value, when calculating semantic similarity of certain genes. In assessment 1, Improvement Percentage (IP) of average/total time was measured according to The Improvement Percentage value of negative x indicates that an average/total time in nanoseconds (ns) was obtained using Threaded SSM. The time was reduced by this x value compared with the average/total time required by the original SSM using the same sample and settings. The Improvement Percentage value of positive x indicates that average/total time in ns was obtained using Threaded SSM. The time was increased by this x value compared with average/total time required by the original SSM using the same sample and settings. The average of the IP is then measured to find the mean value of the IPs. Figure 5 shows a flowchart of this procedure.
In assessment 2, IP of average/total time is measured according to If the Improvement Percentage value is negative x, that means average/total time in nanosecond (ns) was obtained using Threaded SSM with input data divided by their similarity via GO split and data clustering algorithms were reduced by x value. If the Improvement Percentage value is positive x, that means that average/total time was increased by the x value. The increases and decreases in average and total time were compared with average/total time obtained using Threaded SSM with input data divided equally and using the same sample and settings. Also, the average of the IP is measured to obtain the mean value of the IPs. Figure 6 shows a flowchart of this assessment. Detailed results of these assessments are shown in the following sections.
4. Results and Discussion
4.1. Performance of Enhanced SSMs
(i)Threaded Resnik. Our results show a reduction in the average time required to calculate the Resnik semantic similarity between each pair of genes (Table 1). The average reduction percentage in average time was 24.51 % of that obtained using original Resnik Conversely, the total time required to calculate the semantic similarity measure in Resnik fluctuated; total time decreased in some test samples (such as in sample size=10, 100, 10000) and increased in others (Table 2). The average reduction percentage of the total time was 8.88%.(ii)Threaded SSDD. Introducing threads in SSDD reduced the average and total time. The average reduction percentage of average time was 22.93%, and average reduction percentage of total time was 23.14% (Tables 3 and 4).(iii)Threaded SORA. As in Resnik and SSDD, threads drop the average and total time of calculating semantic similarity. Table 5 shows that the average reduction percentage of the average time was 33.68%. Also, the average reduction percentage of total time was 39.27% as shown in Table 6. Unlike Resnik and SSDD, SORA requires more memory to measure similarity. For example, with input size of 100000, it took 48 hours to find the similarity of 38045 pairs using original SORA and of 38098 pairs using the threaded version. When input size equals 1 million, it took 61 hours to find the similarity of 139487 pairs using original SORA and of 148182 pairs using the threaded version. In these two cases, the reduction percentage of the total time was approximately 0.14% and 5.87%, respectively.
Introducing threads in Resnik, SSDD, and SORA reduced the time of calculating semantic similarity between gene pairs and improved the performance of these SSMs. The reduction percentage of average time was 24.51% for Resnik, 22.93% for SSDD, and 33.68% for SORA. The reduction percentage of total time was 8.88% for Resnik, 23.14% for SSDD, and 39.27% for SORA.
4.2. Performance of Enhanced SSMs in the Distributed System (Input Data Divided Equally)
(i)Threaded Resnik. In the distributed system, applying Threaded Resnik and dividing input data equally dramatically reduced total time. The average reduction percentage increased with increasing the number of slaves. The average total time was reduced by 73.93% in the case of 2 slaves, by 80.65% in the case of 3 slaves, and by 82.29% in the case of 4 slaves (Table 7). The average reduction percentage of total time was reduced because data were distributed, and slaves worked in parallel. However, this does not reduce the average time of calculating the semantic similarity between pairs, unlike the case with total time (Table 8). This is because average time is staggered; it is reduced in some cases and increased in others. Additionally, there is an enormous increase in average time when the number of slaves is increased.(ii)Threaded SSDD. Similar to the results obtained using Threaded Resnik, using Threaded SSDD with a distributed system and input data divided equally reduced the average reduction percentage of total time (Table 9). The reduction was increased by increasing the number of slaves. The average total time was reduced by 59.86%, 65.34%, and 68.19% for 2, 3, and 4 slaves, respectively. Conversely, the average reduction percentage in the average time required for calculating similarity via Enhanced SSDD was markedly increased by increasing the number of slaves (Table 10).(iii)Threaded SORA. The results for Threaded SORA were similar to those obtained with Threaded Resnik and Threaded SSDD in a distributed system with input data were divided equally. The average total time of calculation in Threaded SORA was reduced by 65.01, 72.31, and 66.09 % for 2, 3, and 4 slaves, respectively (Table 11). As we mentioned previously, SORA needs a lot of memory to complete the semantic similarity measure in the Original SORA and Threaded SORA (for input sample sizes of 100,000 and 1,000,000 genes). In the distributed system, with input data divided equally for the same input samples, some slaves finished early; others continued working for a long time until we stopped the test due to suspension. Conversely, the average percentage of total calculation time increased notably (Table 12).
Using the Threaded versions of Resnik, SSDD, and SORA and dividing input data equally dramatically reduced total time. In contrast, the average time of calculating similarity increased markedly by increasing the number of slaves.
4.3. Performance of Enhanced SSMs with Input Data Divided by Their Similarity
(i)Threaded Resnik. In the distributed system, with input data divided by their similarity and using our GO split and data clustering algorithms, Threaded Resnik reduced the average time in the case of 4 slaves and sample of size 10,000; average time was increased, however, in the remaining cases, as shown in Table 13. Conversely, total time was decreased with sample size of 10 and 100 and using 2 and 3 slaves, and with sample size of 100 when using 4 slaves. In the remaining cases, total time was increased gradually by increasing the sample size and the number of slaves, as shown in Table 14.(ii)Threaded SSDD. The average time obtained with Threaded SSDD in the distributed system, with input data divided by their similarity, was reduced in the case of 2 slaves and sample size ranging from 10 to 100,000, but increased with sample size of 1,000,000. In the case of 3 slaves, average time increased notably at sample size 10, decreased gradually to a lower value, and then increased again at sample size 1,000,000 (Table 15). In the case of 4 slaves, the average time was markedly increased with sample size of 10. Average time was then reduced gradually until it was lower than the average time obtained with original SSDD, 4 slaves, and sample size of 100000. Average time then increased again with sample size of 1000000. The average reduction percentage of total time was 46.10%, 59.26%, and 48.19% with 2, 3, and 4 slaves, respectively (Table 16).(iii)Threaded SORA. Using Threaded SORA with a distributed system and input data divided by their similarity reduced total time (Table 17). The average reduction percentage of total time was 80.07%, 83.25%, and 84.24% using 2, 3, and 4 slaves, respectively. Table 18 shows a decrease in average time using 2 slaves and sample size of 100 to 10000. Average time was also decreased using 3 and 4 slaves with a sample size of 10000. However, the average time was considerably increased in the rest of the cases.
Using the Threaded versions of Resnik, SSDD, and SORA with input data divided by their similarity produced different behaviors in each case. In Threaded Resnik, the average time was unexpectedly increased. In most cases, total time was gradually increased by increasing the sample size and number of slaves; however, in a few cases, total time was reduced. In Threaded SSDD and SORA, the average time was decreased in some cases and increased in others. Total time was reduced by 46.10%, 59.26%, and 48.19% using Threaded SSDD and by 80.07%, 83.25%, and 84.24% using Threaded SORA, with 2, 3, and 4 slaves, respectively.
4.4. Comparing the Performance of Data Divided Equally and Data Divided by their Similarity
(i)Threaded Resnik. We compared dividing data equally and dividing data by their similarity, using each approach with our distributed system based on (2). The average time obtained with Threaded Resnik and four slaves, and dividing data by their similarity, was reduced by an average of 92.04% (Table 19) compared with the percentage obtained by dividing data equally (Table 8). Dividing data based on their similarity reduced the average time of calculating similarity when the number of splits was increased. In other words, defining more splits resulted in splits with high similarity. This reduced the time required to calculate semantic similarity because each slave calculated similarity for a group of nodes located near each other. Using a smaller number of splits, each split still contained many unrelated or dissimilar genes; this did not reduce the average time, as was the case with more splits. Conversely, total time was increased by increasing the number of splits and slaves because of overhead during the division of GO and clustering input data (Table 20). Increasing the number of splits and slaves resulted in the highest total time in the majority of cases. In Resnik, dividing data based on their similarity did not reduce the average time in any of the test cases; this is because Resnik depends on the IC value in the calculation more than on the relationship and distance of the gene, as do other SSMs. Therefore, Resnik needs the node relationships only to find the pair ancestors and obtain the IC value of MICAs.(ii)Threaded SSDD. Using Threaded SSDD with our distributed system and input data divided by their similarity reduced the average time by an average of 24.1%, 39.2%, and 66.6% with 2, 3, and 4 slaves, respectively (Table 21). This reduction was increased by increasing the number of slaves and splits. Average time was increased with input sample size of 1000000 pairs and using 2 and 3 slaves; however, when 4 slaves were used, the average time was reduced by 38.32%. This is because defining more splits produces splits that contain genes with more similarity, relatedness, and close proximity to each other. This positively affects the average time of SSDD semantic similarity calculation, which depends on the distance and relationships of the gene. Total time was more increased if input data were divided based on their similarity rather than divided equally. This is because dividing data by their similarity requires more processing to split the GO and cluster the data. As shown in Table 22, total time was increased in most of the test cases when using 3 slaves and a sample size of 10 and 100, and when using 4 slaves and a sample size of 1000000.(iii)Threaded SORA. Using Threaded SORA in the distributed system, with input data divided by their similarity, reduced the total and average time in approximately all of the test cases (Tables 23 and 24). The average reduction percentages were approximately 33.0%, 78.2%, and 93.1% using 2, 3, and 4 slaves, respectively. An exception was observed in two cases. In the first case, there was a slight increase in average time when using two slaves and sample size of 1000. In the second case, there was a minor increase in total time when using three slaves and a sample size of 1000. As we mentioned previously, the reduction in average and total time occurs because SORA depends on calculating the distance and relationship between genes. This is affected by grouping similar and more related genes in one cluster, reducing the total and average time of calculating similarity via SORA SSM.
We compared the two methods of data allocation in the distributed system in order to measure the performance of Threaded Resnik, Threaded SSDD, and Threaded SORA. Dividing input data based on their similarity and using the data clustering algorithm gave better performance than dividing data equally. The reduction in average time more effectively reflects the performance of enhanced SSMs than does the reduction in total time. Average time reflects the time required to measure semantic similarity for the majority of gene pairs, which is not the case with total time. Total time can be increased by values that are far from the average value when calculating semantic similarity between certain genes.
In Threaded Resnik, the average time was reduced with increasing the number of slaves/splits; average time was reduced by 92.04% in the case of 4 slaves. This indicates that defining splits with more similar and related genes produces high similarity within each split and causes minimum overlap with other splits. Average time was not reduced as much as it was using other SSM. This is because, in Threaded Resnik, calculating the semantic similarity of a term depends on the number of genes annotated with it (IC value) and does not depend on its location in the GO hierarchy. In Threaded Resnik, calculating the semantic similarity of a term needs term location in the GO hierarchy only to obtain the IC value of MICA. Threaded SSDD and SORA, however, depends on the term location in the GO hierarchy. Therefore, the average time is reduced dramatically by 24.1%, 39.2%, and 66.6% in Threaded SSDD, and by 33.0%, 78.2%, and 93.1% in Threaded SORA, using 2, 3, and 4 slaves, respectively. The reduction is increased gradually by increasing the number of slaves/splits.
Total time was increased using Threaded SSDD, and markedly increased using Threaded Resnik, with increasing the number of slaves/splits. This is because the time required to run the data clustering algorithm was longer compared to that required for the semantic similarity calculation. In Threaded SORA, the time required to perform the semantic similarity calculation was very long compared to the time required to run the data clustering algorithm; therefore, total time was reduced considerably.
Here, we proposed a method to enhance the three best SSMs in the field of biology using parallel and distributed processing. Our approach showed a dramatic reduction in average processing time. The reduction was increased gradually by increasing the number of slaves/splits.
In Threaded Resnik, if the number of splits is small, the resulting splits contain numerous unrelated or dissimilar genes. This does not decrease the average time, as is the case with more splits. Dividing the data based on their similarity in Resnik did not reduce the average time for any of the test cases. This is because the Resnik semantic similarity measure depends on the IC value in the calculation more than on the relationship and distance of the gene. Resnik depends on term location in the GO hierarchy only to obtain the IC value of MICA. Conversely, Threaded SSDD and SORA depend on the term location in the GO hierarchy. Therefore, the average time is reduced dramatically, and the reduction is increased gradually by increasing the number of slaves/splits.
Total time was increased in Threaded Resnik and SSDD with increasing the number of slaves/splits. This is because the time required to run the GO split and data clustering algorithms is longer than that required to calculate semantic similarity. Therefore, the total time was increased considerably by increasing the number of slaves/splits. The percentage of increase in Resnik was large because the time required for semantic similarity calculation in Resnik is much less than that required by SSDD. In Threaded SORA, total time was reduced significantly. This is because, in SORA, the time required for semantic similarity calculation is very long compared to that required to run the GO split and data clustering algorithms.
These results were mainly limited by the system used to run our assessment. Our system considerably limited our ability to have more VM, processors, and RAM for each virtual machine. Provided a more powerful machine, we can complete assessments using large sample sizes, which we could not achieve in this study. So, further experiments need to be done to find the minimum and the maximum number of VMs that need to be used to enhance the performance.
In future studies, we will build a framework that will depend on the GO split and data clustering algorithms to automatically integrate big data in the field of biology. We will use Threaded Resnik, SSDD, and SORA to measure the similarity between genes and gene products, handling big data scalability and computational problems with good performance. Also, we will propose an algorithm to calculate the minimum and the maximum number of VMs that need to be used to enhance the performance.
The data used to support the findings of this study are available online at .
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
The authors thank the Deanship of Scientific Research and RSSU at King Saud University for their technical support. Also, the authors would like to thank Deanship of scientific research for funding and supporting this research through the initiative of DSR Graduate Students Research Support (GSR).
D. Laney, “3-D Data Management: Controlling Data Volume, Velocity, and Variety,” META Group Research. Note 6, vol. 6, 2001.View at: Google Scholar
A. Aggoune, A. Bouramoul, and M.-K. Kholladi, “Big data integration: A semantic mediation architecture using summary,” in Proceedings of the 2nd International Conference on Advanced Technologies for Signal and Image Processing, ATSIP 2016, pp. 21–25, Tunisia, March 2016.View at: Google Scholar
M. Obitko and V. Jirkovský, “Big Data Semantics in Industry 4.0,” in Proc. Ind. Appl. Holonic Multi-Agent Syst, pp. 217–229, 2015.View at: Google Scholar
E. Blanchard, M. Harzallah, and P. Kuntz, “A Generic Framework for Comparing Semantic Similarities on a Subsumption Hierarchy,” in Proceedings of the in 18th Eur Conf Artif Intell, pp. 20–24, Amsterdam, The Netherlands, 2008.View at: Google Scholar
Z. Wu and M. Palmer, “Verb semantics and lexical selection,” in Verbs semantics and lexical selection, pp. 133–138, Stroudsburg, Pa, USA, 1994.View at: Google Scholar
C. Pesquita, “Semantic Similarity in the Gene Ontology,” in The Gene Ontology Handbook, vol. 1446, pp. 161–173, Springer, New York, NY, USA, 2017.View at: Google Scholar
“SGFSC: Speeding the gene functional similarity calculation based on hash tables,” 2018, https://www.researchgate.net/publication/309765689_SGFSC_Speeding_the_gene_functional_similarity_calculation_based_on_hash_tables.View at: Google Scholar
S. Jain and G. D. Bader, “An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology,” BMC Bioinformatics, vol. 11, article 562, 2010.View at: Google Scholar
F. M. Couto, M. J. Silva, and P. M. Coutinho, “Semantic similarity over the gene ontology: Family correlation and selecting disjunctive ancestors,” in Proceedings of the ACM Conference in Information and Knowledge Management, pp. 343-344, 2005.View at: Google Scholar
F. M. Couto and M. J. Silva, “Disjunctive shared information between ontology concepts: application to Gene Ontology,” Journal of Biomedical Semantics, vol. 2, article 5, 2011.View at: Google Scholar
N. Seco, T. Veale, and J. Hayes, “An Intrinsic Information Content Metric for Semantic Similarity in WordNet,” in Proceedings of the ECAI-04, pp. 1089-1090, Amsterdam, The Netherlands, 2004.View at: Google Scholar
F. Azuaje, H. Wang, and O. Bodenreider, “Ontology-driven similarity approaches to supporting gene functional assessment,” in Proceedings of the ISMB’2005 SIG meeting on Bio-ontologies 2005, pp. 9-10, 2005.View at: Google Scholar
S. Benabderrahmane, M. Smail-Tabbone, O. Poch, A. Napoli, and M.-D. Devignes, “IntelliGO: A new vector-based semantic similarity measure including annotation origin,” BMC Bioinformatics, vol. 11, no. 1, article 588, 2010.View at: Google Scholar