BioMed Research International

BioMed Research International / 2014 / Article
Special Issue

Big Data and Network Biology

View this Special Issue

Research Article | Open Access

Volume 2014 |Article ID 831751 |

Sony Hartono Wijaya, Husnawati Husnawati, Farit Mochamad Afendi, Irmanida Batubara, Latifah K. Darusman, Md. Altaf-Ul-Amin, Tetsuo Sato, Naoaki Ono, Tadao Sugiura, Shigehiko Kanaya, "Supervised Clustering Based on DPClusO: Prediction of Plant-Disease Relations Using Jamu Formulas of KNApSAcK Database", BioMed Research International, vol. 2014, Article ID 831751, 15 pages, 2014.

Supervised Clustering Based on DPClusO: Prediction of Plant-Disease Relations Using Jamu Formulas of KNApSAcK Database

Academic Editor: Samuel Kuria Kiboi
Received30 Nov 2013
Accepted18 Feb 2014
Published07 Apr 2014


Indonesia has the largest medicinal plant species in the world and these plants are used as Jamu medicines. Jamu medicines are popular traditional medicines from Indonesia and we need to systemize the formulation of Jamu and develop basic scientific principles of Jamu to meet the requirement of Indonesian Healthcare System. We propose a new approach to predict the relation between plant and disease using network analysis and supervised clustering. At the preliminary step, we assigned 3138 Jamu formulas to 116 diseases of International Classification of Diseases (ver. 10) which belong to 18 classes of disease from National Center for Biotechnology Information. The correlation measures between Jamu pairs were determined based on their ingredient similarity. Networks are constructed and analyzed by selecting highly correlated Jamu pairs. Clusters were then generated by using the network clustering algorithm DPClusO. By using matching score of a cluster, the dominant disease and high frequency plant associated to the cluster are determined. The plant to disease relations predicted by our method were evaluated in the context of previously published results and were found to produce around 90% successful predictions.

1. Introduction

Big data biology, which is a discipline of data-intensive science, has emerged because of the rapid increasing of data in omics fields such as genomics, transcriptomics, proteomics, and metabolomics as well as in several other fields such as ethnomedicinal survey. The number of medicinal plants is estimated to be 40,000 to 70,000 around the world [1] and many countries utilize these plants as blended herbal medicines, for example, China (traditional Chinese medicine), Japan (Kampo medicine), India (Ayurveda, Siddha, and Unani), and Indonesia (Jamu). Nowadays, the use of traditional medicines is rapidly increasing [2, 3]. These medicines consist of ingredients made from plants, animals, minerals, or combination of them. The traditional medicines have been used for generations for treatments of diseases or maintaining health of people and the most popular form of traditional medicine is herbal medicine. Blended herbal medicines as well as single herb medicines include a large number of constituent substances which exert effects on human physiology through a variety of biological pathways. The KNApSAcK Family database systems can be used to comprehensively understand the medicinal usage of plants based upon traditional and modern knowledge [4, 5]. This database has information about the selected herbal ingredients, that is, the formulas of Kampo and Jamu, omics information of plants and humans, and physiological activities in humans. Jamu is generally composed based on the experience of the users for decades or even hundreds of years. However, versatile scientific analyses are needed to support their efficacy and their safety. Attaining this objective is in accordance with the 2010 policy of the Ministry of Health of Indonesian Government about scientification of Jamu. Thus, it is required to systemize the formulations and develop basic scientific principles of Jamu to meet the requirement of Indonesian Healthcare System. Afendi et al. initiated and conducted scientific analysis of Jamu for finding the correlation between plants, Jamu, and their efficacy using statistical methods [68]. They used Biplot, partial least squares (PLS), and bootstrapping methods to summarize the data and also focused on prediction of Jamu formulations. These methods give a good understanding about relationship between plants, Jamu, and their efficacy. Among 465 plants used in 3138 Jamu, 190 plants were shown to be effective for at least one efficacy and these plants were considered to be the main ingredients of Jamu. The other 275 plants are considered to be supporting ingredients in Jamu because their efficacy has not been established yet.

Network biology can be defined as the study of the network representations of molecular interactions, both to analyze such networks and to use them as a tool to make biological predictions [9]. This study includes modelling, analysis, and visualizations, which holds important task in life science today [10]. Network analysis has been increasingly utilized in interpreting high throughput data on omics information, including transcriptional regulatory networks [11], coexpression networks [12], and protein-protein interactions [13]. We can easily describe relationship between entities in the network and also concentrate on part of the network consisting of important nodes or edges. These advantages can be adopted for analyzing medicinal usage of plants in Jamu and diseases. Network analysis provides information about groups of Jamu that are closely related to each other in terms of ingredient similarity and thus allows precise investigation to relate plants to diseases. On the other hand, multivariate statistical methods such as PLS can assign plants to efficacy by global linear modeling of the Jamu ingredients and efficacy. However, there is still lack of appropriate network based methods to learn how and why many plants are grouped in certain Jamu formula and the combination rule embedding numerous Jamu formulas.

It is needed to explore the relationship between Indonesian herbal plants used in Jamu medicines and the diseases which are treated using Jamu medicines. When effectiveness of a plant against a disease is firmly established, then further analysis about that plant can be proceeded to molecular level to pinpoint the drug targets. The present study developed a network based approach for prediction of plant-disease relations. We utilized the Jamu data from the KNApSAcK database. A Jamu network was constructed based on the similarity of their ingredients and then Jamu clusters were generated using the network clustering algorithm DPClusO [14, 15]. Plant-disease relations were then predicted by determining the dominant diseases and plants associated with selected Jamu clusters.

2. Methods

2.1. Concept of the Methodology

Jamu medicines consist of combination of medicinal plants and are used to treat versatile diseases. In this work we exploit the ingredient similarity between Jamu medicines to predict plant-disease relations. The concept of the proposed method is depicted in Figure 1. In step 1 a network is constructed where a node is a Jamu medicine and an edge represents high ingredient similarity between the corresponding Jamu pair. In Figure 1, the nodes of the same color indicate the Jamu medicines used for the same disease. The similarity is represented by Pearson correlation coefficient [16, 17]; that is, where is the weight of plant- in Jamu , is the weight of plant- in Jamu , is mean of Jamu , and is mean of Jamu . The higher similarity between Jamu pairs the higher the correlation value. In the present study, and are assigned as 1 or 0 in cases the th plant is, respectively, included or not included in the formula. Under such condition, Pearson correlation corresponds to fourfold point correlation coefficient; that is, where , , , and represent the numbers of plants included in both and , in only , in only , and in neither nor , respectively.

In step 2 the Jamu clusters are generated using network clustering algorithm DPClusO. DPClusO can generate clusters characterized by high density and identified by periphery; that is, the Jamu medicines belonging to a cluster are highly cohesive and separated by a natural boundary. Such clusters contain potential information about plant-disease relations.

In step 3 we assess disease-dominant clusters based on matching score represented by the following equation: Matching score of a cluster is the ratio of the highest number of Jamu associated with a single disease to the total number of Jamu in the cluster. We assign a disease to a cluster for which the matching score is greater than a threshold value. In step 4, we determine the frequency of plants associated with a cluster if and only if a disease is assigned to it in the previous step. The highest frequency plant associated to a cluster is considered to be related to the disease assigned to that cluster. True positive rates (TPR) or sensitivity was used to evaluate resulting plants. TPR is the proportion of the true positive predictions out of all the true predictions, defined by the following formula [18]: where true positive is the number of correctly classified and false negative is the number of incorrectly rejected entities. We refer to the proposed method as supervised clustering because after generation of the clusters we narrow down the candidate clusters for further analysis based on supervised learning and thus improve the accuracy of prediction of the proposed method.

3. Result and Discussion

3.1. Construction and Comparison of Jamu and Random Networks

We used the same number of Jamu formulas from previous research [6], 3138 Jamu formulas, and the set union of all formulas consists of 465 plants. We assigned 3138 Jamu formulas to 116 diseases of International Classification of Diseases (ICD) version 10 from World Health Organization (WHO, Table 1) [19]. Those 116 diseases are mapped to 18 classes of disease, which contains 16 classes of disease from National Center for Biotechnology Information (NCBI) [20] and 2 additional classes. Table 2 shows distribution of 3138 Jamu into 18 classes of disease. According to this classification, most Jamu formulas are useful for relieving muscle and bone, nutritional and metabolic diseases, and the digestive system. Furthermore, there is no Jamu formula classified into glands and hormones and neonatal disease classes. We excluded 4 Jamu formulas which are used to treat fever in the evaluation process because this symptom is very general and almost appeared in all disease classes. Jamu-plant-disease relations can be represented using 2 matrices: first matrix is Jamu-plant relation with dimension and the second matrix is Jamu-disease relation with dimension .

IDDiseaseClass of disease

1Abdominal pain3
2Abdominal pain, diarrhea3
4Acne, skin problems (cosmetics)16
5Amenorrhoea, dysmenorrhea6
6Amenorrhoea, irregular menstruation6
8Appendicitis, urinary tract infection, tonsillitis3
10Arthralgia, arthritis11
12Benign prostatic hyperplasia (Bph)10
13Breast disorder6
17Cancer pain2
18Cancer, inflammation2
19Colic abdomen, bloating (in infant)3
20Common cold15
21Common cold, dyspepsia, insect bites15, 3, 16
22Common cold, influenza15
24Degenerative disease14
25Dermatitis, urticaria, erythema16
27Diabetic gangrene16
29Diarrhea, abdominal pain3
30Diseases of the eye5
31Disorders in pregnancy6
33Dysmenorrhea, irregular menstruation6
34Dysmenorrhea, menstrual syndrome6
37Dyspnoea, cough, orthopnoea15
39Fatigue, anaemia, loss appetite1
40Fatigue, lack of sexual function6
41Fatigue, low back pain11
42Fatigue, myalgia, arthralgia11
43Fatigue, osteoarthritis11
44Fertility problem6, 10
46Gastritis, gastric ulcer3
49Heart diseases8
50Heartburn3, 8
51Hepatitis, other diseases of liver3
54Hypertension, diabetes14
55Hypertension, hypercholesterolaemia14
58Indigestion (K.30)3
59Indigestion, lose appetite3
60Infertility6, 10
61Irregular menstruation, menstruation syndrome6
62Kidney diseases17
63Lactation problems6
64Leukorrhoea (Vaginalis)6
65Leukorrhoea (Vaginalis), dysmenorrhoea6
66Lose appetite3
67Lose appetite, underweight14
68Low back pain, myalgia, arthralgia11
69Low back pain, myalgia, constipation11
70Low back pain, urinary tract infection17
71Lung diseases15
72Malaise and Fatigue11
73Malaise and Fatigue, Constipation11
74Malaise and Fatigue, Fertility Problems10, 11
75Malaise and Fatigue, Low Back Pain11
76Malaise and Fatigue, Sexual Dysfunction11, 6, 10
77Malaise and Fatigue, Skin Problems (Cosmetics)16
78Malaria, anaemia1
80Menopausal syndrome6
81Menopause/menstrual syndrome, leukorrhoea (vaginalis)6
82Menstrual syndrome6
83Menstrual syndrome, fatigue6
85Mood disorder18
86Myalgia, arthralgia11
87Nausea/vomiting of pregnancy6
89Osteoarthritis, fatigue11
90Overweight, obesity14
92Post partum syndrome6
93Prevent from overweight14
94Respiratory infection due to smoking15
95Respiratory tract infection15
96Rheumatoid arthritis, gout11
97Secondary amenorrhea6
98Secondary amenorrhea, irregular menstruation6
99Sexual dysfunction, fatigue6, 10
100Skin diseases16
101Skin problems (cosmetics)16
102Sleeping and Mood Disorders18
103Sleeping disorders18
105Stomatitis, gingivitis, tonsilitis3
106Stone in kidney (N20.0)17
107Stone in kidney (N20.0), urinary bladder stone (N21.0)17
111Typhoid, dyspepsia3
112Ulcer of anus and rectum3
113Underweight, lose appetite3
114Urinary tract infection (urethritis)17
115Vaginal discharges6
116Vaginal diseases6

IDClass of disease (NCBI)Ref.Number of JamuPercentage

1Blood and lymph diseasesNCBI201 6.41
2CancersNCBI32 1.02
3The digestive systemNCBI457 14.56
4Ear, nose, and throatNCBI2 0.06
5Diseases of the eyeNCBI1 0.03
6Female-specific diseasesNCBI382 12.17
7Glands and hormonesNCBI0
8The heart and blood vesselsNCBI57 1.82
9Diseases of the immune systemNCBI22 0.70
10Male-specific diseasesNCBI17 0.54
11Muscle and boneNCBI649 20.68
12Neonatal diseasesNCBI0
13The nervous systemNCBI32 1.02
14Nutritional and metabolic diseasesNCBI576 18.36
15Respiratory diseasesNCBI313 9.97
16Skin and connective tissueNCBI163 5.19
17The urinary system*90 2.87
18Mental and behavioral disorders*21 0.67

The number of Jamu classified into multiple disease classes1193.79
The number of Jamu unclassified40.13

Total Jamu formulas3138100.00

After completion of data acquisition process, we calculated the similarity between Jamu pairs using correlation measure. The similarity measures between Jamu pairs were determined based on their ingredients. Corresponding to (3138 in present case) Jamu formulas, there can be maximum = = 4,921,953 Jamu pairs. We sorted the Jamu pairs based on correlation value using descending order and selected top- (0.7%, 0.5%, and 0.3%) pairs of Jamu formula to create 3 sets of Jamu pairs. The number of Jamu pairs for 0.7%, 0.5%, and 0.3% datasets is 34,454 pairs, 24,610 pairs, and 14,766 pairs and the corresponding minimum correlation values are 0.596, 0.665, and 0.718, respectively. The three datasets of Jamu pairs can be regarded as three undirected networks (step 1 in Figure 1) consisting of 2779, 2496, and 2085 Jamu formulas, respectively (Table 3). Figure 2 shows visualization of 0.7% Jamu networks using Cytoscape Spring Embedded layout. We verified that the degree distributions of the Jamu networks are somehow close to those of scale-free networks, that is, roughly are of power law type. However, in the high-degree region the power law structure is broken (Figure 3). Nearly accurate relation of power laws between medicinal herbs and the number of formulas utilizing them was observed in Jamu system but not in Kampo (Japanese crude drug system) [4]. The difference of formulas between Jamu and Kampo can be explained by herb selection by medicinal researchers based on the optimization process of selection [4]. Thus, the broken structure of power law corresponding to Jamu networks is associated with the fact that selection of Jamu pairs based on ingredient correlation leads to nonrandom selection. We also constructed random networks according to Erdős-Rényi (ER) model [21], Barabási-Albert (BA) model [22], and Vazquez’s Connecting Nearest Neighbor (CNN) model [23] of the same size corresponding to each of the real Jamu network. We used Cytoscape Network Analyzer plugin [24] and R software for analyzing the characteristics of both the Jamu and the random networks.


Network statisticsTotal pairs34,45424,61014,766
Minimum correlation0.5960.6650.718
Number of Jamu formulas2,7792,4962,085
Average degree24.819.714.2
(Random network: ER)
(Random network: BA)
(Random network: CNN)
Clustering coefficient0.5210.5200.540
(Random network: ER)
(Random network: BA)
(Random network: CNN)
Number of connected components69119254
(Random networks: ER, BA, CNN)(1)(1)(1)
Network diameter 151720
(Random network: ER)
(Random network: BA)
(Random network: CNN)
Network density0.0080.0080.007
(Random network: ER)
(Random network: BA)
(Random network: CNN)

DPClusOTotal number of clusters1,7461,411938
Number of clusters with more than 2 Jamu1,296873453
Number of Jamu formulas in the biggest cluster11810489

We determined five statistical indexes, that is, average degree, clustering coefficient, number of connected component, network diameter, and network density of each Jamu network and also of each random network. The clustering coefficient of a node is defined as , where is the number of neighbors of and is the number of connected pairs between all neighbors of . The network diameter is the largest distance between any two nodes. If a network is disconnected, its diameter is the maximum of all diameters of its connected components. A network’s density is the ratio of the number of edges in the network over the total number of possible edges between all pairs of nodes (which is , where is the number of vertices, for an undirected graph). The average number of neighbors and the network density are the same for the real and random networks of the same size as it is shown in Table 3. In case of 0.7% and 0.5% real networks, the clustering coefficient is roughly the same and in case of 0.3% the clustering coefficient is somewhat larger. The number of connected components and the diameter of the Jamu networks gradually decrease as the network grows bigger by addition of more nodes and edges.

Very different values corresponding to clustering coefficient, connected component, and network diameter imply that the Jamu networks are quite different from all 3 types of random networks. The differences between Jamu networks and ER random networks are the largest. Random networks constructed based on other two models are also substantially different from Jamu networks. Based on the fact that the random networks constructed based on all three types of models are different from the Jamu networks, it can be concluded that structure of Jamu networks is reasonably biased and thus might contain certain information about plant-disease relations. Specially, much higher value corresponding to clustering coefficient indicates that there are clusters in the networks worthy to be investigated. To extract clusters from the Jamu networks (step 2 in Figure 1) we applied DPClusO network clustering algorithm [14] to generate overlapping clusters based on density and periphery tracking.

3.2. Supervised Clustering Based on DPClusO

DPClusO is a general-purpose clustering algorithm and useful for finding overlapping cohesive groups in an undirected simple graph for any type of application. It ensures coverage and performs robustly in case of random addition, removal, and rearrangement of edges in protein-protein interaction (PPI) networks [14]. While applying DPClusO, the parameter values of density and cluster property that we used in this experiment are 0.9 and 0.5, respectively [15]. Table 3 shows the summary of clustering result by DPClusO. Because clusters consisting of two Jamu formulas are trivial clusters, for the next steps we only use clusters each of which consists of 3 or more Jamu formulas. The number of total clusters increases along with the larger dataset, although the threshold correlation between Jamu pairs decreases. We evaluated the clustering result using matching score to determine dominant disease for every cluster (step 3 in Figure 1). Matching score of a cluster is the ratio of the highest number of Jamu associated with the same disease to the total number of Jamu in the cluster. Thus matching score is a measure to indicate how strongly a disease is associated to a cluster. Figure 4 shows the distribution of the clusters with respect to matching score from three datasets. All datasets have the highest frequency of clusters at matching score >0.9 and overall most of the clusters have higher matching score, which means most of the DPClusO generated clusters can be confidently related to a dominant disease. Furthermore the number of clusters with matching score >0.9 is remarkably larger compared to the same in other ranges of matching score in case of the 0.3% dataset (Figure 4(c)). If we compare the ratio of frequency of clusters at matching score >0.9 for every dataset, the 0.3% dataset has the highest ratio with 40.84% (of 453), compared to 29.67% (of 873) and 21.91% (of 1296), in case of 0.5% and 0.7% datasets, respectively. Thus, the most reliable species to disease relations can be predicted at matching score >0.9 corresponding to the clusters generated from 0.3% dataset.

Figure 5(a) shows the success rate for all 3 datasets with respect to threshold matching scores. Success rate is defined as the ratio of the number of clusters with matching score larger than the threshold to the total number of clusters. As expected it tends to produce lower success rate if we decrease correlation value to create the datasets. However more clusters are generated and more information can be extracted when we lower the threshold correlation value. The success rate increases rapidly as the matching score decreases from 0.9 to 0.6 and after that the slope of increase of success rate decreases. Therefore in this study we empirically decide 0.6 as the threshold matching score to predict plant-disease relations.

3.3. Assignment of Plants to Disease

By using DPClusO resulting clusters, we assigned plants to classes of disease. Based on a threshold matching score we assigned dominant disease to a cluster. Then we assign a plant to a cluster by way of analyzing the ingredients of the Jamu formulas belonging to that cluster and determining the highest frequency plant, that is, the plant that is used for maximum number Jamu belonging to that cluster (step 4 in Figure 1). Thus we assign a disease and a plant to each cluster having matching score greater than a threshold. Our hypothesis is that the disease and the plant assigned to the same cluster are related.

The total number of assigned plants depends on matching score value. Figure 5(b) shows the number of predicted plants that can be assigned to diseases in the context of matching score. With higher matching score value, the number of predicted plants assigned to classes of disease is supposed to remain similar or decrease but the reliability of prediction increases. In Figure 5(b) a sudden change in the number of predicted plants is seen at matching score 0.6 which we consider as empirical threshold in this work. Based on the 0.7% dataset, the largest number of plants (135 plants, Table 4) was assigned to diseases. There are 63 plants assigned to only one class of disease, whereas the other 72 plants are assigned to at least two or more classes of disease (Figure 6).

NumberPlants  nameHit-miss  status

A.  Disease:  blood  and  lymph diseases     
1Tamarindus  indica Hit *
2Allium  sativum Hit *
3Tinospora  tuberculata Hit *
4Piper  retrofractum Hit   
5Syzygium  aromaticum Hit *
6Bupleurum  falcatum Hit   
7Graptophyllum  pictum Hit   
8Plantago  major Hit   
9Zingiber  officinale Hit *
10Cinnamomum   burmannii Hit *
11Soya  max Miss *
12Kaempferia  galanga Hit   
13Curcuma  longa Hit *
14Piper  nigrum Hit   
15Zingiber  aromaticum Hit *
16Phyllanthus  urinaria Hit *
17Oryza  sativa Hit   
18Myristica  fragrans Hit *
19Alstonia  scholaris Hit *
20Syzygium  polyanthum Miss   
21Andrographis  paniculata Hit *
22Sida  rhombifolia Miss   
23Cyperus  rotundus Hit   
24Sonchus  arvensis Miss   
25Curcuma  aeruginosa Hit *
26Curcuma  xanthorrhiza Hit   

B.  Disease:  cancers      
1Catharanthus  roseus Hit   

C.  Disease:  the  digestive  system      
1Foeniculum  vulgare Hit   
2Glycyrrhiza  uralensis Hit *
3Imperata  cylindrica Hit   
4Zingiber  purpureum Hit *
5Physalis  peruviana Hit   
6Punica  granatum Hit *
7Echinacea  purpurea Hit   
8Zingiber  officinale Hit *
9Psidium  guajava Hit   
10Baeckea  frutescens Hit *
11Amomum  compactum Hit   
12Cinnamomum  burmannii Hit *
13Melaleuca  leucadendra Hit   
14Caesalpinia  sappan Hit *
15Parkia  roxburghii Hit   
16Rheum  tanguticum Hit   
17Kaempferia  galanga Hit   
18Coriandrum  sativum Hit   
19Curcuma  longa Hit   
20Zingiber  aromaticum Hit   
21Phyllanthus  urinaria Hit   
22Myristica  fragrans Hit   
23Hydrocotyle  asiatica Hit *
24Carica  papaya Hit   
25Mentha  arvensis Hit   
26Lepiniopsis  ternatensis Hit   
27Helicteres  isora Hit   
28Andrographis  paniculata Hit   
29Symplocos  odoratissima Hit   
30Schisandra  chinensis Hit   
31Blumea  balsamifera Hit   
32Silybum  marianum Hit *
33Cinnamomum   sintoc Hit   
34Elephantopus  scaber Hit   
35Curcuma  aeruginosa Hit   
36Kaempferia  pandurata Hit   
37Curcuma  xanthorrhiza Hit   
38Curcuma  mangga Hit *
39Curcuma  zedoaria Hit   
40Daucus  carota Hit *
41Matricaria  chamomilla Hit *
42Cymbopogon  nardus Hit *

D.  Disease:  female-specific  diseases      
1Foeniculum  vulgare Hit   
2Imperata  cylindrica Hit   
3Tamarindus  indica Hit   
4Pluchea  indica Hit *
5Piper  retrofractum Hit   
6Punica  granatum Hit   
7Uncaria  rhynchophylla Hit   
8Zingiber  officinale Hit   
9Guazuma  ulmifolia Hit *
10Nigella  sativa Hit   
11Terminalia  bellirica Hit   
12Baeckea  frutescens Hit   
13Phaseolus  radiatus Hit   
14Amomum  compactum Hit *
15Sauropus  androgynus Hit   
16Usnea  misaminensis Hit   
17Cinnamomum   burmannii Hit   
18Melaleuca  leucadendra Hit   
19Parameria  laevigata Hit   
20Parkia  roxburghii Hit   
21Piper  cubeba Hit   
22Kaempferia  galanga Hit   
23Coriandrum  sativum Hit   
24Kaempferia  angustifolia Hit   
25Curcuma  longa Hit   
26Zingiber  aromaticum Hit   
27Languas  galanga Hit   
28Galla  lusitania Hit   
29Quercus  lusitanica Hit   
30Hydrocotyle  asiatica Hit   
31Areca  catechu Hit   
32Lepiniopsis  ternatensis Hit   
33Helicteres  isora Hit *
34Piper  betle Hit   
35Elephantopus  scaber Hit *
36Kaempferia  pandurata Hit   
37Curcuma  xanthorrhiza Hit   
38Sesbania  grandiflora Hit   

E.  Disease:  the  heart  and  blood  vessels      
1Allium  sativum Hit   
2Curcuma  longa Hit *
3Morinda  citrifolia Hit *
4Homalomena  occulta Hit *
5Hydrocotyle  asiatica Hit   
6Alstonia  scholaris Hit *
7Syzygium  polyanthum Miss *
8Andrographis  paniculata Hit *
9Apium  graveolens Miss   
10Imperata  cylindrica Hit   

F.  Disease:  male-specific  diseases      
1Cucurbita  pepo Miss   
2Serenoa  repens Miss   
3Baeckea  frutescens Hit   
4Phaseolus  radiatus Hit   
5Curcuma  longa Hit   
6Elephantopus  scaber Hit   

G.  Disease:  muscle  and  bone      
1Foeniculum  vulgare Hit   
2Clausena  anisum-olens Hit *
3Zingiber  purpureum Hit   
4Allium  sativum Hit   
5Strychnos  ligustrina Hit   
6Tinospora  tuberculata Hit *
7Piper  retrofractum Hit   
8Syzygium  aromaticum Hit   
9Cola  nitida Hit *
10Ginkgo  biloba Hit *
11Panax  ginseng Hit   
12Equisetum  debile Hit *
13Zingiber  officinale Hit   
14Ganoderma  lucidum Hit   
15Nigella  sativa Hit   
16Terminalia  bellirica Hit *
17Baeckea  frutescens Hit *
18Amomum  compactum Hit   
19Cinnamomum   burmannii Hit   
20Melaleuca  leucadendra Hit   
21Parameria  laevigata Hit *
22Psophocarpus  tetragonolobus Hit *
23Parkia  roxburghii Hit   
24Piper  cubeba Hit *
25Kaempferia  galanga Hit   
26Coriandrum  sativum Hit   
27Cola  acuminata Hit   
28Coffea  arabica Hit   
29Orthosiphon  stamineus Hit   
30Curcuma  longa Hit   
31Piper  nigrum Hit   
32Alpinia  galanga Hit   
33Vitex  trifolia Hit   
34Zingiber  amaricans Hit *
35Zingiber  zerumbet Hit   
36Zingiber  aromaticum Hit   
37Languas  galanga Hit   
38Massoia  aromatica Hit   
39Morinda  citrifolia Hit   
40Carum  copticum Hit *
41Panax  pseudoginseng Hit *
42Oryza  sativa Hit   
43Myristica  fragrans Hit   
44Pandanus  amaryllifolius Hit   
45Eurycoma  longifolia Hit   
46Hydrocotyle  asiatica Hit   
47Areca  catechu Hit *
48Mentha  arvensis Hit *
49Lepiniopsis  ternatensis Hit   
50Pimpinella  pruatjan Hit   
51Andrographis  paniculata Hit   
52Blumea  balsamifera Hit   
53Cymbopogon  nardus Hit   
54Sida  rhombifolia Hit   
55Cinnamomum   sintoc Hit   
56Piper  betle Hit *
57Talinum  paniculatum Hit   
58Elephantopus  scaber Hit   
59Cyperus  rotundus Hit   
60Curcuma  aeruginosa Hit   
61Kaempferia  pandurata Hit *
62Curcuma  xanthorrhiza Hit   
63Tribulus  terrestris Hit   
64Corydalis  yanhusuo Hit   
65Pausinystalia  yohimbe Hit   

H.  Disease:  nutritional  and metabolic  diseases      
1Foeniculum  vulgare Hit   
2Glycyrrhiza  uralensis Hit   
3Zingiber  purpureum Hit   
4Allium  sativum Hit   
5Tinospora  tuberculata Hit   
6Pandanus  conoideus Hit   
7Syzygium  aromaticum Hit   
8Punica  granatum Hit   
9Zingiber  officinale Hit   
10Guazuma  ulmifolia Hit   
11Nigella  sativa Hit   
12Amomum  compactum Hit *
13Cinnamomum   burmannii Hit   
14Parameria  laevigata Hit   
15Caesalpinia  sappan Hit   
16Soya  max Hit *
17Cocos  nucifera Hit   
18Rheum  tanguticum Hit   
19Piper  cubeba Hit *
20Murraya  paniculata Hit   
21Kaempferia  galanga Hit *
22Coffea  arabica Hit *
23Orthosiphon  stamineus Hit   
24Curcuma  longa Hit   
25Piper  nigrum Hit *
26Zingiber  aromaticum Hit   
27Aloe  vera Hit   
28Phaleria  papuana Hit   
29Galla  lusitania Hit   
30Quercus  lusitanica Hit   
31Morinda  citrifolia Hit   
32Myristica  fragrans Hit *
33Momordica  charantia Hit   
34Areca  catechu Hit   
35Lepiniopsis  ternatensis Hit   
36Alstonia  scholaris Hit   
37Hibiscus  sabdariffa Hit   
38Laminaria  japonica Hit   
39Syzygium  polyanthum Hit   
40Andrographis  paniculata Hit   
41Sindora  sumatrana Hit *
42Cassia  angustifolia Hit   
43Woodfordia  floribunda Hit   
44Piper  betle Hit   
45Spirulina Hit   
46Stevia  rebaudiana Hit   
47Theae  sinensis Hit   
48Sonchus  arvensis Hit   
49Curcuma  heyneana Hit   
50Curcuma  aeruginosa Hit   
51Kaempferia  pandurata Hit *
52Curcuma  xanthorrhiza Hit   
53Curcuma  zedoaria Hit *
54Olea  europaea Hit   

I.  Disease  respiratory  diseases      
1Foeniculum  vulgare Hit   
2Clausena  anisum-olens Hit   
3Glycyrrhiza  uralensis Hit   
4Zingiber  purpureum Hit   
5Piper  retrofractum Hit *
6Syzygium  aromaticum Hit   
7Gaultheria  punctata Hit   
8Panax  ginseng Hit   
9Equisetum  debile Hit *
10Zingiber  officinale Hit   
11Citrus  aurantium Hit *
12Nigella  sativa Hit *
13Amomum  compactum Hit   
14Cinnamomum   burmannii Hit   
15Melaleuca  leucadendra Hit   
16Parkia  roxburghii Hit   
17Cocos  nucifera Hit   
18Piper  cubeba Hit   
19Kaempferia  galanga Hit   
20Coriandrum  sativum Hit   
21Curcuma  longa Hit   
22Piper  nigrum Hit   
23Zingiber  aromaticum Hit   
24Languas  galanga Hit   
25Mentha  piperita Hit   
26Oryza  sativa Hit *
27Myristica  fragrans Hit   
28Pandanus  amaryllifolius Hit *
29Hydrocotyle  asiatica Hit *
30Mentha  arvensis Hit   
31Lepiniopsis  ternatensis Hit   
32Helicteres  isora Hit   
33Blumea  balsamifera Hit   
34Cymbopogon  nardus Hit   
35Piper  betle Hit   
36Curcuma  xanthorrhiza Hit   
37Salix  alba Hit *
38Matricaria  chamomilla Miss *

J.  Disease:  skin  and  connective  tissue      
1Strychnos  ligustrina Hit   
2Merremia  mammosa Hit *
3Piper  retrofractum Hit *
4Santalum  album Hit   
5Zingiber  officinale Hit *
6Citrus  aurantium Hit   
7Citrus  hystrix Hit   
8Cassia  siamea Hit   
9Cocos  nucifera Hit   
10Trigonella  foenum-graecum Hit   
11Orthosiphon  stamineus Hit   
12Curcuma  longa Hit   
13Vetiveria  zizanioides Hit   
14Aloe  vera Hit   
15Rosa  chinensis Hit   
16Jasminum  sambac Hit   
17Phyllanthus  urinaria Hit   
18Mentha  piperita Hit   
19Oryza  sativa Hit   
20Myristica  fragrans Hit *
21Hydrocotyle  asiatica Hit   
22Lepiniopsis  ternatensis Hit   
23Alstonia  scholaris Hit   
24Andrographis  paniculata Hit   
25Cymbopogon  nardus Hit   
26Piper  betle Hit   
27Theae  sinensis Hit   
28Curcuma  heyneana Hit   
29Kaempferia  pandurata Hit *
30Curcuma  xanthorrhiza Hit   
31Melaleuca  leucadendra Hit   
32Matricaria  chamomilla Miss *

K.  Disease:  the  urinary  system      
1Foeniculum  vulgare Hit *
2Imperata  cylindrica Hit *
3Strychnos  ligustrina Hit *
4Plantago  major Hit   
5Zingiber  officinale Hit *
6Cinnamomum   burmannii Hit *
7Strobilanthes  crispus Hit   
8Kaempferia  galanga Hit *
9Orthosiphon  stamineus Hit   
10Phyllanthus  urinaria Hit   
11Blumea  balsamifera Hit *
12Sonchus  arvensis Hit   
13Curcuma  xanthorrhiza Hit   

indicates that plant will not assigned if we use matching score >0.7.
3.4. Evaluation of the Supervised Clustering Based on DPClusO

We used previously published results [6] as gold standard to evaluate our results. The previous study assigned plants to 9 kinds of efficacy whereas we assigned the plants to 18 disease classes (16 from NCBI and 2 additional classes). For the sake of evaluation we got done a mapping of the 18 disease classes to 9 efficacy classes by a professional doctor, which is shown in Table 5. Table 6 shows the prediction result of plant-disease relations for all 3 datasets, corresponding to clusters with matching score greater than 0.6. Table 6 also shows corresponding efficacy, the number of assigned plants, number of correctly predicted plants, and true positive rates (TPR), respectively.

Class of diseaseRef.Efficacy class

D1 Blood and lymph diseasesNCBIE7 Pain/inflammation (PIN)
D2 CancersNCBIE7 Pain/inflammation (PIN)
D3 The digestive systemNCBIE4 Gastrointestinal disorders (GST)
E7 Pain/inflammation (PIN)
D4 Ear, nose, and throatNCBIE7 Pain/inflammation (PIN)
D5 Diseases of the eyeNCBIE7 Pain/inflammation (PIN)
D6 Female-specific diseasesNCBIE5 Female reproductive organ problems (FML)
D7 Glands and hormonesNCBIE7 Pain/inflammation (PIN)
D8 The heart and blood vesselsNCBIE7 Pain/inflammation (PIN)
D9 Diseases of the immune systemNCBIE7 Pain/inflammation (PIN)
D10 Male-specific diseasesNCBIE6 Musculoskeletal and connective tissue disorders (MSC)
D11 Muscle and boneNCBIE6 Musculoskeletal and connective tissue disorders (MSC)
D12 Neonatal diseasesNCBIE7 Pain/inflammation (PIN)
D13 The nervous systemNCBIE7 Pain/inflammation (PIN)
D14 Nutritional and metabolic diseasesNCBIE2 Disorders of appetite (DOA)
E4 Gastrointestinal disorders (GST)
D15 Respiratory diseasesNCBIE8 Respiratory disease (RSP)
E7 Pain/inflammation (PIN)
D16 Skin and connective tissueNCBIE9 Wounds and skin infections (WND)
D17 The urinary system*E1 Urinary related problems (URI)
D18 Mental and behavioural disorders*E3 Disorders of mood and behavior (DMB)

Class of disease Corresponding efficacy0.7% dataset0.5% dataset0.3% dataset
Number of assigned plantsCorrect predictionTrue positive rateNumber of assigned plantsCorrect predictionTrue positive rateNumber of assigned plantsCorrect predictionTrue positive rate

D1E72622 0.85 2420 0.83 2420 0.83
D2E711 1.00 55 1.00 11 1.00
D3E4 4242 1.00 3333 1.00 2828 1.00
E738 0.90 30 0.91 25 0.89
D6E53838 1.00 3737 1.00 3232 1.00
D8E7108 0.80 87 0.88 65 0.83
D9E7000011 1.00
D10E664 0.67 2031 0.33
D11E66565 1.00 7171 1.00 6060 1.00
D13E7000055 1.00
D14E2 5444 0.81 4536 0.80 3526 0.74
E454 1.00 45 1.00 35 1.00
D15E7 3837 0.97 3434 1.00 3333 1.00
E831 0.82 30 0.88 29 0.88
D16E93231 0.97 3232 1.00 2727 1.00
D17E11313 1.00 99 1.00 88 1.00
D18E30055 1.00 44 1.00

Total assigned plants135129117

We determined TPR corresponding to a disease/efficacy class by calculating the ratio of the number of correct prediction to the number of all predictions. When a disease corresponds to more than one kind of efficacy, the highest TPR can be considered the TPR for the corresponding disease. For all 3 datasets the TPR corresponding to each disease is roughly 90% or more. The 0.3% dataset consists of Jamu pairs with higher correlation values and based on this dataset 117 plants are assigned to 14 disease classes. The 0.7% dataset contains more Jamu pairs and assigned plants to 11 disease classes, one less disease class compared to 0.5% dataset. The two disease classes covered by 0.3% dataset but not covered by 0.5% and 0.7% datasets are the nervous system (D13) and disease of the immune system (D9). The only disease class covered by 0.3% and 0.5% datasets but not covered by 0.7% dataset is mental and behavioural disorders (D18). The larger dataset network tends to have lower coverage of disease classes. The number of Jamu pairs, that is, the number of edges in the network, affect the number of DPClusO resulting clusters and number of Jamu formulas per cluster. As a consequence, for the larger dataset networks, the success rate becomes lower and the coverage of disease classes is lower but prediction of more plant-disease relations can be achieved.

4. Conclusions

This paper introduces a novel method called supervised clustering for analyzing big biological data by integrating network clustering and selection of clusters based on supervised learning. In the present work we applied the method for data mining of Jamu formulas accumulated in KNApSAcK database. Jamu networks were constructed based on correlation similarities between Jamu formulas and then network clustering algorithm DPClusO was applied to generate high density Jamu modules. For the analysis of the next steps potential clusters were selected by supervised learning. The successful clusters containing several Jamu related to the same disease might be useful for finding main ingredient plant for that disease and the lower matching score value clusters will be associated with varying plants which might be supporting ingredients. By applying the proposed method important plants from Jamu formulas for every classes of disease were determined. The plant to disease relations predicted by proposed network based method were evaluated in the context of previously published results and were found to produce a TPR of 90%. For the larger dataset networks, success rate and the coverage of disease classes become lower but prediction of more plant-disease relations can be achieved.

Conflict of Interests

The authors declare that there is no financial interest or conflict of interests regarding the publication of this paper.


This work was supported by the National Bioscience Database Center in Japan and the Ministry of Education, Culture, Sports, Science, and Technology of Japan (Grant-in-Aid for Scientific Research on Innovation Areas “Biosynthetic Machinery. Deciphering and Regulating the System for Creating Structural Diversity of Bioactivity Metabolites (2007)”).


  1. R. Verporte, H. K. Kim, and Y. H. Choi, “Plants as source of medicines,” in Medicinal and Aromatic Plants, R. J. Boger, L. E. Craker, and D. Lange, Eds., chapter 19, pp. 261–273, 2006. View at: Google Scholar
  2. A. Furnharm, “Why do people choose and use complementary therapies?” in Complementary Medicine: An Objective Appraisal, E. Ernst, Ed., pp. 71–88, Butterworth-Heinemann, Oxford, UK, 1996. View at: Google Scholar
  3. E. Ernst, “Herbal medicines put into context,” British Medical Journal, vol. 327, no. 7420, pp. 881–882, 2003. View at: Google Scholar
  4. F. M. Afendi, T. Okada, M. Yamazaki et al., “KNApSAcK family databases: integrated metabolite—plant species databases for multifaceted plant research,” Plant and Cell Physiology, vol. 53, no. 2, p. e1, 2012. View at: Publisher Site | Google Scholar
  5. F. M. Afendi, N. Ono, Y. Nakamura et al., “Data mining methods for omics and knowledge of crude medicinal plants toward big data biology,” Computational and Structural Biotechnology Journal, vol. 4, no. 5, Article ID e201301010, 2013. View at: Publisher Site | Google Scholar
  6. F. M. Afendi, L. K. Darusman, A. Hirai et al., “System biology approach for elucidating the relationship between Indonesian herbal plants and the efficacy of Jamu,” in Proceedings of the 10th IEEE International Conference on Data Mining Workshops (ICDMW '10), pp. 661–668, Sydney, Australia, December 2010. View at: Publisher Site | Google Scholar
  7. F. M. Afendi, L. K. Darusman, A. H. Morita et al., “Efficacy of Jamu formulations by PLS modeling,” Current Computer-Aided Drug Design, vol. 9, pp. 46–59, 2013. View at: Google Scholar
  8. F. M. Afendi, L. K. Darusman, M. Fukuyama, M. Altaf-Ul-Amin, and S. Kanaya, “A bootstrapping approach for investigating the consistency of assignment of plants to Jamu efficacy by PLS-DA model,” Malaysian Journal of Mathematical Sciences, vol. 6, no. 2, pp. 147–164, 2012. View at: Google Scholar
  9. W. Winterbach, P. V. Mieghem, M. Reinders, H. Wang, and D. de Ridder, “Topology of molecular interaction networks,” BMC Systems Biology, vol. 7, article 90, 2013. View at: Publisher Site | Google Scholar
  10. C. Bachmaier, U. Brandes, and F. Schreiber, “Biological network,” in Handbook of Graph Drawing and Visualization, pp. 621–651, CRC Press, 2013. View at: Google Scholar
  11. X. Chen, M. Chen, and K. Ning, “BNArray: an R package for constructing gene regulatory networks from microarray data by using Bayesian network,” Bioinformatics, vol. 22, no. 23, pp. 2952–2954, 2006. View at: Publisher Site | Google Scholar
  12. P. Langfelder and S. Horvath, “WGCNA: an R package for weighted correlation network analysis,” BMC Bioinformatics, vol. 9, article 559, 2008. View at: Publisher Site | Google Scholar
  13. A. Martin, M. E. Ochagavia, L. C. Rabasa, J. Miranda, J. Fernandez-de-Cossio, and R. Bringas, “BisoGenet: a new tool for gene network building, visualization and analysis,” BMC Bioinformatics, vol. 11, article 91, 2010. View at: Publisher Site | Google Scholar
  14. M. Altaf-Ul-Amin, M. Wada, and S. Kanaya, “Partitioning a PPI network into overlapping modules constrained by high-density and periphery tracking,” ISRN Biomathematics, vol. 2012, Article ID 726429, 11 pages, 2012. View at: Publisher Site | Google Scholar
  15. M. Altaf-Ul-Amin, H. Tsuji, K. Kurokawa, H. Asahi, Y. Shinbo, and S. Kanaya, “DPClus: a density-periphery based graph clustering software mainly focused on detection of protein complexes in interaction networks,” Journal of Computer Aided Chemistry, vol. 7, pp. 150–156, 2006. View at: Publisher Site | Google Scholar
  16. S. K. Kachigan, Multivariate Statistical Analysis: A Conceptual Introduction, Radius Press, New York, NY, USA, 1991.
  17. J. L. Rodgers and W. A. Nicewander, “Thirteen ways to look at the correlations coefficient,” The American Statiscian, vol. 42, pp. 59–66, 1995. View at: Google Scholar
  18. M. Li, J.-E. Chen, J.-X. Wang, B. Hu, and G. Chen, “Modifying the DPClus algorithm for identifying protein complexes based on new topological structures,” BMC Bioinformatics, vol. 9, article 398, 2008. View at: Publisher Site | Google Scholar
  19. World Health Organization, “International Classification of Diseases (ICD) 10,” 2010, View at: Google Scholar
  20. National Center for Biotechnology Information, Genes and Disease, NCBI, Bethesda, Md, USA, 1998.
  21. P. Erdos and A. Renyi, “On the evolution of random graph,” Publicationes Mathematicae Debrecen, vol. 6, pp. 290–297, 1959. View at: Google Scholar
  22. A.-L. Barabási and R. Albert, “Emergence of scaling in random networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999. View at: Publisher Site | Google Scholar
  23. A. Vázquez, “Growing network with local rules: preferential attachment, clustering hierarchy, and degree correlations,” Physical Review E—Statistical, Nonlinear, and Soft Matter Physics, vol. 67, no. 5, Article ID 056104, 15 pages, 2003. View at: Publisher Site | Google Scholar
  24. Max Planck Institut Informatik, “NetworkAnalyzer,” 2013, View at: Google Scholar

Copyright © 2014 Sony Hartono Wijaya et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Related articles

No related content is available yet for this article.
 PDF Download Citation Citation
 Download other formatsMore
 Order printed copiesOrder

Related articles

No related content is available yet for this article.

Article of the Year Award: Outstanding research contributions of 2021, as selected by our Chief Editors. Read the winning articles.