Research Article  Open Access
Devotha G. Nyambo, Edith T. Luhanga, Zaipuna O. Yonah, Fidalis D. N. Mujibi, "Application of Multiple Unsupervised Models to Validate Clusters Robustness in Characterizing Smallholder Dairy Farmers", The Scientific World Journal, vol. 2019, Article ID 1020521, 12 pages, 2019. https://doi.org/10.1155/2019/1020521
Application of Multiple Unsupervised Models to Validate Clusters Robustness in Characterizing Smallholder Dairy Farmers
Abstract
The heterogeneity of smallholder dairy production systems complicates service provision, information sharing, and dissemination of new technologies, especially those needed to maximize productivity and profitability. In order to obtain homogenous groups within which interventions can be made, it is necessary to define clusters of farmers who undertake similar management activities. This paper explores robustness of production cluster definition using various unsupervised learning algorithms to assess the best approach to define clusters. Data were collected from 8179 smallholder dairy farms in Ethiopia and Tanzania. From a total of 500 variables, selection of the 35 variables used in defining production clusters and household membership to these clusters was determined by Principal Component Analysis and domain expert knowledge. Three clustering algorithms, Kmeans, fuzzy, and SelfOrganizing Maps (SOM), were compared in terms of their grouping consistency and prediction accuracy. The model with the least household reallocation between clusters for training and testing data was deemed the most robust. Prediction accuracy was obtained by fitting a model with fixed effects model including production clusters on milk yield, sales, and choice of breeding method. Results indicated that, for the Ethiopian dataset, clusters derived from the fuzzy algorithm had the highest predictive power (77% for milk yield and 48% for milk sales), while for the Tanzania data, clusters derived from SelfOrganizing Maps were the best performing. The average cluster membership reallocation was 15%, 12%, and 34% for Kmeans, SOM, and fuzzy, respectively, for households in Ethiopia. Based on the divergent performance of the various algorithms evaluated, it is evident that, despite similar information being available for the study populations, the uniqueness of the data from each country provided an overriding influence on cluster robustness and prediction accuracy. The results obtained in this study demonstrate the difficulty of generalizing model application and use across countries and production systems, despite seemingly similar information being collected.
1. Introduction
Despite the high potential of livestock keeping, Ethiopia and Tanzania still suffer from low meat and milk production given that most livestock populations are dominated by low producing indigenous breeds [1, 2]. Smallholder farmers dominate the livestock keeping enterprise in Africa, accounting for about 50% of the total livestock production [3]. Dairy farming is an important source of income for smallholder farmers with high potentials for daily cash flow [4]. Majority of these smallholder producers have not reached their production potential in terms of yield and commercialization. However, data from a recent largescale survey provides evidence that some farmers produce at a level well beyond the average production (PEARL data, 2016; unpublished). There are many constraints that contribute to the unreached potential, including lack of appropriate support in technologies and information dissemination.
Despite the constraints hindering smallholder dairy productivity, milk obtained from smallholder dairy farmers constitutes the bulk of supply available for sale in Eastern Africa [4]. Among the hindering factors in the provision of appropriate support to the dairy sector and evolvement of the dairy farmer beyond subsistence, is the lack of understanding of the production system these farmers are operating in. Characterization of farm typologies is a necessary first step in designing appropriate interventions that allow these farmers to improve farm output and performance. The characterization of production systems and identification of homogenous units that represent contemporary groups in management terms allow us to understand the specific attributes associated with drivers of productivity. This holds the key to unlocking the ingredients of household evolvement through proper planning, adoption, and utilization of appropriate improved technologies and critical policy support [5]. This study sought to provide a mechanism through which farmers that perform similar production activities or have similar production system attributes can be grouped together into production clusters that describe their organization, needs, and outputs.
Given the huge diversity of practices seen in smallholder farms, the need to form homogenous units that group farmers with near similar characteristics has been addressed in several studies. Primarily, this has been done by domain experts allocating farmers to various predetermined classes of farmers; defining their place in the production ecosystem, as well as statistical and machine learning approaches [6–10]. The latter approach involves use of various supervised and unsupervised algorithms to study, analyze, model, and predict trends in smallholder production systems. Recently, unsupervised learning algorithms have been applied in various studies to understand production systems [11, 12]. Some of the more popular unsupervised algorithms include hierarchical clustering, nonhierarchical clustering (Kmeans), unsupervised neural network algorithms (SelfOrganizing Maps), Naïve Bayes and fuzzy clustering algorithms. However, despite their frequent use, unsupervised learning approaches suffer greatly from lack of consistency and predictability [13]. Various attempts have been made to overcome this weakness, including application of multiple algorithms to cluster farm data and select the one with highly homogeneous groups [14, 15].
In this study, three unsupervised machine learning (ML) models were applied to classify and study the characteristics of smallholder dairy production systems based on data obtained from baseline surveys in Ethiopia and Tanzania. The aim of the study was to identify the most robust approach to accurately assign diverse dairy farming households into homogenous production units that reflect the differences in production practice and performance.
2. Methodology
2.1. Dataset Preparation and Feature Selection
Data was collected under the PEARL (Program for Emerging Agricultural Research LeadersFunded by the Bill and Melinda Gates Foundation through the Nelson Mandela African Institution of Science and Technology) project from June 2015 to June 2016 in Ethiopia and Tanzania. The total number of households surveyed was 3,500 for Tanzania and 4,679 for Ethiopia. Data collection was undertaken using questionnaires developed on the Open Data Kit (ODK) platform. Data quality checks included removal of erroneous data such as negative values, questionnaires whose total collection time was below a defined threshold (16 min), and data collected at night (survey start time beyond 7pm). The data cleaning process trimmed the datasets to 3317 and 4394 records for Tanzania and Ethiopia, respectively. From a total of 500 unique variables (features) available for analysis, a set of 46 variables were selected for inclusion in the cluster analysis based on their relevance to productivity and farmer evolvement.
Feature Selection. In order to identify the most unique features among the 46 variables, Principal Component Analysis (PCA) was undertaken to eliminate correlated variables. The top 21 features (based on the load score) with the lowest communality were then selected for further analysis. An additional 14 variables related to feeding systems and health management practices which are known to influence productivity in smallholder dairy farming were included based on expert domain knowledge, such that a total of 35 features were available for cluster analysis and farm type characterization (Table 1). As a prerequisite for clustering, missing values for continuous variables were identified and replaced with population means, while missing values for categorical values were replaced with mode value. The effect of location (study site) for each country was removed from the response variables by fitting a liner model ( and extracting adjusted values. Each quantitative variable was tested for normality and scaled to have a mean of zero and unit variance. Additionally, for each variable, outliers were identified as values above or below the bounds estimated using box plots. Outliers were removed to minimize bias and misclustering. Specifically, bias was minimized by applying the following filters.

The total number of cattle owned was restricted to a maximum of 50 per herd for Ethiopian farmers and a maximum of 30 per herd for Tanzanian farmers based on livestock densities [1, 2]. Some smallholder farmers held land holdings above 100 acres; all farmers with land holdings greater than 100 acres were removed. The maximum amount of milk sold by smallholder farmers was restricted to 100 liters per day, based on expert domain knowledge of the herd sizes and yield per cow. It was assumed that an extension officer could visit a farmer once each week. Any farmer who had more than 54 visits per year was considered an outlier.
2.2. Clustering Algorithms
Three unsupervised learning algorithms, fuzzy clustering, SelfOrganizing Maps (SOM), and Kmeans, were used for cluster analysis. In the analysis, the number of groups (K) represented how many farm typologies (clusters) could be defined for each dataset. The number of clusters that best represented the data was determined using the Elbow method (where a bend or elbow in a graph showing decline of within cluster sum of squares differences as the number of clusters increases provides the best solution). Gap statistics and silhouette separation coefficients were used in preliminary analysis to validate the results from the Elbow method [16], while the Euclidean distance was used to assess cluster robustness. The Elbow method was found to be robust and subsequently used for the rest of the analysis. Given that the selected algorithms have various methods with different convergence rates, two methods for each algorithm were tested and those that minimized convergence time were selected. The final clustering methods used were (i) Fanny for fuzzy clustering [17], (ii) superSOM with batch mode [18], and (iii) HartiganWong [19, 20] for Kmeans. Evaluation of the clustering algorithms was done by considering ranking consistency in the testing dataset, mean distance of observations from central nodes, and mean silhouette separation coefficients as well as accuracy of predicting observed values of select response variables using a model fitting the predicted clusters as fixed effects. Other evaluation criteria for the clustering algorithms were. Data analysis was done using both SAS version 9.2 (SAS Institute Inc., Cary, NC, USA) and R software (Kabacoff, 2011).
2.3. Clustering Models
SelfOrganizing Maps (SOM) have been used to characterize smallholder farmers due to their ability to produce accurate typologies as explained by Nazari et al. [15] and Galluzzo [21]. The SOM algorithm calculates Euclidean distance by using (1) and the best matching unit (BMU) satisfying (2) [21, 22].where and are vectors in an n dimension Euclidean space relating to position of a member and neuron, respectively, and whereby v is any new weight vector, is the current weight of the winning neuron, and is a weight of any other th neuron on the map.
The Kmeans algorithm has been widely used in nonhierarchical clustering and characterizing smallholder dairy farms [7, 8, 10]. Similar to SOMs, the algorithm uses Euclidean distance measures to estimate weights of data records. The algorithm is presented as. (3), with a segment of the Euclidean distance as in (1).where computes the Euclidean distance as in (1); k = number of clusters, n= number of observations, j = minimum number of clusters, i= minimum number of observations, = Euclidean vector for any th observation, and = cluster center for any jth cluster.
Fuzzy analysis (fanny method) was selected based on its relatively short convergence time and good measures for clusters separation [17]. Various methods based on fuzzy models have been used for cluster analysis [23–26]. The fanny method adds a fuzzier and a membership value to the common Kmeans algorithm (see (3)). In addition, the model uses the Dunn coefficient and a silhouette separation coefficient for assessing the solution fuzziness and intercluster cohesion, respectively. The general equation for fuzzy clustering [27] is given in (4) and the Dunn definition of partitioning [28] is given in (5).where k = number of clusters, n = number of observations, i= minimum number of clusters, j= minimum number of observations, =membership coefficient, = Euclidean vector for any th observation, and = cluster center for any jth cluster. Given (4), the Dunn definition of partitioning is given by
2.4. Cluster Validation and Prediction Accuracy
Production clusters outputted from the clustering algorithms were validated in three ways: assessment of cluster robustness, comparison of the cluster membership reallocation (differential allocation of households to clusters for training and testing datasets), and evaluation of the proportion of variation explained by the clusters.
Validation of cluster robustness was first undertaken by comparing three metrics: total within sum of square differences, mean Euclidean distance of observations from the cluster nodes, and the silhouette separation coefficients. Based on these parameters, the most suitable clustering model was identified. In the second stage of validation, the ability of the clustering models to allocate the same group of households into clusters in both training and testing datasets was tested. If all cluster members are colocated in one cluster in training and testing datasets, the reranking is 0 (the rank correlation between the two clusters is 1), and the model would be deemed the most accurate and robust. Parameters considered for evaluation were correlation coefficient, AIC, and residual deviance. The third stage of validation involved fitting linear (or logistic as appropriate) regression models with a set of fixed effects on milk yield, sales, and choice of breeding method. The first model (see (6) and (9)) included the clusters as one of the fixed effects while a second model did not include the clusters (see (7) and (10)). The difference in variance between the two models represented the proportion of total variance in the response variable accounted for by the clusters. The logistic model for choice of breeding method was fitted with only the cluster of production (see (8)) for Ethiopian data while two models were fitted for Tanzania (see (11) and (12)). In preliminary analysis, a model fitted with cluster of production yielded best fit results in the Ethiopia dataset and very low variances as a result of under fitting for the Tanzania dataset. For that reason, two models were fitted for Tanzania and one for Ethiopia to predict the binary variable. Class labels for the logistic regression were 0 and 1 for choice of bull method and Artificial Insemination, respectively. For assessing prediction accuracy, onethird of the records for the response variables were removed so that they could be predicted. The predicted values were correlated with the actual values to obtain an estimate of the prediction accuracy. These latter prediction accuracies were compared with those obtained in the previous validation step to help evaluate the algorithms’ consistency and clusters’ robustness. The logistic model used to predict choice of breeding method is shown inFor Tanzania, predictive models were given byAnd choice of breeding method was given by (see (11) and (12))where is milk yield or milk quantity sold and is choice of breeding method. For the Ethiopia models, is cluster of production, is the error term, is experience in dairy farming, and is years of schooling. For the Tanzania models, is cluster of production, is the error term, is experience in dairy farming, is years of schooling, is total land size, and is area under fodder production.
For all model validation steps, prediction accuracies were obtained by developing the clustering model in a training dataset (70% of all records) and the resulting model reapplied to a testing dataset (remaining 30%). The model with the least reallocation of households between clusters for the training and testing datasets was considered the most robust. Rank analysis using the spearman correlation coefficient was used to evaluate the level of household reallocation between clusters.
3. Results
3.1. Clustering
Based on the Elbow method, a four cluster solution was found to be optimal for the Ethiopia dataset and was fitted in the clustering models (Figure 1). The SOM and Kmeans algorithms clustered the households in the Ethiopia dataset into four groups, while the fuzzy model assigned all households into three clusters, with no members in the fourth cluster. Table 2 shows the cluster densities for each algorithm. For Tanzania, six clusters were defined based on the Elbow method (Figure 2). However, at K=6, the fuzzy model had highly fuzzy cluster memberships of 0.09 and 0.18 for each member. Such low membership values imply an unstable cluster solution. The fuzzy model was therefore discarded for the Tanzania dataset and analysis proceeded with the Kmeans and SelfOrganizing Maps (SOM) algorithms. Cluster densities associated with the six clusters are provided in Table 3.


For the Ethiopian data, cluster densities given in Table 2 indicate the presence of one unchanging cluster for both Kmeans and SOM models (with the exact same list of 487 members). The number of members in the other clusters varied, indicating households being reassigned to different clusters. Figures 3, 4, and 5 represent the cluster visualization for each algorithm in the Ethiopia dataset. Clusters obtained using Kmeans were well separated and showed significant intracluster adhesion (Figure 3), while spatial distribution of SOM clusters (Figure 4) indicated significant overlap between two of the 4 clusters (clusters in red). Cluster densities for Tanzania are displayed in Table 3.
(a)
(b)
Figures 4(a) and 4(b) are a heatmap representation of cluster densities and dendrogram from the SOM model, respectively. Figure 4(a) shows counts of households within clusters while Figure 4(b) indicates cluster relationship and separation. The numbers on the colored plane indicate number of members in each cluster. Two clusters had equal number of farmers (shown in red color) and on the dendrogram these are categorized as clusters 1 and 4. These two clusters seemingly had few differentiating features since they originate from the same parent node. This phenomenon can also be observed in Figure 3 for the Kmeans model (clusters 2 and 4). These clusters appear to be joined into one cluster in the fuzzy model (cluster 3 in Figure 5). The fuzzy model resulted in 3 clusters, each with a significant number of outliers (Figure 5). The outliers were however more pronounced for cluster 2 than clusters 1 and 3.
Presence of the outliers and cluster overlap in the fuzzy model was supported by a low value of the Dunn coefficient (0.3014) which corresponds to a high level of fuzziness.
Based on the results obtained, the cluster composition parameters related to intercluster adhesion and intracluster cohesion indicated that clusters from the Kmeans model were better separated (higher mean silhouette value) and more compact (lower mean distance from central node) than in the other models for Ethiopia (Table 4).

For Tanzania, the mean silhouette separation coefficients were not significantly different (0.66 and 0.64 for Kmeans and SOM, respectively) as shown in Table 5. However, there was a tendency for the SOM to have better defined clusters given its lower within cluster sum of squares as well as lower mean distance from central node. The spatial distribution is illustrated in Figures 6 and 7.

(a)
(b)
For Tanzania clusters’ separation and intactness can be observed through Figures 6 and 7. No significant difference can be observed with regard to the intercluster adhesion between Kmeans and SOM (Table 5).
Figure 6 shows clusters visualization from the Kmeans model for Tanzania dataset. Cluster 4 and 5 overlap and are in close proximity to cluster 6, indicating that they have few differentiating characteristics. This overlapping is equally observed in the SOM model (Figure 7).
The numbers on the colored bar in Figure 7(a) indicate densities of members in each cluster. There are only four well separated clusters based on density (from left: red, orange, yellow, and light gold). However, the dendrogram (Figure 7(b)) shows that three clusters, branching from the same node, which also are also seen as the overlapping clusters (clusters 4, 5, and 6) in the Kmeans plot (Figure 6)
3.2. Cluster Validation
3.2.1. Cluster Membership Reranking
Ranking correlation was used to study the levels of household relocation for the training and testing datasets. Generally, the clustering models applied to the Ethiopia dataset indicated low membership relocation. Table 6 summarizes the results for Ethiopia where, despite a lower Akaike Information Criteria (AIC) estimate, the fuzzy model had the highest number of members reallocated to other clusters (32%) compared to the Kmeans and SOM. The high correlation coefficients for SOM and Kmeans indicate lower reallocation of cluster members. In contrast, results from Tanzania indicated very high reranking of cluster membership between training and testing datasets (Table 7).


3.2.2. Prediction Accuracy
Tables 8 and 9 summarize the results for predicting missing values for milk yield, sales, and breeding choice. Results for Ethiopia dataset indicate that model fitting fixed effects of clusters derived from the fuzzy model had higher accuracies for peak milk yield (0.77), milk sales (0.48), and probability of choosing AI (0.55) as shown in Table 8, while for Tanzania, higher accuracies were obtained for milk production and sales (0.46 and 0.41) while fitting clusters were obtained from the Kmeans model (Table 9).


For the Tanzania dataset, clusters from the Kmeans model achieved high prediction accuracies for both milk yield and sales (at 46% and 41%, respectively). However, the Kmeans clusters had lower prediction accuracy for choice of breeding method (29%). Clusters from the SOM model performed poorly on the quantitative traits but had higher probability (46%) for correctly assigning the choice of breeding method.
3.2.3. Cluster Variances
In order to assess whether the clusters defined by the various algorithms reflect differences in production characteristics between households, we evaluated the variance accounted for by these cluster on select performance measures. For Ethiopia, total variance was 1.015 and 0.988 for milk yield and sales, respectively, while in Tanzania, the total variance was 1.076 and 1.09 for milk yield and sales, respectively. The differences between residual variances for two linear models (see (6) versus (7) for Ethiopia and (9) versus (10) for Tanzania) were significant (p < 0.00001). Results show that, for Ethiopia data, the fuzzy model clusters accounted for 89% and 70% of the total variance in milk yield and milk sales, respectively. On the other hand, the Kmeans clusters accounted for 71% and 65% of the total variation in milk yield and milk sales, respectively. Tables 10 and 11 summarize the proportion of variances accounted for by the clusters for each clustering model.
 
Data scaled to have unit variance and mean of zero. 
 
indicates data scaled to have unit variance and mean of zero. 
4. Discussion
4.1. Characterization of Smallholder Farmers
Unsupervised learning models have been used to characterize smallholder farmers despite the fact that these models lack consistency and are highly unpredictable [13]. In this study, the performance of three commonly used algorithms for clustering farming households; namely, Kmeans, fuzzy, and SOM were compared. A set of validation criteria to assess the robustness of the defined clusters is proposed. This approach is seldom used for similar studies.
In Africa, smallholder farming systems have been characterized using common hierarchical and nonhierarchical clustering algorithms. Work done by Mburu et al. [29], Bidogeza et al. [30], Dossa et al. [10], and Kuivanen et al. [7, 8] utilized the ward and Kmeans methods to define clusters for smallholder households. In addition to the machine learning approaches, use of expert knowledge to validate cluster based characterization is highly recommended [7, 8]. In some studies, the local knowledge has been used in a participatory approach to accurately estimate farm types. Furthermore, complex clustering approaches have also been explored in studying smallholder farm types as done by Salasya & Stoorvogel [23], Pelcat et al. [31], Galluzzo [21], and Paas & Groot [12]. These studies present use of fuzzy clustering, Neural Networks, and Naïve Bayes algorithms, respectively. Although all clustering assigns farmers into some types, the fuzzy clustering presents a soft clustering approach where a farm can belong to more than one farm type or none [31]. However, from the analyzed previous researches clustering models’ robustness and their ability to predict farm types remains uncharted. Following up on Goswami et al. [5] study of smallholder farmers needs to be subjected into formulation of predictive farm types. As such, evolvement of farmers in the homogeneous groups can be predicted because the clusters’ stabilities are known.
4.2. Clustering Algorithms Evaluated
The determination of putative number of clusters that best define the data (K) presents the foremost need in cluster analysis. Bad estimates of K may result into unstable clusters and presence of many members appearing as outliers. Since the goal is to obtain highly homogeneous groups, the within group sum of square difference is commonly used to evaluate how compact the clusters are. We adopted recommendations given by Kassambara [16] and employed the Elbow, Gap statistics, and average silhouette methods to assess the best K for the datasets. The Elbow and Gap statistics estimate a value of K that minimizes the within groups sums of square (WSS) differences such that any additions to the estimated value of K will not significantly change the WSS. Since the study goal was to arrive at highly homogeneous groups, the measure of within sum of square differences seemed most important. However, a common method to estimate optimal number of clusters from other studies is to try out different values of K while observing the silhouette separation or manual inspection of dendrogram produced in hierarchical clustering [15, 16]. While the Elbow method and Gap statistics use within groups sum of square differences, the silhouette method compares the average clusters separation.
The application of the three separate algorithms revealed differences in their performance based on data type and structure. Where observations were highly identical, soft clustering (fuzzy model) failed to categorize the records into appropriate number of clusters. The fuzzy model allocated households into only 3 clusters despite four clusters being determined as appropriate for the Ethiopia dataset (Figure 5). The other models converged at 4 clusters (Figures 3 and 4). Similarly, for the Tanzanian dataset, the fuzzy model could not converge even after many iterations. It would appear that the fuzzy model is best suited to situations where data is highly heterogeneous. Otherwise it does not lend itself well to cluster identification.
Balakrishnan (1994) compared Kmeans and SOM algorithms in cluster identification within specific criterion of intracluster similarity and intercluster differences. In addition, the dataset had known cluster solutions; so, the only target was to find out performance differences between the two algorithms. Results indicated that the Kmeans algorithm had good performance over the SOM algorithm. Mingoti & Lima [32] compared Kmeans and SOM models’ performance by using smallholders’ farm data. Results indicated that Kmeans were more robust. In this study, the SOM performed poorly compared to the fuzzy and Kmeans for the Ethiopia dataset having higher within cluster dispersion, as well as lower separation between clusters. For the Tanzania dataset, the SOM performed similarly as the Kmeans algorithm. Results from our study show that the performance of SOM is concordant with that of Nazari et al. [15] who characterized dryland farming systems. In contrast to observations by Mingoti & Lima [32], the fuzzy model used in their study failed spectacularly for both datasets. This reinforces observations by Xu [33] who concluded that the performance of clustering algorithms is subject to the nature of data and area of application. More studies need to be undertaken to see how the fuzzy algorithm can be best adapted to farming datasets.
4.3. Cluster Membership Reallocation and Prediction Accuracy
A good clustering model should be able to repeatedly allocate a majority of households into the same clusters, even when the volume of data changes. In order to be sure that our model definitions represented a collection of the most important features that describe each cluster, we tested the ability of the models to redefine the same clusters between training and testing datasets. This strategy aligns well with Xu [33], who recommends that a good clustering model should have the ability to deal with new data cases without the need to relearn. The spearman rank correlation was used to measure the degree of reranking. For the Tanzania data, the SOM model provided the best cluster allocation that minimizes reranking. The rank correlations seen in Tanzania were very low for both the Kmeans and SOM models. Given the above premise and the spectacular failure of the fuzzy model in Tanzania, a pattern emerges to suggest a fundamental problem with the Tanzanian dataset rather than issues to do with model suitability. It is possible that there is no significant differentiation between households in Tanzania and the extreme homogeneity proves a challenge because each household can be allocated to any cluster. Such a scenario could occur due to flawed data collection strategies. We suspect that, due to requirements to finalize data collection within set timelines, groups of farmers were interviewed collectively while data was entered as if it were for an individual farmer.
The fuzzy model in Ethiopia had the best fit, indicated by the lowest AIC value despite higher membership reallocation. Given a standard prediction problem, this would be the best model for the data. This is also corroborated by the fact that the variance accounted for by the clusters was also highest for the fuzzy model. However, given that our intention is to maximize correct reassignment of individuals into clusters, the Kmeans and SOM models would be preferred for household membership allocation.
Three response variables (milk yield, sales, and choice of breeding method) were selected for the prediction exercise because of their vital role in smallholder dairy farm evolvement. They generally represent the commercial orientation of a smallholder farm. Evaluation of prediction accuracies for selected response variable indicated a very different scenario from the clustering problem. When the clusters were included in the models to predict milk yield, sales, or breeding method, the fuzzy modelderived clusters had the highest prediction accuracies compared to Kmeans and SOM clusters for Ethiopia data. For Tanzania data, the SOM model clusters yielded the best prediction accuracies for the binary trait, choice of breeding method, while Kmeans model performed the best for the quantitative traits. However, the prediction accuracies for the Tanzania data were low, underscoring the earlier assertions about data structure and integrity. Given the predictive power of the clusters on select response variables, the fuzzy clustering model performed the best, with defined clusters accounting for significantly higher variations in the response variable than other clustering models.
Based on the results from Ethiopia, where all the models could be evaluated, it would seem that model choice depends on the problem that needs to be solved. For a clustering problem, where the intention is to obtain robust membership allocation, then the Kmeans algorithm would be the most appropriate, to ensure maximal homogeneity within clusters. The use of this model would minimize reranking when applying the model to new datasets without need for new learning. However, in the event that clusters are to be used in prediction models, the fuzzy algorithm would be the best for clusters definition.
5. Conclusion
The goal of the reported study was to identify the most robust approach to correctly classify diverse households into homogenous groups of farmers with similar production systems and management activities. The reason for the characterization was to use the defined groups in order to design interventions and strategies that facilitate the evolvement of smallholder dairy farmers beyond subsistence in Ethiopia and Tanzania. Results from this study demonstrate the use of unsupervised learning models in cluster definition for smallholder dairy farmers as well as strategies to assess the models’ suitability and cluster robustness. Performance varied across the tested models, underscoring the need to find an appropriate method depending on data structure and questions being answered. The results obtained from this study are a necessary first step in understanding smallholder farmer production systems and the study of household evolvement from subsistence to full commercial orientation.
Data Availability
The data used to support the findings of this study are available from the corresponding author upon request.
Conflicts of Interest
The authors declare that there are no conflicts of interest regarding the publication of this paper.
Acknowledgments
The authors acknowledge the African Development Bank through the Nelson Mandela African Institution of Science and Technology (NMAIST), for funding the PhD study of the corresponding author. Sincere appreciations are due to the leadership of the PEARL Project at NMAIST for granting access and permitting use of the baseline dairy farm data.
References
 T. Guadu and M. Abebaw, Challenges, Opportunities and Prospects of Dairy Farming in Ethiopia: A Review, vol. 11, 2016.
 Tanzania, “Tanzania Livestock Modernization Initiative,” 2016. View at: Google Scholar
 S. K. Lowder, J. Skoet, and T. Raney, “The Number, Size, and Distribution of Farms, Smallholder Farms, and Family Farms Worldwide q,” World Development, vol. 87, pp. 16–29, 2016. View at: Publisher Site  Google Scholar
 F. Place, R. Roothaert, L. Maina, S. Franzel, J. Sinja, and J. Wanjiku, “The impact of fodder trees on milk production and income among smallholder dairy farmers in East Africa and the role of research,” in ICRAF Occasional Paper, vol. 12, World Agroforestry Centre, Nairobi, Kenya, 2009. View at: Google Scholar
 R. Goswami, S. Chatterjee, and B. Prasad, “Farm types and their economic characterization in complex agroecosystems for informed extension intervention: study from coastal West Bengal, India,” India, pp. 1–24, 2014. View at: Publisher Site  Google Scholar
 J. A. van de Steeg, P. H. Verburg, I. Baltenweck, and S. J. Staal, “Characterization of the spatial distribution of farming systems in the Kenyan Highlands,” Applied Geography, vol. 30, no. 2, pp. 239–253, 2010. View at: Google Scholar
 K. S. Kuivanen, S. Alvarez, M. Michalscheck, K. Descheemaeker, and J. C. J. Groot, “Characterising the diversity of smallholder farming systems and their constraints and opportunities for innovation: A case study from the Northern Region, NJAS  Wageningen Journal of Life Sciences,” NJAS  Wageningen Journal of Life Sciences, 2016. View at: Publisher Site  Google Scholar
 K. S. Kuivanen, M. Michalscheck, K. Descheemaeker, and S. Adjeinsiah, “A comparison of statistical and participatory clustering of smallholder farming systems e A case study in Northern Ghana,” Journal of Rural Studies, vol. 45, pp. 184–198, 2016. View at: Publisher Site  Google Scholar
 J. A. RiveiroVali, M. F. MareyPérez, J. A. RiveiroValiño, C. J. ÁlvarezLópez, and M. F. MareyPérez, “The use of discriminant analysis to validate a methodology for classifying farms based on a combinatorial algorithm,” in Computers and Electronics in Agriculture, vol. 66, pp. 113–120, 2 edition, 2009. View at: Google Scholar
 L. H. Dossa, A. Abdulkadir, H. Amadou, S. Sangare, and E. Schlecht, “Exploring the diversity of urban and periurban agricultural systems in SudanoSahelian West Africa: An attempt towards a regional typology,” Landscape and Urban Planning, vol. 102, no. 3, pp. 197–206, 2011. View at: Publisher Site  Google Scholar
 S. Gizaw, M. Abera, M. Muluye, M. Aliy, and K. Alemayehu, “Validating the Classification of Smallholder Dairy Farming Systems Based on Herd Genetic Structure and Access to Breeding Services,” Agricultural Sciences, vol. 8, no. 7, 2017. View at: Publisher Site  Google Scholar
 W. Paas and J. C. J. Groot, “Creating adaptive farm typologies using Naive Bayesian classification,” in Information Processing in Agriculture, 2017. View at: Publisher Site  Google Scholar
 R. Gelbard, O. Goldman, and I. Spiegler, “Investigating diversity of clustering methods: An empirical comparison,” Data & Knowledge Engineering, vol. 63, pp. 155–166, 2007. View at: Publisher Site  Google Scholar
 C. Conrad, “Assessment of cropping system diversity in the fergana valley through image fusion of landsat 8 and sentinel1,” ISPRS Annals of Photogrammetry, Remote Sensing and Spatial Information Sciences, vol. III7, pp. 173–180, 2016. View at: Publisher Site  Google Scholar
 M. Nazari, F. Razzaghi, D. Khalili, A. A. KamgarHaghighi, and S. M. Tahami Zarandi, “Regionalization of dryland farming potential as influenced by droughts in western Iran,” International Journal of Plant Production, vol. 11, no. 2, pp. 315–332, 2017. View at: Google Scholar
 A. Kassambara, “The Elbow Method,” in Practical Guide to Introduction in R: Unsupervised Learning, 2017. View at: Google Scholar
 B. R. F. AbuJamous and A. K. Nandi, “Integrative cluster analysis in bioinformatics,” Integrative Cluster Analysis in Bioinformatics, pp. 1–419, 2015. View at: Google Scholar
 M. Cottrell, M. Olteanu, F. Rossi, and N. VillaVialaneix, “Theoretical and Applied Aspects of the SelfOrganizing Maps,” in Advances in selforganizing maps and learning vector quantization, vol. 428 of Advances in Intelligent Systems and Computing, pp. 3–26, Springer International Publishing, 2016. View at: Publisher Site  Google Scholar
 N. Nidheesh, K. A. Abdul Nazeer, and P. M. Ameer, “An enhanced deterministic KMeans clustering algorithm for cancer subtype prediction from gene expression data,” Computers in Biology and Medicine, vol. 91, pp. 213–221, 2017. View at: Publisher Site  Google Scholar
 K. Kazuaki, “Experiment of Document Clustering by Triplepass Leaderfollower Algorithm without Any Information on Threshold of Similarity,” IPSJ SIG Technical Report 23, 2013. View at: Google Scholar
 N. Galluzzo, “Technical and economic efficiency analysis on Italian smallholder family farms using Farm Accountancy Data Network dataset,” Studies in Agricultural Economics, vol. 117, no. 1, pp. 35–42, 2015. View at: Publisher Site  Google Scholar
 T. Vatanen, M. Osmala, T. Raiko et al., “Selforganization and missing values in SOM and GTM,” Neurocomputing, vol. 147, no. 1, pp. 60–70, 2015. View at: Publisher Site  Google Scholar
 B. Salasya and J. Stoorvogel, “Fuzzy classification for farm household characterization,” Outlook on agriculture, vol. 39, no. 1, pp. 57–63, 2010. View at: Publisher Site  Google Scholar
 M. K. Gumma, P. S. Thenkabail, F. Hideto et al., “Mapping irrigated areas of Ghana using fusion of 30 m and 250 m resolution remotesensing data,” Remote Sensing, vol. 3, no. 4, pp. 816–835, 2011. View at: Publisher Site  Google Scholar
 M. Söderström, J. Eriksson, C. Isendahl et al., “Using proximal soil sensors and fuzzy classification for mapping Amazonian Dark Earths,” Agricultural and Food Science, vol. 22, no. 4, pp. 380–389, 2013. View at: Publisher Site  Google Scholar
 W. A. Journal, A. Ecology, D. S. Cirad, and C. Board, Mapping Fertilizer Recommendations for Cocoa Production in Ghana Using Soil Diagnostic and GIS Tools, 2009.
 J. C. Bezdek, R. Ehrlich, and W. Full, “FCM: the fuzzy cmeans clustering algorithm,” Computers & Geosciences, vol. 10, no. 23, pp. 191–203, 1984. View at: Publisher Site  Google Scholar
 E. Trauwaert, “On the meaning of Dunn's partition coefficient for fuzzy clusters,” Fuzzy Sets and Systems, vol. 25, no. 2, pp. 217–242, 1988. View at: Publisher Site  Google Scholar
 L. M. Mburu, J. W. Wakhungu, and W. G. Kang'ethe, “Characterization of smallholder dairy production systems for livestock improvement in Kenya highlands,” Livestock Research for Rural Development, vol. 19, no. 8, 2007. View at: Google Scholar
 J. C. Bidogeza, P. B. M. Berentsen, J. Graaff, and A. G. J. M. O. Lansink, “A typology of farm households for the Umutara Province,” in in Rwanda, 321335., URL, vol. 10, pp. 321–335, 2009. View at: Google Scholar
 Y. Pelcat, B. McConkey, P. Basnyat, G. Lafond, and A. Moulin, “InField Management Zone Delineation from Remote Sensing Imagery,” 2015. View at: Google Scholar
 S. A. Mingoti and J. O. Lima, “Comparing SOM neural network with Fuzzy cmeans, Kmeans and traditional hierarchical clustering algorithms,” European Journal of Operational Research, vol. 174, no. 3, pp. 1742–1759, 2006. View at: Publisher Site  Google Scholar
 R. Xu and D. Wunsch II, “Survey of clustering algorithms for MANET,” IEEE Transactions on Neural Networks and Learning Systems, vol. 16, no. 3, pp. 645–678, 2005. View at: Publisher Site  Google Scholar
Copyright
Copyright © 2019 Devotha G. Nyambo et al. This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.