Abstract

Nowadays, urban multimodal big data are freely available to the public due to the growing number of cities, which plays a critical role in many fields such as transportation, education, medical treatment, and land resource management. The successful completion of poverty-relief work can greatly improve the quality of people’s life and ensure the sustainable development of the society. Poverty is a severe challenge for human society. It is of great significance to apply machine learning to mine different categories of poverty-stricken households and further provide decision support for poverty alleviation. Traditional poverty alleviation methods need to consume a lot of manpower, material resources, and financial resources. Based on the density-based spatial clustering of applications with noise (DBSCAN), this paper designs the hierarchical DBSCAN clustering algorithm to identify and analyze the categories of poverty-stricken households in China. First, the proposed method adjusts the neighborhood radius dynamically for dividing the data space into several initial clusters with different densities. Then, neighbor clusters are identified by the border and inner distances constantly and aggregated recursively to form new clusters. Based on the idea of division and aggregation, the proposed method can recognize clusters of different forms and deal with noises effectively in the data space with imbalanced density distribution. The experiments indicate that the method has the ideal performance of clustering, which identifies the commonness and difference in characteristics of poverty-stricken households reasonably. In terms of the specific indicator “Accuracy,” the accuracy increases by 2.3% compared with other methods.

1. Introduction

With the development of Information and Communication Technology, the era of multimodal big data has arrived comprehensively. Cities are the important places which are of prime importance for big data distribution, such as population, economy, transportation, and landscape [13]. The urban multimodal big data obtained by traditional data collection methods such as field survey and questionnaire interview cannot objectively and accurately reflect the status quo of urban development and the law of residents’ activities in a wide range of time and space. Also, the obtained urban operation information has a large lag. Multimodal big data can make up for the above defects and deeply depict the urban physical space and social environment. This not only provides the possibility to objectively understand the urban system and summarize its development rules but also provides important support for urban planning and related research studies such as poverty-relief work and urban education.

It must be admitted that urban planning based on urban multimodal big data is a very challenging task for poverty-relief work. It can improve urban environments, quality of life, and smart city systems [4, 5]. Due to the short time, heavy task of targeted poverty alleviation in the early stage, the basic information of each impoverished object and the causes of poverty are not comprehensive and accurate enough, which needs to be further enriched and improved. Poor object management mechanism is not perfect. Due to the large number of poor people in the poor villages and the complicated family situation, the number of people coming out of the basin and returning to poverty is in constant change [6]. In addition, the management mechanism of poor objects at the village level is not sound enough, so there is a lack of changes in the poor population in the poor villages.

In this paper, we focus on the tasks of identifying and analyzing categories of poverty-stricken households in China. Eradication of poverty is the historical task facing the international community. With the development of artificial intelligence (AI) technologies such as machine learning and deep learning, a growing number of researchers are making great efforts to develop and unleash the huge potential of these AI technologies in alleviating poverty [7]. China, as the largest developing country worldwide, has made a significant contribution to global poverty alleviation. In the year of 2013, the Chinese government raised the concept of targeted poverty alleviation, which aims to take targeted measures to assist each truly poverty-stricken household and eliminate various factors leading to poverty fundamentally, thus achieving the goal of sustainable poverty alleviation [8]. On the basis of the policy, this paper adopts the clustering algorithm [9] to divide the data of poverty-stricken households in China reasonably and thus identify different categories of poverty-stricken households for supporting the formulation and implementation of antipoverty measures.

Poverty-oriented scientific research depends on the analysis of poverty data. The Chinese poverty data generally come from population censuses carried out by the country, society, and universities [10]. Due to the wide coverage of population and the individual differences in educational level and psychology, respondents may not answer questionnaires according to actual conditions, which results in the subjectivity of questionnaire data. Additionally, faults in processes such as data entry and storage can easily lead to outliers and missing values in datasets. Since the quality of poverty datasets obtained by population censuses is hard to guarantee, it brings certain difficulties for the design and application of clustering algorithms.

The design of clustering algorithms for poverty datasets should make reasonable consideration of noises caused by missing values and outliers. Nowadays, common clustering methods mainly include partitional clustering, hierarchical clustering, and density-based clustering [11]. The K-means clustering algorithm achieves clustering through the partition, which assigns each sample to the closest cluster according to distances between samples and prototypes and updates prototypes by the average of samples within clusters, then repeats the above steps until the iteration ends [12]. Although the method is easy and practicable, the number of clusters and the initial prototypes need to be predefined. The agglomerative hierarchical clustering (AHC) regards each sample as a separate cluster and then merges the two closest clusters into a new cluster constantly [13]. The AHC algorithm requires no predefined prototypes and can get the hierarchical structure of clusters, but it is sensitive to noises within data. The density-based spatial clustering of applications with noise (DBSCAN) algorithm is a representative of density-based clustering methods, which defines the cluster as the maximal set of density-connected samples and takes the sample regions with high densities as clusters, thus discovering clusters of arbitrary shapes [14] whereas the hyperparameters and in the DBSCAN algorithm, i.e., the neighborhood radius and the minimum number of samples required to form a dense region, have a great influence on the result of clustering, and the method is not applicable to datasets with different density distribution. Many researchers improve DBSCAN in view of the existing problems in the algorithm and propose improved algorithms such as K-nearest neighbor DBSCAN (KNNDBSCAN), DVBSCAN, and varied density-based spatial clustering of applications with noise (VDBSCAN) [1518]. For instance, Gaonkar and Sawant [19] drew a k-dist graph based on the distance between each sample and its k-th nearest neighbor so as to identify multiple values of the neighborhood radius, then finds the clusters with different densities under each value of the neighborhood radius. Fahim et al. proposed an enhanced DBSCAN (EDBSCAN) algorithm, which defined the density variation for core points and specified that a core point allowed for expansion only when its density variation was less than or equal to a threshold value and its neighborhood satisfies the homogeneity index [20]. In terms of the clustering methods, some other researchers proposed many advanced approaches such as robust FCM clustering [21], improved quantum clustering algorithm [22], and swarm clustering algorithm [23]. Chen et al. [24] proposed a fast clustering for large-scale data. Chel et al. [25] presented the HDBSCAN clustering algorithm to find a clustering pattern present in calcium spiking obtained by confocal imaging of single cells. Znidi et al. [26] introduced a new methodology for discovering the degree of coherency among buses using the correlation index of the voltage angle between each pair of buses and used the hierarchical density-based spatial clustering of applications with noise to partition the network into islands. Parmar et al. [27] proposed a residual error-based density peak clustering algorithm named REDPC to better handle datasets comprising various data distribution patterns. Specifically, REDPC adopted the residual error computation to measure the local density within a neighborhood region. Parmar et al. [28, 29] proposed the feasible residual error-based density peak clustering algorithm with the fragment merging strategy, where the local density within the neighborhood region was measured through the residual error computation and the resulting residual errors were then used to generate residual fragments for cluster formation. Overall, the above methods have the limits of low clustering efficiency and time-consuming with high-dimensional data.

Considering that clusters in real-world datasets may have different sizes, shapes, and densities, accompanied by certain noises and outliers, this paper takes the idea of initial division and hierarchical aggregation to design a clustering algorithm named hierarchical DBSCAN (HDBSCAN). The proposed method comprises two stages of division and aggregation. Our contributions are as follows:(1)First, it makes an initial division of the dataset based on sample densities; that is, the proposed method takes the neighbor information of samples to calculate local density values and then searches the set of density-connected samples for each unlabeled core point sequentially according to the density values in descending order, thus forming the initial clusters.(2)Then, the method adopts the idea of hierarchical clustering to perform the aggregation of neighbor clusters. Based on the inner and border distances between clusters, the most similar clusters are regarded as neighbor clusters and merged to form a new cluster, and the process is repeated until the iteration ends.(3)Based on the way of division and aggregation, the method can identify clusters with different forms in the dataset. Moreover, noise data cannot be integrated into high-density clusters as its density is relatively sparse, by which the proposed method can handle noise data reasonably.

The rest of this paper is organized as follows. Section 2 introduces two typical clustering algorithms, i.e., the DBSCAN clustering and the hierarchical clustering. Section 3 describes the proposed hierarchical DBSCAN algorithm in detail. Section 4 discusses the clustering performance of the proposed method, then applies it to the Chinese poverty dataset, and further analyzes the result of clustering. Finally, conclusions are presented in Section 5.

2. Theoretical Foundation

2.1. The DBSCAN Clustering

The DBSCAN algorithm regards regions with high densities as clusters and those with sparse densities as noises. It requires two hyperparameters, i.e., the neighborhood radius and the minimum number of samples required to form a dense region .

Let represent the dataset composed of samples and attributes, where denotes the i-th sample in the dataset. The -neighborhood of iswhere denotes the distance between samples and , calculated by

If satisfies equation (3), it is called the core point:

There are several definitions in the DBSCAN algorithm, listed as follows:(1)A sample is directly reachable from with respect to and if is a core sample and (2)A sample is reachable from with respect to and if there exists a chain of samples with and , where each is directly reachable from with respect to and (3)A sample is reachable from with respect to and if there exists a chain of samples with and , where each is directly reachable from with respect to and

In the process of clustering, the algorithm randomly selects a core point as the initial point and takes all the core points in its -neighborhood for continuous expansion. The expansion ends until the maximal set of density-connected samples is found and labeled as one cluster. After that, the algorithm randomly chooses other unlabeled core points for generating new clusters. The process of clustering completes when all the core points are labeled.

2.2. Hierarchical Clustering

The hierarchical clustering can be divided into the agglomerative hierarchical clustering and the divisive hierarchical clustering. The agglomerative hierarchical clustering first takes each sample as a separate cluster, then finds the two closest clusters by measuring the distance between the clusters, and then merges them into a new cluster. Subsequently, the algorithm recalculates the distance between clusters and continues the aggregation process. The realization of the divisive hierarchical clustering is the exact opposite of the above, which regards the whole dataset as one cluster and then performs the division iteratively.

In the hierarchical clustering, the distance between and can be calculated by (4), i.e., the average of sample distances between two clusters. Besides, the minimum distance of samples between clusters shown in (5), or the maximum distance of samples between clusters, can also be taken to measure the distance of two clusters:

2.3. The Hierarchical DBSCAN Algorithm

As the global hyperparameters for the DBSCAN algorithm, the numerical values of and have a direct impact on the expansion of all the clusters. Figure 1 illustrates the expansion of clusters under different numerical values of, where the red points denote the initial core points in each iteration of expansion. According to Figure 1(a), the clusters and can be identified while the other samples are regarded as noises and cannot be partitioned properly if the DBSCAN algorithm takes as the neighborhood radius. It can be seen from Figure 1(b) that all the samples are divided into one cluster through four iterations of expansion if the algorithm takes as the neighborhood radius.

In view of the above problem, this paper takes the way of division and aggregation to design the HDBSCAN clustering algorithm. First, the proposed method makes an initial division of the dataset according to sample densities. During the expansion of each cluster, the method adaptively adjusts the neighborhood radius based on the neighbor information of samples within the cluster. Then, the idea of hierarchical clustering is adopted to perform the recursive aggregation; that is, the method takes the cluster pair with the minimum distance as the neighbor clusters and then merges them into a new cluster. Based on division and aggregation, the method can perceive the clusters with different forms in the data space.

2.4. Initial Division

During the process of initial division, the parameter is used to calculate the local density. Let represent the set composed of samples closest to , and the average distance between and all samples in the set is

The distance can capture the density distribution around the sample . The smaller the value, the greater the density. Therefore, the local density of can be defined as

The neighborhood radius of , namely, , is the distance between and the -th nearest sample. The process of the initial division includes the following steps.

Step 1. Calculate the local density for each sample and then sort the samples based on the local density values so as to form the sequence:The cluster label is initialed as .

Step 2. Select an unlabeled sample from the sequence in order and set the iteration number.

Step 3. Let and represent the set of samples and the sequence of core points for the -th cluster in the -th iteration and .

Step 4. Calculate the adaptive neighborhood radius for the expansion of the current cluster by all samples in the cluster:

Step 5. Select a core point from the sequence in order and continue the expansion based on .

Step 6. Calculate the set of neighbor samples to be expanded according to

Step 7. Update and by

Step 8. The expansion of the -th cluster is completed if , then it returns to Step 9. Otherwise, it sets and returns to Step 4.

Step 9. The initial division ends if all the samples are labeled. Otherwise, it sets the cluster label as and returns to Step 2.

2.5. Aggregation of Neighbor Clusters

In this paper, the similarity between clusters is measured by border distance and inner distance. Figure 2 takes the clusters and during the aggregation as an example to describe two kinds of distances. In Figure 2, the red points denote the core points and the grey ones denote the border points distributed around the clusters.

Suppose that the dataset can be represented by after the initial division, where denotes the number of clusters and . While the neighbor clusters are merged to form new clusters continuously during the aggregation, is described by . The set of border points in iswhere denotes the neighborhood radius at the completion of division for . The value changes dynamically due to the adaptive adjustment of neighbor radius. According to Figure 2(a), the border distance between clusters and is the minimum distance between the border points of two clusters, namely,

As can be seen from Figure 2(b), the cluster consists of four initial clusters, and thus the inner distance of the cluster is defined as

During the aggregation, the two clusters with the minimum border distance are considered as the neighbor clusters for further merging if their difference of inner distances and that of densities below certain limitations. Algorithm 1 is a simple implementation of aggregation for neighbor clusters. In the actual implementation of the algorithm, values such as border distances and inner distances will be restored to avoid repeated calculation. According to the 14th line of Algorithm 1, two clusters will be involved in calculating neighbor clusters only when their density difference, border distances, and inner distances satisfy certain conditions.

(1)Input: clusters after initial division ; the threshold ;
(2)Output: final clusters after aggregation
(3)
(4)While True
(5) Calculate by the averaged of density differences between clusters
(6)For each cluster in
(7)  For each cluster in
(8)   Calculate
(9)   Calculate by the averaged densities for samples in the clusters
(10)
(11)
(12)
(13)  
(14)  If and and
(15)   
(16)  End For
(17)End For
(18)If
(19)
(20)  Else
(21)   Break
(22)End While

The proposed HDBSCAN clustering algorithm can capture clusters with different forms in the data space. The aggregation of neighbor clusters weakens the sensitivity of the algorithm to hyperparameters in the initial division. Besides, the result of the division in the DBSCAN algorithm depends on the selection sequence of initial core points. The proposed method can weaken the fluctuation caused by the selection sequence to some extent. The Algorithm 2 summarizes the whole process.

(1)Input: parameter , clusters after initial division ; the threshold ;
(2)Output: final clusters
(3)
(4)While True
(5) Calculate the local density
  
(6) Select an unlabeled sample from the sequence
(7)For.
(8)  Calculate the adaptive neighborhood radius
(9)  Select a core point from the sequence
(10)  Calculate the set of neighbor samples
(11)  For
(12)  The expansion of the -th cluster is completed
(13)  End For
(17)End For
(18) Calculate by the averaged of density differences between clusters
(19)   If and and
(20)    
(21)   End For
(22)  End For
(23)  If
(24)
(25)  Else
(26)   Break
(27)End While

3. Experimental Results and Analysis

3.1. Experimental Design
3.1.1. Datasets

Three public artificial datasets and four real-world datasets are chosen to verify the effectiveness of the proposed clustering algorithm. The description of artificial datasets is listed in Table 1. The visualization of artificial datasets is shown in Figure 3.

The description of real-world datasets is listed in Table 2, where Banknote, Parkinson, Codon usage, HCV, and Planning relax are taken from UCI machine learning repository, and CFPS2016 is the dataset of poverty-stricken households in China. The CFPS2016 dataset comes from the China Family Panel Studies (CFPSs) released by the Institute of Social Science Survey of Peking University, China, in 2016. In the experiment, the CFPS2016 dataset consists of 14019 samples and 320 attributes, which covers the family economy as well as the states of adults and children in health, education, and psychology. Hence, the CFPS2016 dataset can reflect the status of each Chinese household objectively. During the data preprocessing, we fill in missing values with the K-nearest neighbor imputation method [30], and then 1778 poverty-stricken households are measured from 14019 Chinese households based on the Alkire–Foster method, the main measurement method of multidimensional poverty [31]. The parameters in this experiment are set the same as DBSCAN under the same experimental platform.

3.1.2. Evaluation Metrics

We take the silhouette coefficient (SC) [32], Davies–Bouldin index (DBI) [33], adjusted Rand index (ARI), and normalized mutual information (NMI) [34] to measure the performance of clustering. The silhouette coefficient is defined bywhere denotes the total number of samples; denotes the average distance between the sample and all other samples in its cluster, which reflects the cohesiveness of clustering; and denotes the minimum value of average distances between the sample and all samples in any other cluster, which reflects the dispersity of clustering. The larger SC represents the higher performance of clustering. Besides, the definition of the Davies–Bouldin index iswhere denotes the number of clusters; and denote the average distance between all the samples within the cluster and the centroid of the cluster; denotes the distance between cluster centroids. The smaller DBI denotes the higher performance of clustering.

With respect to performance, adjusted Rand index (ARI) and normalized mutual information (NMI) are also used for evaluation. ARI represents the similarity measure between two clusterings that is adjusted for chance and is related to accuracy, while NMI quantifies the amount of information obtained about one clustering, through the other clustering (i.e., the mutual dependence between the two). In the case of observations being identified as noise, each noise observation is treated as a distinct singleton cluster for both ARI and NMI.

3.1.3. Compared Methods

This paper compares the proposed method with three existing clustering algorithms which are described as follows:(1)AHC: as described in Section 2.2, the method regards every sample as a separate cluster and then merges the two closest clusters continuously until the iteration ends.(2)DBSCAN: as described in Section 2.1, the method performs the continuous expansion for each cluster based on core points and thus takes regions with high densities as clusters and those with low-densities as noises.(3)EDBSCAN: the method calculates the density variation for each core points and specified that a core point is allowed to expand only when its density variation is below a specified threshold and its neighborhood satisfies the homogeneity index [35].(4)NS-DBSCAN: the NS-DBSCAN algorithm used a strategy similar to the DBSCAN algorithm. Furthermore, it provided a new technique for visualizing the density distribution and indicating the intrinsic clustering structure [36].(5)ADBSCAN: unlike many other algorithms that estimate the density of each samples using different kinds of density estimators and then choose core samples based on a threshold, ADBSCAN utilized the inherent properties of the nearest neighbor graph [37].

4. Results and Analysis

4.1. Artificial Datasets and Real-World Datasets from UCI

First, we conduct the effect experiments of on the local sensitivity as shown in Figure 4. Then, the selected is used for the following experiments to provide the equitable comparison. From Figure 4, we can know that when is 0.5, the local sensitivity is small. The effect of proposed method is better. Therefore, we select  = 0.5 in this paper.

The clustering results of three artificial datasets based on the proposed method are shown in Figure 5, where regions with different colors can be regarded as one cluster. According to Figures 5(a), 5(c), and 5(e), the datasets are cut into several regions with different densities after the initial division. As can be seen from Figures 5(b), 5(d), and 5(f), the adjacent regions with similar densities aggregate continuously during the aggregation of neighbor clusters, which contributes to the ideal results of clustering. In Figure 5(f), some discrete points are distributed around four large clusters. The proposed method identifies these points as noises since there exist certain differences between the densities of discrete points and those of clusters around them.

The metric values for three UCI datasets obtained by four comparison methods are shown in Table 3, in which the optimal results have been bolded and the suboptimal results have been italicized.

According to Table 2, all the SC values obtained by the proposed method HDBSCAN are better than those obtained by other methods, and the method also has ideal DBI values. For instance, in respect of the Parkinson dataset, the SC value of HDBSCAN is 8.91% higher than that of the suboptimal method AHC. Although the DBI value of HDBSCAN is suboptimal, it is only 2.63% worse than that of EDBSCAN. The above results indicate that the proposed method HDBSCAN has the ideal performance of clustering. Table 2 shows the ARI performance with the different methods on the artificial datasets. From these results, HDBSCAN is shown to rank first in these datasets. More importantly, in each case HDBSCAN is able to identify the underlying classes of each dataset, whereas each of the other approaches fails at this task in at least one case.

4.2. The Dataset of Poverty-Stricken Households in China

We perform clustering on 1778 poverty-stricken households of CFPS2016 so as to identify different categories of poverty-stricken households. Table 4 shows the metric values for CFPS2016 obtained by four compared methods, where the optimal results have been bolded and the suboptimal results have been italicized. Table 4 also shows NMI performance results on the same set of artificial datasets and clustering approaches. Here, HDBSCAN ranking performance is identical to those discussed with respect to ARI.

We also make accuracy comparison with the other three methods. The results are the average values shown in Table 5.

It can be seen from Table 5 that the values of SC and DBI obtained by HDBSCAN are better than those obtained by other compared methods. Therefore, the proposed method has the ideal performance of clustering on the CFPS2016 dataset. The clustering result based on HDBSCAN is listed in Table 6.

According to Table 6, the proposed method divides CFPS2016 into 10 clusters and identifies 70 noises. Additionally, the numbers of households within different clusters are distributed unevenly. For instance, the number of households in Cluster 1 is 382 while those in Cluster 9 and Cluster 10 are 61 and 34, respectively. To evaluate the rationality of the clustering result, we adopt the random forest algorithm to calculate the importances of attributes in ten clusters and thus analyze the characteristics of each cluster. Specifically, based on the labels generated by HDBSCAN clustering, we take each cluster as the positive class and the other clusters as the negative class to construct multiple binary classification models, thereby mining the important attributes within each cluster.

Based on the important attributes within clusters, the characteristics of Cluster 1 are listed below. (1) The household has no children under the age of 16. (2) The annual net income of the household is higher than the average level. (3) Medical expenses are more prominent in the expenditure of the household. The characteristics of Cluster 9 are as follows: (1) the average age of adults in the household is 76. (2) Almost every household member has no pension insurance. Besides, the characteristics of Cluster 10 are as follows: (1) the annual per capita income of the household is 35, 914 yuan, 1.43 times higher than the average level. (2) More than half of the members use computers. The living standard of households in Cluster 10 is relatively high compared with other clusters, and Cluster 10 accounts for a small proportion of poverty-stricken households. According to the above analysis, the causes of poverty and characteristics for most households are similar so that the numbers of households in some clusters are large whereas the characteristics of a few poverty-stricken households are clearly different from others, which leads to small numbers of households in clusters such as Cluster 9 and Cluster 10.

Figure 6 shows the distribution of attribute importances in each cluster, where the abscissa values indicate the numbers of 320 attributes and the ordinate values indicate the attribute importances; the ten curves represent the distribution of attribute importances in ten clusters.

As can be seen from Figure 6, the distributions of attribute importances represented by ten curves nearly differ from each other. For instance, the attribute with the highest importance in Cluster 7 is the 165th-dimensional attribute which denotes the stage of schooling for household members at the last survey. And that in Cluster 8 is the 218th-dimensional attribute which denotes the total post-tax annual income from work. The phenomenon shows that poverty-stricken households within different categories differ in the characteristics and the causes of poverty. Therefore, the proposed method can identify the commonalities and differences in poverty effectively. Finally, for all the datasets, we conduct computational complexity experiments with the different methods. The results are shown in Table 7. Because the proposed method is the hierarchical DBSCAN algorithm based on the initial division and aggregation of neighbor clusters, the time is higher than traditional DBSCAN. However, the time is lower than other new methods.

5. Conclusions

This paper designs the hierarchical DBSCAN algorithm based on the initial division and aggregation of neighbor clusters. First, the proposed method HDBSCAN adopts the adaptive neighborhood radius to perceive regions with different densities and thus makes the initial division of the dataset. Then, iterative aggregation is performed on neighbor clusters according to the border and inner distances. Experiments on artificial datasets and UCI real-world datasets indicate that HDBSCAN has the ideal performance of clustering. Additionally, HDBSCAN divides the dataset of Chinese poverty-stricken household, namely, CFPS2016, into 10 clusters, and experimental results verify the rationality of the clustering result. The main reasons for the ideal performance of HDBSCAN lie in the following two aspects. First, the adaptive neighborhood radius helps to identify regions of different densities in the data space with imbalanced density distribution. Second, the aggregation further merges neighbor clusters with similar densities, which weakens the impact of the accuracy of initial partition on the clustering performance effectively. However, if the dimension of the datasets is very higher, the cluster effect is not better. In the future, more research studies will be conducted on the clustering result of the CFPS2016 dataset. To be specific, we will study the characteristics of poverty-stricken households in each category so as to support the formulation and implementation of antipoverty measures. The advanced clustering technology will be applied in targeted poverty alleviation of the poverty counties in China.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.