Abstract

Big data technology has been developed rapidly in recent years. The performance improvement mechanism of targeted poverty alleviation is studied through the big data technology to further promote the comprehensive application of big data technology in poverty alleviation and development. Using the data mining knowledge to accurately identify the poor population under the framework of big data, compared with the traditional identification method, it is obviously more accurate and persuasive, which is also helpful to find out the real causes of poverty and assist the poor residents in the future. In the current targeted poverty alleviation work, the identification of poor households and the matching of assistance measures are mainly through the visiting of village cadres and the establishment of documents. Traditional methods are time-consuming, laborious, and difficult to manage. It always omits lots of useful family information. Therefore, new technologies need to be introduced to realize intelligent identification of poverty-stricken households and reduce labor costs. In this paper, we introduce a novel DBSCAN clustering algorithm via the edge computing-based deep neural network model for targeted poverty alleviation. First, we deploy an edge computing-based deep neural network model. Then, in this constructed model, we execute data mining for the poverty-stricken family. In this paper, the DBSCAN clustering algorithm is used to excavate the poverty features of the poor households and complete the intelligent identification of the poor households. In view of the current situation of high-dimensional and large-volume poverty alleviation data, the algorithm uses the relative density difference of grid to divide the data space into regions with different densities and adopts the DBSCAN algorithm to cluster the above result, which improves the accuracy of DBSCAN. This avoids the need for DBSCAN to traverse all data when searching for density connections. Finally, the proposed method is utilized for analyzing and mining the poverty alleviation data. The average accuracy is more than 96%. The average -measure, NMI, and PRE values exceed 90%. The results show that it provides decision support for precise matching and intelligent pairing of village cadres in poverty alleviation work.

1. Introduction

In recent years, deep learning has achieved great success in some fields; especially, the deep neural network (DNN) method has achieved good results on various tasks, such as autonomous driving, intelligent speech, and image recognition [14]. DNN is mainly composed of multiple convolutional layers and fully connected layers. All the layers process the input data, transmit it to the next layer, and output the results at the final layer.

The high-precision DNN model has many network layers and requires more computing resources. Currently, for intelligent application tasks, DNN is usually deployed on cloud servers with sufficient computing resources. In this way, the data source needs to transfer the task data to the model in the cloud. The cloud computing model has the problems with high latency, low privacy, and high communication cost; so, it cannot meet the task requirements well. Some researchers attempt to deploy the cloud model to mobile devices, but due to the limited resources of mobile devices, only simple machine learning methods can be run, which results in low recognition accuracy [57].

As a new computing model, edge computing pushes computing power to the edge, which has attracted extensive attention from researchers [810]. In the edge computing scenario, the DNN model is deployed on the edge computing nodes around the device [11]. Edge computing nodes are much closer to data sources than cloud services; so, the low latency feature can be easily implemented [12]. However, the processing capacity of current edge computing devices is limited, and a single edge computing node may not be able to complete the inference task of complex network model well; so, multiple edge computing nodes are required to jointly deploy the DNN model. The main challenge of deploying DNN is how to select the appropriate computing nodes to deploy the model. Taking the segmentation of the neural network model, the computing requirements of the model and the network conditions of the edge computing nodes were taking into account; so, it optimizes the delay when multiple computing nodes jointly run the neural network model. Figure 1 shows the deep learning model under cloud computing and edge computing, respectively. In the cloud computing scenario, the data sent by the user device is uploaded to the cloud server through the network. Under the edge computing scenario, the cloud server only conducts the model training process of delay tolerance and then utilizes the trained model on the edge node. The data source transmits the task data to the edge node, and the DNN model returns the operation result for the data source.

Targeted poverty alleviation is a way to accurately identify, assist, and manage poverty alleviation targets through scientific and effective procedures in accordance with the conditions of farmers living in different poverty areas. Targeted poverty alleviation based on big data is to build a poverty portrait of poor households through the mining and analysis of poverty alleviation data and to carry out all-round identification and evaluation for poor people [1316], so as to find out the poor population, figure out the causes of poverty, and send poverty alleviation policies. Poverty portrait generally includes poverty index, poverty characteristics, and matching assistance measures of poor households. However, the data mining method is rarely adopted for targeted poverty alleviation. Data mining analyzes data from the vast amounts of data contained in the hidden rule. The data mining includes prior knowledge-based classification, clustering without prior knowledge, association rule mining based on association rules, and intelligent algorithm based on machine learning.

Cluster analysis is an important data mining technology, which plays a very important role in data mining [1719]. Cluster analysis technology is widely used in other research fields, such as machine learning, artificial intelligence, image processing, and cloud computing. Clustering is the process of dividing a data set into different clusters according to the similarity between data objects. The data objects belonging to the same cluster are as similar as possible. Data objects belonging to different clusters are as different as possible. So far, many different clustering algorithms have been proposed, including partition-based clustering algorithm, hierarchical clustering algorithm, density-based clustering algorithm, grid-based clustering algorithm, and fuzzy-based clustering algorithm.

The partition-based clustering algorithm uses the iterative control strategy to optimize an objective function and constantly changes the data objects in the clustering center by iterative relocation, thus improving the partitioning results in each time. The -means algorithm is a classical partition-based clustering algorithm. Since the -means algorithm randomly selects the clustering center, the selection of the initial clustering center will have a great impact on the clustering results, and the algorithm cannot deal with nonspherical clusters. The hierarchical clustering algorithm can handle nonspherical clusters. The Chameleon algorithm is a kind of hierarchical clustering algorithm that mixes “top down” and “bottom up” strategies. The Chameleon algorithm first constructs the -neighborhood graph on the original data set then uses an efficient graph partition algorithm to partition the -neighborhood graph and get the initial class cluster; finally, it merges the subclusters. However, the algorithm is sensitive to noise points and has high time complexity. The density-based clustering algorithm [20] is measured by density correlation between data points; according to the setting threshold, the density of the density exceeds the threshold of the adjacent areas connected data cluster. Compared with partitioning-based clustering, density-based clustering can find the clustering with arbitrary shapes, and the unique outlier processing strategy can process the abnormal data effectively. The DBSCAN algorithm can discover clusters with arbitrary shape, and it is not sensitive to noise. But the time complexity is very high, and it is more sensitive to the neighborhood parameter, since different parameters can lead to different clustering results. In order to solve the problem of parameter sensitivity, Bryant et al. [21] proposed a density estimation method using the reverse nearest neighbor as the data object and used a -neighborhood graph similar to DBSCAN for clustering.

The grid-based clustering algorithm divides the data object space into finite units according to different dimensions, and all processing takes the unit as the object [22]. In this method, the clustering operation of data sets is transformed into the blocks processing in data space, thus improving the efficiency of the algorithm. The CLIQUE algorithm [23] combines the characteristics of grid-based and density-based clustering algorithms and makes use of the priori properties of frequent patterns and association rule mining to obtain the monotony of dense elements in terms of dimension. Then, the clustering is performed by identifying dense elements. The FCM algorithm is a classic fuzzy-based clustering algorithm, which optimizes an objective function iteratively and allocates data objects according to membership matrix. Because the number of class clusters is specified in advance, if the parameter is not selected properly, it is easy to fall into local optimum.

Rodriguez et al. [24] proposed the density peak clustering (DPC) algorithm in 2014, which did not need to specify the number of class clusters in advance, and it only required fewer parameters to find nonspherical clusters and was insensitive to noise. However, the DPC algorithm also has many shortcomings including the following aspects: (1) the artificial setting of the truncation distance has a certain randomness, which has a great influence on the clustering results, and (2) when calculating the local density of data objects, it does not consider the structure difference within the data sets. When the data density difference between the clusters is large, the ideal clustering result cannot be obtained; (3) it cannot handle high dimensional data and large scale data well. In view of the shortcomings of the DPC algorithm, many algorithms have been proposed to solve the above problems. Chen et al. [25] proposed an algorithm to redefine the measurement method of truncation distance and local density by combining the concept of -nearest neighbor, which could generate the truncation distance adaptively for any data set and make the calculation result of local density more consistent with the real distribution of data. At the same time, distance ratio contest was introduced in the decision graph to replace the original distance parameter so that the center of class cluster was more obvious in the decision graph. Guo et al. [26] adopted the idea of -nearest neighbor to calculate the local density of data objects. Although the influence of truncation distance parameter on clustering results was largely solved, the choice of parameter needed further study. Jin et al. [27] introduced natural neighbors to calculate the local density of the data object, thus solving the problem of parameter selection. Li et al. [28] proposed a prominent peak clustering (PPC) algorithm based on significant density peak. The main idea of the algorithm was to divide data objects into multiple clusters and then merged the clusters with no obvious density peak to obtain accurate clustering results. Chen et al. [29] defined the local density of data objects through the law of universal gravitation and established a two-step strategy based on the first cosmic velocity to allocate the remaining data objects, so as to make the allocation of the remaining data objects more accurate. Yu et al. [30] improved the DPC algorithm by introducing weighted local density sequence and two-step allocation strategy and then used the nearest neighbor dynamic table to improve the clustering efficiency of the algorithm. Du et al. [31] introduced the -nearest neighbor and PCA method into the DPC algorithm, which made it handle high-dimensional data well. Xu et al. [32] used the similarity index MS to allocate data points of the data set and then redefined noise points in the boundary region by the -nearest neighbor method. This algorithm was suitable for data sets with high dimensional data and complex data structure, but it could not automatically determine the clustering center. Xu et al. [33] proposed a method by using grid to process large-scale data. For the above methods, they still have low efficiency when processing high dimension data.

The DBSCAN clustering algorithm uses fixed Eps and minPts (two input parameters), and the effect of processing multidensity data is not ideal. The time complexity of the algorithm is . To solve the above problems, a DBSCAN multidensity clustering algorithm based on region division is proposed. Our main contributions are as follows: (1)First, we deploy an edge computing-based deep neural network model(2)Then, in this constructed model, we execute data mining for poverty-stricken family(3)In this paper, the DBSCAN clustering algorithm is used to excavate the poverty features of the poor households and complete the intelligent identification of the poor households(4)In view of the current situation of high-dimensional and large-volume poverty alleviation data, the algorithm uses the relative density difference of grid to divide the data space into regions with different densities and adopts the DBSCAN algorithm to cluster the above result, which improves the accuracy of DBSCAN(5)Experiments show that the algorithm can cluster multidensity data effectively and has strong adaptability to various data and better efficiency

This paper is organized as follows. In section 2, the DNN model based on edge computing for this paper is introduced. Then, we give the detailed proposed DBSCAN algorithm in Section 3. Section 4 displays the experiment and analysis. There is a conclusion in Section 5.

2. Model Deployment Based on Edge Computing

To deploy the DNN model in an edge computing scenario, the training process of the model needs to be completed on the cloud server firstly. The structure of DNN is an ordered sequence between layers. Each layer receives the output data from the previous layer and processes it, then transmits it to the next layer. So, the DNN can be constructed into a DNN model with multiple branches, each branch is composed of a specific neural network layer, and the multiple branches constitute a complete DNN model. The edge computing nodes are composed of a variety of devices with different computing performances, which are numerous, independent, and scattered around the users. A single edge computing node can only run a simple model with low accuracy [34, 35]; so, the branching DNN model is distributed to multiple edge computing nodes. Because each branch has a different network layer structure, and this will lead to different requirements on the computing resources of the deployment nodes for each branch; that is, the deployment of the same model to different nodes will lead to different running delays. Moreover, considering the different network conditions between different edge computing nodes, the distributed deployment of the branch neural network needs to comprehensively consider the computing capacity of nodes, branch model structure, and data transmission between nodes. Therefore, it is necessary to select the best edge computing node deployment for a given DNN branching model structure.

In the edge computing scenario, given edge computing node set and DNN model set with branches , where stands for the th branch, and the model branches require the running order, represent the running order of branch model and .

3. Proposed DBSCAN Algorithm

The DBSCAN algorithm is a classical clustering algorithm based on density. This algorithm calculates the Eps neighborhood of each data object and obtains clustering results by clustering the densified data objects into a class cluster. The DBSCAN algorithm can automatically determine the number of class clusters, find any shape of class clusters, and is not sensitive to noise data. Given a -dimensional data set , DBSCAN is defined as follows:

Definition 1. The Eps neighborhood of the data object . The Eps neighborhood of the data object is defined as the set of points contained within the region of a -dimensional hypersphere with as the center of sphere and Eps as the radius, i.e., , where is the data set in the -dimensional space, and represents the distance between points and in .

Definition 2. Core data objects. Given the parameters Eps and minPts, for the data object , if the object number of in the Eps neighborhood meets , then is called the core object.

Definition 3. Directly density reachable. Given Eps and minPts, for the data object , if satisfies the two conditions: and , then is called directly density reachable about Eps and minPts. In addition, the directly density reachable does not satisfy symmetry.

Definition 4. Density reachable. Given Eps and minPts, for the data object , if there is object sequence , , , is directly density reachable, then is called density reachable about Eps and minPts. In addition, the density reachable does not satisfy symmetry.

Definition 5. Density connection. Given Eps and minPts, for the data object , if there is a data that makes and are density reachable, so and are called density connection about Eps and minPts. Therefore, density connection satisfies symmetry.

When given Eps and minPts, the simple flow of the DBSCAN algorithm is as follows. It selects any unpartitioned data object and determines whether it is a core data object. If so, it finds all data objects with density reachable and labels these data objects as a class. Otherwise, the noise data will be judged. If it is a noise point, it will be labeled; if it is not a noise, the object will not be processed. This is repeated until all data objects have been partitioned.

Given a -dimensional data set , the number of data is , and any dimension attribute in is bounded. Let the value of the th dimension be in the interval , and then is the -dimensional data space. Each dimension of data space is divided into an interval with equal length and mutually disjoint to form a grid unit. These grids are left closed and right open in every dimension. In this way, the data space is divided into super rectangular grid cells with equal volume (numi is the number of intervals on the th dimension in the data space). Set the grid edge length as where is the grid control factor, which is used to control the size of the grid. is used in all experiments in this paper. According to the grid side length, the number of intervals on each dimension can be calculated as

It maps each object in the data set to the corresponding grid. For each data object , the subscript of the corresponding grid on each dimension is

For each object in the data set, it is mapped to the corresponding grid according to equation (3), and the number of objects in grid is .

Adjacent grid is defined as if the grid and are adjacent, then , and there are at most adjacent grids.

The relative density difference of the grid is used; that is, two grid cells and have the density of and , respectively, and the relative density difference of relative to is defined as

Formula (4) is used as a condition for grid merging. Firstly, the grid with the highest density is selected as the initial cell grid , and the relative density difference between adjacent grid and is calculated according to equation (4) in turn. If ( is a given parameter), then and are merged. The initial unit grid becomes the merged big grid area, the initial unit grid density , where represents the merged grid number; that is, the initial grid cell density is dynamic. It continues to extend outward combined grid with this method, until all the boundary grid does not meet the , because some points in the boundary grid may be boundary points for this region, as shown in Figure 2. Data points A and B are the boundary points of cluster C, which cannot be regarded as noise loss. Therefore, the boundary grid is also merged into the region (the boundary grid is no longer extended outward), and the merged grid forms a region, whose data set is denoted as .

Then, it takes the densest of the remaining unprocessed grids as the initial unit grid and repeats the above steps until the remaining data points can no longer be clustered. So, the data set is divided into and the initial noise point.

The above region division method has roughly divided the data set into data regions and noise points with different densities. Then, DBSCAN clustering is carried out for each region with different densities. According to the grid partitioning method, the grid with the largest density of the remaining grid is taken as the initial grid unit every time; so, the density from the first region to the th region is basically decreasing.

According to the grid-based method, the grid with the largest density is used as the initial grid cell in each division. So, the density basically decreases from the first region to the th region. When conducting cluster, this paper inputs Eps and minPts for the first region to perform DBSCAN clustering. Eps is automatically obtained from to by the following formula: where represents the data number in . represents the merged grid number in the th region.

In this paper, the data space is first roughly divided into different data regions through grid partitioning. Then, Eps is automatically obtained according to different densities for each region, and DBSCAN clustering is performed. Region partitioning reduces the unnecessary query operation of density connection in the DBSCAN algorithm and improves the efficiency. Eps of the DBSCAN algorithm is automatically obtained according to the different densities of their respective regions, which makes it more adaptable to data, especially when processing multidensity data, and the effect is better.

The specific description of the proposed algorithm is as follows.

Input: dataset , Eps, minPts, and .

According to equations (1)~(3), the grid is partitioned, and the data is boxed. The data set is mapped to the partitioned grid, and the density of each grid is counted.

According to equation (4) and grid density difference parameter , the grids with similar densities and adjacent distances are merged. In this way, the data space is divided into different regions, and data blocks , and preliminary noise points are generated.

DBSCAN clustering is performed for data regions with different densities according to parameter Eps, minPts, and equation (5).

Output: clustering results.

4. Experiments and Analysis

All algorithms in this paper are realized and processed by MATLAB. The experimental environment is as follows: CPU-Intel i7, Memory-4 GB, and Windows 10. We compare the proposed algorithm in this paper with the CDBSCAN [36], FSDBSCAN [37], and NARDBSCAN [38]. In this paper, three indexes including precision (PRE), normalized mutual information (NMI), and -measure, which are widely used in clustering algorithms, are adopted as the performance measurement criteria for the clustering algorithms, where the value of NMI, -measure, and PRE is [0,1]. The higher value denotes the better clustering result. The experiment data sets are from UCI machine learning repository (http://archive.ics.uci.edu/ml/datasets.php).

4.1. Experiment 1

Data set DS1 has four classes: noise points and multiple densities. The clustering results are shown in Figure 3.

As can be seen from Figure 3(a), data loss in low-density regions is serious when noise points are not absorbed in high-density regions. Conversely, the low-density region does not lose data, the high-density region will absorb the noise point, or the two U-shaped data on the left will be merged into a class. The clustering results in Figure 3(b) are ideal, which can identify classes with noise in the multidensity data, but the algorithm efficiency needs to be improved. In Figure 3(c), the processing effect of irregular graph boundary points is poor. In Figure 3(d), the clustering algorithm in this paper adopts the grid division method to perform DBSCAN clustering for different densities, which not only identifies four classes but also almost does not absorb noise points. In DS1 data set, the average values with different methods are shown in Table 1.

Table 1 shows the clustering results of the four algorithms on the dataset DS1. It can be seen from the comparison of -measure, NMI, and ACC that the clustering of the proposed algorithm in this paper is higher than other algorithms on dataset DS1.

4.2. Experiment 2

This experiment uses a data set DS2 with class boundary interference. The results are shown in Figure 4.

As can be seen from Figure 4(a), CDBSCAN identifies three classes, and the class with low density and close to high density is absorbed into the high density region and absorbs more noise points. In Figure 4(b), the data center in the upper right corner has a high density, and the density gradually decreases along the edge, which is suitable for the edge data. As a result, the data in the lower right corner, which is close to the upper right corner and has a similar density, is merged into a class. In Figure 4(c), the NARDBSCAN algorithm divides the high-density region into multiple classes, and it lacks data processing from the density center to the edge. The algorithm in this paper adopts grid division and DBSCAN clustering algorithm with different parameters for different density blocks, which can well identify clusters with different density and absorb fewer noise points. Table 2 is the comparison result on data set DS2.

4.3. Experiment 3

In this experiment, data sets DS3 with large data volume, many clusters, and rich shapes are used for comparison. The data contains nine clusters. The experiment result is shown in Figure 5.

In Figure 5(b), the cluster processing of the FSDBSCAN method is ideal, but inevitably, the two classes that are close to each other are merged into one class. In Figure 5(c), low-density edges are treated as noise points, and relatively isolated points in the class are treated as noise points. In Figure 5(d), with the proposed method, the high-density region does not absorb too many noise points, and the low-density region is not divided or treated as noise; so, the multidensity data is well processed. Table 3 shows the detailed index values.

The above experiments mainly show the adaptability of the proposed algorithm to multidensity and arbitrary shape data clustering. The following is a quantitative analysis. The original data DS1, DS2, and DS3 are three-dimensional data, and the third dimension is the class label. If the class label in the clustering result is the same as the class label of the original data, the clustering is correct. The correct number of samples in clustering is denoted as , and the total number of data as T [39]. The formula for calculating accuracy is

The data in the following table is the average value after 20 times. DS1, DS2, and DS3 experimental results are shown in Tables 46.

4.4. Experiment 4. Identification of Poverty Features

This section adopts the proposed DBSCAN algorithm to process the real poverty alleviation data. The data source of poverty alleviation comes from a prefecture-level city, including 11,423,500 rural population, 196,700 rural households, and 68,000 poverty-stricken households. The original data includes the data of archived card, the data of visit, the data of agricultural cloud project, the data of education, and health and sanitation departments. According to the designed poverty alleviation data indicator system, the data of poverty alleviation can be filled with ETL tool. Finally, the proposed DBSCAN algorithm in this paper is applied to analyze the poverty alleviation index data to intelligently identify the poverty characteristics of the poor households and carry out the corresponding visual display and analysis to provide decision support for the regional assistance work.

In this paper, the proposed DBSCAN clustering algorithm based on local sensitive hash is applied to complete the mining and analysis of the preprocessed poverty alleviation index data matrix to identify the poverty characteristics of the poor households in different regions. Feature classification is carried out by clustering algorithm, and 8 data clusters are finally identified by consensus with each cluster corresponding to a feature category. For each data cluster, it takes the central data object as the cluster. The index whose value exceeds the threshold value of 0.7 is the impoverishment index of the data object. The poverty characteristics identified in this paper are compared and corresponded to the causes of poverty in the registered cards of poor households. On the one hand, it refines and complements the general causes of poverty in the archival registration cards to facilitate the matching of assistance measures and assistance cadres in the later period. On the other hand, the identification results of this paper and the causes of poverty can be used to verify each other.

Table 7 shows the comparison accuracy between the poverty characteristics identified in this paper and the causes of poverty caused by registered card. We give the overall comparison results. Because this article identifies poverty characteristics compared with cross tent card of poverty causes more refined, index, and rich content, the farmers that have the same poverty characteristics may correspond to multiple poverty reasons. When calculating the accuracy, for the poverty characteristics of a certain household, some or all of the poverty characteristics correspond to the causes of poverty caused by registered card, and we believe that the recognition result is consistent with the result of registered card.

5. Conclusions

Targeted poverty alleviation aims to build a well-off society in an all-round way. However, the current poverty alleviation work mainly relies on the establishment of documents, visiting by village cadres, and so on, which costs a lot of manpower and material resources, and lacks scientific and effective means with supervision management. This paper mainly studies the intelligent identification method of poverty characteristics for poor households, designs an intelligent identification scheme of poverty characteristics, and further provides data reference and guidance for accurate matching of assistance measures and pairing of assistance cadres. However, if the dimension of the datasets is very higher, the cluster effect is not perfect with the current experimental environment. In the future, we will adopt newly deep learning methods to improve the cluster effectiveness. The improved and advanced clustering technologies will be applied in targeted poverty alleviation of the poverty counties in China.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there is no conflict of interest regarding the publication of this paper.