Abstract

With the recent emergence of big data, there has been significant progress in the study of big data mining and rapid developments in urban computing. With the integration of planning and management in urban areas, there is an urgent need to focus on the identification of urban functional areas (UFAs) based on big data. This paper describes the concept of communication activity intensity, which is more meaningful than the number of communication activities or the user density in identifying UFAs. The impact of diverse geographical area subdivisions on the accuracy of UFA recognition is discussed, and a k-means clustering method for dynamic call detail record data and kernel density estimation technique for static point of interest data are established at the traffic analysis zone level. A case study on the region within Beijing’s 3rd Ring Road is conducted, and the results of UFA identification are qualitatively and quantitatively verified. The causes of large passenger flows on certain metro lines in Beijing are also analyzed. The highest identification accuracy is obtained for park and scenery areas, followed by residential areas and office areas. In conclusion, the proposed method offers a significant improvement over the identification accuracy of previous techniques, which verifies the reliability of the method.

1. Introduction

In the process of urban planning and management [1], the division of urban functional areas (UFAs) is a fundamental step. The distribution of UFAs is directly related to decision-making regarding urban transportation, resource management, and factory relocation [2]. As a city develops, the requirements for the integration of urban planning and management change, requiring some dynamic adjustment to the urban planning procedure. At the same time, as urban traffic congestion increases, it is important to alleviate this congestion to prevent an imbalance between urban traffic supply and demand caused by an unreasonable layout of urban functions. However, there is a certain deviation between the existing urban planning and real-world urban development. Therefore, the precise and timely identification of UFAs is urgently required. Furthermore, the identification of UFAs has positive significance for policy formulation, resource allocation, transportation, and enterprise development [3]. Of course, it also has great significance for refining future traffic demand management.

Traditional urban land use classification is largely based on questionnaire surveys, which are time-consuming, labor-intensive, and nonexhaustive and do not reflect the structure of the city in real time [4]. However, some researchers believe that the arrival of the big data era signifies a change in our mode of thinking [57], and so the application of big data in planning is currently a hot topic of research [8]. There is also a recognition that constructing UFAs based on big data is essentially self-fulfilling.

In recent years, many studies have made full use of big data for urban land use classification or UFA detection [9, 10]. For example, the number of regional mobile phone calls has been used to represent the characteristics of urban functions [11], and points of interest (POIs) data have been collected to demonstrate the land use of an area [12, 13]. However, three challenging problems must be solved before mapping the functional area to very-high-resolution images [14], namely, the spatial units, features used for the analysis, and category criteria.

There have been many studies on UFAs using massive mobile phone data, including Call Detail Record (CDR) databases and Location-Based Service (LBS) databases. Several studies have also focused on the division and selection of Geographical Area Subdivisions (GASs) when using big data. Previous studies considering CDR volumes did not take the GAS size into account. Additionally, there has been a lack of data such as POIs, which include the attributes of land use at the application of CDR data, and little application of combined qualitative and quantitative methods in the verification of results.

This paper describes a set of data-driven methods for UFA identification. We consider the abovementioned factors comprehensively, including the influence of different GAS sizes, statistical indicators of CDR data, data sources containing land use features, and verification methods that are both qualitative and quantitative. The purpose of this study is to develop a practicable method for UFA identification, thus enabling reliable decision-making for urban planning and traffic planning and improving the utilization rate of existing big data applications in the engineering field.

Based on CDR data and POI data, the proposed approach makes the following contributions. First, their large-scale and long-period properties mean that CDR data can be used to record citizens’ daily activities. A novel data-driven method of UFA recognition is proposed, and the intensity of daily activities, which depends on land use, makes a great contribution to identifying the function of a district within a city. Second, this study demonstrates that both calculation indicators and GASs must be considered before constructing the CDR model. The size of GASs is shown to have a significant impact on the results of numerical experiments, which further affect the indicators of CDR data such as the Number of Communication Activities (NCA), CAI, and the User Density (UD). Third, POI data overcome the shortcomings of CDR data in the analysis of land use characteristics. Combining POI data and CDR data can improve the accuracy of UFA identification. Finally, although many previous methods have employed qualitative verification, few have also adopted quantitative verification.

The structure of this paper is organized as follows. In the next section, relevant research related to UFA identification is reviewed. Novel region classification methods (GASs, CDR data model, and POI data model) and the required data sources are then presented in Section 3. Section 4 describes a case study of Beijing along with qualitative and quantitative verification and presents the results and a detailed discussion. Finally, Section 5 provides some conclusions and recommendations for future studies.

2. Literature Review

Research on land use classification and UFA recognition has been the subject of considerable effort in the field of geographic information [1517]. However, satellite remote sensing data and other traditional detection methods have some shortcomings, such as a long collection cycle, high cost, and poor representation of the difference between intrinsic functions. Several scholars [18, 19] have studied land use classification models, but their accuracy varies greatly depending on the input data. To overcome these limitations, mobile phone data have been used to explore the spatial structure of cities [20]. The results of spatiotemporal changes in urban activities based on mobile phone data can be displayed using thermodynamic diagrams [21]. This opened the way to a wide range of big data applications in urban computing [2226]. For example, Croce et al. [27] used Floating Car Data (FCD) for zoning and graph building, while Alonso et al. [28] used a great quantity of observed traffic data to estimate the effects of traffic control regulation on the macroscopic fundamental diagram of the traffic network. Croce et al. [29] integrated transport models with big data on transport and energy in an attempt to design transport services with electrical vehicles.

CDR data use the auxiliary positioning function of the Global Positioning System (GPS) [30], allowing the analysis of crowd activities or human activity patterns. A literature review has investigated the use of mobile phone data to track travel behavior [31]. Population activities and human activity patterns are closely related to urban land use and UFAs [32], allowing urban functional types to be distinguished from the perspective of “humans” by CDR data. Of course, there has been much research on the application of big data for land use, for example, traffic data from loop sensors [28], Smart Card Data (SCD) [24], FCD data [27], and GPS data [33].

The employment space and commuting scope of the urban population in the suburbs of New York were analyzed by using CDR data in different periods [34]. Urban activities have also been analyzed dynamically in Monza and Brianza province, Italy, using the amount of mobile phone conversations, messages, and the number of mobile switching center users in different time intervals [11]. However, some experts have mentioned the greater influence of density than volume for CDR data applications [35]. Iounousse has identified the land use of a city using unsupervised clustering based on satellite data [36].

In terms of GASs, their size differs from buildings to administrative regions [37]. Moreover, the GASs may not represent a complete region in the city [38]. Additionally, researchers have conducted experiments that indicated the significance of traffic analysis zones (TAZs) in CDR applications and provided useful suggestions for urban transportation planning agencies [39].

UFA recognition typically uses a clustering method [24, 36, 38, 40, 41] or a semantic model [9, 15, 31, 42]. Semantic models can realize hierarchical recognition, but ignore the shape and size of objects, which have a great impact on the results. In addition, erroneous classification objects can also lead to incorrect results, and the correlations between the UFAs are known to have a strong influence on the overall classification. Clustering methods can overcome these shortcomings; furthermore, the clustering approach is adaptive to individuals and obtains results quickly and precisely. The lack of discussion on GASs and quantitative verification in previous studies has led to inaccurate recognition results, and the combination of static data in existing methods is inadequate when using CDR data. In this study, to identify UFAs, the k-means clustering model is applied to dynamic CDR data that have been translated into the CAIs of GASs, and kernel density estimation (KDE) is used for static POI data based on TAZs. Additionally, in the verification of UFA recognition, qualitative and quantitative analyses are used based on static Baidu high-resolution image map data and field survey data.

3. Methodology

3.1. Data Sources
3.1.1. CDR Data

The case study covers the region within Beijing’s 3rd Ring Road, an area of 159 km2. The study described in this paper was conducted using Beijing CDR data from June 1–30, 2015. These data were obtained from strategic cooperation projects undertaken by our research team and the Beijing branch of China Mobile Communications Group Co., Ltd. As a result, the data have strict privacy protection (with private information removed) and right of use protection. The research included 3198 mobile communication base stations, with 880 macrocellular stations and 2318 microcellular stations. This covered an average of about 4.94 million daily users and 100.73 million daily records. The CDR data format and examples are presented in Table 1.

3.1.2. POI Data

The POI data refer to all geographic entities that can be abstracted as points. The POI data were extracted from the Beijing electronic map in 2015 (see Table 2).

3.1.3. Baidu High-Resolution Image (BHRI) Map Data

The BHRI map data used in this paper can be found at https://map.baidu.com and are publicly available.

3.2. Discussions of GASs

Five different GASs were collected from previous studies, namely, the raster layer [43], Voronoi layer [41, 44], road network segmentation layer [33], TAZ layer [24], and administrative layer [6]. The influence of the different GASs on the identification of UFAs is discussed in Table 3.

From the discussion in Table 3, the TAZ layer appears to have several advantages in terms of UFA identification. There are 235 TAZs within Beijing’s 3rd Ring Road.

3.3. CDR Data Model

Compared with other methods, clustering has many advantages, such as easy operation, rapid output of results, and the ability to focus on individuals. The k-means clustering method is widely used in clustering analysis of UFA recognition based on human activity data. Hence, the k-means clustering is used in this study to deal with CDR data. As there is some difference between human travel characteristics on workdays and at the weekend [45], these periods are analyzed separately, which is very helpful for the recognition of UFAs. In addition, NCA, CAI, and UD are also considered in the CDR data model.

3.3.1. Several Definitions

The following items are used in our model of CDR data (see Table 4):(1)CAIs of GAS: the ratio of the number of calls made or received in a certain GAS at a fixed time interval of the day to the area of the GAS coverage(2)UDs of GAS: the ratio of the number of users in a certain GAS at a fixed time interval of the day to the area of the GAS coverage(3)Matrix of NCAs, CAIs, and UDs: the distributions of NCA, CAI, and UD in each GAS at a 5 min time slot of the day, expressed as , , and , where denotes the 5 min time slot, (4)Signature of each GAS: the aggregation result of the NCA matrix, CAI matrix, and UD matrix, which indicates a certain UFA, expressed as , , ,  = {Monday, Tuesday, Wednesday, Thursday, Friday},  = {Saturday, Sunday}, and .

3.3.2. Index Calculation

(1)Matrix of user numbers:where is the matrix of user numbers in the th GAS; is the number of GASs; represents each day of a month ( in this paper); is the number of mobile communication base stations in a certain GAS; represents the number of users in the nth GAS connected to the cth mobile communication base station in the th 5 min interval of day .(2)Matrix of communication numbers:where is the matrix of communication numbers in the th GAS; represents the number of communications in the nth GAS connected to the th mobile communication base station in the th 5 min interval of day .(3)Area of GASs: area statistics are mainly determined using ArcGIS, and the nth GAS area is referred to as . The specific statistical operations are not discussed in this article.(4)Average value calculation:(a)The average user numbers in GASs:where represents the matrix of average user numbers in GASs on weekday or weekends; , with 1 denoting weekday and 2 denoting weekend.(b)Average communication numbers of GASs:where represents the matrix of average communication numbers in GASs on a weekday or weekend.(5)Intensity calculations:where denote the matrixes of CAIs and UDs, respectively and is the area of the th GAS (km2).(6)Signature calculations:(a)Signature of NCA:(b)Signature of CAI:(c)Signature of UD:where are the signatures of GASs based on the NCA, CAI, and UD on a weekday or a weekend. The signatures are calculated by SPSS.

3.3.3. Clustering Analysis

Unsupervised clustering technology requires the number of clusters to be known beforehand. In the case of k-means, the optimal number of clusters is determined by whether close clustering or good separation is required. A validation method [46] can be used to select a better value of k.

The cluster validity index is the ratio of the intracluster distance to the intercluster distance. The ideal classification will minimize the intracluster distance and maximize the intercluster distance, so a smaller value of the validity index indicates better classification. The cluster validity index is calculated as follows:where and denote the intracluster and intercluster distances; is the validity index; is the set of signatures belonging to the cluster defined by centroid ; represents a signature, such as , , and ; and .

Figure 1 shows the clustering results obtained with different values of k. The following can be inferred: (1) The validity value of the CAI data is smaller than that of the NCA and UD data, which indicates that clustering analysis based on CAIs results in a large intracluster distance and small intercluster distance. This suggests a better clustering result and demonstrates that the size of the GAS has a significant impact on the recognition of urban functions. (2) The UD and NCA data do not provide good results, indicating that the NCAs or UDs may not have as great an impact on the clustering results as the CAIs, which can broadly distinguish the mechanisms of CDR data. There are many situations in which the communication base stations are triggered, including active triggering and passive triggering, cross-region triggering, and switching on-off. In practice, there may be great deviations in results if the communication activity is ignored. (3) The different k values produce different values of VI. The smallest value is given by the CAI data with k = 6. Thus, combined with some relevant research about the types of urban functions and the Chinese standard [47], five single UFAs (residential, commercial, park and scenery, office, and education areas) plus mixed areas are considered in this paper.

3.4. POI Model
3.4.1. POI Data Processing

The POI data were classified for modeling. First, any POI data unrelated to functional identification were removed, for instance, bus station data, exit and entrance data, and other POI data. Several POIs were then reclassified according to the Chinese standard [47]; in this study, the school POI data were divided into university, high school and middle school, and primary school and kindergarten. Residential areas were permitted to include some public facilities, such as primary schools, kindergartens, and convenience stores. In addition, office buildings, government agencies, and parking lots of office buildings were classified as office areas. Commercial areas were distinguished by supermarkets and shopping malls, hotels, and restaurants. The processing results of the POI data are presented in Table 5.

3.4.2. POI Model

In general, nonparametric estimation, which is not affected by the overall parameters, is the most widely used method for determining the probability density. Moreover, it can be applied to any sample analysis. KDE is a nonparametric estimation method for the unknown probability density function. Thus, the POI data model uses KDE. The calculation can be expressed as follows:where is the KDE function at spatial position s; is a distance attenuation threshold; is the number of elements for which the distance is less than or equal to from location ; is the sample element; and is the spatial weighting function.

The two key parameters in the KDE function are and . Different average weights must be selected when choosing a certain function. The uniform function gives the same weight to all points within the scope of the study; the triangular function gives a linear decreasing trend; the Epanechnikov function is relatively slow; and the Gaussian function has no boundaries, allowing weights to be assigned to all points. This study adopts an adaptive bandwidth for the KDE of the Gaussian kernel function [48], as this ensures better convergence and smoothness than the fixed-bandwidth KDE function.

3.5. Recognition Procedures

The identification procedure is illustrated in Figure 2. First, based on CDR data, we calculate the parameters required for the index calculations. Second, the characteristics of the CDR clustering results are analyzed, including the weekday and weekend features, number of peak values, intensity of peak values, and distribution of peak values; this allows the travel behaviors and public cognitions to be understood. Third, based on the clustering of POI data, the UFA identification results are modified. Fourth, verification is conducted using the BHRI data, field data, and the identification index. Finally, we obtain the final UFA results.

4. Case Study and Discussion

4.1. Results and Discussion

To recognize the UFAs, the clustering results of POI data are shown in Figures 3 and 4. The characteristics of residents’ travel behavior and public cognition are now introduced to explain the signatures. UFA identification based on Figure 3 is also discussed.

Cluster 1: the main feature of this cluster is that it has a very high CAI on a weekday and an obvious double peak concentrated at 08:00–11:00 and 14:00–16:00. The CAI in the morning is greater than that in the afternoon. Furthermore, there is still a certain number of CAIs between 18:00 and 21:00, which indicates that some people are still working during this period. The CAIs of this cluster are lower on the weekend than on weekdays, probably the result of overtime being worked on weekends. However, the value of CAI begins to decrease at 16:00 and is very low after 17:00, which suggests that the work is much more flexible on weekends than on weekdays, so employees can leave their offices early. Based on this analysis, cluster 1 is considered to represent office areas.

Cluster 2: this cluster is characterized by the fact that the CAIs on the weekend are higher than those on weekdays, and there is no double peak on weekdays. In contrast, there is a peak activity from 15:00 to 17:00 on weekends, which indicates that people in these areas use their mobile phones to contact friends, fellow travelers, or drivers to arrange their journey home. Combined with the POI results in Figure 4(a), we can infer that this is the signature of park and scenery areas.

Cluster 3: this cluster has the obvious feature that the CAI values on workdays and weekends are relatively low. Additionally, there is a double peak on weekdays and higher CAIs than on weekends. However, no double peak occurs at the weekend. This can be explained by residents working at home on workdays and taking a nap after lunch. In contrast, people who are enjoying their leisure time do not need a specific period of rest. There are around 500 calls/km2 on weekdays, which might be to invite friends or clients to dinner. Thus, these GASs are likely to be residential areas.

Cluster 4: in this cluster, a notable double peak occurs on the left side and has a higher value on weekdays than on weekends. However, the CAIs on both workdays and weekends are not especially high. The trends on weekdays and weekends are similar after 19:00, and the intensity values are only slightly different between day and night, which indicates that it is mainly young people living and working here. In conclusion, this kind of schedule suggests universities and high schools. With the help of the POI data in Figure 4(b), we can firmly conclude that these are education areas.

Cluster 5: the fifth cluster type features a slight difference in intensity between workdays and weekends, and there is a double peak on working days. Furthermore, high CAIs are maintained from 08:00 to 21:00 and longer into the night on weekends. Though the CAI values decline on both weekdays and weekends after 21:00, their number and duration during this period on weekends are stronger and longer than that on weekdays. All of these features are more likely to occur in commercial areas.

Cluster 6: with the lowest CAI values on weekdays and almost the highest values (albeit with significant fluctuations) on weekends, this cluster cannot be accurately summarized, especially at weekends. At the same time, no travel behaviors or human activities can fully explain this pattern. Thus, this region is tagged as a mixed area.

The cluster members of each signature, as calculated by SPSS, were displayed in GIS, and the spatial distribution of the UFA recognition results are shown in Figure 5.

According to the UFA recognition results in Figure 5, several conclusions can be drawn. First, the residential areas have a high density of occupation and are widely distributed. However, the distribution of park and scenery areas is relatively concentrated. Second, most of the GASs south of metro line 1 are residential areas; in contrast, the educational areas are largely located to the north of metro line 1, which may result in tidal traffic situations. As a result, the passenger flow on north-south subway lines (e.g., metro lines 4 and 5) is very high. Third, office areas are mainly distributed around and between metro lines 6 and 1. This places significant traffic pressure on these metro lines, with the spatial and temporal characteristics of passenger flow making for heavy daily average passenger numbers Fourth, the concentrated distribution of park and scenery areas, especially in urban central areas, brings greater centripetal traffic pressure to the urban traffic operations and management. This phenomenon is particularly remarkable on holidays and at weekends.

4.2. Result Verification

To check the accuracy of the results, qualitative and quantitative verifications are applied based on the recognition results, field survey results, and BHRI map data. The following are typical GASs considered in this analysis: Temple of Heaven Park, Financial Street, Beijing Institute of Technology (BIT), Fangzhuang Region, Qianmen Street, and Beijing South Railway Station.

4.2.1. Qualitative Verification

In terms of qualitative verification, we consider some typical GASs and field survey data, as well as the BHRI map. Representations of the six clusters are discussed in Table 6.

4.2.2. Quantitative Analysis

In the quantitative analysis of the proposed methodology, the mixed areas about 8.5% of the total area are neglected because more than one functional component is present. For those areas with a sole functional result, the identification index is defined as the ratio of the area covered by that function to the whole area of the GAS. This is schematically illustrated in Figure 6. Of course, this index can be used to represent the accuracy of recognition. The actual function area is calculated using the field survey results and the BHRI map, and the area of each GAS is computed by GIS. The identification index is computed as follows:where is the cluster type, ; is the identification index of type ; is the actual function area of cluster (km2); and is the area of the GAS in which is located (km2).

The lowest identification index was found to be 63.16% for the commercial area, which is higher than the overall accuracy obtained in previous studies [49, 50]. The dynamic needs of urban planning and management can be satisfied if the identification index is above 60% [50]. Thus, the identifications have great practical significance because the results are all above 60%. The average identification index is 78.30%, far more than the mean value achieved in the previous research, which demonstrates the great progression made by this study.

As Table 7 shows, the park and scenery areas have the highest identification index of 96.00%. This can be explained by the fact that the GASs or TAZs were considered when these functions were divided; furthermore, it shows the significance of choosing reasonable GASs before identifying the urban functions.

The next-highest identification index values are given by the residential areas and office areas. The POIs of multitype residential facilities (e.g., kindergartens, drug stores, and convenience stores) are very helpful in identifying residential areas. Moreover, there are very high CAIs in office areas, so an impressive identification index can be achieved.

The education areas and commercial areas have lower identification index values. This can be explained by the many hotels for conference attendees and departments for school staff around the education area; likewise, with complex land use close to commercial areas, people come and go, but do not stay too long, which affects the CAIs to a certain extent.

5. Conclusions and Future Work

The tendency toward integrated urban planning and management requires dynamic recognition of UFAs. However, the selection of GASs in the previous research has nonnegligible effects on the identification of UFAs. In this study, three indexes of CDR were presented, and the concept of CAI was selected as the main focus of the study. Moreover, POI data were found to be very helpful in identifying UFAs. Thus, k-means clustering for CDR data and the KDE method for POI data were applied to the region within Beijing’s 3rd Ring Road. It is worth noting that the proposed method could be used with a combination of other information, such as SCD data or blog check-in data, which contains POI data. In the final UFA identification results, the park and scenery areas were found to be most accurate. The average identification index was about 78.30%, far higher than in previous research.

The findings of this study are conducive to dynamic urban management and planning. Note that the proposed method has not been applied to the whole city in a case study. However, urban planning theories and related planning data should be considered in future research on UFA identification. Additionally, further research may focus on the application of new technologies in big data mining, such as deep learning and machine learning, which can provide reliable information for the integration of planning and management.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Disclosure

The manuscript abstract was previously presented in Transportation Research Board 98th Annual Meeting.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

The authors would like to acknowledge the financial support for this study provided by the National Key R&D Program of China (2017YFC0803903 and 2016YFE0206800), Transportation Technology Project of Henan Province (2017Z8), Beijing Natural Science Foundation (no. L181001), Key Program of Beijing Natural Science Foundation (no. 4181002), Science and Technology Project of Beijing Municipal Transportation Commission (201825-HNBJ2), and Project of Beijing Municipal Science & Technology Commission (Z191100002519002).