With the development of and advances in smartphones and global positioning system (GPS) devices, travelers’ long-term travel behaviors are not impossible to obtain. This study investigates the pattern of individual travel behavior and its correlation with social-demographic features. For different social-demographic groups (e.g., full-time employees and students), the individual travel behavior may have specific temporal-spatial-mobile constraints. The study first extracts the home-based tours, including Home-to-Home and Home-to-Non-Home, from long-term raw GPS data. The travel behavior pattern is then delineated by home-based tour features, such as departure time, destination location entropy, travel time, and driving time ratio. The travel behavior variability describes the variances of travelers’ activity behavior features for an extended period. After that, the variability pattern of an individual’s travel behavior is used for estimating the individual’s social-demographic information, such as social-demographic role, by a supervised learning approach, support vector machine. In this study, a long-term (18-month) recorded GPS data set from Puget Sound Regional Council is used. The experiment’s result is very promising. The sensitivity analysis shows that as the number of tours thresholds increases, the variability of most travel behavior features converges, while the prediction performance may not change for the fixed test data.

1. Introduction

An activity-travel behavior pattern analysis includes the identification of activity patterns, such as types, duration, sequence, and locations, and the recognition of travel behavior pattern regarding departure time, travel time, and travel types, such as commuting and noncommuting. It is one of the most fundamental research topics to many real-world applications, including Active Traffic and Demand Management (ATDM), Mobility-as-a-Service, and transportation demand management. The activity-travel behavior pattern is derived from either manually collected traveler activity diaries in travel surveys or passively obtained data, like global positioning system (GPS) trajectory data [15], geolocation data [6, 7], and transit smart card data [8, 9].

Travel demand management, such as ATDM, aims to reduce traffic demand or to redistribute the traffic demand temporally or spatially [10]. There are “hard” and “soft” strategies [11]. The “hard” strategies, also called hard policy measures, use a penalty to enforce travel behavior changes [12], including road pricing [13], toll roads [1416], and parking pricing [17]. The “soft” measures include two categories. The first one offers traffic information to impact travelers’ decisions, which does not force behavior change [18]. Implementation cases include a comparison study of passengers’ travel choice behavior by altering the train timetable, proposed by Kusakabe in Japan [19, 20]; a dynamic ridesharing service, Virtual Bus in Italy [21]; a Predict-a-Trip traffic information forecast program in San Francisco [22]; and so on. The second category of “soft” measures uses incentives to influence traveler behavior and has recently attracted attention worldwide. A study in Germany showed an increase in bus use by offering prepaid bus tickets [23]. An early bird, free-ticket program, applied in Melbourne, Australia, aimed to mitigate the rail overcrowding issue and to shift the demand from peak to nonpeak hours [24]. In 2013, a 10-week pilot study was conducted by Metropia in the Los Angeles area using an incentive-based activity demand management smartphone app [10], and significant travel behavior changes, including departure time choices and route options, were observed.

For incentive strategies, the challenge is that the travel patterns and social-demographic features of the target users are not entirely understood. Some ATDM programs use incentives to influence travelers in specific groups [25, 26], like transit riders, while some apply incentives to general autodrivers directly [10]. The limited incentive resources distributed to a significant amount of general travelers may not be efficient for influencing travel behavior. To stimulate travelers to change their travel behavior efficiently and effectively, recognizing the travelers’ social-demographic information, such as social-demographic roles and the associated travel pattern, scientifically dispatching incentives into the targeted individuals or specific individual groups are critical for an incentive strategy in ATDM.

However, collecting travelers’ social-demographic information is not trivial. The most used method is collecting an activity diary in a traffic survey, including paper-based questionnaires and telephone interviews [27]. However, the traffic surveys usually only recruit a small number of participants for a short period (days or weeks), with shortcomings of cost, labor, and unguaranteed accuracy. Fortunately, with the prevalence of location-aware devices, such as a smartphone or GPS-enabled devices, the long-term (months or years) continuous collection of individualized trajectory data offers an unprecedented opportunity to gain insight into the traveler’s daily travel pattern. Particularly, the GPS data provided by smartphone apps, such as Uber [28], Google, and Metropia [10], and the instrumented data derived by GPS devices mounted in vehicles, are among the latest sources of a new information collection mechanism. Rich information relevant to one’s travel behavior is embedded in such long-term continuous collected raw GPS data. However, extracting the travel behavior patterns from raw GPS trajectory data and using them to predict an individual’s social-demographic role are challenging.

Travel behavior variability describes the variance of travel behavior for an extended period, which was recognized and studied [2931] recently. Some researchers focus on the temporal variability of travel behavior characteristics, such as daily travel time [4, 32, 33]. The spatial variability (e.g., activity locations), in which the travelers either repeat or vary their location choice over days, is also studied [32, 34]. In addition to the temporal variability and spatial variability, the mobile variability, such as driving time ratio variance and travel time variance, describes the individual’s movement characteristics. The temporal-spatial-mobile variability reflects the travel characteristics with respect to time, space, and mobility. It is directly correlated with a traveler’s demographic feature, especially social-demographic role (i.e., employment status), like full-time employee, part-time employee, student, retired worker, and so on [35, 36]. For example, a full-time employee is usually a daily commuter from home to work with tight departure time and destination constraints. The commuter may not have much flexibility to stop during the trip or to detour onto a different route. On the other hand, a retired worker may not be a regular commuter and has loose temporal-spatial-mobile restrictions.

This study proposes a social-demographic role prediction framework based on individuals’ travel behavior variability. It first extracts travel behavior variability from a long-term GPS data set. The travel behavior variability is decomposed as three-dimensional features: temporal, spatial, and mobile. The temporal dimension represents the departure time variability, and the spatial dimension indicates the destination location variability. The fluctuations of trip travel time and driving time ratio form the mobile variability dimension. In this study, the travelers’ home sites are detected from the raw GPS data. Then, the home-based tours and the travel behavior variability are produced. Next, the travel behavior variability is fed into a supervised machine learning model (support vector machine) to predict travelers’ social-demographic roles. The study built upon the Puget Sound Regional Council household 2004–2006 survey data, which are provided by the National Renewable Energy Laboratory’s Transportation Secure Data Center [37]. The data set includes 18 months of continuous GPS tracking over survey 450 vehicles from 275 households and the individual traveler demographic information from the travel survey. This complete data set is used not only to extract the travel behavior variability pattern of the survey respondents from their extended period continuous GPS data but more importantly to cross reference with the traditional house survey data and build machine learning models for social role prediction. Other social-demographic variables, such as income, age, and gender, are also tested to understand the general performance of the proposed social-demographic prediction model. Additionally, this study conducts a sensitivity analysis, which investigates the impact of the data collection criterion (i.e., number of tours) on tour variability and social-demographic character prediction. The major features and contributions of this research are summarized below:(i)This study proposes an individual social-demographic role prediction model based on travel behavior variability. The travel behavior variability and its correlation to the social-demographic role are explored.(ii)A sensitivity analysis of sampling threshold for a long-term data set reveals how the travel behavior variability and social-demographic role prediction change by different data sampling thresholds.

This research is expected to provide a practical process framework to fully take advantage of available emerging data (i.e., continuous GPS tracked data) and integrate them into the existing modeling or behavior-related research and applications. These are elaborated in the following sections. The details of travel behavior variability extraction and the social role prediction method are introduced in Methodology. Case Study and Discussion describe the experimental details and the experimental results on the testing data set. It also reveals the result of the sensitivity study of the impact of data collection on travel behavior variability and the social-demographic role prediction. Finally, the Conclusion closes the paper, and the principal findings are illustrated.

2. Methodology

The framework of the proposed social-demographic role prediction method is shown in Figure 1. The framework includes three modules: (1) GPS trajectory preprocessing and tour extraction, (2) tour features and variability generation, and (3) social-demographic role prediction. First, it preprocesses the raw GPS data and determines travelers’ home locations. Based on that, the home-based tours from the original trajectories are detected. Then, the method extracts the tour variability by the tour features and variability generation procedure. Next, the individual tour variability data set is fed into the social role prediction module to estimate social-demographic roles. The methodology details are illustrated in the following sections.

2.1. GPS Trajectory Preprocessing and Tour Extraction

The GPS trajectory preprocessing aims to remove outliers from vehicle-instrumented GPS data. Initially, a data cleaning and smoothing process derived from Schüssler’s raw GPS data-processing procedure [38] is carried out to address GPS system errors, such as warm/cold start problems, and random errors, such as urban canyon errors. Several criteria are used for removing system errors, like the number of satellites, the ground elevation, and the distance between consecutive GPS points. A procedure to detect repeated measurements will record nearly the same coordinates and zero or almost zero travel speed measurements for two or more consecutive GPS points. Only one point represents the repeated measurements: for example, a vehicle stopped at a location in front of a red light will only be represented by one GPS point rather than duplicated measurement points. After the data cleaning and filtering processes have been applied, most of the outliers will be removed, and GPS data trajectories are ready for use.

For continuous GPS trajectory data, it is not hard to detect the home location and then to generate home-based tours. The top three most visited places clusters’ centroid location that the user has visited (departure from or arrive to) are at least 1 mile away from each other, as they are more likely to be home or other locations, such as the workplace. According to the characteristics of the trips related to these sites, such as the departure location of the first trip of the day, the arrival location of the last trip of the day, and the duration of the stay (e.g., more than 8 hours) at this site, home locations can be identified.

After determining the home location, the individual home-based tours, such as Home-to-Home (HH) and Home-to-Non-Home (HN), can be produced. An HH tour is defined as the traveler departing from and returning back home with a reasonable trip travel time during the day (such as 3 hours). An HN tour is the travel during the day departing from home and arriving at any other location, such as the workplace. In a day, the HH and HN tour number, especially the HH tour number, may be greater than 1.

2.2. Tour Features and Variability Generation

In this study, a home-based tour, either HH or HN, may comprise one or more consecutive trips, which is described by departure time, destination location, driving time, and travel time. Similar to the trip, a tour has the tour features encompassing departure time, destination location, driving time ratio, and travel time. The tour departure time is the first trip departure time of the tour, which is a temporal travel behavior feature. The spatial feature, tour destination location, includes the in-tour trips destination locations and the tour destination location, which is represented by a position coordinates set The tour travel time is defined as the total elapsed time (in minutes) from the tour origin (i.e., home) to the destination (i.e., home or others), while the tour driving time ratio is calculated by all trips’ driving time over the tour travel time. Both tour travel time and tour driving time ratio are mobile features or “degree of trip chain.” The travel behavior variability is derived from the variance pattern of tour feature during the data collection period.

For a traveler at th tour, represents the tour feature of home-based tour (, [1-departure time; 2-destination location; 3-travel time; 4-driving time ratio]), and denotes the tour variability of different tour features, derived from tours. The descriptions of tour features and variability variables are listed in Table 1. The details of the tour features and variability are elaborated in the following parts.

2.2.1. Temporal Feature and Variability

The tour temporal feature is the tour departure time, which is converted into a 15-minute time slot index from the beginning (00:00 a.m.) of the day, to describe the departure time within a day numerically. In that case, any time of day can be expressed as the 15-minute time slot index integer ranging from 0 to 95. The tour temporal variability is defined as the expectation of standard error of the sample mean (SEM) of departure time slot index crossing tours in type . The variability of departure time feature (i.e., departure time SEM) is illustrated below, where is the sample mean of

2.2.2. Spatial Feature and Variability

The spatial feature is represented by the destination locations, which are the destinations of all trips in the tour. For example, although the HH tours have a fixed origin and destination (i.e., home), an HH tour may include multiple trips with different purposes, such as grocery shopping trips, children-pickup trips, or social trips. They may have different destination locations. For HN tours, except for the tour destination variation, the in-tour trip destination locations may vary a lot like the HH tours. To numerically describe the variability of the destination locations, Shannon’s entropy [34, 39] is used in this study.

First, for individual and tour type , all destination locations from for tours are collected. The total destination locations of individual for tours are represented as a random variable , where the locations are denoted as , and because a tour may have more than one trip. A clustering procedure merges the close destination locations into clusters according to the distance between any two locations less than 1 km. After the location merging and clustering procedure, the clusters’ centroid locations for a traveler are collected as . Location variability of individual for tours can be measured as the entropy below,where is the historical probability of individual ’s visiting the clustering location during tours for tour type . The property of Shannon entropy indicates that if a traveler repeatedly visits a single location, the location variability of the individual equals zero, while a larger value of results from regular visits to a larger number of locations.

2.2.3. Mobile Features and Variability

The mobile features reflect the vehicle movement behavior and travel property. They are delineated by travel time and driving time ratio. The variability of tour travel time is defined as the SEM of the tour travel time for tours. Similar to the travel time, the driving time ratio variability is calculated by the SEM of the tour driving time ratio for tours. The details are illustrated by (3), where the notations are similar as the previous section

2.3. Social-Demographic Role Prediction

After collecting individuals’ variability variables, with the individuals’ social-demographic role labels as the ground truth data, a supervised machine learning model describing the correlation between travel behavior variability and social-demographic role can be developed. The eight variability variables are the independent features for defining an individual’s travel behavior variability pattern, and the ground truth social-demographic role is used as the dependent variable. The support vector machine (SVM) [36, 40] is a favorite and the most used supervised machine learning approach for multiple and binary classifications and prediction applications. SVM is known as a large margin classifier, and it determines the best decision hyperplanes that provide the biggest possible margin among classes. The primal problem is formatted aswhere is the weight vector of features to define the decision boundary; is a regularization (or penalty) term to relax the objective function, where is the distance of the point from the margin if it is misclassified, and is a constant coefficient to weight the penalty; is the intercept and is the data transformation function; represents the data sample size; and is the class label for data sample (i.e., −1 or 1 for binary classes). The dual problem is developed to help in solving the constrained optimization primal problem,where is the Lagrange multiplier, which is the decision variable. The dual objective function can be represented by the kernel . The radial basis function kernel was suggested to be the most appropriate kernel [41, 42] and was used in this model. The dual problem solutions, which are Lagrange multiplier , are used for predicting the data class by computing the decision function , where is the vector dimension number (eight travel behavior variability variables in this study). The binary classification is determined by the positive or negative values of the decision function .

3. Case Study and Discussion

The Puget Sound Regional Council traffic choices study was an 18-month (during 2004 to 2006) research on travel behavior in response to road use. With 450 vehicles from over 275 households, the GPS raw trajectory data indicated that more than 4.5 million vehicle miles were traveled. Travelers’ social-demographic features are collected as well. The National Renewable Energy Laboratory’s Transportation Secure Data Center [37] summarized the data with high-resolution GPS trajectory data and traditional household survey data. In this experiment, the home-based tours features are extracted from the raw GPS data and, based on that, the variability of tour features is generated. In conjunction with the collected individual social role data, taking tour variability features as the independent variables, an SVM-based prediction model is developed and validated. The number of tours the threshold sensitivity analysis presented based on the experiment data indicates how the thresholds impact tour variability and prediction.

3.1. Experiment
3.1.1. Case Study and Variability Observations

After the raw data were preprocessed and incomplete records were removed, a total of 218 individuals have complete variability variables for at least five HH or HN tours with social-demographic information. For those 218 individuals, the individual’s HH tours (green) and HN tours (red) number distributions are illustrated Figure 2. The mean value of HH tours is about 195, while the average value of HN tours is about 155. One observation is that the HH histogram shifted toward the right-hand side, which implies that there are more HH tours than there are HN tours for general travelers. That is because there are more HH tours than HN tours during a day.

The individuals’ social-demographic roles (employment status) include six types: (1) full-time employee; (2) part-time employee; (3) student; (4) homemaker; (5) retired; and (6) other. The number of type 1-full-time employees dominates the other types. Considering the unbalanced data amount of social role types, the original data set is converted as a binary class data set as type 1 and type 0. Type 1 class is the original type 1 class, while type 0 class stands for the total of type 2 through type 6. Type 1 class has 165 travelers; type 0 class has 53 travelers. The tours’ variability variables of the binary class data set are discussed. Table 2 illustrates the statistical details for type 1 and type 0.

The statistically significant variables are HH tours departure time SEM and HN tours departure time SEM and HN tours driving time ratio SEM.(i)For HN tours, type 1 travelers have significantly lower mean values of departure time SEM (2.51 versus 4.32) than that of type 0 travelers. The reason behind is that type 1 travelers have more departure time restriction on home to other places tours, for example, morning home-to-work commute.(ii)For HN tours, the HN tour’s mean driving time ratio SEM for type 1 travelers is smaller than that for type 0 (0.06 versus 0.08), which indicates that driving time ratio change of type 1 is not significant as that of type 0. It can be explained as the type 1 travelers are more dedicated to their trips and do not frequently stop during their tours.(iii)For HH tours, the departure time situation is reversed. The mean departure time SEM of type 1 is 6.81, which is higher than that of type 0 (5.84). It indicates that the type 1 travelers have slightly more departure time variability for HH tours.

3.1.2. Social-Demographic Role Prediction Result

In the prediction model, the SVM classification is implemented by the python library (sklearn) taking default configurations, and radial basis function kernel is used. The multiclass and binary class prediction accuracy results are illustrated in Table 3. It lists two accuracy metrics. The recall accuracy is defined as the correctly estimated individuals’ number over the total number of actual individuals of the type class. The precision accuracy is the ratio of correctly estimated individuals’ number over the total number of estimated individuals of the type class.

From Table 3, the prediction results are promising, and the overall general accuracy of prediction reaches 94.95%. For multiclass prediction, type 1 class has the highest recall accuracy (100%), type 3 student class has the worst recall accuracy (50%), and three of them are falsely labeled as full-time employees. From a precision accuracy perspective, all classes have high precision accuracy values. For binary class prediction, the recall of type 1 class is still 100%, and 11 travelers from type 0 class are predicted as type 1, which generates a recall of 79.25%. The prediction accuracies for type 1 and type 0 are 93.75% and 100%, respectively.

One observation of the results is the poor prediction performance of type 2 to type 6 classes in the multiclass case and type 0 in binary class cases. The poor prediction results may be led by the unbalanced data set and the limited sample size.

3.1.3. Income, Age, and Gender Prediction Results

In addition to the employment status, an individual’s other social-demographic variables, including income, age, and gender, are discussed in this study. Similar to the experiment results of employment status shown previously, the prediction results of those three variables (income, age, and gender) are shown in Tables 4, 5, and 6. The individual’s income is defined at five levels: (1) less than $25,000, (2) $25,000–$50,000, (3) $50,000–$75,000, (4) $75,000–$150,000, and (5) greater than $150,000. The individual’s age is categorized in different classes: (1) less than 21, (2) 22–34, (3) 35–44, (4) 45–54, (5) 55–65, and (6) greater than 65. The individual’s gender includes female and male.

The overall prediction accuracy values of the three variables (income level = 86.7%, age level = 83.03%, and gender = 90.83%) are still acceptable, although they are relatively lower than the prediction accuracy of employment status (94.95%). It indicates that individual’s employment status is easier to predict than other variables. The reason behind is that the employment status is more directly and closely correlated to the travel behavior variability than other social-demographic variables.

3.2. Sensitivity Analysis

The test data were collected over nearly 18 months, and for a data set collected over a long time, it is feasible to carry out a sensitivity analysis for the sampling threshold, that is, the number of tours. The sensitivity analysis investigates how the threshold impacts the tour variability and even social-demographic role prediction, aiming to answer the questions about the data collection sufficiency for travel behavior variability convergence and estimating the individuals’ social-demographic roles. As a comparison to the SVM model used in the study, another machine learning classification model, logistic regression (LR), is implemented in the analysis.

The number of tours threshold is defined as the required minimum number of tours for both HH and HN for a successful data collection. The number of tours threshold ranges as . For example, a value of 5 indicates that any travelers with less than five tours are disqualified and discarded, while the travelers with 5 or more are qualified and collected. For the qualified individuals, five tours are randomly selected from this traveler’s tour pool. If the threshold value is high, the number of individuals satisfying the number of tours requirement becomes small and vice versa.

3.2.1. Tour Variability

The tour variability variables plotted against the number of tours thresholds for type 1 and type 0 travelers are illustrated in Figure 3. For all diagrams, the -axis is the number of tours thresholds, and the -axis is the variability variables; each single curve represents an individual’s variability values along with the number of tours thresholds. In the diagrams, all variability variables of type 1 and type 0 follow the same or similar patterns.

The HH and HN tours’ departure time SEM, travel time SEM, and driving time ratio SEM variability curves are oscillating at the beginning (lower number of tours thresholds of 40) and then converge to the small values at the end. This indicates that a larger sample size will reduce the variability of the tour features.

The destination location entropy of HH tours (Figures 3(c) and 3(d)) dramatically increases at the beginning (at about 40 tours) and then converges to large individual values for all people. This means that more samples will bring more uncertainty to the destination locations at a small threshold range and then stays constant for the high thresholds. However, the destination location entropy of HN tours (Figures 3(k) and 3(l)) does not follow the significant increase and convergence pattern.

Generally, for a large sample size, the variances of travel behavior features will not change too much, and the variability values are low. According to the diagrams, one thumb of rule is that when the number of tours reaches about 40, the variances of travel behavior features keep constant at low values (except destination location entropy) and the travel behavior variability is more reliable and predictable.

The statistical analyses of two types of travelers for all eight travel behavior variability variables are conducted to understand the travel behavior features variances “before and after 40 tour threshold.” The statistical results are listed in Table 7. The feature variability values for each type of traveler are separated into two groups: “equal to and less than 40” (≤40) and “greater than 40” (>40), according to the number of tours threshold attributes. The sample sizes of the two groups are comparable for each type of travelers. For type 0, the “≤40” group has 379 measurements, while the “>40” group has 197 measurements. For type 1, the “≤40” group has 1,272 measurements, while the “>40” group has 847 measurements.

The standard deviation and mean values of the “>40” group for each type of traveler at nearly all features are significantly smaller than those of the other group (“≤40”), except for a few cases (e.g., HN location entropy and HN travel time SEM at type 1). Besides, for almost all cases, hypothesis tests are significant, except the HN travel time SEM at type 1 and HH travel time SEM at type 0. This indicates the two groups are statistically different for two types of traveler.

The statistical analysis results are consistent with the observations from Figure 3. They validate the conclusion that when the number of tours is more than 40, the variances of travel behavior features keep constant at low values (except for destination location entropies, which are at high values) compared to the cases which are within the “equal to and less than 40” group.

3.2.2. Social-Demographic Role Prediction

The sensitivity study includes logistic regression (LR) as a comparable prediction approach to the SVM model used in this study. This comparison study focuses on the data set overall recall accuracy. Since the number of qualified individuals decreases as the number of tours threshold goes up, the various sample set sizes at different thresholds may impact the prediction results. Figure 4 describes decreasing trend of the number of qualified individuals as the number of tours thresholds increases.

From Figure 4, we can see that, after 90, the decreasing trend of the number of qualified individuals is more significant. The number of qualified individuals at thresholds after 90 is almost less than 100. A fixed sample set, which includes the 126 qualified individuals at threshold 90 (who have at least 90 tours), is used for the test in the study. Those 126 qualified individuals exist as a subset in the qualified individual sets at thresholds from 5 to 90.

The sensitivity research result for the fixed sample set for the number of threshold ranging from 5 to 90 is illustrated in Figure 5. The SVM and LR prediction accuracy values roughly keep constant throughout all different thresholds, while the average prediction accuracy of SVM (about 95%) is always better than that of the LR model (about 84%).

The prediction results illustrate that a larger number of tours required for data collection does not significantly improve the prediction accuracy. Since the traveler type detection result heavily depends on the travel behavior variability differences between both types of travelers, the same or similar travel behavior variability patterns of both types of individuals (which are observed from the diagrams in Figure 3) may explain the result. Although most travel behavior features’ variability converges as the number of tours threshold increases, the relative variability difference of type 1 and type 0 travelers may not change much for different thresholds. Also, for the fixed sample set (126 qualified individuals), the two types of individuals’ variability mean difference ratio ((type_1 − type_0)/type_1) of each travel behavior variability variable, crossing different thresholds, are used to describe the relative variability difference indirectly and to help understand the prediction result, statistically. Figure 6 illustrates the variability mean difference ratio changing trend of all eight variability features as the number of tours threshold increases. From it, the variability mean difference ratios of most features (except (3) HH travel time SEM and (7) HN travel time SEM) stay low and keep constant crossing all thresholds. It tells that the relative variability differences of almost all features are not significantly changed by different thresholds. It partially explains that increase in the number of tours threshold does not cause significant social-demographic prediction changes, at least for this fixed sample set.

4. Conclusions

This paper proposes a social-demographic role prediction method based on the travel behavior variability pattern. It is based on the principles that, for different social groups, they have specific travel behavior patterns. The paper provides a way to formalize traveler’s travel behavior variability pattern by analyzing long-term raw GPS data and to predict individuals social-demographic roles through support vector machine model by travel behavior variability.

The study applies to Puget Sound Regional Council data set, which includes a long-term (18-month) GPS trajectory data set and a particular individual social-demographic data set. The variability derived from the data set indicates that, (1) for HN tours, the full-time employees have tighter departure time restrictions on home to other places tours, for example, the morning home-to-work commute; (2) they are more dedicated to their trips and do not stop frequently; (3) for HH tours, the full-time employee individuals have more departure time flexibility. According to the travel behavior variability properties, the prediction accuracy rates for social-demographic features, including employment status, income, age, and gender, are discovered. Among the social-demographic features, an individual’s employment status is mostly related to the travel behavior variability and can be predicted accurately. The sensitivity analyses about sampling size (number of tours threshold) impacts on the tour variability and the prediction accuracy are also studied. The tour variability is going to converge as the number of tours threshold increases. However, for the fixed sample set, the social-demographic role predictions do not change much as the number of tours threshold increases.

This study preliminarily explores the possibility of using travel behavior variability to predict an individual’s social-demographic information. This prediction method helps to obtain the social-demographic data for the people with long-term collected activity data without any traditional travel surveys. The sensitivity analysis can guide future studies to gather data and design the experiments. However, there are several limitations of this study. The first issue is that there are only a few individuals in the test data set. A larger traveler sample size may improve the model’s performance: the model only considers home-based tours and limited travel behavior variability attributes. More measures of travel behavior features and their variability, such as travel mode, trip purpose, and other types of tours (e.g., work-based tours), should be considered in future work.


The publisher, by accepting the article for publication, acknowledges that the U.S. Government retains a nonexclusive, paid-up, irrevocable, worldwide license to publish or reproduce the published form of this work or allow others to do so, for U.S. Government purposes.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this manuscript.


This work was supported by the U.S. Department of Energy under Contract no. DE-AC36-08GO28308 with Alliance for Sustainable Energy, LLC, the Manager and Operator of the National Renewable Energy Laboratory. Funding was provided by the Federal Highway Administration.