Over the last decade, demand for active transportation modes such as walking and bicycling has increased. While it is desirable to provide high levels of safety for these eco-friendly modes of travel, unfortunately, the overall percentage of pedestrian and bicycle fatalities increased from 13% to 18% of total road-related fatalities in the last decade. In San Diego County, although the total number of pedestrian and bicyclist fatalities decreased over the same period of time, a similar trend with a more drastic change is observed; the overall percentage of pedestrian and bicycle fatalities increased from 19.5% to 31.8%. This study aims to estimate pedestrian and bicyclist exposure and identify signalized intersections with highest risk for walking and bicycling within the city of San Diego, California, USA. Multiple data sources such as automated pedestrian and bicycle counters, video cameras, and crash data were utilized. Data mining techniques, a new sampling strategy, and automated video processing methods were adopted to demonstrate a holistic approach that can be applied to identify facilities with highest need of improvement. Cluster analysis coupled with stratification was employed to select a representative sample of intersections for data collection. Automated pedestrian and bicycle counting models utilized in this study reached a high accuracy, provided certain conditions exist in video data. Results from exposure modeling showed that pedestrian and bicyclist volume was characterized by transportation network, population, traffic generators, and land use variables. There were both similarities and differences between pedestrian and bicycle models, including different spatial scales of influence by mode. Additionally, the study quantified risk incorporating injury severity levels, frequency of victims, distance crossed, and exposure into a single equation. It was found that not all intersections with the highest number of pedestrian and bicyclist victims were identified as high-risk after exposure and other factors such as crash severity were taken into account.

1. Introduction

According to the fatality analysis reporting system (FARS) encyclopedia, in 2016, 818 cyclists and 5987 pedestrians were killed in traffic accidents, making up 18.2 percent of all crash fatalities. After some reduction between 2005 and 2009, pedestrian and cyclist fatalities have seen an increasing trend since 2009. The overall percentage of pedestrian and bicycle fatalities, as a percentage of total road-related fatalities, increased from 13% to 18% in the last decade [1]. In San Diego County, although the total number of pedestrian and cyclist fatalities decreased over the same period of time, a similar trend with a more drastic change is observed; the overall percentage of pedestrian and bicycle fatalities increased from 19.5% to 31.8%. Statistics also shows more Americans are walking and bicycling for commuting and recreation over that last decade [2]. Bicycle and pedestrian volume, known as exposure data, is an essential part of safety assessment. However, most existing bicycle and pedestrian networks are not equipped to routinely collect count data, such as typical count data collection undertaken for vehicular networks (e.g., via loop detectors). Given this lack of bicycle and pedestrian data, local agencies are not able to accurately assess which facilities are in highest need of improvement. Technological advancements in transportation are creating new opportunities to investigate new sources of data to more accurately measure pedestrian and bicycle activity and risk exposure, thereby improving our safety modeling and planning.

Average annual daily pedestrian volume (AADP) and average annual daily bicyclist volume (AADB) are common metrics used in nonmotorized transportation studies. These metrics can be used as a measure of exposure in crash risk analyses. Due to limited resources and time, it is not possible to collect data for a whole year at many locations. Thus, a common practice is to collect short-term counts at several locations and apply an extrapolation method to convert short-term counts into yearly counts [39], which can be expressed in AADP or AADB forms. Exposure modeling can then be applied to estimate AADP or AADB for all other locations that were not originally selected. Direct-demand models have been the most popular approach for exposure estimation. Several studies have used the direct-demand model approach to develop demand models based on observed pedestrian or bicycle traffic volume and associated socioeconomic and land use variables in specific areas [1013]. The estimated exposure can be used in risk quantification for normalization; risk is generally calculated by dividing the number of safety events by the estimated exposure derived from pedestrian and bicycle counts. Depending on how safety events (number of crashes, victims, etc.) and exposure (populations, traffic volume, AADP, etc.) are defined, different crash risk values can result for the same location. Some studies have focused on pedestrian and bicyclist crash risk modeling to investigate the factors that increase or decrease crash risk as well as selecting or developing a metric to quantify risks at specific locations [1419].

While exposure—or the amount of pedestrian, bicyclist, or motorist activity occurring within a certain timeframe within a certain area—is highly correlated with pedestrian and bicycle crashes [20], there is a well-acknowledged dearth of data on walking and bicycling behaviors and activity levels, and researchers globally are working to address this gap [21]. This study contributed to the field by utilizing data from multiple sources, including automated pedestrian and bicycle counters, video cameras, crash databases, and other sources (e.g., GIS), to identify high-risk signalized intersections within the city of San Diego, California, USA. Data mining techniques, a modern sampling strategy, and automated video processing methods were adopted to demonstrate a holistic approach that can be applied to identify facilities with highest need of improvement. The study was conducted in four major steps: (1) Identifying the intersections for short-term video data collection. (2) Developing a vision-based intersection monitoring system to automatically detect, track, and count pedestrians and bicyclists. (3) Converting short-term counts to long-term counts collected at the selected intersections. (4) Conducting exposure modeling and risk quantification for walking and bicycling at signalized intersections. The remainder of the paper is organized as follows. Background on exposure modeling, risk quantification, and data needs are presented next, followed by a data description section. Subsequently, the methodology and results sections discuss the four major steps mentioned above. Finally, conclusions and future direction are provided.

2. Background

Exposure and roadway crash risk studies have been conducted for different levels of the geographic area. These studies can be divided into two categories [22]: (1) area-wide approaches in which wide areas such as traffic analysis zones and census tracts are considered as units of analysis [23, 24], and (2) facility-specific approaches that take roadway segments or intersections as units of analysis [2530]. The present study focuses on identifying high-risk signalized intersections in the city of San Diego, and therefore signalized intersections are the units of analysis for this study. Out of all these intersections, a sampling strategy is required to identify a subset of intersections for collecting short-term counts that are utilized to develop an exposure model. The exposure model can then be applied to estimate the volumes at other locations that were not part of the sample. Two general groups of sampling methods for site selection include probabilistic and nonprobabilistic sampling [22, 31]. Nonprobabilistic sampling techniques are mostly based on nonrandom factors such as engineering judgment [31]. In contrast, probabilistic sampling techniques involve some random selection in the process and thus result in better generalization. Other probabilistic sampling techniques include cluster analysis, stratified sampling, and multistage random which is basically the combination of clustering and stratification [22, 31].

2.1. Short-Term and Long-Term Data Collection

Most studies and agencies have used a manual approach [14, 32] that includes having people collect count data in the field or having them extract count data by watching video data collected at the selected intersections. To facilitate manual work, advanced video processing algorithms [3335] can be applied to automatically count pedestrians and bicyclists. Different methods and technologies for collecting vulnerable road users are extensively discussed in [36]. Three steps can be identified when conducting video data analysis: object detection, object tracking, and behavior analysis. Object detection is the process of identifying different objects in images or video frames. Once the objects are detected, they need to be tracked frame by frame to monitor the spatial and temporal characterization of the objects. Finally, behavior analysis of the object trajectories could be carried out to obtain a desired outcome, such as speed of moving objects [37, 38], object counts [35, 3840], waiting time [35, 40], lane keeping data [41], traffic violations [37, 42], and overtaking information [43].

As it is not feasible to collect count data for a whole year at the selected sites, even automatically, a common practice is to collect data for shorter periods of time and apply an extrapolation method to convert short-term count data to yearly data. Extrapolation has been used in several studies for estimating Annual Average Daily Pedestrian (AADP) and Annual Average Daily Bicyclist (AADB) [4, 6, 25, 44, 45]. When collecting short-term counts, several factors such as counting period length, time of day, month, and year could potentially impact the accuracy of yearly count estimation [3, 4, 6, 4547]. In addition to collecting short-term data, long-term count data at several locations are also required to perform the extrapolation. The long-term counts could provide daily, weekly, and monthly count patterns that are used for calculating adjustment factors to perform extrapolation.

Several extrapolation methods have been used in previous studies such as traditional or standard method, Day-by-month, day-of-year, and weather model [4, 6, 9, 45, 47]. The selection of the method depends on the geographic location and weather variation of the study area as well as the availability and duration of short-term and long-term counts. For example, the variation in temperature has a great impact on the number of pedestrian and bicyclists in certain areas. Therefore, some methods were proposed to incorporate the influence of weather on AADP and AADP [7, 9]. Another consideration in the extrapolation process is to apply a suitable method for matching short-term counters to appropriate long-term counters [6, 7, 9, 30, 4750].

2.2. Exposure Modeling and Risk Quantification

Crash risk is generally estimated by dividing the number of safety events such as crashes by a total number of people who were likely to be involved in the safety events (i.e., exposure). Hence, crash risk is the probability of crash occurrence per unit of exposure. Focusing on pedestrians and bicyclists, the exposure has been defined in different ways such as pedestrian or bicyclist volume and estimated number of streets or travel lanes crossed [51]. At a theoretical level, it has been suggested that exposure may be defined as a measure of the number of potential opportunities for a crash to occur [14]. In most studies, the number of pedestrian/bicyclist crashes has been divided by a single exposure variable in order to calculate the crash risk. However, exposure can also be defined using multiple variables simultaneously. For example, the number of pedestrians and vehicles were used in [14]; average daily pedestrians, average daily traffic, and distance crossed were used in [52]; and hundred million pedestrian/bicycle miles of roadway traveled were used in a study conducted in Washington, D.C. [51].

When investigating intersection-related crash risk, it is important to consider the roles that different transportation modalities including pedestrians, bicyclists, motorcyclists, and low-speed mobility vehicles play in traffic and the differential risks that are faced by people traveling in these modalities. In a large-scale study, crash risks were investigated by travel mode, gender, and age groups [53]. In a study in Washington, D.C., it was found that although the number of crashes involved with pedestrians was more than two times the number of bicycle-involved crashes, the crash risks were similar after taking exposure (per 100M miles traveled) into account [32]. Also, in a tendency known as the “safety in numbers” phenomenon, areas with higher bicycle traffic flow have been found to be less risky for individual bicyclists in some cases [54].

3. Data Description

The unit of analysis in the present study is signalized intersections in the city of San Diego. A total of 1522 signalized intersection was identified using an ArcGIS shapefile. Short-term video data were collected by National Data and Surveying Services (NDS) at a sample of 45 intersections (selection of 45 intersections will be discussed in the site selection section below). These intersections were equipped with cameras, and the videos were recorded for 12 hours (7:00 am to 7:00 pm) in a workday (Tuesday, Wednesday, or Thursday) in May, June, or July 2018. Short-term data collection can be conducted for different lengths from a few hours to a few weeks, and generally, more data leads to smaller extrapolation errors. In this study, data collection length of 12 hours was used for two reasons. First, data collection for longer periods was not feasible due to budget constraints. Second, according to Nordback et al.’s study, it was found that extrapolation error rates do not decrease significantly from 12 to 24 hours [6]. Subsequently, pedestrian and bicycle short-term counts were automatically obtained through machine-vision modeling. In addition to the short-term counters (of selected intersections), 43 automated counters were also utilized among all automated counters in the county of San Diego. These counters are not located at intersections, but they continuously collect pedestrian and bicyclists counts since 2012. Data have been collected for several years from these counters. However, due to the equipment being vandalized and some issues with battery counters, data from 2015 were used, which is believed to contain the most reliable data. For every intersection, demographic characteristics, socioeconomic, and built environment variables were obtained by buffer analysis in ArcGIS. In addition, crash data involving pedestrians and bicyclists were obtained from the Statewide Integrated Traffic Records System (SWITRS) through the Transportation Injury Mapping System (TIMS). Crash data per victim for each intersection was extracted from 2006 to 2016. For all data sets, missing values and outliers were identified and dealt with by comparing saturation flow rate of bicyclists and pedestrians with actual observations, manually reviewing the data, and imputing data (e.g., using interpolation).

4. Methodology

4.1. Site Selection

A sampling strategy entailing cluster analysis and stratified sampling was utilized to identify a representative subset of intersections based on several variables. Cluster analysis is a classification technique that can be used to classify observations (intersections in our case) into distinct categories with similar characteristics. The algorithms used for data clustering can be categorized into several groups, such as partitional, hierarchical, density-based, grid-based, and model-based algorithms [55, 56]. In the present study, partitional algorithms were considered as they are the most widely used approaches. To select the best number of clusters, the Silhouette metric, Elbow, and gap statistics method can be employed [57, 58].

In stratified sampling, a few variables with different levels or thresholds are typically used to create strata, and each analysis unit can then be associated with a stratum [26]. Stratified sampling is effective in that it ensures the sample contains observations with different levels for the variables used. However, as the number of variables increases the number of strata grows rapidly, and thus selecting one intersection per stratum could make the sample too large. Therefore, stratified sampling could reduce flexibility in selecting many variables in the site selection process. Many factors could potentially impact pedestrian and bicyclist volume at intersections, which are typically used in site selection. A number of variables have been used in similar studies for selecting count locations such as population density, median income, and proximity to commercial properties [4, 26, 32, 45, 5961].

In this study, a multistage random approach was adopted to benefit from both stratification and clustering. The data in this study contains both numerical and categorical variables and thus Gower coefficient [62] was used to calculate distances (i.e., similarities) between two observations. After determination of pairwise distance among observations, a partitional clustering algorithm, Partitioning Around Medoids (PAM) [57], was employed to identify the clusters. The most popular partitional algorithm is K-means due to its efficiency and simplicity [63, 64]. However, K-means is more sensitive to outliers compared to PAM, since K-means employs centroids as the center of clusters and PAM employs the actual observations to define clusters. In the stratification step, depending on the number of clusters used, a stratified sampling method can be applied using one or more intersection characteristics to ensure that intersections with different levels for these characteristics are included in the sample. If the number of clusters turns out to be very high or very low, then the number of variables and/or levels of variables used for stratification could be adjusted to obtain the desired sample size.

4.2. Vision-Based Pedestrian and Bicylist Monitoring

A vision-based monitoring system was used to count the number of pedestrians and bicyclists crossing the intersections from short-term video data. The system used consists of three steps: object detection, object tracking, and object counting. Object detection was performed by utilizing Faster R-CNN [33] to detect pedestrians and bicyclists in video frames. Faster R-CNN is a state-of-the-art real-time object detection method that has reached a high performance when applied to the PASCAL VOC 2007 test data. Subsequently, detection results were used to perform object tracking implemented by using Intersection-Over-Union (IOU) [65] tracker. The main idea of this tracker is to associate each detection result with the highest IOU to the last detection result in the previous frame. The tracker starts a new trajectory and ends the old trajectories if all detections are not associated with any trajectories. Finally, pedestrian and bicyclist counts were obtained from regions of interest which were defined as areas typically used by pedestrians and bicyclists to cross intersections. To obtain the counts, any trajectory that entered the regions of interest was counted as a crossing pedestrian or bicyclist.

4.3. Extrapolation

To estimate AADP and AADB from short-term counts, similar pedestrian and bicyclist volume patterns for each short-term data collection site need to be identified. These volume patterns are utilized for extrapolating long-term counts from short-term counts. Permanent counters, even at a different location from the short-term counters, are typically used to identify similar demand patterns. In this study, the matching process was performed in two steps as proposed in [30]: in the first step, the PAM clustering method was applied to classify long-term counters into different clusters based on traffic distribution indexes including AMI, WWI, and PPI for bicycle counters and AMI and WWI for pedestrian counters. In the second step, the classified long-term counters were used as the training data for developing a K-nearest neighbor (KNN) [57] model to match short-term counters to appropriate clusters.

Several variables have been calculated and utilized in the extrapolation process. These variables include population density, employment density, and land use density (commercial, residential, government, industrial, park and recreational) within a given counter buffer (0.25-mile buffer), as well as traffic distribution indices such as AMI, WWI, and PPI introduced in [7, 9]. These indices reflect bicyclist volume in morning peak hour over midday peak hour, weekend over weekday, and monthly variations, respectively.

Several extrapolation methods have been used in previous studies and a few -such as day-of-year [4], day-by-month [9], and weather model- have been shown to produce lower AADP and AADB estimation errors. The day-of-year method was not applicable in this study as it required the short-term and long-term data to be collected in the same year. However, short-term data collection was conducted in 2018 while long-term data used were collected in 2015. In addition, the weather model was not deemed to be beneficial due to San Diego’s year-round mild weather. Thus, the day-by-month method was applied with minor modification as described below.

First, 12-hour counts were converted to 24-hour counts using (1). Equation (2) shows how day-by-month factor was calculated for every day d of the week and month m. Subsequently, AADP and AADB counts were estimated by applying day-by-month adjustment factors to the 24-hour counts using (3). The long-term counter data in the following equations refer to a counter that has been matched with the short-term counter of interest. However, it should be noted that if two or more permanent counters are matched to a short-term counter, the mean of adjustment factors across all matched counters was used. where : pedestrian (bicyclists) count on day of a week, in month estimated for short-term counter s;: pedestrian (bicyclist) count in hour of day of a week, in month m from short-term counter ;: pedestrian (bicyclist) count in hour of day of a week, in month m from matched long-term counters;: pedestrian (bicyclists) day-by-month factor for day d of a week, in month m;: average annual daily pedestrian (bicyclist) count obtained from matched long-term counters;: average daily pedestrian (bicyclist) volume on day d, in month m from matched long-term counters;: average annual daily pedestrian (bicyclist) count estimated for short-term counter s.

4.4. Exposure Modeling and Risk Quantification

A wide variety of approaches and methods have been used in predicting nonmotorized activity using direct-demand models. Given the nature of the dependent variable, which is discrete in nature, and the variance was greater than the mean, the negative binomial model was selected as the exposure model of this study. As discussed earlier, several variables were considered in the analysis. Univariate and bivariate correlation analyses were first conducted to explore variables’ distribution or pattern and to investigate the relationship between the dependent and independent variables. Several variable forms and functions were examined to get the best data fit.

After several model trials with different combinations of the key variables, the best models were evaluated based on the predictive accuracy of the models in terms of mean absolute error (MAE) and root mean squared error (RMSE). Cross-validation technique was employed for performance evaluation. Cross-validation is a resampling technique that helps identify a parameter value, ensuring a proper balance between bias and variance [66]. For cross-validation, a subset of the data, known as the training set, is used to train the model, and the remaining data points serve as a test set or validation set. While fitting a model on a training set, it is desirable to have minimum MSE, which minimizes the difference between the prediction and the actual observation. This research used a 10-fold cross-validation method to evaluate and compare the performance of the developed models. This method split the feature vector sets into ten approximately equally sized distinct partitions. While one set was used for testing, the other sets were used for training. Then, the procedure was repeated ten times, and all accuracy rates over these ten runs were averaged to improve the estimate. The performance evaluation criterion was the average accuracy. The final models were identified based on statistical, predictive, and intuitive considerations as well as insights from the literature.

Utilizing the estimated pedestrian and bicycle activity (i.e., exposure), risks associated with walking and bicycling at signalized intersections can be calculated. A general equation to quantify risk is shown in (4). Several studies have used the number of crashes in the risk equation as the number of safety events. As a result, two different locations with the same number of crashes and exposure would lead to the same level of risk, while different number of victims might be involved at one location from the other. Taking number of victims into consideration (instead of number of crashes), a crash with multiple victims should be associated with a higher risk comparing to a crash with only one victim. In addition, crashes with higher levels of severity should be attributed to higher risks. Also, the number of fatalities (instead of crashes) could be used to provide an estimate of the relative lethality of intersections. Therefore, a combination of fatalities and injuries was utilized to provide a more holistic estimate of the risk. Crash severity was incorporated in risk quantification by utilizing crash costs associated with severity levels. Other factors, AADP (AADB) as the exposure and distance crossed, were also included in the risk equation as presented in (5). In addition to the crash cost, the equation numerator includes a term, , to produce more weight (i.e., importance) on the locations with higher frequency of victims. Since crashes are rare events, it is important to magnify the number of occasions that led to fatalities and injuries. The tuning parameter, , can also be used to provide the extent of the weight. For example, if zero is selected for this parameter, becomes one, which means zero weight is given to the victim frequency. As increases, it provides more weight on the victim frequency and consequently higher risks are resulted.

Crash cost has been used for different purposes, such as analyzing the effectiveness of a specific roadway enhancement and measuring the effect of seatbelt, and it has been estimated based on injury severity in several studies [6771]. In the very beginning, Miller estimated motor-vehicle crash comprehensive costs by injury severity and body region [67]. Another study estimated crash costs of medium and heavy trucks by seven injury severity levels [68]. Miller et al. broke down pedestrian and pedalcyclist crash costs by age, injury severity, and body region in the United States [69]. Federal Highway Administration (FHWA 2005) also presented an estimation of crash cost based on maximum police-reported injury severity.

In crash cost studies, maximum abbreviated injury severity (MAIS) is defined as the maximum threat of a crash to a victim’s life [72]. Crash cost generally results from a combination of different cost categories, including medical, emergency service, lost productivity, the monetized value of the pain and suffering, and lost quality of life costs. Collectively, these costs have been called comprehensive costs. Monetary or economic cost value of a crash can be obtained by subtracting lost quality of life from the comprehensive cost [67]. In this study, lifetime costs were used which includes medical, work loss, and quality of life costs [69].where: total crash cost weighted by severity;: number of pedestrians or bicycle victims with injury severity level s;: cost per victim with injury severity level s;: severity level = fatal, severe injury, other visible injury, and complaint of pain;: total number of victims;: exponent of , a tuning parameter to magnify the frequency of victims;: average annual daily pedestrian (bicyclist) count;: distance a pedestrian or bicyclist crossed.

5. Results and Discussion

5.1. Site Selection

Any variable that is expected to affect pedestrian and bicycle activity at intersections could potentially be included in the site selection process. A total of 18 variables were examined and after several trials with different subsets of variables and excluding variables with high correlation with other variables, a subset of 12 variables were selected to perform site selection as follows: population density, land use (parks and recreational; residential), presence of college, presence of school, transit stops density, mean traffic volume, pedestrian victims, bicyclist victims, proximity to Balboa Park, proximity to beaches, sidewalk density, and bikeway density.

Using all selected variables, signalized intersections were grouped into clusters by applying PAM clustering method. The Silhouette metric and Elbow method were applied to identify the best number of clusters. The highest Silhouette value -that shows the highest clustering performance- was obtained when using 5 clusters, as shown in Figure 1. Based on the Elbow plot in this figure, as the number of clusters increases, the total sum of squares decreases. However, no clear elbow point is visible, and thus the best number of clusters was selected to be five based on the Silhouette method only. The geographic distribution of these 5 clusters is shown in Figure 2.

Within each cluster, stratified sampling was performed using two variables, namely the number of pedestrian victims and the number of bicyclist victims. The purpose was to ensure that the sample includes intersections with a high, moderate, and low number of victims. The number of pedestrian victims, ranging from 0 to 13, was divided into three levels (low: 0, 1, 2; moderate: 3, 4, 5; high: >=6). Similarly, the number of bicyclist victims was divided into three levels (low: 0, 1; moderate: 2, 3, 4; high: >=5). Consequently, nine strata for each cluster (33=9) resulted. Subsequently, a sample of 45 intersections was identified as shown in Figure 2 by selecting one intersection per stratum (59=45). It should be pointed out that most adjacent intersections are assigned to the same cluster unless their characteristics are very different from their adjacent intersections. In addition, the selected sample was carefully reviewed to ensure adjacent intersections are not in the sample. This was considered to avoid variable correlation of intersections sharing the same area of influence.

5.2. Vision-Based Pedestrian and Bicylist Monitoring

Utilizing video data, machine-vision models were trained to detect and track pedestrians and bicyclists. Several pedestrians and bicyclists were labeled to perform the training task. The vision-based monitoring system was tested in several scenarios and system performance was assessed using real-world video data from stationary cameras at several signalized intersections. Figure 3 shows an example in which two pedestrians and a bicyclist were successfully detected, tracked, and counted as they crossed the intersection within the region of interest (transparent blue region). The average pedestrian and bicycle counting accuracies were 85% and 81%, respectively. Several factors impacted the model performance, including the number of pedestrians and bicyclists labeled, intersection shape and size, lighting condition, occluded objects, and video quality. The best counting accuracy of 95% was achieved for both pedestrians and bicyclists in scenarios in which many objects were labeled, good lighting condition was present, video quality was decent, and pedestrians/bicyclists were not crossing in groups.

As expected, more labeled pedestrians and bicyclists led to better model performance as the models were provided with more information in terms of positioning, angles, and lighting conditions. For example, the detection accuracy increased by 68% when the number of labeled pedestrians increased from 15 to 32. Model transferability was examined by using data from one intersection to train models and testing these models on other intersections. The benefit was that the manual work of labeling was reduced, but the models performed poorly since different intersections have different shapes and sizes. Focusing on one intersection for both training and testing, the way people cross the intersection (as individual vs. in groups), lighting conditions due to time day significantly affected model generalizability. For instance, the models had difficulty detecting and tracking pedestrians and bicyclists crossing the intersection in groups since some of them were occluded by others in multiple video frames. In addition, the quality of the video and object distances to the camera impacted the results. Cameras used in this study were set at a corner of each intersection. Detecting objects crossing the two farther intersection approaches from the camera was challenging, especially in large intersections where pedestrians and bicyclists were too small to distinguish.

5.3. Extrapolation

Pedestrian and bicycle volume patterns at permanent counters were identified by PAM clustering method. Figures 4 and 5 illustrate clusters of pedestrian counters classified into three clusters (Recreational, Mixed, and Utilitarian) based on AMI and WWI. As shown in Figures 6 and 7, bicycle counters were grouped into four clusters (Utilitarian, Recreational, Mixed Recreational, and Mixed Utilitarian) based on AMI, WWI, and PPI calculated for every counter. Pattern classification into three and four clusters have been used in past studies [46, 49, 73] and the only reason three clusters were used (instead of four) for pedestrian counters was the limited number of counters which would have led to having only one counter in a cluster, which is not recommended [6, 50].

Utilitarian counters have two distinct peak hours in mornings and evenings on weekdays. They also have a relatively uniform distribution throughout the week as shown by daily pattern. Recreational counters have higher weekend peaks than weekday peaks as expected. The daily patterns also show the highest volume on Saturdays and Sundays compared to the other days of the week. Mixed, mixed recreational, or mixed utilitarian counters represent different combination variations of utilitarian and recreational demand. These classifications for bicycle and pedestrian counters were consistent with the literature as discussed in [49]. Finally, each short-term counter was matched to one of the specified clusters, and the mean of the adjustment factors across all counters within that cluster was used to extrapolate the short-term counts (12-hour counts) to yearly counts (AADP or AADB).

5.4. Exposure Modeling and Risk Quantification

After the estimation of AADP and AADB for the sample intersections, exposure modeling was applied to calculate AADP and AADB for the remainder of intersections as discussed below. Tables 1 and 2 show the results of the negative binomial regression models of pedestrian and bicycle annual average daily volume, respectively. The tables present the explanatory variables and their estimates. The variables such as transit stop density for which buffer analysis was conducted were tested in exposure models with three different areas of influence (0.1 miles, 0.25 miles, and 0.5 miles). The number following each variable represents the buffer area of influence.

As shown in Table 1, the pedestrian model has seven variables, and all of them were statistically significant at the 90 percent confidence level. As shown in Table 2, the bicycle model also has seven variables, and all variables except one (transit stop density at 0.1 miles) were statistically significant at the 95 percent confidence level. The R-squared values for the pedestrian and bicycle models were 0.7 and 0.67, respectively, which assert that the models have decent goodness of fit. The MAE and RMSE of the models also indicated that the models provide a good estimate of the annual average pedestrian and bicycle volume at locations without counts.

The model results revealed important insights. The pedestrian volume was characterized by transportation network (transit stop density and speed limit), population (employment density, and regular transit rider, pedestrian, or bicyclist population), and land use (vacant housing units, commercial or mixed-use land area, and high-crime area). The best model was obtained with variables of different spatial scales. This finding is consistent with previous studies [26, 74, 75] that suggested it is unlikely to have all the variables significant at the same buffer scale. The direct-demand model for bicycle traffic included variables that represent characteristics of the transportation network (density of bicycle facility, maximum posted speed within an intersection, and transit stops density), population (total regular bicyclist population), traffic generator (presence of a school and proximity to the beach), and land use (total commercial or mixed-use area).

Interestingly, some variables, such as commercial or mixed-use land area and transit stop density, influence both pedestrian and bicycle traffic for the study area, but the spatial scale of influence varies. The commercial or mixed-use land area influences pedestrian and bicycle volume within 0.1 miles and 0.25 miles, respectively. This indicates that the commercial and mixed-use land area attracts bicyclist traffic for a larger catchment area than pedestrian traffic does. Previous studies have also indicated that commercial areas attract pedestrian [12, 75] and bicycle [11, 12, 28] activity. However, the study by Tabeshian and Kattan (2014), conducted in Canada, found a significant impact of commercial areas on pedestrian and bicycle traffic within 0.25 miles and 0.1 miles, respectively, which is contrary to this study’s findings [12]. Two separate studies in Alameda County, California, have also found a significant influence on commercial areas on pedestrians within 0.25 miles [26] and bicyclists within 0.1 miles [28]. The comparison suggests that not only the explanatory variables but also their influence areas for nonmotorized traffic activity vary with location and community.

Similarly, transit stop density influences pedestrians within 0.5 miles, which is larger than the bicycle buffer of 0.1 miles. The results indicate that pedestrians are likely to travel more to ride a transit facility than bicyclists are. Previous studies have also observed a significant association of transit facilities with pedestrians within 0.5 miles [76] and bicyclists within 0.5 miles [11]. Given that mass transit facilities are bicycle-friendly in San Diego [77], transit riders probably make up a large proportion of pedestrians and bicyclists in the city.

Pedestrian and bicycle volumes decrease when the maximum intersection speed limit exceeds 40 mph. The finding confirms that pedestrians and bicyclists are more likely to avoid high-speed intersections and find an alternative route. Fagnant, D. J., and K. Kockelman [10] also observed a similar relationship for bicycle traffic in the Seattle, Washington, area. The finding is not surprising given the rise of traffic fatalities. A report [78] indicated that around 1,000 pedestrians and bicyclists are hit and seriously injured annually in San Diego, and in 2012, pedestrian collisions increased 20 percent compared to previous years. In 2017, there were 12 deaths and 71 serious injuries involving pedestrians and bicyclist [79]. The high crash risk could discourage pedestrians and bicyclists from using high-speed intersections.

The pedestrian model had a strong positive association between the pedestrian volume and the percentage of regular transit rider, pedestrian, or bicyclist population within 0.25 miles. As expected, the population inclined to use active modes and public transportation was more likely to contribute to the walking volume within their neighborhood. The pedestrian volume also increased with increasing employment density within 0.25 miles. Previous studies also observed similar influence in San Francisco [13] and San Diego, California [59]. The results suggest that with more people working in a neighborhood, intersections are more likely to observe higher pedestrian volume. Similarly, the negative association between pedestrian volume and total vacant housing units indicates that pedestrians are less likely to generate from neighborhoods with many vacant properties. The negative influence of crime on pedestrian volume, but not on bicyclists was also observed, which shows that people are more likely to avoid high-crime locations and conforms with previous research [80, 81].

As expected, the bicycle model indicated higher bicycle volume in areas with a larger population of regular bicyclists. The model also indicated that the intersections near beach access points (less than 10 miles) were more likely to observe high bicycle traffic. The density of bike facilities also had a positive influence on daily bicycle volume. The finding can be attributed to the recent surge of dockless bicycles in the city [82] as well as the 16 miles of separated bike paths around San Diego Bay, completed in February 2018 [83]. The Bay Shore bikeway was built with a vision to provide a scenic and convenient way for bicyclists to travel in the San Diego area. The dockless bike sharing facilities were first launched in February 2018 and added the convenience of using bicycles. The combined influence may contribute to a higher bicycle volume in locations near beach areas and with better bicycle facilities. Surprisingly, the model revealed a negative association between the presence of a school and bicyclist volume, which contradicts previous studies conducted in Canada [11, 84]. However, a study conducted in the United States suggested that the number of students (ages 5 to 18) who walk or bike to school decreased sharply in recent years due to increased traffic collisions, lack of sidewalks, and urban sprawl [85]. Perhaps the increasing collision rate in the city discouraged children from bicycling to schools, and adult bicyclists tend to avoid locations near schools.

After estimating pedestrian and bicycle volumes (i.e., AADP and AADB) as the exposure measure, the risk was quantified by applying the proposed quantified risk equation presented in (5). In this equation, number of victims and crash severity levels were obtained from the SWITRS data, distance pedestrians or bicyclists crossed at the intersection was calculated by multiplying the average number of lanes (across all approaches) by the lane width (12 ft was assumed), cost per victim was obtained based on the victim’s age and injury severity as estimated by Miller et al. [69]. After experimenting several values for the tuning parameter , a value of three was chosen for the final model. Models with smaller values of , identified some intersections as high-risk (i.e., top 50) with only one or two victims in the past ten years. While other factors such as a small AADP (AADB) and/or a high crash cost could lead to high-risk values, it may not be practical to recognize an intersection with only one victim in the past ten years as a high-risk intersection even if the AADP is small. High values of led to extreme values of risk especially when the number of victims was high, and thus the value of three was selected as it provided a reasonable outcome. The risk for all signalized intersections was calculated to identify high-risk intersections for walking and bicycling as mapped in Figure 8. Previously, 15 intersections, known as “fatal 15”, were identified as the deadliest intersections for pedestrians in the city of San Diego [86]. As expected, it was found that these intersections had more number of victims than other intersections. However, when exposure and other factors were taken into account using the quantified risk equation, not all intersections with the highest number of pedestrian and bicyclist victims were identified as high-risk. For example, out of 39 intersections with the highest number of pedestrian victims (Victims >= 8), only 22 were made it to the top 39 high-risk intersections based on the quantified risk. Similarly, out of 36 intersections with the highest number of bicyclist victims (Victims >= 5), 26 remained in the top 36 high-risk intersections.

6. Conclusion

While it is accepted that pedestrian and bicyclist volume is positively correlated with the number of pedestrian and bicyclist crashes, there is a renowned lack of pedestrian and bicycle data that can negatively impact accurate risk quantification and safety evaluations. This study leveraged a combination of available data sources including automated pedestrian and bicycle counters, video cameras, crash databases, and other sources (e.g., GIS), to identify high-risk intersections for walking and bicycling. Cluster analysis and stratification were applied to identify a representative sample of locations to collect short-term video data that were used to develop a vision-based monitoring system for automatic detection, tracking, and counting of pedestrians and bicyclists. When sufficient number of pedestrians and bicyclists were annotated, pedestrians and bicyclists were not too far from the camera, they did not cross the intersection in groups, and good lighting condition was present, a high counting accuracy of 95% was obtained. Utilizing permanent counters, an extrapolation method along with a novel matching method was employed to estimate yearly counts that were used for estimating exposure by direct-demand models. Exposure analysis identified transportation network, population, traffic generator, and land use variables as statistically significant in estimating pedestrian and bicyclist volume. Accounting exposure as a normalization factor and other factors such as frequency of victims and crash severity in quantifying risk had a significant impact on the selection of high-risk intersections; not all intersections with the highest number of pedestrian and bicyclist victims were identified as high-risk. The variables were found to be influential at multiple buffer area and showed differences across pedestrian and bicycle activity. The results underscored the importance of location and community in characterizing nonmotorized demand and targeted improvements to encourage nonmotorized activities.

The modeling framework and data sources used in this study are beneficial to conduct future analyses for other facility types such as roadway segments and also at more aggregate levels such as traffic analysis zones. The approach is also beneficial to public agencies, which can help identify high-risk facilities and prioritize them for countermeasure implementation. Since crashes are rare events, the identification of high-risk facilities would take a long time, and thus a potential future direction is to proactively assess safety by discovering near-crash situations in video analysis. This enables researchers and practitioners to quantify risk and evaluate safety in a much shorter period of time. Another future research topic is to investigate advanced spatial modeling methodologies with direct-demand models to better understand the impact of intersections sharing the same area of influence (e.g., adjacent intersections).

Data Availability

Except the video data that may include personally identifiable information, other data may be made available. However, at this point, since the project has not been finalized yet, it is not possible to make them available within any supplementary files or any other forms. The data may be released after finalizing the project and obtaining any required permissions from the project sponsor if any.


The contents of this paper reflect the views of the authors, who are responsible for the facts and the accuracy of the information presented herein. This document is disseminated in the interest of information exchange.

Conflicts of Interest

The authors declare that they have no conflicts of interest.


This research was supported by Safe-D University Transportation Center. We also much appreciate the work of our student research assistants in this project, namely, Chenlei Zhang, James Davisson, and Christopher Galan at SDSU and Kyuhyun Lee at TTI. The report is funded, partially or entirely, by a grant from the U.S. Department of Transportation’s University Transportation Centers Program. However, the U.S. Government assumes no liability for the contents or use thereof.