Abstract

Traffic light recognition is an essential task for an advanced driving assistance system (ADAS) as well as for autonomous vehicles. Recently, deep-learning has become increasingly popular in vision-based object recognition owing to its high performance of classification. In this study, we investigate how to design a deep-learning based high-performance traffic light detection system. Two main components of the recognition system are investigated: the color space of the input video and the network model of deep learning. We apply six color spaces (RGB, normalized RGB, Ruta’s RYG, YCbCr, HSV, and CIE Lab) and three types of network models (based on the Faster R-CNN and R-FCN models). All combinations of color spaces and network models are implemented and tested on a traffic light dataset with 1280×720 resolution. Our simulations show that the best performance is achieved with the combination of RGB color space and Faster R-CNN model. These results can provide a comprehensive guideline for designing a traffic light detection system.

1. Introduction

Over the past few years, various advanced driving assistance system (ADAS) have been developed and commercialized. In particular, most automotive companies are now doing their best to launch autonomous vehicles as soon as possible. According to the society of automotive engineers (SAE), the international standard defining the six levels of driving automation requires the autonomous driving to achieve level 3 and higher [1]. Obviously, the traffic light recognition is an essential task for ADAS as well as for autonomous vehicle.

For traffic light recognition, various methods have been proposed. These can be analyzed from three aspects such as color space, feature extraction, and verification/classification. Different color spaces, namely, gray scale [2, 3], RGB [4, 5], normalized RGB [6], Ruta’s RGB [7], YCbCr [8, 9], HSI [10], HSV [11, 12], HSL [13], and CIE Lab [14], have been used. Moreover, some studies [1518] have used more than one color spaces. For feature extraction, Haralick’s circularity measure [19], Sobel edge detection [20], circle Hough transform [21], 2D Gabor wavelets [22], Haar-likes [23], histogram of oriented gradients (HOG) [24], and geometric features [25] have been applied. For verification/classification, various conventional classifiers have been used, e.g., k-means clustering [26], template matching [27], 2D independent component analysis (ICA) [28], linear discriminant analysis (LDA) [29], decision-tree classifier, k-nearest neighbor (kNN) classifier [30], adaptive boosting algorithm (Adaboost) [31], and support vector machine (SVM) [32]. Recently, some basic deep-learning networks such as LeNet [33], AlexNet [34], and YOLO [35, 36] have been applied to traffic light recognition. Other approaches using visual light road-to-vehicle communication have been developed. LED-typed traffic lights broadcast the information, then photo-diode [37, 38] or high-frame-rate image sensor [39, 40] receives the optical signal. In this paper, we mainly focus on vision-based traffic light recognition using deep-learning.

In the last couple of years, deep-learning has achieved a remarkable success in various artificial intelligence research areas. In particular, deep-learning has become very popular in vision-based object recognition due to its high performances of classification. One of the first advances is OverFeat that applies the convolutional neural network (CNN) [34] to multiscale sliding window algorithm [41]. Girshick et al. proposed a region with CNN (R-CNN), which achieves up to almost 50% improvement on the object detection performance [42]. In R-CNN, object candidate regions are detected and features are extracted using CNN, while objects are classified using SVM. Girshick proposed a Fast R-CNN, which uses selective search to generate object candidates and applies fully connected neural network to classify the objects [43]. However, the selective search algorithm slows down the object detection system performance. Redmon et al. proposed YOLO, which uses a simple CNN approach to achieve real-time processing by enhancing detection accuracy and reducing computational complexity attaining [35]. Ren et al. proposed a Faster R-CNN which replaces the selective search by region proposal network (RPN) [44]. The RPN is a fully convolutional network that simultaneously predicts the object bounds and object/objectless scores at each position. This method makes it possible to implement the end-to-end training. Recently, two notable deep-learning network models were proposed, single shot detector (SSD) and region-based fully convolutional networks (R-FCN). SSD uses multiple sized convolutional feature maps to achieve a better accuracy and higher speed than YOLO [45]. R-FCN is a modified version of Faster R-CNN, which consists of only convolutional networks [46]. It is to be noted that the feature extraction is included in deep-learning detection network in the cases of Fast R-CNN, YOLO, Faster R-CNN, SSD, and R-FCN frameworks. The above-mentioned deep-learning methods have been widely applied to detect objects such as vehicle and pedestrian [4751]. However, only a few deep-learning based network models have been applied to traffic light detection system [5255].

From the viewpoints of color representation, various color spaces of input video data have been used in conventional traffic light recognition methods. However, only a few color spaces have been applied in deep-learning based methods. Because color information plays an important role in the performance of traffic light detection, it is necessary to select the color space carefully in deep-learning based methods. In this study, we focus on how to design a high-performance deep-learning based traffic light recognition system. To find color space most suitable to deep-learning based traffic light recognition, six color spaces such as RGB, normalized RGB, Ruta’s RYG, YCbCr, HSV, and CIE Lab are investigated. For deep-learning network models, three models based on the Faster R-CNN and R-FCN are applied. All combinations of color spaces and network models are implemented and compared.

The rest of this paper is organized as follows. Second section discusses the previous research works on traffic light detection system. In third section, we describe various color spaces and deep-learning network models. All combinations of color spaces and network models have been designed in this study. In fourth section, we explain the configurations such as parameter and data set for the performance evaluation. Fifth section presents the simulation results. Final section draws the conclusions.

In this section, we briefly introduce the work done so far on traffic light detection. These works are categorized into two groups, namely, deep-learning based and conventional classification methods, depending on whether the deep-learning is used or not. They are investigated mainly from the viewpoints of color representation and verification/classification. The analysis is summarized in Table 1.

2.1. Conventional Classification Based Methods

In general, conventional traffic light recognition methods mainly consist of two steps, candidate detection, and classification. Various color representations have been used. Charette and Nashashibi did not use any color information [2, 3]. They proposed to use the gray-scale image as input data. After the top-hat morphological filtering, adaptive template matching with geometry and structure information was applied to detect the traffic lights. Park and Jeong used color extraction and k-means clustering for candidate detection [4]. The average and standard deviation of each component in RGB color space were then calculated and used. Here, Haralick’s circularity was used for verification. Yu et al. used the difference of each pair of components in RGB space to extract the dominant color [5]. They applied region growing and segmentation for candidate detection. For verification, the information of shape and position was used. Omachi and Omachi used normalized RGB for color segmentation [6]. The edge detection and circle Hough transform were applied for verification of traffic light. Kim et al. used Ruta’s RGB based color segmentation for the detection of traffic light candidates at night [7]. Some geometric and stochastic features were extracted and used in SVM classifier. Kim et al. used YCbCr color-based thresholding and shape filtering for candidate detection [8]. Here, Haar-like features and Adaboost were used for classification. Kim et al. used YCbCr color segmentation for candidate detection [9]. Here, candidate blobs with red and green lights were detected by thresholding Cb and Cr components. Various shape and modified Haar-like features were extracted and used in decision-tree classifier. Siogkas et al. used the CIE Lab color space [14], where the multiplications of L and a components (RG), and L and b components (YB) are used to enhance the discrimination of red and green regions. They used fast radial symmetry transform and persistency to identify the color of traffic lights. Cylindrical color spaces such as HSI, HSV, and HSL have also been used [1013]. Hwang et al. used HSI color-based thresholding, morphological filtering, and blob labeling for candidate detection [10]. For verification, they used convolution of the candidate region with Gaussian mask using existence-weight map. HSV color space was also used, where the histograms of hue and saturation components are used for candidate extraction [11]. Probabilistic template matching was applied for classification. Recently, it has been reported that the detection performance of traffic lights can be improved by using the 3D geometry map that are prebuild from GPS, INS/IMU, and range sensors such as stereo camera or 2D/3D range lidar [12, 13]. Jang et al. used Haar-like feature based Adaboost with 3D map information for candidate detection [12]. HOG and HSV color histogram were applied to SVM classifier. Moreover, traffic light candidates are detected by using HOG features based linear SVM classifier with the uncertainty of 3D prior that constrains the search regions [13]. For classification, image color distribution in HSL color space is used.

Some researchers used two color spaces [1518]. Omachi and Omachi used RGB and normalized RGB color spaces to find candidates, and circle Hough transform was applied for verification [15]. Combination of RGB and normalized RGB was also used for color segmentation based on fuzzy logic clustering, where some geometric and stochastic features were used as primary clues to discriminate traffic lights from others [16]. Cai et al. used RGB and YCbCr color spaces for candidate extraction and classification, respectively [17]. Gabor wavelet transform and ICA based features were extracted and applied to the nearest neighbor classifier. Furthermore, red, yellow and green traffic light regions are detected by thresholding based HSV color segmentation and geometrical features [18]. HOG features were extracted in RGB space and used to determine whether arrow sign is on the light or not. Three different classification algorithms such as LDA, kNN, and SVM were applied, respectively.

Since the selection of color space plays the most important role in traffic light detection performance, past studies have explored all the possible options. Clearly, it is important to identify the best among all these color spaces.

2.2. Deep-Learning Based Methods

Deep-learning has been also used for traffic light detection and classification [5255]. At first, deep learning was applied only in the classification of traffic lights, where candidates were detected by conventional method [52, 53]. Saini et al. used HSV color space-based color segmentation, aspect ratio, and area-based analysis and maximally stable extremal region (MSER) to localize the candidates [52]. HOG features and SVM were used for verification, whereas simple CNN was used for classification. Lee and Park used CIE Lab color space-based segmentation to find the candidate regions [53]. To reduce false regions, they use SVM with size, aspect ratio, filling ratio and position. The classification was performed by two cascaded CNN which consists successively of LeNet and AlexNet. LeNet quickly differentiates between traffic lights and background. AlexNet classifies traffic light types. Recently, deep-learning has been applied both to candidate detection and classification. Behrendt et al. [54] and Jensen et al. [55] applied YOLO-v1 [35] and YOLO-9000 [36] for traffic light detection/classification.

As discussed, only a few color spaces have been applied in deep-learning based traffic light detection. Because color information plays an important role in the performance of detection, it is necessary to select the color space carefully. It is also required to apply more sophisticated and efficient deep-learning network models to traffic light detection and classification.

3. Deep-Learning Based Traffic Light Detection

In this section, we present a deep-learning based traffic light detection system that consists of the preprocessing, deep-learning based detection, and postprocessing as shown in Figure 1. In preprocessing, the input video data is transformed to other color space. Six color spaces are considered. The deep-learning based detection uses ensemble of feature extraction network model and detection one. Here, we consider three kinds of network models based on the Faster R-CNN and R-FCN which can perform localization and classification. In postprocessing, redundant detection is removed by using the nonmaximum suppression (NMS) technique [56, 57].

We focus on the combination of the color spaces and the ensemble networks which can achieve high performance of the traffic light detection. The color spaces and ensemble network models to be considered are described below.

3.1. Color Spaces

In the vision-based object detection and classification, it is necessary to determine the color space in which the characteristic of the object appears well. The RGB color space is defined by the three chromatics, red (), green (), and blue () [58]. For robustness under changes in the lightning condition, normalized RGB has often been used. The normalized RGB denoted as , , and are obtained by , , and , respectively, where [59]. At low illumination, it is difficult to distinguish between the normalized RGB colors [60]. To overcome this difficulty, Ruta et al. proposed the new red and blue color transform for traffic sign detection [60]. Since the traffic lights are red, green, and yellow, we modify Ruta’s color representation for the same. Ruta’s red, green, and yellow, denoted as , , and are obtained as below.

The YCbCr color space is obtained from the RGB [61]. Y component is luma signal, and and are chroma components. The color space can also be represented in cylindrical coordinates such as HSV color space [62]. The hue component, H, refers to the pure color it resembles. All tints, tones, and shades of red have the same hue. The saturation, , describes how white the color is. The value component, , also called lightness, describes how dark the color is. The CIE Lab color space consists of one component for luminance, , and two color components, and [63]. It is known that the CIE Lab space is more suitable to many digital image manipulations than RGB color space.

In this paper, the six kinds of color spaces are considered in preprocessing of the traffic light detection system as shown in Figure 1. Each color representation is applied and its performance is compared.

3.2. Deep-Learning Based Ensemble Networks

It is known that the end-to-end trainable deep-learning models are more efficient than other models in general object detection [35, 36, 4446], because it allows a sophisticated training by sharing the weights between feature extraction and detection. YOLO [35, 36], Faster R-CNN [44], SSD [45], and R-FCN [46] have been developed for the end-to-end model. In our traffic light detection system, we only consider the end-to-end deep-learning network models that can perform feature extraction and detection.

According to COCO [64], a dataset is divided into three groups depending on the size of the object to be detected; small (), medium (), and large (), where area denotes the number of pixels the object occupies. Therefore, the detection performance of a system can be different for different object sizes. Traffic lights are relatively smaller in size than other objects such as vehicle and pedestrian. For example, almost 90 % of traffic lights in our evaluation dataset belong to small-size group () as shown in Table 2. Therefore, it is necessary to determine a deep-learning network model which is suitable for small-size object detection.

Huang et al. applied various network models to general object detection using COCO dataset and their performances are compared [65]. Fourteen kinds of meta-architectures with feature extractors and network models are analyzed. Five feature extractors such as VGGNet [66], MobileNet [67], Inception-v2 [68], Resnet-101 [47], and Inception-Resnet-v2 [69] are compared. Three kinds of network models based on the Faster-RCNN, R-FCN, and SSD are compared. They show that SSD (similar to YOLO) has higher performance for medium and large sized objects, but significantly lower performance than Faster R-CNN and R-FCN for small objects. They show that three ensemble networks such as Faster-RCNN with Inception-Resnet-v2, Faster R-CNN with Resnet-101, and R-FCN with Resnet-101 have higher performances than others for the small-size object detection. This is the reason why these three ensemble networks are applied in our traffic light detection method.

In Faster R-CNN [44], the selective search is replaced by very small convolutional network called RPN to generate regions of interest (RoI). To handle the variations in aspect ratio and scale of objects, Faster R-CNN introduces the idea of anchor boxes. At each location, three kinds of anchor boxes are used for scale 128×128, 256×256, and 512×512. Similarly, three aspect ratios 1:1, 2:1, and 1:2 are used. RPN predicts the probability of being background or foreground for nine anchor boxes at each location. The remaining network is similar to the Fast-RCNN model. It is known that Faster-RCNN is 10 times faster than Fast-RCNN while maintaining a similar accuracy level [44].

R-FCN [46] is a region-based object detection framework leveraging deep fully convolutional networks. In contrast to other region-based detectors such as Fast R-CNN and Faster R-CNN that apply per-region subnetwork hundreds of times, the region-based detector of R-FCN uses fully convolutional network that applies on the entire image. Instead of RoI pooling at the end layer of Faster R-CNN, R-FCN uses position-sensitive score maps and position-sensitive RoI pooling layer to address a dilemma between translation-invariance in image classification and translation-variance in object detection.

The fully convolutional image classifier backbones, such as Resnet-101 [47] and Inception-Resnet-v2 [69], can be used for object detection. Resnet [47] is a residual learning framework to make the training easy for deeper neural network. It is reported that the residual network with 101 layers (Resnet-101) has the best performance for object classification [47]. Inception-Resnet-v2 [69] is a hybrid inception version which combines residual network and inception network.

In this paper, three kinds of deep-learning based ensemble network models are considered for the traffic light detection (see Figures 24). The first network model, Faster-RCNN with Inception-Resnet-v2, consists of Inception-Resnetv2 for feature extraction, RPN for candidate extraction, and RoI pooling of Fast R-CNN for classification. The second one, Resnet-101 with Faster R-CNN, consists of Resnet-101 for feature extraction, RPN, and RoI pooling. R-FCN with Resnet-101 consists of Resnet-101, RPN, and position-sensitive score map and position-sensitive RoI pooling for classification. Each network model is applied and its performance is compared.

4. Configuration for Evaluation

In this section, we introduce the dataset and data augmentation method for traffic light detection, parameter tuning, and measurement metrics.

4.1. Dataset and Data Augmentation

For the simulations, we use Bosch Small Traffic Lights Dataset (BSTLD) offered by Behrendt et al. [54]. To use the same types of traffic lights both for training and test, we use only training data set of BSTLD which consists of 5,093 images. Among them, 2,042 images containing 4,306 annotated traffic lights are randomly selected and used as the test data set. The training data set consists of 6,102 images containing 12,796 annotated traffic lights. For testing set, 3,051 images are obtained from BSTLD training set and the others are generated using the following data augmentation techniques.(i)Additional Noise and Blur. Random addition of Gaussian, speckle, salt and pepper noise, and generation of an image with signal-dependent Poisson noise.(ii)Brightness Changes in the Lab Space. Addition of random values to luminance (lightness) component.(iii)Saturation and Brightness Changes in the HSV Space. Additive jitter which is generated at random by means of exponentiation, multiplication and addition of random values to the saturation and value channels.

Both training and testing data sets consist of 1,280×720 size images with annotations including bounding boxes of traffic lights as well as the current state of each traffic light. An active traffic light is annotated by one of six kinds of traffic light states (green, red, yellow, red left, green left, and off) as shown in Figure 5. Detail descriptions of training and testing data sets are summarized in Table 3.

4.2. Parameter Tuning for Training

All three ensemble networks are trained until a maximum of 20,000 epochs using the pretrained weights are obtained from the COCO dataset [64]. The Faster R-CNN and R-FCN networks are trained by stochastic gradient descent (SGD) with momentum [70, 71], where the batch size is 1 and the momentum optimizer value is 0.9. We manually tune the learning rate schedules individually for each feature extractor. In our implementation, the tuning parameters of learning rate for SGD with momentum optimizer are set as follows:(i)Initial learning rate: 0.0003(ii)Learning rate of 0 ≤ Step < 900,000: 0.0003(iii)Learning rate of 900,000 ≤ Step < 1,200,000: 0.00003(iv)Learning rate of 1,200,000 ≤ Step: 0.000003

As suggested by Huang et al. [65], we limit the number of proposals to 50 in all three networks to attain similar speeds of traffic light detection.

4.3. Measurement Metrics

To evaluate the performances of traffic light detection, we use measurement metrics such as average precision (AP), mean average precision (mAP), overall AP, and overall mAP that have been widely used in VOC challenge [72, 73] and the COCO 2015 detection challenge [74].

AP is precision averaged across all values of recall between 0 and 1. Here, AP is calculated by averaging the interpolated precision over eleven equally spaced interval of recall value 0, 0.1, 0.2, …0.9, 1.0] [75]. To evaluate the performance for two or more classes, the average of AP, mAP is calculated by averaging APs over every class. We also use overall AP and overall mAP that are obtained by averaging APs and mAPs, respectively, over the IoU=0.5, 0.55, 0.60, …, 0.90, 0.95], where IoU stands for interval of intersection over union [73].

5. Simulation Results

In this section, we analyze the simulation results and detection examples. For the evaluation, we use measurement metrics such as overall mAP, overall AP, mAP, and AP. For analysis of the detection examples, we apply NMS.

5.1. Simulation Results

Every eighteen methods combined with six different color spaces and three network models are implemented and compared. Tables 4 and 5 show the detection performances where every combination methods are listed in the left columns. In the tables, bold and underlined numbers indicate the top-ranked method, bold for the second ranked and underlined for the third ranked.

The first two network models, Faster R-CNN model with Inception-Resnet-v2 and Faster R-CNN model with Resnet-101, have roughly better performances than R-FCN model with Resnet-101 in terms of mAP. In all three networks, RGB and normalized RGB have high performance than other colors. There is no method having good performance over every type of traffic light.

In Table 4, from the view point of color space, RGB, normalized RGB, and HSV spaces have higher mAP in Faster-RCNN with Inception-Resnet-v2. RGB and normalized RGB have good performance in Faster R-CNN with Resnet-101. In the case of yellow traffic light, normalized RGB and HSV in Faster R-CNN with Inception-Resnet-v2 and HSV in Faster R-CNN model with Resnet-101 have higher performance than other methods, but most methods have limited overall mAPs depending on sizes of traffic light, small and nonsmall. The medium and large size data are combined into nonsmall set, because our dataset has very limited number of large size traffic light data. The top-ranked three methods such as RGB based Faster R-CNN with Inception-Resnetv2, normalized RGB based Faster R-CNN with Inception-Resnet-v2, and HSV based Faster R-CNN with Inception-Resnet-v2 retain their good performances in the small-size object.

Table 5 shows the detection performances in terms of AP and mAP when IoU is fixed to 0.5. The top-ranked two methods such as RGB based Faster R-CNN with Inception-Resnet-v2 and normalized RGB based Faster R-CNN with Inception-Resnet-v2 have also better performance than others, even though the ranking order is slightly changed depending on the object size. Ruta’s RYG and HSV in Faster-RCNN model with Resnet-101 and HSV in Faster R-CNN with Inception-Resnet-v2 have significantly high performance for yellow traffic light. Similar to Table 4, CIE Lab color space has relatively poor performance regardless of the network models. As shown in Table 5, the performances of mAP depending on the size are similar to Table 4.

5.2. Detection Examples

After the traffic light detection procedure, we use NMS to remove the redundant detections. At final test process, IoU threshold of NMS is fixed to 0.5. The traffic light detection examples of the top-ranked two methods are shown in Figures 6 and 7. Six example images are selected to show detection results for six types of traffic lights. The traffic lights with object score being greater than 0.5 are detected and classified. True positives are indicated by the corresponding traffic light symbol. False positives and false negatives are noted by FP and FN, respectively. As shown in Figure 6, the top-ranked method, RGB color-based Faster R-CNN with Inception-Resnet-v2, has twenty-six true positives, three false positives, and four false negatives in six images. Figure 7 shows that normalized RGB color-based Faster R-CNN with Inception-Resnet-v2 has twenty-four true positives, two false positives, and seven false negatives. Both methods cannot detect the yellow traffic lights well.

5.3. Summary

Based on the performance analysis, the Faster R-CNN model is more suitable to traffic light detection than R-FCN. Inception-Resnet-v2 shows better performance for feature extraction than Resnet-101 in Faster R-FCN framework. From view point of color space, the use of RGB has highest performance in all ensemble networks. The normalized RGB is also a good color space for Inception-Resnet-v2 model.

6. Conclusions

In this paper, we present a deep-learning based traffic light detection system that consists mainly of color space transform and ensemble network model. Through the simulations, it is shown that Faster R-CNN with Inception-Resnet-v2 model is more suitable to traffic light detection than others. Regardless of the network models, RGB and normalized RGB color spaces have high performance. However, most methods have limited performance to detect yellow traffic lights. It is observed that yellow lights are often misclassified into red lights because the amount of yellow lights is relatively much smaller in training dataset. The performance can be improved, if yellow light training data is large enough as other colors. The results can help developers to choose appropriate color space and network model when deploying deep-learning based traffic light detection.

Data Availability

For the simulations, Bosch Small Traffic Lights Dataset offered by Behrendt et al. is used: https://hci.iwr.uni-heidelberg.de/node/6132.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported in part by Basic Science Research Programs through the National Research Foundation of Korea (NRF) funded by the Ministry of Education (Grant no. NRF-2018R1A2B6005105), in part by the Brain Korea 21 Plus Program (no. 22A20130012814) funded by National Research Foundation of Korea (NRF), and in part by MSIT (Ministry of Science, ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2018-2016-0-00313) supervised by the IITP (Institute for Information & Communications Technology Promotion).