Abstract

An intelligent transportation system (ITS) is an advanced application that supports multiple transport and traffic management modes. ITS services include calling for emergency rescue and monitoring traffic laws with the help of roadside units. It is observed that many people lose their lives in motorbike accidents mainly due to not wearing helmets. Automatic helmet violation detection of motorcyclists from real-time videos is a demanding application in ITS. It enables one to spot and penalize bikers without a helmet. So, there is a need to develop a system that automatically detects and captures motorbikers without a helmet in real time. This work proposes a system to detect helmet violations automatically from surveillance videos captured by roadside-mounted cameras. The proposed technique is based on faster region-based convolutional neural network (R-CNN) deep learning model that takes video as an input and performs helmet violation detection to take necessary actions against traffic rule violators. Experimental analysis shows that the proposed system gives an accuracy of 97.69% and supersedes its competitors.

1. Introduction

The world’s population is increasing at an unprecedented rate. As per a survey report, the world population was around 600 million at the start of the eighteenth century, which has now increased up to 7.8 billion in 2020 [1]. The increasing population rate is directly proportional to an increase in the use of vehicles. In 2018, the total number of registered vehicles was 23,588,268 compared to 21,506,641 in the previous year [2]. Motorbike is cheaper and an affordable source of transportation for middle-class people. The number of registered motorbikes reached an astonishing number of 17,465,880 in the year 2018, as compared to 15,664,098 in the previous year [1, 2]. According to the stats for the year 2018, 74% of all registered vehicles were motorbikes [3]. Due to the increased number of vehicles, road congestion caused more accidents [4]. An intelligent transportation system (ITS) is an advanced transportation system, a collection of integrated technologies like electronics, communication, sensors, cameras, and so on [5]. It aims to provide a risk-free system that saves human lives and time and keeps them informed about road conditions, like weather, construction, and other calamities [68]. ITS is capable of implementing a transportation system that is smart, fully functional, and based on real-time calculations. This system usually calls the helpline in case of any emergency or accident encountered by travellers. It uses surveillance cameras mounted on roads to check violations [9, 10]. It incorporates different applications from basic to advanced, i.e., navigation systems for vehicles, variable message signs, and surveillance cameras on the road are some of its applications [1114]. Figure 1 displays some applications of ITS.

Figure 2 shows an increase in the number of accidents in Pakistan, separating fatal and non-fatal accidents [15]. Motorbike is not only the most widely used vehicle but also the most dangerous mode of transportation [16]. According to a study conducted in Pakistan Institute of Medical Science (PIMS) during September 2015–December 2015, 709 total accidents were reported in the hospital. Out of these accidents, 71% were related to motorbikers [17]. It shows that most of the victims of traffic accidents are bike riders. So, it leads to a high causality rate in bikers during or after accidents. In those cases, riding without a helmet is the primary cause of death. According to stats, helmet reduces the death rate by 37% and the head injury rate by 69% [18]. So, it is mandatory by law to use a helmet while riding a bike [19]. Capturing all the people violating the rules for a traffic warden standing on a road is difficult. The worldwide reviews of studies proved that fatal accidents causing severe injuries had been reduced from 40% to 11% in the presence of surveillance cameras [20]. So, it is evident that there is a need to develop an intelligent system that automatically detects bikers without wearing a helmet with the help of surveillance cameras.

This paper develops a system for automatically detecting bikers without a helmet using a faster region-based convolutional neural network (R-CNN). The system takes input in the form of video and converts that into frames to perform helmet violation detection. The dataset has been collected from two sources, i.e., online repositories and self-captured videos from different locations in Lahore, Pakistan. The experimental analysis shows that the proposed system has 97.69% accuracy. It may help to take necessary actions against traffic rule violators.

The rest of the paper is organized as follows. Section 2 consists of a literature review. Section 3 contains the proposed helmet violation detection technique. Experimental analysis is performed in Section 4. Finally, Section 5 concludes the paper.

2. Literature Review

Computer vision and digital image processing are used in various applied domains such as remote sensing, pose detection, decision making, path detection, defect detection, and automatic driving [2126]. The recent focus of research in this field is the use of deep learning models that have shown good results in various applied domains [2729].

Many researchers have suggested different methods to solve the problem of automatic detection of helmet in real-time environment. Cheverton [30] implemented a system using supper vector machine (SVM) and background subtraction techniques to identify bikers with and without a helmet. The self-generated dataset has been used for the development of the system. However, the system has two main limitations. Firstly, it examines the whole frame for helmet detection, increasing overall computational cost. Secondly, it also has an issue: it incorrectly classified the number of heads without a helmet. Silva et al. [31] introduced a hybrid descriptor model based on texture and geometric features to detect bikers without a helmet. The Hough transform (HT) and SVM are used to detect the head of the biker. The self-generated dataset has been used for the training of the algorithm. They extended their work and used a multilayer perception model to differentiate among different objects showing an accuracy of 94.23%.

Silva et al. [32] proposed a system based on HT and histogram oriented gradient (HOG) that helps extract the image’s attributes. The input images are taken from the roadside cameras, and database of 255 images is established. The developed system has given accuracy of 91.37%. Waranusast et al. [33] suggested a system based on the K-nearest neighbor (KNN) classifier that helps determine and detect motorcyclists with and without helmets. The system has been tested on the self-created dataset. The input image is taken from a web camera. The experimental results showed that the system had given a correct detection for the far lane, near lane, and both lanes as 68%, 84%, and 74%, respectively. Dahiya et al. [34] developed a system that helps detect a motorcyclist without a helmet using HOG, SIFT, LBP, and SVM machine learning techniques. The input is taken from the camera in video and then converted to frames for further processing. They applied the background subtraction technique to select moving objects from the frames. HOG, scale invariant feature transform (SIFT), and local binary pattern (LBP) techniques are applied to extract features. If anything except a bike is detected, it has been overlooked. After that, SVM is used to classify the bikers with and without helmets. The self-generated dataset has been taken for testing purposes. The system has given an accuracy of 93.80%.

Boonsirisumpun et al. [35] deployed a convolutional neural network (CNN) system to detect bikers without a helmet. The input has been taken using cameras. The dataset of 493 images has been used for training purposes. The system used four CNN-based models, including Google Net, MobileNet, VGG19, and VGG16. The MobileNet gave the highest accuracy, which is 85.19%. Raj et al. [36] contributed to detecting bikers who have violated helmet-wearing rules based on a deep learning technique. The task of detecting motorcycles is accomplished using HOG and then selecting the region of interest. They applied CNN technology to identify bikers without helmets and to perform number plate recognition. The self-generated dataset from different sources has been used. They claimed accuracy of 94.70%. Wu et al. [37] used YOLOv3 and YOLO-dense models to detect bikers without a helmet. They collected datasets from two sources, i.e., self-generated and the Internet. The experimental results indicated that they had achieved 95.15% mAP for YOLOv3 and 97.59% for the YOLO-dense model.

Siebert and Lin [38] utilized a deep learning approach, RatinaNet50, to detect bikers without a helmet. The proposed system has used self-generated data for training. The two classes have been created, i.e. “With Helmet” and “Without Helmet.” The experimental result showed that an accuracy of 72.8% has been achieved. Vishnu et al. [39] used an adaptive search method to identify moving objects. After that, CNN on a self-generated dataset was used to identify bikers from moving objects. Finally, CNN is implemented to differentiate bikers not wearing a helmet.

Mistry et al. [40] used CNN to detect bikers without a helmet. They used YOLOv2 in 2 levels. Firstly, the system used YOLOv2 to detect different objects and motorcyclists without helmets. The COCO dataset has been used for training purposes. The experimental result gives an accuracy of 92.87%. Afzal et al. [41] used Faster R-CNN to detect bikers that have not used helmets. The system has been trained on a self-generated dataset. The experimental results gave an accuracy of 97.26%. Kharade et al. [42] introduced a system for detecting motorcyclists who are not wearing helmets through deep learning algorithms based on the YOLOv4 model. The proposed model indicates true performance in traffic motion pictures compared to current CNN-based algorithms.

The primary goal of Sridhar et al. [43] is to look at whether the person wears a helmet or not through YOLOv2. A method that uses deep convolutional neural networks (CNNs) for revealing motorcycle riders who disobey the legal guidelines has been established. It first detects the motorbike and then classifies it as with or without helmet. The proposed architecture yielded better experimental results in comparison with traditional algorithms.

Kathane et al. [44] used the YOLOv3 algorithm for implementation. Exceptional deep learning models are trained for object detection. The developed system uses three diverse deep learning models to detect these objects. The established system gives 88.5% precision for motorcycle detection and 91.8% for number plate detection. Rajalakshmi and Saravanan [45] developed a system for monitoring and handling persons breaking the guidelines through a convolutional neural network (CNN). The system performs vehicle classification, helmet detection, and mask detection through an appropriate CNN-based model. Table 1 displays the summary of the abovementioned related work.

The existing systems above can detect bikers without a helmet, but there are also some limitations. Most of the existing systems have low accuracy. Moreover, the dataset used to develop the system is also limited. Furthermore, some of the above systems cannot differentiate between helmet and scarf. The proposed system can easily differentiate between a helmet and a scarf. The significant contribution of this work is the establishment of the dataset that consists of almost all types of bikes. In addition, the proposed technique is developed using a comprehensive dataset and achieves higher accuracy than the existing systems.

3. Proposed System

This section presents a proposed technique to automatically detect helmet violations from surveillance videos captured by roadside-mounted cameras. The proposed technique is based on Faster R-CNN deep learning model that takes video as an input and performs helmet violation detection to take necessary actions against traffic rule violators. The proposed system performs multiple operations in a sequence. Firstly, it detects motorbikes and separates these from other vehicles. Secondly, it categorizes riders into two classes, i.e. “With Helmet” and “Without Helmet.” A deep learning algorithm, i.e., Faster R-CNN, is used to detect the bikers without helmets. Figure 3 shows the block diagram of the proposed technique. The following sections describe each component of the proposed technique.

3.1. Data Acquisition

A dataset of bikers with and without helmet is required to develop a system. For data acquisition, three sources include two datasets from existing works [41, 46] and one dataset of self-captured data to accommodate most of the motorcycles running in different countries. The second source includes the surveillance videos captured from Lahore safe city cameras mounted on different roads of Lahore, Pakistan. The captured videos consist of the frontal and back views of the motorcyclists and are converted into frames at the rate of 25 fps. Figure 4 shows sample images from the dataset.

3.2. Preprocessing

The dataset should be preprocessed to get the appropriate data according to the problem. The obtained dataset contained redundant data, frames with irrelevant images, an incomplete object, etc. Manual preprocessing is done to select appropriate frames from the dataset [47]. Redundant images are removed from the dataset. A total of 23800 frames are selected after preprocessing, i.e., in which 13631 are with helmets and the remaining 10169 are without helmets.

3.3. Annotation

Annotation has been used for image labelling [48, 49]. In this work, a bounding box is drawn around the image. A total of four values are assigned to the bounding box. The label “with helmet” is assigned to images containing bikers with helmets, and the “without helmet” label is assigned to bikers without wearing a helmet. The sample annotated image is shown in Figure 5.

3.4. Faster R-CNN

This work uses Faster R-CNN [50] to detect bikers without a helmet. It is the extended version of the Fast R-CNN [51] and consists of two main modules, region proposal network (RPN) and Fast R-CNN. The RPN guides the Fast R-CNN detection module to find objects in the image [52]. The RPN generates a region proposal, and Fast R-CNN helps to perform object detection from the proposed region. The general architecture of the Faster R-CNN is shown in Figure 6.

This task is performed with the help of a fully convolutional network for sharing computation with a Fast R-CNN object detection network. The RPN takes an image as input (of any dimension) and generates a series of rectangular object proposals along with an objectless score as an output. So, the RPN does not require extra time to generate the region proposals compared to its competitors like selective search. This sharing of convolutional layers also helps in reducing the training time.

A small window is sided over the feature map for the generation of region proposals. The RPN consists of a regressor and classifier. Classifier tells about the probability of an object at a specific location while regressor tells its coordinates. The aspect ratio and scale are critical parameters for any image, and their values are set to 3. The central part of the sliding window is known as anchor. There are a total of 9 anchors at a position by default. Each anchor is assigned a binary label telling whether an object is present or not. A positive label is assigned to the anchors that either have maximum intersection-over-union (IoU) overlap with a ground-truth box or have IoU overlap greater than 0.7 with any ground-truth box. A negative label is assigned to the anchor if its IoU is less than 0.3. Labels are assigned on two bases, i.e., “the anchors that have high intersection-over-union overlap with a ground truth box” and “the anchors with intersection-over-union overlap which are higher than 0.7.” For the training of RPNs, a loss function given in equation (1) is used [53].where i indicates the anchor index in a mini-batch and ai is the probability of anchor i predicted as an object. denotes the ground truth label, and its value is 1 or 0, depending on whether the anchor is positive or negative. The coordinates of the predicted bounding box are represented by bi vector and represents the positive anchor’s ground truth label. is referred to as classification loss. It is the log of the loss over object and non-object classes. is the regression loss, so the expression shows that regression loss has an effect only for positive anchors where . The classification and regression layers output comprise that are normalized using and , respectively. is used as the balancing weight.

The input of the proposed model is cropped helmet image of size 224 × 224 × 3. There are 8 blocks in the backbone architecture, of which 3 are connected layers, and the remaining 5 are convolutional layers. Non-linearities follow each convolutional layer as the max pooling and rectification (ReLU) layer. The outcomes of two of the three fully connected layers are 4049 dimensional. The output of the last connected layer depends on the class present in the dataset and has N = 2622. The primary purpose of the softmax layer is to handle the un-normalized vectors. It is placed right after the 2nd connected layer. The output of all these is the prediction probability represented in the form of probabilistic scores as shown in the following equation:

Table 2 compares Faster R-CNN with other models like Fast R-CNN and R-CNN. The comparison is performed by taking three attributes, i.e., the region proposal method, computation time, and prediction time. Faster R-CNN uses RPN for region proposal instead of a selective search method which is used in R-CNN [54] and Fast R-CNN. Moreover, the computation and prediction time of Faster R-CNN is better than its predecessor, making it appropriate to be used in this work.

4. Experiment Analysis

Core i7 system is used with 32 GB RAM and Ubuntu operating system to develop a proposed technique. For training and validation of the model, GPU GTX 1080 Ti is being used. A dataset that contains a total of 23800 images is divided into two parts, i.e., training and validation. For training and validation of the model, 70% and 30% of the data are used, respectively. The number of epochs is set to 200000. During the training process, an early stop function is used in which the model is trained until the convergence does not occur. Figure 7 shows the training and validation loss. It indicates that, initially, validation loss is high. But as the training continues, loss gradually decreases. At 200000 epochs, this loss decreases significantly. It is necessary to pass the detected object to the model for the classification of the object.

Figure 8 illustrates the training and validation accuracy. The system attains an accuracy of 97.69%. As training starts, accuracy is low, and the loss is high. As time increases, the accuracy is also increased. The maximum accuracy obtained at 200000 epochs is 97.69%.

The confusion matrix for the proposed technique is shown in Table 3. In this work, 7133 samples are used for validation purposes in which 4089 samples are those who have helmets, and 3044 are those who do not have helmets. For samples that have helmets, 3995 samples are predicted correctly. Only 94 samples are wrongly predicted. For the remaining 3044 samples (without helmets scenario), 71 are wrongly predicted while 2973 are predicted correctly.

Several performance metrics are computed to evaluate the proposed system. Table 4 lists the performance measure metrics and their values. It indicates that proposed system has achieved 97.67% accuracy, 97.70% precision, 97.98% F1 score, and 98.25% sensitivity.

Table 5 lists the comparative analysis of the proposed technique with the existing systems. It reflects that the proposed system gives accuracy of 97.69% and supersedes its competitor.

Figure 9 displays some predictions made by the proposed system. The yellow bounding box indicates those motorcyclists who did not wear the helmet, whereas green bounding box represents those who have worn the helmet. In Figure 9(a), the system correctly predicted the helmet violation, and the yellow bounding box encompasses the motorcyclist who did not wear the helmet. Figure 9(b) portrays the case of correct and incorrect prediction of helmet violations. Similarly, in Figures 9(c)–9(f), the algorithm correctly predicted both kinds of motorcyclist, i.e., with and without helmet. It is evident from Figures 9(c) and 9(e) that the proposed system successfully differentiated among helmet, scarf, and cap.

5. Conclusion

Automatic helmet violation detection of motorcyclists from real-time videos is a demanding application in ITS. It enables one to spot and penalize bikers without a helmet. This work proposes an automatic helmet violation detection technique for ITS. The proposed technique is based on Faster R-CNN deep learning model that takes video as an input and performs helmet violation detection to take necessary actions against traffic rule violators. The experimental analysis shows that the proposed technique achieved 97.6% accuracy. This work may be extended to incorporate more features, like number plate detection and other traffic violations, in the future.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.