Image processing-based artificial intelligence algorithm is a critical task, and the implementation requires a careful examination for the selection of the algorithm and the processing unit. With the advancement of technology, researchers have developed many algorithms to achieve high accuracy at minimum processing requirements. On the other hand, cost-effective high-end graphical processing units (GPUs) are now available to handle complex processing tasks. However, the optimum configurations of the various deep learning algorithms implemented on GPUs are yet to be investigated. In this proposed work, we have tested a Convolution Neural Network (CNN) based on You Only Look Once (YOLO) variants on NVIDIA Jetson Xavier to identify compatibility between the GPU and the YOLO models. Furthermore, the performance of the YOLOv3, YOLOv3-tiny, YOLOv4, and YOLOv5s models is evaluated during the training using our PowerEdge Dell R740 Server. We have successfully demonstrated that YOLOV5s is a good benchmark for object detection, classification, and traffic congestion using the Jetson Xavier GPU board. The YOLOv5s achieved an average precision of 95.9% among all YOLO variants and the highest success rate achieved is 98.89.

1. Introduction

In recent years, technology has constantly been evolving with the spread of artificial intelligence techniques such as deep learning. Various machine learning algorithms have been developed to solve one of the biggest challenges in computer vision, namely, object detection and identification [1]. Object detection is a problem of identification, localization, and classification of single or multiple objects in an image [2]. It is well established that deep learning algorithms have shown superior results to conventional techniques. There are two main categories for object detection, identification, and tracking. The first is based on a single-stage neural network with convolutional architecture [3] that generates a fixed number of predictions on the grid, such as SSD [4], YOLO [5], and M2Det [6]. The latter is based on two or more stage networks that take advantage to find regions of interest that have a high probability of containing an object and second or higher networks to get the classification score and spatial offsets, such as FPN [7], YOLOv5 [8], and faster R-CNN [9]. Object detection techniques have been successfully used in many real-time applications ranging from autonomous driving [10] to robotics and machine vision [11], video surveillance [12] to traffic monitoring [13], and medical imaging [14] to the diagnostic system [15].

Despite the unimaginable breakthroughs in machine learning and deep learning [16], there is plenty of room for improvement. In the present era, object detection, identification, and tracking depend on an efficient algorithm and an embedded platform running the computationally expensive algorithm. The optimal selection of embedded platforms is critical for real-time applications [17].

Over the last few years, embedded hardware has intensified as platforms with graphical processing units (GPU) [18]. Embedded platforms based on GPUs provide high performance and low power consumption and perform functions in parallel. In addition, the compatibility of the NVIDIA GPU-based embedded system with the JetPack SDK [19] and other Open-Source Computer Vision Library (OpenCV) provides good advantages as they have libraries for deep computer vision learning and accelerated computing. However, the performance of an embedded system based on GPU depends on various parameters such as GPU and memory usage, temperature, and inference time [20]. Furthermore, most reported works use offline or batch mode where historical or recorded data sets are analyzed. This article examines the performance of NVIDIA Jetson Xavier using deep learning algorithms in real-time environments. The performance of all standard YOLO variants for YOLOv1, YOLOv2, YOLOv3, YOLOv4, and YOLOv5 are tested and evaluated in real-time on NVIDIA Jetson Xavier. The main contributions that have been made through this research work are as follows: (1)The performance of YOLO variants with improved CNN algorithms is evaluated in real-time(2)The performance parameters of NVIDIA Jetson Xavier AGX, such as memory, temperature, and interference time, are also measured and evaluated with the real-time implementation of YOLO variants(3)The GPU processing board NVIDIA Jetson Xavier was evaluated to analyze the real-time road traffic performance using real-time traffic data

Furthermore, the remaining paper is structured as follows. Section 2 discusses the literature review on recent research based on CNN-based object detection algorithms. Section 3 presents the methodology of the research papers, and analytical results are then provided in Section 4. Finally, Section 5 states the conclusions.

2. Literature Review

Table 1 summarizes the literature on various deep learning algorithms applied to GPUs over the last few years. In [21], Blair and Robertson used the Histogram of Oriented Gradients (HOG) and Mixture of Gaussian (MoG) for object and event detection on-field programming gate array (FPGA), central processing unit (CPU), and GPU. They concluded that the detectors using GPUs process faster and consume more power. However, the setups that perform processing on FPGA have relatively less power consumption with less accuracy. In [22], Artamonov et al. implemented YOLO on mobile graphic processors such as NVIDIA Jetson for traffic sign recognition. Komasilovs et al. developed a vehicle detection and tracking system using an outdoor surveillance camera. The pretrained SSD MobileNet V1 model is used for the fine training vehicle detection model. The real-time tracking was done using a CPU (Intel i5, 16 GB RAM) and achieved an average of 92% vehicle detection and tracking accuracy. They concluded that the deep learning detection model is viable only when executed on better GPU-equipped hardware.

Zhou et al. [23] used automotive light detection and ranging (LiDAR) sensors and NVIDIA GTX 1080i GPU to implement a spiking convolutional neural network in YOLOv2 for real-time object detection for autonomous cars. The proposed networks were compared with various other frameworks. They concluded better performance in terms of average precision than other typical models reported in the literature.

Khazukov et al. [24] used the YOLOv3 on a CPU equipped with a GPU to detect vehicles and monitor traffic parameters. They use the YOLOv3 neural network architecture and Simple Online and Real-time Tracking (SORT) open-source tracker. They achieved almost 90% accuracy of vehicle count in day and night images. In [25], Avramović et al. used different variants of YOLO implemented on GeForce GTX 1080 to identify real-time performance for automotive applications, including driving assistance, detecting-road objects, autonomous vehicles, and automatic traffic sign inventory maintenance.

Barba-Guaman et al. [26] used the Jetson Nano and implemented various algorithms to detect vehicles and pedestrians’ luggage, namely, single short detection (SSD), PedNet, multiped, mobile net V1 and V2, and SSD-inception V2. They used different datasets for the identification of vehicles and pedestrians. The maximum accuracy for vehicle detection was 84.01% for SSD-Mobilenet V1 and SSD-inception V2. In the case of pedestrian detection, the maximum obtained accuracy was 90.23% with the PedNet framework. They also found models that consume less time in their performance which were SSD-mobilenet-V2, SSD-mobilenet-V1, and SSD inception-V2.

Castellano et al. [27] used embedded hardware platforms to detect human crowds for aerial images using CNN. The training was run offline on the VisDrone dataset [28] using an Intel Ci5 system with 8 GB RAM and NVIDIA 2 GB GeForce MX110 GPU, running Windows 10 Operating system. The trained networks were deployed on two computational hardware platforms, Raspberry Pi 3, and NVIDIA Jetson TX2, and TX2 outperformed Raspberry Pi 3 regarding detection accuracy and processing speed for all implemented models.

Kim et al. [29] used a multistage convolutional neural network (MSCNN) and variants of YOLOv3 to improve vehicle detection on conventional Intel i5 CPU. The proposed MSCNN and YOLOv3 apply to three datasets: KITTI VD [30], AUTTI [31], and crowd AI [32]. The algorithms were trained using the Pytorch package [33] in Python and the GTX TITAN X GPU. Tests were performed on Intel CPU i5-4670 without a dedicated GPU; however, the feasibility of real-time embedded implementation of MSCNN and YOLOv3 was not discussed.

It is important to note that the previous works are based on either offline or batch mode, where historical or recorded datasets are used for object detection, identification, and tracking. They have been using computationally less expensive algorithms, which provide compromised accuracy in terms of detection and tracking. Other methods discussed in the literature can often be considered expensive, given the real-time requirements, implementation on embedded platforms, and application’s computational limits.

2.1. Object Detection Algorithms

There are various object recognition and detection algorithms, such as YOLO [34]. However, YOLO (You Only Look Once) gained significant importance in the computer vision community due to its real-time and accurate object detection in wide applications [35]. YOLOv2 was released in 2017 with several iterative improvements in the layers, including batch norm. A higher version added an object score presented in the bounding box YOLOv2 was released in 2017 with several iterative improvements in the layers, including batch norm, higher resolution, and an adequately defined anchor box.

YOLOv3 was released in 2018, and the improvements applied to this version added an object score presented in the bounding box prediction and improved backbone network layers. These predictions are at three stages of granularity by improving the performance of smaller objects. YOLOv2 and YOLOv3 have improved and higher mAP; FPS than the Faster R-CNN and SSD, whereas Girshick first published RCNN [36] and faster R-CNN [37] in 2014 and 2015.

YOLOv4 was released in 2020 from the literature and has been more accurate than the YOLOv3 algorithm. However, the accuracy of YOLOv4 has been compared with YOLOv5, which is still open to question as some researchers have been claiming that YOLOv4 is the more accurate while others are claiming that YOLOv5 is more accurate. Jocher et al. released YOLOv5 in 2020, right a few days after the release of the YOLOv4 algorithm, with enhanced improvements. The reported results show that all attributes have different datasets and improved hyperparameters, since none of the related works uses the real-time live streaming processing on NVIDIA Jetson AGX Xavier compared with different YOLO variants with the specific criteria in this research.

2.2. Mobile Computation Platforms

GPU is characterized by excellent memory bandwidth and computation power [38]. With the same number of transistors available, the GPU achieves higher arithmetic intensity due to graphic computation’s parallel nature. In addition, the GPUs are both inexpensive and readily available. These features make the GPU an excellent choice for implementing deep learning frameworks. With the advent of developer kits like NVIDIA Jetson developer kits, real-time portable applications have been made possible. Table 2 presents NVIDIA Jetson developer boards available as a computation platform.

3. Methodology

This section proposes detecting and counting the vehicles for traffic congestion monitoring using hardware acceleration. In the present scenarios, different hardware accelerators can solve complex problems through their capabilities. This research includes the NVIDIA developer kit from the Jetson family (Jetson Xavier AGX) used for the application because of its high computation performance, as mentioned in Table 2, and power consumption efficiency in utilizing a standalone system. The Xavier AGX hardware accelerator selection depends on the data characteristics such as size, quantity, and application. This information provides assistance in selecting the right combination based on the data Properties in [40].

3.1. Data Selection

The initial step in selecting and implementing the deep learning-based YOLO-variant algorithms is to select the data. The data process involves converting videos into images, where resolution must be considered to accomplish good quality images to train the model and determine the size of the input of the algorithms. The data quantity should be observed because if the data is too small for the training set that will negatively impact the samples for each class. The COCO dataset [41] is mentioned in Table 3, from which 20,500 images have been taken of five classes of traffic vehicles. Our dataset contains one week of a traffic video sequence at various challenging levels, such as variations in time duration according to nighttime and daytime, with different angles on a high-resolution and augmented the dataset with a different rotation, zooming, and flip into top and bottom. The first step is to divide the dataset into 80% for training and 20% for testing. The first step is to consider the color samples from the combined dataset from the contained information for the feature extraction, using it for deep learning models.

3.2. Hardware Accelerator

DL methods rely on hardware accelerators, particularly those that fulfil data needs, and the application of the model must be chosen, necessitating assessment to find the best hardware for this processing. Deep neural networks’ growth has raised the need for computational complexity and, as a result, their resource consumption, providing implementation issues for deep neural networks. As mentioned in Table 2, the NVIDIA Jetson Xavier AGX has been used, utilizing power consumption efficiency as a standalone system. However, the other components of the prototype hardware include an ethernet-supported Hikvision HD camera and a battery backup system for portability. The camera (model: DS-2CD4A85-IZH and resolution ultra HD 4 K resolution) [46] is communicated with NVIDIA Jetson Xavier AGX using Real-Time Streaming Protocol (RTSP) for live data analysis, as shown in Figure 1. Furthermore, Internet Protocols (IPs) assigned to the camera and NVIDIA media processor are of the same IP pole, and, respectively.

The core part of the proposed AI-enabled object detection-based traffic monitoring system is developing and training several YOLO variants. Figure 2 represents the overall process followed. However, further explanation regarding the implementation is explained below sections.

3.3. Implementation of YOLO Variants

The You Only Looks Once (YOLO) deals with object detection, which takes an image as an example and predicts it by its bounding box coordinates. The YOLO algorithm locates each object and a corresponding class label using a bounding box and has an advantage in speed and performance compared to other deep learning algorithms. YOLO uses the convolutional neural network backbone, divided into three layers: input, hidden, and output. However, Table 4 mentions the total layers of individual YOLO models. The YOLO works well for multiple objects. Each object is associated with one grid cell, which helps overlap where one grid cell contains the center points of the two different objects known as anchor boxes. Each bounding box in the anchor box contains a certain height and width. Figure 3 illustrates a field test where the YOLO detects multiple objects in the image.

3.3.1. Loss Function

From equation (1), the following are:

refers to the number of grids;

total number of forecasting boxes in a cell;

and different for each cell, center coordinates;

and are dimensions of the prediction box;

confidence of forecasting;

vehicle detection assurance;

position loss function weight;

classification loss function weight;

vehicle object in the th prediction frame; in case of a target vehicle, its value is 1, otherwise, 0;

subsequent predicted value = ().

The general loss function calculates the sum of the total squared error for position predictions. According to the root-square value of predictive width and height for the box, the third and the fourth summing elements utilize the loss function for certainty. The fifth part adds to the equation and utilizes the loss function for the likelihood class. In the YOLO calculation, the Intersection over Union (IOU) loss function error and the loss function error classification are determined using multiclass crossentropy classification.

3.3.2. Training and Evaluation

The network has been trained using MS COCO (Common Objects in Context) dataset. The dataset has 80 classes of objects with annotations and labels. However, our dataset has combined MS COCO to improve the algorithm’s performance. For dataset development, HD cameras were deployed at various locations (Dr Ziauddin Hospital North Nazimabad, Dr Ziauddin Hospital OPD North Nazimabad, and Ziauddin Engineering University) and converted the video stream into frames, and then those frames were saved into annotations of text file format by using the label Img tool as demonstrated in Figure 4. The total images taken were 5000 at this stage. The dataset has been augmented using the library of augmentor. The purpose was to train the algorithm to face real-time data challenges like noise, picture brightness variation, and frame tilting issues. The five classes of images were considered for training the network: truck, bus, car, bicycle, and motorbike. The Dell R740 Server combined with NVIDIA Tesla T4 GPU is utilized for training the models, as shown in Figure 2.

For evaluation, trained algorithms YOLOv3, YOLOv3-tiny, YOLOv4, and YOLOv5s on a high-definition live video stream of 20 fps have been tested. The research work is evaluated using YOLO models, and the performance matrices are given in. The presented work is performed using the NVIDIA Jetson AGX Xavier controller. The processing is evaluated based on the RAM utilization, inference time, temperature, and GPU utilization at a different resolution. The overall hardware and software packages required for system training and testing are shown in Table 3.

4. Results and Discussion

The performance of the YOLOv3, YOLOv3-tiny, YOLOv4, and YOLOv5s models is evaluated during the training, and the parameters are mentioned in Table 5. A 5000-image dataset is used for evaluating each algorithm. The models YOLOv3-tiny, YOLOv3, and YOLOv4 are quite similar, but the YOLOv5s are adaptive learners and have high precision and recall compared to other YOLO models. In Table 5, the first column represents the labels of the trained objects divided into five classes: car, truck, motorbike, bicycle, and bus. The second column represents the image size of images during the training, and the third column is for batch size for the model; the fourth and fifth represent recall and precision. Finally, the sixth and seventh column is average precision for the performance of the proposed model. The latest variant of the YOLO family, defined as YOLOv5s, has high precision, recall, and reduced weight size, which is the lightest weight characteristic compared with the other models.

In Equation (2), The true positive (TP) is used for correctly detecting any object that represents and exists in the frame conducted from the video. False-positive (FP) represents the invalid/incorrect detection (sometimes, the algorithm detects the incorrect objects in the frame while detecting an object). In Equation (3), a false negative (FN) represents the object the algorithm does not detect. The Intersection over Union (IoU) assesses the overlap region among the forecasting box and the actual item’s ground truth bounding box in object detection. It can be categorized as correct or wrong by comparing the IoU to a specified threshold using IoU. Equations (4) and (5) are used for average computing precision (AP) and mean average precision. AP has been used to show the precision and recall curve into a numeric value representing the overall precision average, where is defined as the number of thresholds. The AP is the weighted sum of precision at each threshold, corresponding to the increase in recall. However, mAP is calculated with a value between 0 and 1, indicating how much the anticipated and ground truth bounding boxes overlap. Because each value of the IoU threshold yields a precise average accuracy (AP) measure, this value must be specified.

The trends in Figure 5 evaluate the loss graph of YOLOv5s, YOLOv4, YOLOv3, and YOLOv3-tiny for class, object, and box. The horizontal axis shows the graph iterations, and the vertical axis represents the loss amplitude. However, the trend shows that the overall performance of the individual model is similar, and the loss trend decreases at a similar rate with each number of iterations (Table 6) exhibits classification performance by presenting the YOLOv5s, YOLOv4, YOLOv3, and YOLOv3-tiny model training results. Table 7 represents the success rate for classifying each object on the scene and instances of misclassification. The highest success rate of 98.89% has been obtained for cars, whereas the algorithm has correctly identified the bus. The lowest success rate obtained is 89.88% for bicycles. Overall, the misclassifications are within tolerable limits. Figure 3 below illustrates the objects detected by the deep learning models by making bounding boxes around the detected image.

The performance evaluation of the models in terms of RAM utilization, inference time, GPU utilization, and the temperature of the on-shelf controller is presented in Figures 69, respectively. For YOLOv3-tiny (in Figure 6), GPU temperature remains on the lower side, whereas for YOLOv5s (in red in Figure 6), the higher temperature corresponded to higher image resolutions of and . YOLOv3 and YOLOv4 (in Figure 6) also had considerably high temperatures for all resolutions. In Figure 6, YOLOv5s at lower resolution seem good in keeping the GPU temperature low. However, the temperature rapidly increases at higher resolution, which shows that the v5s utilize more GPU. The inference time (in seconds) is one of the crucial factors in live streaming tasks. However, it is highly dependent on detection accuracy. As illustrated in Figure 7 below, the inference time increased as the resolution of images was generally in YOLOv3(in Figure 7), obtaining the highest inference of 3.13 with resolution. However, YOLOv3-tiny and YOLOv4 maintain a lower inference time than others, with the lowest value of 0.46 and 0.6 seconds, respectively. On the other hand, YOLOv5s take 1.37 and 2.579 seconds at resolutions and , respectively.

The histogram in Figure 8 illustrates the relation between the image resolution and GPU utilization. YOLOv5S (in Figure 8) has been shown to use more GPU than other models for all image resolutions, whereas YOLOv3-tiny (in Figure 8) used the lowest GPU for all image resolutions. Therefore, YOLOv3 tiny seems to be the most reasonable option for implementing the algorithm on the Jetson Xavier board if the most significant concern is keeping the GPU utilization in check. Finally, the lowest RAM utilization has been achieved by YOLOv5s, as shown in Figure 9 as a general trend, and RAM utilization exhibits an ascending trend as image resolution is increased. For example, YOLOv3-tiny and YOLOv5s utilize 2.8 and 2.4 GB of RAM, respectively, out of 16 GB at lower resolution. However, YOLOv3 uses 8.2 GB, and YOLOv4 uses 7.1 GB of RAM at a resolution of .

5. Conclusion and Future Work

The main scientific contribution of the project is developing a standalone system using Jetson Xavier AGX to perform traffic surveillance and monitoring. In this research paper, Jetson Xavier AGX is an excellent choice for implementing complex CNN-based (YOLOv3, YOLOv3-tiny, YOLOv4, and YOLOv5s) models with exceptional performance. Furthermore, improving deep learning libraries on NVIDIA platforms can result in even better results. The proposed system has been tested day and night, showing a success rate of 98.895. Traffic monitoring and management are one of the biggest challenges for third-world countries like Pakistan. By implementing the presented systems to detect and count vehicles, traffic problems can be minimized, such as false parking detection, traffic management using traffic controlling, and congestion detection. Finally, the work has successfully demonstrated that the powerful computational abilities of Jetson Xavier AGX can be exploited for object detection in live video streams.

Data Availability

The data is available and can be accessed on request, and contact the authors for further assistance at [email protected] and [email protected].

Conflicts of Interest

There are no conflicts of interest.


The authors would like to acknowledge the whole research lab’s help and support provided by Data Acquisition, Processing, and Predictive Analytics Lab, National Center in Big Data and Cloud Computing, Ziauddin University, Karachi, Pakistan.