Abstract

Autonomous vehicles are equipped with multiple sensors that allow perception of the road environment. However, there are always challenges in terms of measurement accuracy, dynamics of the road driving conditions, and extended perception availability. Vehicular communication technologies have already been extensively researched, and the IEEE 802.11p standard has been approved. Therefore, communications could help in extending the perception of the autonomous vehicles if proper information transmission mechanisms are utilized. In fact, this paper proposes a novel and innovative design that will allow vehicles to extend their perception by exchanging a smaller number of packets than needed and estimate the actual perception of the environment. First, we propose a novel MAC layer that is compatible with the IEEE 802.11p standard that allows vehicles to recover extra perceptional areas of the environment as they receive new packets. Second, we demonstrate that this approach will result in a better utilization of the communication channel and acceptable perception accuracy of the environment, compared to transmitting the complete information.

1. Introduction

It is forecasted that by 2025, autonomous cars will enter several markets in North America and other developed countries [1]. The existence of autonomous cars might not fully substitute the current model of human driving the car, especially in environments that lack proper transportation infrastructure. Each autonomous car has to rely mainly on its set of sensors to gain perception on the environment. Common sensors include cameras, liadars, and radars. It is perhaps safe to consider these sensors to be reliable; however, autonomous cars should utilize all accessible information regarding the surrounding environment in order to assure safety. Hence, comes the concept of extended perception.

In the sequel, extended perception is the exchange of perception information from one vehicle to another (assuming this other vehicle does not have access to the exact information although this could be possible). Extended perception can bring completely new information to the vehicle such as in bringing attention to a new object on the road, or can enhance the current information by providing the same information with a better accuracy in terms of detection, localization, or resolution. An example is providing better localization of an object using stereo matching. Figure 1 provides an example in which vehicle A has good perception of the roadsign, while other vehicles do not have the complete perception of that sign. In this example, vehicle A could transmit the information it has to other vehicles in order to provide extended perception.

Autonomous vehicles would be able to exchange information using vehicular communications technology. The current standard allows vehicles to exchange messages over the 5.9 GHz spectrum frequency. However, there are two obstacles for using the dedicated short range communication (DSRC) spectrum [2, 3]. First, the channel capacity is limited, and hence, the number of packet collisions would be large, resulting in channel congestion in certain scenarios [4, 5]. Second, the amount of sensory information is large, and hence, it would definitely cause congestion to flood the communication channel with all sensory information [6]. Hence, optimizing the data acquisition to be compatible with the communications technology limitations is needed. And, this must be performed with an innovative channel congestion avoidance and environment perception method.

In this paper, we propose an extended perception scheme that allows vehicles to sense the road environment and exchange that information with their neighboring vehicles. We focus on providing the sensing information to the other vehicles with the best number of transmitted packets that allow the communication channel to operate without congestion. Moreover, each vehicle already senses the environment with its own sensors, and the extra information can be used for enlarging the sensing capability of each vehicle.

In this paper, we assume that extended perception is required. That is, we do not discuss the scenario where vehicle A analyzes the environment and then transmits its understanding of the information although that is possible. However, we assume that extended perception can be used for different applications such as localization, collision avoidance, and navigation. Moreover, due to the dynamics of the vehicular environment, it is possible to generate scenarios where extended perception can be useful, or perhaps does not help at all.

The rest of this paper is organized as follows: Section 3 describes the system model including the transmission scheme and the object detection scheme. Section 4 shows the performance evaluation of the proposed scheme. Section 5 concludes the paper.

2.1. Vehicular Communication Network

In mobile networks, sensors can be ubiquitous, and a vehicular sensor network (VSN) is a clear example of a mobile sensor network (MSN). Vehicular communication systems [4] and sensing services [6] are the main framework a VSN works within. First, the vehicular communication technology offered a ground for different applications, including enhancing road safety, localization of vehicles, traffic monitoring, transportation management, multimedia streaming, and data collection. The majority of these applications require the vehicle to act as a sensor––hence the name VSN.

A VSN is a vehicular network where sensors attached to the vehicles sense the environment and transmit the sensed data to a data center or to a destination vehicle for processing. There are some serious projects that focus on implementing the VSN in industry. However, most of the current projects focus on V2V and V2I communication modes according to the IEEE 802.11p WAVE protocol stack of standards [2, 3]. In addition to several network management and security layers, WAVE includes the IEEE 802.11p in the MAC layer and the 1609.4 multichannel coordination layer [2] which assumes that vehicles broadcast beacons on the dedicated control channel (CCH) every 100 ms and can communicate over the six available service channels (SCHs) on the licensed band of the DSRC spectrum.

DSRC-operated network raises awareness and provides communication for vehicles excluding cyclists and pedestrians from the vehicular network. Therefore, an effective vehicular communication system must include non-DSRC-operated elements in the vehicular networks. D2D communications provide a suitable communication platform to fill in the gap between vehicles, pedestrians, and cyclists. However, multiple challenges in such a D2D communication system exist such as accurate localization of wireless terminals, reliable and delay-sensitive communication, multichannel operation, and energy conservation of smartphones.

2.2. Machine Learning for Object Detection

Before the rise of deep learning in visual processing, one of the well-known machine learning techniques that has been used to detect vehicles is the Haar-like Cascade detector with AdaBoosting [7]. The algorithm is very fast because it deals with the integral image, and the learning process via AdaBoosting selects a small number of critical features. Furthermore, the Cascade procedure attempts to discard the background regions of the image and focus more on object-like candidate regions of the image. The algorithm has been used to detect vehicles in gas stations [8] with a detection rate of at least and a false detection rate of at most .

A similar classifier has been proposed using the vertical and horizontal edges and the shadows under the vehicles to provide a rough estimation of the vehicles in the image [9]. After that, the detection procedure involves a Histogram of Oriented Gradient (HOG) transformation and an AdaBoost classifier to optimize vehicle detection and eliminate background. They also use the Harris corner detection to estimate the tracking and distance the detected vehicle moved. In [10], the Cascade detector is shown to detect both the driving car ahead and the face of the driver residing in the rear car with fine details such as the eyes locations. In [11], the Cascade classifier is used similar to [12]. The authors suggested some preprocessing of the data such as alignments of the vehicles and providing variations of each image in the database.

There are other object detection algorithms other than the Cascade classifier. A Canny edge detector with a temporal difference is used to detect vehicles in [13]. The results show that multiple vehicles can be detected in one frame as one segment, which is usually not desirable. An excellent detection scheme is using the histograms of oriented gradients, matching algorithms for deformable part-based models, and discriminative learning with latent support vector machines (SVM) [14]. It demonstrates excellent detection rates for vehicles, pedestrians, and multiple objects, but it is not much slower compared to the Cascade detector with integral images and AdaBoost [7]. In [15], a robust client-server-based roadsign detector used cross-correlation, achieved good sign detection, and estimated the speed of the driving car.

The recent breakthrough in deep neural networks leads to excellent results in classifying images using convolutional neural networks (CNN) [16, 17], detecting objects within images using Region-based Convolutional Neural Networks (R-CNN) [18]. The training of deep neural networks requires significant computation. The running of such networks currently requires significant computation as well. In [19], the proposed fast RNN running time is 2 seconds on a CPU, but run at a faster rate of 17 frames per second on a GPU.

3. System Model

We consider vehicles to operate according to the IEEE 802.11p standard [3]. Each vehicle has one radio interface and operates over the CCH. The CCH is used as the source of transmission for any new safety information. Each vehicle obtains its location via a GPS device every 100 ms. Vehicles operation is extended to multiple channels, during which the nonsafety information is transmitted. Assume that after a number of MAC layer frames N, there is a virtual super-frame. During a super-frame, M samples are randomly chosen from N to be transmitted to the MAC layer and are inactive transmissions. This mechanism can be thought of as a transmission rate reduction approach. Ideally, the transmission frequency is decreased in order to send a smaller number of packets to reduce congestion at the communication channel. Similarly, one can think of the proposed method as a congestion avoidance approach. At each packet, we send the current information and a random encoding of some previous information.

3.1. Network Transmission Model

Let be the actual image size at time i. Moreover, define to be an vector where at time i. Here, is a stack of columns of the image as in Figure 2. We assume that vehicles capture images at a specific frame rate resulting on an interval of time [i, ], where is the inverse of the frame rate.

We assume that an image or the vector (we use the words image or vector to refer to interchangeably as it contains the same information in the image) can be represented in the discrete cosine transform (DCT) domain, and the corresponding vector would be sparse. This is normal as images can be represented by the highest coefficients in the DCT or Fourier domains. Hence, the image can be represented by , where represents the basis at which is sparse.

Consider that each image is captured by the camera of the autonomous car, and then, before it is transmitted to the neighboring vehicles, each image is multiplied by another matrix:where is a sampling matrix of size (where ) and subscript i indicates the vector of M linearly combined measurements at time i corresponding to the vector . The sampling matrix, , reduces the dimensionality of the transmitted captured image vector from size N to M. This is illustrated in Figure 3.

It is crucial that holds, where c is a constant, in order for the captured image to be recovered with high probability [20, 21]. The reduced image can then be represented at the sparse domain as

In the sequel, we consider that M and the matrix satisfy sparse recovery problems due to the incoherence of measurements and restricted isometry property (RIP). That is, the matrix satisfies the (RIP) where there is a constant δ such that

In other words, the number of measurements should satisfywhere c is a constant [20, 21].

3.2. Network Reception Model

We assume that each vehicle will receive M measurements of an image after being transmitted through the communication channel. It is important to note that as long as the M received measurements satisfy the sparse recovery conditions, the original image can be recovered with acceptable accuracy. Each vehicle then can use norm minimization to recover the original captured image. Or in other words, (the DCT coefficient) can be recovered with a high probability by applying

This is a standard basis pursuit optimization problem [20, 21]. After that, the time domain image can be obtained from the corresponding recovered coefficients in the DCT domain by a straight forward application of inverse cosine transform. The use of basis pursuit is standard in compression sensing, and many other alternative solvers can be used as well.

3.3. Object Recognition at the Receiver

In our problem formulation, we have a number of objects . The object is captured by the camera and becomes where is the noise affecting the capturing process with respect to the surrounding environment at time i. Now, we have three factors in our system, namely, , which is the original image of the object, , representing a noisy version of the image of the object, andrepresenting the compressed sampling version of the image of the object. With all these values being known at the receiver, we recover an estimated version of the image using in the DCT domain and its corresponding estimation in the time domain is . Hence, our object classifier is matching and estimated value to an image from the set of images.

For classification, we used a convolutional neural network and trained it on the dataset we have. We then used the sparsely recovered samples as a testing set. The following section explains the exact procedure.

3.4. Network Transmission Model

In order to compare the impact of our proposed scheme on the network performance, we use a simple, but accurate model that represents the transmission of the packets at the MAC layer of the vehicular network. The model has been used in the literature [22, 23]. Based on [22], we define the probability of successful reception of a packet at the MAC layer for vehicle aswhere p is the probability of transmission at each timeslot, n is the number of interfering nodes, and R is the MAC frame length. Assume all the n nodes are using a standard MAC for transmission. We consider the number of vehicles at the same communication range to be . Only a vehicles operate normally, and b vehicles operate according to our proposed information transmission scheme. Therefore, (7) becomes

The above equation means that all vehicles operate normally as interferers to the transmitting vehicle. However, in the proposed scheme, the number of interferers becomes . Hence, (8) becomes

We will show in the next sections that as b increases and a decreases, the probability of successful transmission for the vehicles increase, and when a increases and b decreases, the probability of successful transmission for the vehicles decrease. In the normal case of evaluation, shows the proposed scheme and shows the original p-persistent MAC.

4. Performance Evaluation

In this section, we describe the simulation experiment that we used to evaluate the proposed scheme.

4.1. Performance Metrics

We use two performance metrics. The first one is the classification accuracy, which is a measure of the accuracy of object detection. The second performance metric is the probability of successful transmission in the network, which represents a normalized measure of the successfully transmitted packets in a network MAC frame.

4.2. Dataset Preparation

For our sparse recovery and object detection experiments, we used traffic signs as the main objects from the BelgiumTS dataset [24]. We used 72 classes from the dataset, and a sample of the data is shown in Figure 4. Without loss of generality, any dataset can be used. However, the level of sparsity in the images would result in different values in the performance metrics.

We can observe from the figure that there are several traffic signs that are very similar. The similarity is crucial in our testing as it shows that our classifier is able to classify similar objects despite any noise. It is important for our sparse recovery scheme to restore the image with the highest possible accuracy in order for the object classifier to recognize the image.

In order to make the observation realistic, we add Gaussian noise to each traffic sign image and generate 100 different variant images out of the original source image. These images are then used for training the CNN classifier along with the original source image. Adding noise makes the input image harder to classify. Figure 5 shows samples of one image affected by Gaussian noise. When an image is captured, we assume the original traffic sign image as the input to the sparse recovery scheme. Then, the recovered image is fed into the CNN classifier for recognition.

In total, we had labeled images for transfer learning of the CNN. The CNN that we used is the standard AlexNet [16]. Our testing was on 6 classes of the dataset. However, using these different parameters and experiments, we estimate images in total. That is, we estimate each image 213 times using different parameters and experiments.

4.3. Evaluation Results

In Figure 6, we show how the sparse recovery algorithm performs given different sampling rates of the transmitted images. The figure shows that as the sampling rate increases, the clarity of the transmitted image increases, which is expected. However, we do not know how much the distortion will affect the object detection. It is clear that more transmitted samples will cause network congestion as the number of transmitting vehicles increases; however, we should evaluate the object detection rate for different sampling rates of the transmitted image, which is shown in Figure 7.

In Figure 7, we use deep CNN to detect the object using the proposed training and testing dataset. We can see from the figure that as the number of samples increases (or the percentage of transmitted samples ), the classification accuracy increases. One of the reasons that calcification accuracy does not reach is the fact that several roadsign images have similar design/shape and similar colors. This makes it harder for the classifier to detect the object.

Finally, we evaluate the impact of using our proposed scheme on the network performance in Figures 8 and 9. Figure 8 shows the probability of successful reception of a packet for different transmission rates. This figure shows that the probability of success increases using our proposed scheme even when we transmit of the samples. As the number of samples increases in the proposed scheme, the probability of successful reception at the MAC layer decreases; however, as we discussed before, the detection accuracy increases, which is a trade-off between two metrics.

In Figure 9, we fix the number of transmitted samples and we change the number of communicating vehicles for the proposed scheme. As the number of communicating vehicles increases, the proposed scheme becomes more congested and the probability of successful transmission decreases. However, even when the number of vehicles is tripled, and using M = 30 samples, the proposed scheme still outperforms the normal p-persistent MAC in terms of probability of success.

5. Conclusion

This paper proposes an innovative solution to extend the perception of the vehicular networks while minimizing the overhead on the communication channel. The methodology used is that vehicles with good perception will capture the information and use a minimalist version as an extended vision for the vehicles which cannot clearly sense the object. We use sparse recovery mechanisms in order to transmit the extended vision component to the vehicles. We show that with the proposed sparse recovery model, we can transmit packets and provide extended vision to vehicles and provide higher probability of successful transmission for each packet compared to a standard p-persistent MAC. Moreover, we show that the object detection is enhanced whenever more samples are collected. The powerfulness of compressive sensing is that as new packets arrive, a more accurate detection can be performed.

This paper focuses on image perception where images are captured and transmitted using the proposed MAC scheme. It is possible to extend this paper to consider the captured data as video, which will make the problem interesting by encoding the input data over time to a reference frame. However, this is out of the scope of this paper.

Data Availability

The data used in this paper are open source data that are available online (https://btsd.ethz.ch/shareddata/; accessed from November 1, 2019) and are cited at relevant places within the text as references.

Conflicts of Interest

The author declares that there are no conflicts of interest.

Acknowledgments

The author would like to thank Al-Muallem Mohammed Bin Ladin Chair for Creativity and Entrepreneurship at Umm Al-Qura University for the continuous support. This work was supported financially by Al-Muallem Mohammed Bin Ladin Chair for Creativity and Entrepreneurship at Umm Al-Qura University (grant number: DSR-UQU-BLIE-002).