Abstract

This study introduces a challenge faced by CNN in the task of traffic sign detection: how to achieve robustness to distributional shift. At present, all kinds of CNN models rely on strong data augmentation methods to enrich samples and achieve robustness, such as Mosaic and Mixup. In this study, we note that these methods do not have similar effects on combating noise. We explore the performance of augmentation strategies against disturbance in different frequency bands and provide understanding from the Fourier analysis perspective. This understanding can provide a guidance for selecting data augmentation strategies for different detection tasks and benchmark datasets.

1. Introduction

In the field of traffic sign detection, the detector based on convolution neural network (CNN) is a very competitive detection method. Many classic detectors based on CNN have achieved excellent results in traffic sign detection. However, as the dataset becomes more and more complex, the comprehensive requirements for detection accuracy and robustness are higher and higher. It is found that achieving high detection accuracy while ensuring high real time and robustness is a difficult task.

Especially, in the task of traffic sign detection, although many computer vision models based on deep learning achieve remarkable performance on many standard benchmarks, for example, Dou et al. [1] Proposed a new lightweight backbone network, and the experiments on CCTSDB [2] dataset illustrate that the algorithm has great real-time performance and accuracy. He et al. [3] presented an automatic recognition algorithm for traffic signs based on visual inspection and a traffic sign recognition learning architecture was created based on CapsNet, and through experiments on LISA (Laboratory for Intelligent and Safe Automobiles) traffic sign dataset, the model consumed shorter time yet better recognition performance than that of baseline methods.

But we should notice that it cannot guarantee a positive effect in practical applications. On the one hand, traffic sign detection tasks in the real world do have many special and complex properties. On the other hand, these models lack of the robustness of the human vision system when the image returned by the camera contains noise of various types and frequency bands.

Using data augmentation is a common and effective method to keep the model robust. Many mature deep learning models have a specially designed data augmentation module, which helps to improve the comprehensive performance of the model. Of course, the reason is not only to enrich the training samples, but also some data augmentation methods include the idea of regularization, so as to have stronger robustness under special disturbances.

However, data augmentation rarely improves robustness across all corruption types. For example, in a real driving scene, the weather conditions are complex and changeable. If there is a heavy fog, heavy rain, or heavy snow, the effect of atomization or the shielding effect of raindrops and snowflakes will appear in the field of vision. In addition, the camera will contain some photo shot noise, dark current shot noise, and readout noise more or less in the imaging process. So this begs a natural question: Is there any way for data augmentation strategy to maintain the robustness of the model under different noise interference conditions?

Only by understanding this question, can our model improves robustness across these different corruptions, and these systems will become safer in the real traffic scenarios.

As it is known, Mosaic is an efficient augmentation strategy proposed by ultralytics-YOLOv3 [4]. The main idea is to randomly cut four pictures, and then splice them into one picture, which enriches the background, expands the samples, and increases batch size without introducing extra computing load. Therefore, it has been successfully adopted in many famous detectors, for example, the follow-ups: YOLOv4 [5] and YOLOv5 [6] of YOLO series. MixUp [7] is also a well-known method, after it is originally designed for image classification task, and then modified in BoF [8] for object detection training. In [9], the effectiveness of Mixup in improving detection accuracy and the advantages of anti-interference in classification networks have been discussed. This article will mainly discuss the two methods of Mixup and Mosaic. Taking these two mature data augmentation methods as examples, it is helpful for us to further promote them to other tasks and methods.

By designing various frequency-domain experiments, it is found that these two augmentation programs make the model tend to use the information of different frequency bands in the input. Meanwhile, this deviation in frequency characteristics is also reflected in the performance and robustness of the model; it means that if the model is more inclined to low-frequency information, easier to obtain better detection accuracy, and more robust to high-frequency disturbances, but less robust to low-frequency disturbances. Models that prefer high-frequency information have the opposite effect.

2. Preliminaries

is used to represent points on two-dimensional digital images. We can understand and as discrete spatial variables. Therefore, for any image after using data augmentation and converted into a gray image, the gray value can be understood as a function of . The two-dimensional discrete Fourier transform (DFT) of can be obtained as follows:where is a digital image of size , and are frequency variables, and the domain of variables u and defines the continuous frequency domain.

A two-dimensional DFT is usually a complex function, which can be expressed in the following polar coordinates:where amplitude is called as the Fourier spectrum.

YOLOX [10] is a target detection framework based on CNN, which adds Mosaic and Mixup into its data augmentation strategies to boost YOLOX’s performance.

YOLO series always pursues the best trade-off between speed and accuracy that can be applied in real time. Scholars continue to extract the most advanced detection technologies at that time and optimize these detection technologies to achieve the best performance. In July 2021, the researchers skillfully integrated and combined the excellent progress in the field of target detection such as decoupling head, data augmentation, no anchor point and label classification with YOLO, and proposed YOLOX [10], which not only realized the AP beyond the past series, but also achieved extremely competitive reasoning speed.

TT100K [11] dataset is an open source Chinese traffic sign dataset. 9171 images in the dataset were used for training, which were divided into 45 categories. All photos were taken on moving vehicles at different times and under different conditions to simulate the real driving situation. The included photos face challenges such as blurring, tilting, strong light, rainy and foggy weather, and dim light. We train YOLOX with TT100K dataset and generate samples and models of natural training, Mixup only, Mosaic only, and both Mosaic and Mixup by controlling the data augmentation strategy. The effect is shown in the Figure 1.

3. Analysis from the Viewpoint of Frequency Domain

For better visualization, when carrying out two-dimensional Fourier transform on the image, we do the following processing: we always shift the low-frequency components to the center of the spectrum. And in case that the intensity at the maximum value is too large, the intensity in other places is close to 0. So we take the log first, and then normalize it during Fourier transform.

We have randomly selected four pictures from the dataset, as shown in the following figure.

First, the two-dimensional DFT is executed for the 4 pictures mentioned in Figure 2. Their characteristics are analyzed from the perspective of frequency domain. As shown in Figures 35.

The plane of these pictures is the plane after Fourier transform and the axis represents the intensity. The peak in the center of the image is at zero frequency. When comparing different images, it can be considered that the height of this peak is the same. So, we normalize the zero frequency of each graph as the same.

It can be clearly seen from Figure 3 that the intensity of the central peak of the Mixup processed image is lower than that of the original image and the other areas do not seem to have changes. So, Mixup can be considered to show no obvious tendency in the frequency domain and the changes in each region are relatively balanced compared with those before processing.

From Figure 4, we can see that after Mosaic processing, compared with the original image, the intensity of the central peak of the Mosaic processed image is significantly higher, and the intensity in the high-frequency region far from the center is also higher, but the low-frequency region is roughly the same. That means Mosaic makes the information more concentrated in the high-frequency part, and the low-frequency information seems to lose a lot.

From Figure 5, we can see that the combination of these two methods makes the high-frequency and low-frequency performance more compromise and there is no special performance in each region. In other words, Mosaic may make the model to be more inclined to use the high-frequency information in the image and ignore the low-frequency information naturally. However, for Mixup, it seems that there is no such effect.

4. Experiments and Results

4.1. Basic Accuracy Experiment

We first tested the results based on the original YOLOX, which is applied with Mosaic and Mixup augmentation strategies. The results are shown in Table 1.

Researchers mentioned in the work of YOLOX [10] that Mosaic has been verified and proved that it can bring significant growth points on a strong baseline model by YOLOv5 and YOLOv4. When the model capacity is large enough, a stronger augmentation is more helpful. In [10], they found that there is still a good improvement in matching Mosaic with Mixup.

This first involves that the accuracy may not be improved after the introduction of a strong data augmentation strategy. Some research groups have studied this situation. The possible reasons are that data augmentation strategies may introduce noise or ambiguity enhancement samples, which limits their ability to improve the overall performance [12]. The dataset augmentation strategy is different for each dataset. In this regard, many designs focus on how to use the most appropriate data augmentation for specific datasets [1315].

4.2. Dataset Shift Experiment

In many practical applications, it can be noticed that some different data augmentation strategies are not necessarily suitable for all datasets. Whether the final model can learn or recognize the features in these high-dimensional data samples is a standard to judge whether the model is applicable. In previous experiments, we can see that there exist obvious differences in the traffic sign detection accuracy and robustness in noise environment under different data augmentation strategies.

The dataset shift problem is essentially that the distribution of data has changed. It is obviously a slightly lower level problem than the network design. Therefore, the exploration of this problem may help us provide a theoretical guidance to the practice of upper level methods.

Many real-world visual applications have such problems, since the generation or change of the dataset is not always completely controlled and will be affected by noise, degradation, and other natural changes. In the field of machine learning, if there is no certain regularity before and after the data, within the data, or the laws are not unified with each other, the actual performance will be reduced due to the wrong violation of constraints [16].

Dataset shift is complex and can be interpreted in various ways, such as [1720] provide a foundational understanding of the subject.

In order to describe the distribution changes between the original dataset and the dataset after the data enhancement process, there are many description methods that can be used for reference, for example, MMD (maximum mean dispersion) [21], KL divergence [22], DSI (Distance-based Separability Index) [23]. Behind these aliasing or clipping methods, we are more concerned about how they change in the frequency-domain perspective. Therefore, in order to be more consistent with the initial theme and previous experiments, we introduce a method of morphology in traditional image processing to describe the distribution changes involved here: particle measure (to determine the distribution of particle size in the image).

Firstly, the original image is transformed into a gray-scale image, and then the threshold of binarization is set according to the empirical knowledge of the observed difference. Continuously increase the size of structural elements in the operation calculation, and record the gray difference between the sizes of adjacent structural elements; The distribution of particle size in the original image is obtained by dividing the sum of variable pixels by the size of structural elements.

Here we still process the four randomly selected images in Figure 2. In the following figure, the abscissa represents the size of the structural element and the coordinate represents the frequency.

Figure 6 shows that the image augmented by Mixup has the largest distribution in the part with the smallest particle size (that is, the part where the abscissa is close to 0 and in the area where the structural element size is very small), but it is significantly weaker than the original images. Compared with the original images, the most fine-grained part of the information is weakened. The distribution is more concentrated in the part with larger particle size. At the same time, in the part with larger particle size (that is, the part where the abscissa is farther away from 0 and in the area with larger structural element size), some large-scale information introduced in Figure 6(b) is also retained. This basically fulfills the conclusion mentioned in part 3: Mixup can be considered to show no obvious tendency in the frequency domain and the changes in each region are relatively balanced compared with those before processing.

Figure 7 shows that the Mosaic processed image (Figure 7(e)) is very concentrated in the part with the smallest particle size, and even this peak exceeds the size of the original image. The distribution is sparse in the part with larger particle size. In the larger part of the particle scale, the information is completely lost. The characteristics of the Mosaic and Mixup processed image (Figure 7(f)) are similar to the trade-off between the Mosaic processed image and the Mixup processed image. The size of the first peak is between the two cases. The distribution of the smaller part of the particle size is concentrated, and some information of the larger part of the particle size is retained. This also fulfills the conclusion of Part 3: Mosaic improved the high-frequency information, and the low-frequency information seems to lose a lot, and the combination of the two methods makes the high-frequency and low-frequency performance more compromise.

From the perspective of dataset shift, it can be said that the processing method of Mosaic has caused drift to the original dataset, which makes the information more concentrated in a very fine-grained range, or in high-frequency areas. However, the performance of Mixup in this aspect is very gentle, without causing prominent drift.

4.3. Against Perturbation Experiment

The observed differences in the Fourier statistics and the particle measure statistics suggest an explanation that different data augmentation strategies may maintain different frequency band information in the input data, and encourage the model to be more interested in this frequency information, while the model may make more use of this frequency information. We investigate this hypothesis via several perturbation analyses of the four models in question.

Even though these methods do not necessarily improve the detection accuracy, in order to further explore the robustness of these data augmentation strategies, we conducted experiments against perturbation.

First, some images (919 items) are randomly selected from the original test dataset. Then, on this basis, we add perturbation to the test set to examine the robustness of the model. From the characteristics of Fourier transform, we can know that on the two-dimensional plane, the Fourier vector basis has symmetry after transformation. So, we can roughly choose a form; the form of noise is a fixed sinusoidal perturbation. In the direction of this fixed Fourier vector basis, we can roughly see the change trend of the whole two-dimensional plane as follows:

Then, we use the above four different models to test these noisy data.

From Figure 8, the robustness of the model using Mixup only is the best, followed by the model using Mosaic and Mixup at the same time. Most of the time, the effects of these two models are very close and the robustness is much higher than that of the naturally trained model. Mosaic’s performance is generally satisfactory in the low-frequency band, but it is poor in the high-frequency band. This also verifies the understanding and prediction of its behavior in the frequency domain in Part 3: Mosaic processing makes the high-frequency region of image information more abundant and the low-frequency part less, which makes the model pay more attention to the high-frequency part of image information. Therefore, when the disturbance is added to the low-frequency part of the image, the model can maintain robustness. When the high-frequency part of the image is mixed with high-frequency disturbance, the model is difficult to maintain robustness. The natural training model generally presents a fluctuating state. After the introduction of noises, the accuracy of several maximums is about 0.5, and the minimums is basically below 0.2. The fluctuation law of the model after introducing data augmentation strategies is also generally consistent with that of natural training. It shows that these maximums and minimums and fluctuations may be caused by the characteristics of the network structure itself or some characteristics of the dataset. Their relationship with the data augmentation module is not significant.

In [24], the author mentioned that Mixup delivers a similar prediction accuracy; however, it catches much more high-frequency information, which is probably not surprising because the Mixup augmentation does not encourage anything about low-frequency information explicitly, and the performance gain is likely due to attention towards high-frequency information. In other words, after adding Mixup, the gap of training accuracy and testing accuracy curve becomes smaller. The reason is that Mixup operation actually confuses low-frequency semantic information, thus encouraging CNN to capture as much high-frequency information as possible. In the one hand, CNN will try to fit the low-frequency information first in the learning process. With the training, if the loss does not decrease, it will further introduce the high-frequency component [2527].

From this point of view, the behavior of Mixup is almost similar to the processing of high-frequency components and low-frequency components when splicing pictures. This can be seen from Figure 3. It can enhance both low-frequency and high-frequency components. From the results in Figure 6, Mixup’s excellent performance in various frequency bands may also benefit from this. Due to the frequency principle, the frequency range of the function learned by CNN is reached according to the needs of training data [25], Mixup can promote CNN to make use of high-frequency components as quickly and as much as possible, so as to improve the performance [24]. Generally speaking, from the perspective of frequency domain, Mixup shows no special preference for information in different frequency bands, but it can better learn high-frequency information. Therefore, the disturbance of different frequency bands performs well, and the accuracy is improved and stable.

5. Conclusion

Starting from training YOLOX on TT100K datasets, this paper uses Fourier spectrum to establish the relationship between disturbance frequency and model performance. Mosaic data augmentation makes the information in the original image focus on the high-frequency part, and makes the model tend to the high-frequency information in the input image, which leads to the model improving the robustness against the disturbance in the low-frequency domain at the cost of reducing the robustness to the high-frequency disturbance. However, this data augmentation strategy also makes the dataset shift to a certain extent. When using the particle measure method to describe the displacement of this dataset shift, it can be found that the probability distribution of particles is very dense at the position where the particle size is very small and the part with large particle size becomes rare. This also reveals the drift law in the frequency domain. The performance of Mixup is more gentle and stable in the whole range, without extreme changes.

It is true that both methods can improve the detection accuracy under certain circumstances, but the experimental results suggest that we should be vigilant about whether the model can remain robust against disturbances in the real environment. And the most suitable data augmentation method is used according to the specific environmental characteristics, disturbance types, and the characteristics of the dataset itself. Of course, in order to obtain better robustness, the technique should not be abused, resulting in over fitting. After all, our goal is to make the model learn domain invariant features, better adapt to the objective environment, and solve practical problems, rather than simply being robust in a specific situation.

Based on the experimental results, it is complex and challenging to improve the robustness of the model. We cannot enumerate all the disturbance types, disturbance frequencies, and whether the collocation of datasets and model structures is reasonable, but we do need a general method to analyze these problems. In this paper, experiments show that the combination of Fourier analysis and frequency domain can better analyze and improve the robustness of the model.

As a future extension, if the robustness of large area occlusion, rotation, and some feature changes are also considered, the traditional methods rely on some specific feature extraction [28] methods, such as using the orthogonal wavelet transform [29] of the image to consider the image details and multiresolution representation. If moving object detection in real scenes is also considered, visual attention is an effective mean [30]. So, these methods are also of great significance in CNN-based traffic sign detection.

Data Availability

The data supporting this research are from previously reported studies and datasets, which have been cited. The processed data used during the current study can be obtained from the corresponding author upon reasonable request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.