Research Article

Evaluation of Vision Transformers for Traffic Sign Classification

Table 6

Evaluation results on the traffic sign datasets.

ModelGermanyIndiaChina
TrainingValidationTestingTrainingValidationTestingTrainingValidationTesting

Convolutional neural networks
VGG1699.89%99.94%98.84%99.77%98.75%98.44%99.65%99.52%99.21%
ResNet99.88%99.82%98.37%99.92%99.06%97.47%99.72%99.41%99.25%
DenseNet99.97%99.90%98.82%100.00%99.38%98.59%99.95%99.69%99.42%
MobileNet99.87%99.56%97.41%99.77%96.83%95.98%99.70%98.40%98.05%
SqueezeNet99.52%99.56%96.69%98.54%96.21%96.65%99.21%98.91%98.24%
ShuffleNet98.96%98.81%95.49%99.92%98.75%99.11%98.96%98.84%95.53%
MnasNet99.96%99.18%96.17%100.00%98.10%96.80%99.67%99.18%96.26%
Vision Transformers
ViT98.27%98.89%83.77%98.80%96.54%97.10%94.35%94.79%93.53%
ViT (RealFormer)98.45%99.19%86.03%98.67%95.94%96.65%93.62%94.21%94.22%
ViT (Sinkhorn Transformer)94.69%97.04%82.29%95.99%94.02%94.79%80.68%85.61%84.71%
ViT (Nyströmformer)79.15%83.15%62.41%90.47%80.13%80.95%86.97%79.08%79.10%
TNT96.83%97.73%84.39%97.71%92.75%94.42%96.25%94.52%95.05%
Performance gap
CNN (best)-Transformer (best)12.81%2.01%4.37%