In many supervised computer vision tasks such as object detection, manual annotation crowdsourcing platforms are widely used for acquiring large-scale labeled data. However, the annotation quality may suffer low quality that can severely affect the training of models. As a result, the evaluation of the annotations within the dataset is critical, yet it has seldom been addressed in object detection. In this paper, we present a fine-grained annotation quality assessment (FGAQA) framework for evaluating the quality of object detection datasets. First, we formulate a generic annotation quality assessment framework based on the core general-purpose data quality dimensions, using the bounding box and the label. Second, cognition theory in terms of hierarchy and continuity is utilized to refine the basic framework, including the consistency of the bounding box, completeness of the category, hierarchical accuracy of the label, and the consistency of the label. Comprehensive experiments on the two object detection datasets are used for performance evaluation. It is found that the ground truth annotations of the Urban Traffic Surveillance dataset have more quality issues than the ones of the PASCAL VOC 2007 detection dataset. The proposed FGAQA framework performs an effective fine-grained evaluation of the annotations, which is significant for quality assurance of annotations from crowdsourcing platforms and the subsequent model’s training.

1. Introduction

In supervised learning, annotation quality plays a vital role in training and assessment of the models for several computer vision tasks such as object classification [1, 2], detection [36], and segmentation [79]. The training of object detection models relies on accurate and sufficient annotations. For large-scale object detection datasets, annotations are usually obtained through crowdsourcing platforms, which results from anonymous participants, and can be collected for efficiency [1012]. However, due mainly to the untrained participants involved in the professional and time-consuming annotation tasks, this has inevitably led to subjective inconsistency and relatively low quality of the collected annotations. As a result, the annotation quality cannot be guaranteed, where the quality assessment of such annotations becomes a challenge in this context.

Annotation quality in object detection is a specialized-purpose data quality problem. Data quality has been widely studied since the 1980s [13]. According to [14], data quality can be defined as the degree to which a set of characteristics of data fulfills the requirements. Data with high quality should represent the real-world entities accurately in the structure and fit for their intended uses. Besides, data quality is of multidimensional characteristics. By reviewing the related literature [1419], a core set of data quality dimensions is defined, including the completeness, accuracy, and consistency. Moreover, there are a fair number of researches about annotation quality. Regarding the annotation quality in classification, accuracy is employed generally [20], not considering the hierarchy of categories. For annotation quality in object detection, quality is evaluated by Intersection-over-Union (IoU) [21]. IoU is the ratio of the intersection area of the ground truth and human annotation to the total area, only considering the quality of the bounding box [22]. There are few systematic researches about annotation quality of object detection. Consequently, we refer to general-purpose data quality and construct an annotation quality framework.

To date, there are relatively few works reported on this topic. This is only addressed from the perspectives of the object category and IoU [21]. However, a few general-purpose metrics can also be applied for annotation quality assessment. And we should perform annotation quality assessment from various aspects of the two attributes: bounding box and label.

Evaluation measures for object classification, detection, and segmentation could serve as a reference for annotation quality in object detection. Regarding flat object classification, precision and recall are employed to assess the performance [2326]. As for hierarchical object classification, distance in the tree or the directed acyclic graph (DAG) is used to assess the performance [2730]. The distance can treat the prediction errors differently. In terms of object detection, the mAP is usually employed [3136], integrating precision, recall, and IOU. The mAP is calculated according to the predicted results and confidence scores. However, for annotations, reasonable confidence scores are hard to obtain. As a result, in this paper, we employ the metrics of precision and recall. Regarding object segmentation, evaluation measures can be categorized into three types: area-based measures, location-based measures, and combined measures [3741]. These image segmentation measures pay more attention to the details and the intrinsic visual characteristics. Consequently, the idea of image segmentation evaluation is introduced into the annotation quality assessment framework.

In this paper, we propose a fine-grained framework for annotation quality assessment of object detection datasets, containing three dimensions: accuracy, completeness, and consistency. First, we construct the basic quality assessment framework based on the core general-purpose data quality (DQ) measurement, including accuracy and completeness, which considers the characteristics of annotation. For consistency, we find that it is difficult to give a strict definition. Further, the relationship of classes should be considered. Previous literature indicates that the cognition of humans is hierarchical in concept [42, 43] and consistent in space-time representations [4446]. Inspired by these observations, the consistency of bounding box, completeness of category, hierarchical accuracy of label, and consistency of label are extracted as four additional elements for annotation quality assessment. The main contributions of this paper are as follows:(1)We present a fine-grained annotation quality assessment (FGAQA) framework for evaluating the quality of object detection datasets. By analyzing the characteristics of the attributes of the bounding box and the corresponding label, the annotation quality contains three dimensions: accuracy, completeness, and consistency.(2)To tackle the limitations of the basic quality assessment framework, we introduce the theory of cognitive perception to analyze the annotation quality and add four elements of annotation quality, including the consistency of bounding box, completeness of category, hierarchical accuracy of the label, and consistency of label. Specifically, the hierarchical accuracy of the label can treat annotation errors distinctively and softly.(3)Comprehensive case studies on the Urban Traffic Surveillance (UTS) dataset and the PASCAL VOC 2007 detection dataset verify the effectiveness of the proposed annotation quality assessment framework. We find that the ground truth annotations of the UTS dataset have more quality issues, compared to the ones of the PASCAL VOC 2007 detection dataset.

The rest of this paper is organized as follows. In Section 2, the proposed cognitive-driven FGAQA framework is presented in detail. Section 3 discusses experiments as two case studies on the UTS and PASCAL VOC datasets. Finally, concluding remarks and future work are given in Section 4.

2. Annotation Quality Assessment Framework

A novel annotation quality assessment framework in object detection is given in this section, which is shown in Figure 1. The annotation has two attributes: bounding box and label. Annotation quality depends on its characteristics. For the bounding box, the size, location, and quantity could have some quality issues. Regarding the label, there may exist the quality problems of value and quantity. And the annotation quality serves reference for the training of the object detection model. Therefore, we define the quality dimensions according to the quality problems and the use of annotation. Inspired by some existing work [1419], the dimensions of completeness, accuracy, and consistency are selected as the core set of the data quality dimensions. By considering the theory of cognitive perception, we redefine some elements based on annotation characteristics. As a result, a fine-grained annotation quality assessment framework is proposed, as shown in Figure 1. The framework is constructed from the views of the bounding box and label. Regarding the quality of the bounding box, completeness, accuracy, and consistency are defined. The completeness of the bounding box can be divided into the completeness of the bounding box’s quantity and the completeness of the bounding box’s size. In terms of the quality of the label, we define completeness, accuracy, and consistency. The completeness of the label consists of the completeness of the bounding box’s label and the completeness of the category. The accuracy of the label contains flat and hierarchical accuracy. And most of these dimensions are computed for every object and are averaged for an image and the total dataset.

2.1. Annotation Quality of Bounding Box’s Quantity
2.1.1. Completeness of Bounding Box

The dimension can be defined as the extent to which bounding boxes are of sufficient quantity and coverage degree for the object. The dimension of completeness focuses on the null values. As for the completeness of the bounding box’s quantity, the null values correspond to unannotated objects. In an object detection dataset, small objects are often be neglected. During the modeling process of object detection, the unannotated objects would be regarded as background. For the completeness of the bounding box’s size, the null values correspond to the uncovered areas of the bounding boxes.(1)Completeness of bounding box’s quantity: for image i, completeness of bounding box’s quantity is a metric that can be defined as follows:where ni is the true object number and is the number of human annotations, namely, the number of bounding boxes. For the dataset, iswhere N is the number of images in the dataset.(2)Completeness of bounding box’s size: the completeness of the bounding box’s size is a pixel-count-based metric and can be defined as follows. For the object in image i, the metric iswhere is the intersection area of the object and bounding box, and is the area of the object. For image i, isFor the dataset, is

2.1.2. Accuracy of Bounding Box

The dimension is intended to measure the closeness of the bounding box to the object. When the accuracy is low, the bounding box contains too much background affecting the distinction between the object and the background. For the bounding box of object in image i, the accuracy iswhere is the area of the bounding box. In image i, the accuracy is

For a dataset, the accuracy can be given as follows:

2.1.3. Consistency of Bounding Box

The dimension focuses on the violation of spatiotemporal continuity of size and location. In crowdsourcing platforms, bounding boxes in adjacent frames may be drawn by different workers. As a result, they could conflict in size and location. Faced with the case, we can perform a quality assessment of the consistency of the bounding box during the corresponding postprocessing. Afterward, the annotations would satisfy the constraints. Concretely, for example, if an object moves toward the camera parallelly, the constraints are as follows:where and are the coordinates for the center of the bounding box, and and h are the width and height of the bounding box. When the object in image i satisfies the constraints, the metric  = 1. Otherwise,  = 0. For image i, the consistency is

For the dataset, ConB is

2.2. Annotation Quality of Label
2.2.1. Completeness of Label

The dimension can be split into two types. The completeness of the bounding box’s label is employed to measure if each box has a label. The completeness of category describes the completeness for the category’s quantity from the aspect of computational learning theory. In the common benchmarks for object detection, there exist minority categories. For a category, if the metric does not meet the requirement, the detection accuracy would be affected.(1)Completeness of bounding box’s label: for image i, the completeness iswhere is the number of labels. For a dataset, the metric is(2)Completeness of category: the completeness of category is a metric that measures whether the number of samples can meet the training for the object detection model. As for a dataset, the classes are usually organized in a semantic hierarchy tree. Regarding a leaf node, if it meets the condition , the completeness is 1. Otherwise, the completeness is 0. For a parent node, the completeness iswhere is the number of the corresponding child nodes. As a result, we can have the completeness of the category for a dataset.

2.2.2. Accuracy of Label

The dimension is employed to measure the closeness of the human and ground truth annotations. Regarding a dataset collected by a crowdsourcing annotation platform, the label noise is the most common error and has a direct influence on the training of the object detection model. The dimension has two elements: flat accuracy and hierarchical accuracy. The flat accuracy of the label is the usual element. However, the label space is often hierarchical. The hierarchical element can treat annotation errors distinctively and is the foundation of the utilization of annotation errors. As a result, we introduce these two kinds of elements for label accuracy evaluation.(1)Flat accuracy of label: the flat accuracy of the label includes two metrics: precision and recall. The precision and recall of class t arewhere is the number of ground truth annotations for class t, and tpt and fpt are the numbers of true positive objects and false-positive objects, respectively. For a dataset, precision can be calculated as follows:which treats each class equally. And similarly, the recall is obtained.(2)Hierarchical accuracy of label: the element also has two metrics. The metrics of class t arewhere and are the corresponding numbers of human and ground truth annotations, and denote the ground truth and human annotation labels, and ans(C) is the operation for computing ancestors for class C, . Then, via macroaveraging the metrics for all classes, the hierarchical precision and recall can be calculated.

2.2.3. Consistency of Label

Similar to the consistency of the bounding box, consistency of label concentrates on the confliction of spatiotemporal continuity of label. In the crowdsourcing platform, the labels in the adjacent frames often conflict due to the existence of low-level workers. If the label of an object is consonant with the labels in the previous and next frames, the metric is 1; otherwise, is 0. For image i, the consistency is

For the dataset, is

3. Case Study

To verify the effectiveness of the quality framework, two case studies are conducted based on the UTS dataset [47] and PASCAL VOC 2007 detection dataset [48]. UTS dataset is a video dataset with varying illumination conditions and viewpoints. PASCAL VOC 2007 dataset is an image dataset and contains twenty categories. Note that a few dimensions of the quality assessment framework are not fit for the dataset. To acquire the annotations, we let a group of students fulfill the annotation work. Generally, ground truth annotations are employed as golden standard annotations. However, in the evaluation process, we find that, to a certain extent, the ground truth annotations have quality problems, especially for the UTS dataset. Consequently, ground truth annotations are evaluated, where human annotations are regarded as “ground truth annotations.” Additionally, to verify the completeness of category, the relationship between this metric and detection performance is studied by conducting object detection experiments.

3.1. Case Study for UTS Dataset

In this case study, the UTS dataset is utilized for verification. To reduce the amount of annotation labor, four shots are selected, and we annotate an image for every four or five images. Finally, the numbers of images in the four shots are 75, 120, 100, and 120 with 1166, 686, 639, and 919 objects, respectively. The evaluation is presented from the aspects of an image and a dataset. We find that the ground truth annotations have quality problems, especially for the completeness of the bounding box’s quantity and the flat recall of the label.

3.1.1. Annotation Quality of an Image

For the clarity of the description of annotation quality, an image is selected for evaluation, which is given in Figure 2. The semantic hierarchy tree we defined is presented in Figure 3. The quality evaluation results for an image are given in Table 1. The accuracy of the bounding box for each object is shown in Figure 4.

Now, the analysis is given below. According to Table 1, the flat precision of hatchback is 0.25. However, it is because of the quality problems of ground truth annotations. Reviewing the annotations, we find that there are two small unannotated objects as shown in Figure 2. Hierarchical measures can reflect the relation of the classes. For instance, hierarchical precision for the hatchback is 0.42, while the flat precision is 0.25. Further, the consistency of the label is less than 1. It shows that there are inconsistent labels with the labels in adjacent frames. In Table 1, four metrics are equal to 1, reflecting that there is no error from these aspects.

3.1.2. Annotation Quality of Human and Ground Truth Annotations

Afterward, we show the annotation quality of the UTS dataset for the human and ground truth annotations. The annotation accuracies of the label are given in Tables 2 and 3. The completeness of the category of the ground truth annotations for each class and the original vehicle dataset is given in Figure 3, where the threshold is set to 1000. The results of other quality dimensions are presented in Table 4.

The quality of human annotations is analyzed first. According to Tables 2 and 4, the overall annotation quality of the bounding box is good, while the annotation quality of the label is relatively poor. Accordingly, it can be inferred that the label’s annotation is a more difficult task. In particular, for SUV and MPV, the accuracy and recall are too low. The hierarchical accuracy is higher than the flat accuracy, treating errors distinctively. According to Table 4, compared with other dimensions, the consistency of the label is lower on account of the own property.

The quality of ground truth annotations is evaluated here. According to Tables 24, the completeness of bounding box’s quantity, flat and hierarchical recall of label, and consistency of label for ground truth annotations are lower than those for human annotations. When reviewing ground truth annotations, we find that ground truth annotations neglect some small and incomplete objects. But these small and incomplete objects can be annotated properly by experience. There are more inconsistent labels in ground truth annotations than in human annotations. Figure 3 shows that the completeness of category for MPV and pickup is 0, as the corresponding category’s quantities do not reach the threshold. Generally, the quality problem exists in the ground truth annotations. Therefore, it is significant to perform a quality assessment in the process of annotation and ground truth inference.

3.1.3. Relationship between the Completeness of Category and Detection Performance

For the sake of exploring the relationship between the completeness of category and detection performance, the following experiment is conducted, which implies the effectiveness of the dimension. The object detection experiment on the UTS dataset is performed on the original dataset and downsampled dataset. As for downsampling, we just select images for every two images. The detection algorithm we use is Faster RCNN [3]. Table 5 presents the corresponding result.

According to Table 5, we argue that the detection result is closely related to the completeness of category. Overall, for the complete class whose training samples’ quantity is over 1000, the corresponding mAP is high, while the detection mAPs of other classes are quite low. However, for SUV in the downsampled dataset, the quantity is about 880. The detection performance is still acceptable. It is due to its salient visual feature. Thus, the threshold varies with the class. Additionally, for the incomplete class, the performance declines with downsampling.

3.2. Case Study for PASCAL VOC 2007 Detection Dataset

In the case study, PASCAL VOC 2007 detection dataset is utilized for verification. To save labor, we select twenty images for each class as annotation samples. Finally, a random-selected dataset containing 353 images is obtained. The PASCAL VOC 2007 dataset is an image dataset. Consequently, a few quality dimensions are not fit for the dataset.

3.2.1. Annotation Quality for Human and Ground Truth Annotation

The quality of human and ground truth annotations for the PASCAL VOC 2007 dataset is given below. Accuracies of the label for the human and ground truth annotations are given in Tables 6 and 7. The semantic hierarchy tree and completeness of category quantity are given in Figure 5, where the threshold is set as 400. The results of other quality dimensions are provided in Table 8.

According to Tables 6 and 8, we can see that the human annotation quality for the dataset is good overall. However, the accuracies of the chair, potted plant, and dining table are relatively poor. For instance, the average flat recall for the potted plant is 0.54. This is because the potted plant is small and tends to be neglected. And for the other dimensions of human annotations, quality is relatively reliable.

Afterward, we evaluate the annotation quality of ground truth annotations. According to Tables 68, we find that the quality of ground truth annotations is slightly worse than that of human annotations. Specifically, the completeness of the bounding box’s quantity and the flat recall of the label are relatively low. These dimensions indicate that there are more unannotated objects. As there are not enough images in the random-selected dataset, we calculate the completeness of category according to the original training set. The total completeness of category is 0.62, as 38% of the classes do not have enough samples.

3.2.2. Relationship between the Completeness of Category and Detection Performance

To explore the relationship between the completeness of category and detection performance, an experiment is conducted in the same way as the previous section. We conduct object detection experiments on the original dataset and downsampled dataset of which the sampling ratio is 0.5. And the major classes of person, car, and chair are not downsampled. Table 9 presents the detection results, where classes are in descending order of quantity of training samples.

According to Table 9, on the whole, the detection performance declines after the dataset is downsampled. For the majority classes of person, car, and chair, there are no obvious declines of mAPs, as we do not make downsampling on these classes. As for the minority classes, mAPs for the bottle and potted plant decline a lot, which can be regarded as hard classes. But mAPs for the other classes of the minority are relatively high and change little, which should be regarded as easy classes. The hard classes are usually of small scale and have nonsalient visual features, hindering the learning of the object detection model. Therefore, the threshold for hard classes is relatively high. In the future process of constructing a dataset, the training samples’ quantity for hard classes should be added.

4. Conclusion

Annotation quality is essential for the object detection model’s training. In this paper, conceptual cognitive modeling for fine-grained annotation quality assessment is proposed. The annotation quality is calculated from the perspectives of the bounding box and label. To begin with, a generic framework based on general-purpose data quality dimensions is constructed from two aspects: the bounding box and the class label. This framework is used to assess the completeness and accuracy from the corresponding aspects. Nonetheless, the basic framework has limitations in assessing the consistency, the category’s quantity, and the annotation errors. Thereupon, the cognitive theory is introduced, and we add the corresponding elements, including consistency of bounding box, hierarchical accuracy of label, consistency of label, and completeness of category. Case studies on the Urban Traffic Surveillance dataset and PASCAL VOC 2007 detection dataset indicate the validity of the framework. Currently, the annotation quality framework is constructed in an ideal condition. Future research is required to consider more practical factors.

Data Availability

The Urban Traffic Surveillance dataset and PASCAL VOC 2007 detection dataset used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by the Key Research and Development Plan of Shanxi Province (Nos. 201703D111027 and 201703D111023), Shanxi International Cooperation Project (No. 201803D421039), and Natural Science Foundation of Shanxi Province (No. 201801D121144).