Abstract

Enormous progress has been made in face detection tasks due to the rapid development of deep learning techniques. Meanwhile, debates arise on whether face detection should be treated as a generic object detection task or considered differently. In this paper, we design an efficient anchor-free face detector that focuses on a low flops regime and combines recent advances in generic object detection with the methods for detecting tiny faces. Specifically, we adopt the anchor-free Fully Convolutional One-Stage (FCOS) method with a recently developed Visual Attention Network (VAN) as the base detector. In accordance with the characteristics of the face dataset, we reallocate the computation across the network components by adjusting the network configurations of the base detector. Then we redesign the criteria for marking positive samples to realize a balanced distribution in pixel maps, and we also adopt the quadruple pixel prediction, which enables more positive samples matched with the model outputs. Under VGA resolution, our face detector achieves 70.5% in AP on the hard subset of the WIDER FACE dataset, while the computational cost is only 1.05 Gflops. This accuracy efficiency trade-off is comparable to state-of-the-art results.

1. Introduction

Face detection, as the upstream task of face tracking [1], face alignment [2], and face verification [3], has received significant attention in the computer vision community. Moreover, its accuracy has been boosted by a large margin due to the emergence of deep learning techniques. In the literature, there are debates over whether face detection differs from generic object detection and requires extra effort to improve performance. On the one hand, TinaFace [4] bridges the gap between generic object and face detection by introducing a simple baseline method. Based on existing general modules, TinaFace achieves the state-of-the-art performance of 92.4% in AP on the WIDER FACE dataset [5]. On the other hand, Guo et al. [6] propose the Sample and Computation Redistribution for efficient Face Detection (SCRFD) method and argue that the characteristics of the face dataset should be considered. They believe the optimal design can only be gained by reconfiguring the whole network structure from backbone to head. In particular, SCRFD algorithm redistributes not only training samples but also computation within the network. As a result, the efficient SCRFD outperforms TinaFace when testing under VGA resolution (i.e., 640 × 480). At the same time, other approaches like additional branches for outputting extra face landmarks in RetinaFace [7] and compensation outer faces to match high-quality anchors in HAMBox [8] help increase the detection performance.

This paper takes ideas from both sides and proposes an efficient anchor-free face detector based on FCOS [9]. This design focuses on a low compute regime (around 1 Gflops) and works under VGA resolution. More specifically, we first take advantage of attention mechanisms and employ VAN [10] as the network backbone. And we use this visual attention backbone that includes Large Kernel Attention (LKA) modules to replace the ResNet [11] backbone in the original FCOS design. We make this modification because a basic block in VAN with the LKA module captures long-range dependency and thus has a better capability of selecting important features. Secondly, network configurations are selected in a way that more computation is reallocated to the shallow stages of the backbone. Experimental results validate this reallocation design principle. We then adjust the responsible range for different pixel maps to fit the small-sized input image. Finally, we propose a quadruple pixel prediction method that produces four bounding box predictions at a single pixel location, which boosts performance with an additional but small computational overhead. To summarize, our main contributions are the following:(i)Incorporating attention mechanisms into the detector backbone.(ii)Reallocating computation distribution within the network and redesigning the positive sample matching criteria according to the characteristics of the face dataset.(iii)Utilizing quadruple pixel prediction to enable the detector to produce more predictions.

We organize the remaining paper as follows. Section 2 gives the related works on face detection, anchor-free detectors, attention mechanisms, and network design spaces. Section 3 describes the proposed efficient anchor-free detector along with our main contributions. Section 4 provides experimental results and analyses. Section 5 concludes the paper with limitations and future work.

2.1. Face Detection

Detecting faces in an image has received continuous attention in the computer vision community. Before deep learning techniques were involved in face detection, traditional methods [12, 13] were mainly boosting-based algorithms and relied on manually designed features. With the power of deep neural networks, features are automatically extracted given a large amount of training data. The main challenge in face detection is the unconstrained conditions where faces can be occluded or dimly illuminated or have extreme poses and tiny scales. Nowadays the most commonly used benchmark for unconstrained face detection is the WIDER FACE dataset [5], on which many recently developed CNN-based face detectors report their results. Among them, RetinaFace [7] utilizes five extra key points on a face to advance training. TinaFace [4] considers the face detection task as a generic object detection problem and combines existing modules and techniques to achieve state-of-the-art performance. HAMBox [8] and MogFace [14] use different online anchor mining strategies to compensate for outer faces or improve label assignment. On the hard subset of the WIDER FACE dataset, these state-of-the-art algorithms all exceed AP 91.0%. However, high performance comes at the cost of heavy computation. The above face detectors either adopt a multiscale testing method or employ heavy backbones. As SCRFD [6] points out, TinaFace introduces more than 40 Tflops due to its multiscale testing strategy. Even when tested under the single scale of 640 × 640, TinaFace consumes over 100 Gflops. At the same time, its performance drops to AP 81.4%.

Therefore, another challenge in face detection is the trade-off between the detection accuracy and the computational complexity. Due to the nature of CNN-based face detectors, the computational complexity can be reduced directly by shrinking the input image to a smaller size, e.g., VGA resolution. The price for this low computation is the reduction in the accuracy. There have been algorithms that consider low-resolution inputs. In particular, RefineFace [15] measures its speed under VGA resolution but provides test results under the multiscale testing strategy. OS-LFFD [16] proposes an ommateum structure with shared parameters to shrink the model size and reports its results under single inference on the original schema. BlazeFace [17], with its focus on mobile applications, takes the input image at the size of 192 × 192 to reduce computational costs. SCRFD, specially designed for the VGA resolution input, provides a family of face detectors with flops ranging from 0.5 G to 34 G. This family of models are sampled from network design spaces with the design rule that lower stages of the backbone should have larger computation resources than other network components. The proposed detector in this paper focuses on the low compute regime (1 Gflops) and validates itself under the 640 × 480 input size. Moreover, the cumbersome work of designing and matching anchors is eliminated due to its anchor-free nature.

2.2. Anchor-Free Detectors

Mainstream object detectors such as Faster-RCNN [18], SSD [19], YOLOv2, and YOLOv3 [20, 21] predict offsets to predefined anchor boxes to get final bounding boxes. Thus, they are categorized into anchor-based methods. Meanwhile, anchor-free methods have recently gained substantial attention due to their simplicity. For example, CornerNet [22] treats object detection as a keypoint detection problem and predicts a pair of keypoints, i.e., the top-left and bottom-right corners of an object’s bounding box. CenterNet [23] goes a little further by adding another center keypoint to detect, improving both precision and recall.

ObjectsAsPoints [24] predicts a keypoint heatmap and local offset features at stride 4. The top 100 peaks in that output keypoint heatmap are the detected object centers. The bounding box predictions are obtained by combining peak locations and corresponding local features. FCOS [9] has similar local offset regression targets as ObjectsAsPoints, with differences in three aspects. First, FCOS marks a pixel in the pixel map as a positive sample when it locates in any ground truth box. ObjectsAsPoints, on the other hand, spreads object center keypoints to a heatmap by a Gaussian kernel. Secondly, five different pixel map levels in FCOS with strides 8, 16, 32, 64, and 128 are used to detect objects of various sizes. Furthermore, FCOS employs an additional centerness branch indicating the relative distance between the pixel location and the predicted bounding box center. Since there are many of tiny faces to be detected, the multilevel FCOS is adopted as our base anchor-free face detector.

2.3. Attention Mechanisms

Recent years have witnessed the success of attention mechanisms [25, 26]. While initially designed for natural language processing, attention mechanisms have been widely adopted in computer vision tasks, from image classification and object detection to instance segmentation [2732]. This new adoption brings substantial performance boosts, and the attention-based algorithms dominate nearly all leaderboards in computer vision tasks. The main idea is that attention mechanisms work as an adaptive process of selecting input features. An attention map is produced by this process, and according to the map, essential features are selected.

Attention mechanisms in computer vision can be categorized into four basic categories [33], i.e., channel attention [34], spatial attention [35], temporal attention [36], and branch attention [37]. Self-attention-based vision transformer [27] and its successors [2832] capture global information by using spatial attention. However, while long-range dependence is captured by self-attention, the computational costs become vast when dealing with a sizeable 2D input. Inspired by MobileNets [3840], the LKA module [10] utilizes depth-wise convolution, dilated depth-wise convolution, and point-wise convolution to overcome this shortcoming. Local contextual information, long-range dependence, and adaptability are all considered by this simple design. We use the LKA-based Visual Attention Network as our detector backbone.

2.4. Network Design Spaces

In the pioneer works of Radosavovic et al. [41], a new network design paradigm is proposed. Instead of designing the convolutional neural network on the instance level, they try to find sound design principles that can be generalized to a population of networks. This is achieved by parameterizing the network. Network configurations, such as the number of blocks per stage, block width, and bottleneck ratio for each block, are parameters of the network. While the above configurations have limited ranges, their combination has around 1018 possibilities [41], which form the original unconstrained network design space. Hundreds of network configuration samples that meet a predefined flop regime are taken out as representative of this considerable design network space. They then train and test each sample configuration.

The analysis of these produced results reveals design principles. For example, consistent bottleneck ratios across stages do not affect model performance. Increasing widths towards the deeper stage tend to perform better. These design principles shrink the original large design space to a smaller one. Meanwhile, new network configurations are sampled within the shrunk design space, and new trends can be observed and become design principles. This shrinking process goes iteratively. While Radosavovic et al. apply the paradigm to the image classification task, SCRFD [6] uses the same method for face detection problems. Networks for classification consist of only the backbone, whereas the detection network needs additional neck and head structures. SCRFD combines the neck and head network configurations into design spaces and then trains and validates configurations on the WIDER FACE dataset. Due to the existence of many tiny-sized faces, the design principle for face detection learned from WIDER FACE is that more computation should be allocated to the early stage of the network where tiny face detection occurs, and the computation of backbone, neck, and head should be jointly adjusted. Inspired by the above works, our anchor-free face directly applies the gained knowledge in SCRFD to the self-attention-based visual attention backbone, feature pyramid neck, and FCOS head.

3. Our Proposed Anchor-Free Face Detector

This section first demonstrates the structure of the proposed anchor-free face detector and its network configurations. Then, we introduce the changes to the network configurations as well as the positive sample marking criteria that better fit the face dataset. Finally, we describe the quadruple pixel prediction method.

3.1. VAN-Based FCOS

Our face detector consists of three components, the backbone, the neck, and the head, as shown in Figure 1. Backbone feature maps from VAN are fed into the Feature Pyramid Network (FPN) [42] that generates neck features. FCOS head takes the neck features and produces several pixel maps that are responsible for different sizes of face bounding boxes.

3.1.1. Visual Attention Backbone

The novel LKA module adopted by the visual attention backbone is the key to achieving state-of-the-art performance [10]. As shown in Figure 2, the LKA module generates an attention map by three successive convolutions with different types. The first is a 5 × 5 depth-wise convolution (DW Conv) that captures local feature information within the same channel. Then a 7 × 7 depth-wise dilation convolution (DW-D Conv) finds the long-range dependence spatially still within the same channel. Lastly, a 1 × 1 point-wise convolution (PW Conv) fuses information across channels. This channel convolution provides the missing channel adaptability, which is not considered in the depth-wise convolution and the depth-wise dilation convolution. The produced attention map represents the importance of features at each spatial and channel location. A high value in the attention map means the feature at the corresponding location is important. When multiplying the attention map with the input feature elementwise, discriminative features are preserved, and noisy features are suppressed. The process of attention map generation can be viewed as a decomposition of large kernel convolution, whereas the considerable computation costs required by large kernel convolution are alleviated.

The overall architecture of the VAN is straightforward. Figure 3(a) shows that it consists of four stages. At the beginning of each stage, an Overlap Patch Embed (OPE) module is inserted to downsample the input feature at stride 2 or 4, depending on the stage location. Moreover, there is a sequence of identical VAN blocks within each stage, as shown in Figure 3(b). The length of the sequence in each stage is one network configuration that can be tuned. Finally, a layer normalization layer ends each stage. Output channel sizes for VAN blocks are consistent within stages but may vary across different stages. Consequently, we have four output channel sizes, which are also network configurations. Taking a close look at a VAN block in Figure 3(c), the input feature takes three paths. The middle path is the identity path directly added to the output feature. The left path is the spatial attention path employing the LKA module described ahead. The remaining path uses a multilayer perceptron (MLP) module consisting of a series of point-wise and depth-wise convolutions, as illustrated in Figure 3(c). In an MLP, a hidden channel size is used across convolutions and is defined by multiplying the output channel size of the block and an MLP ratio. MLP ratios are also network configurations. By tuning the MLP ratio, we can easily configure a basic VAN block to a bottleneck or an inverted bottleneck structure. With equivalent large kernel convolution and customizable MLP ratio, the visual attention backbone can maintain the same representation power with fewer parameters and flops compared to the ResNet backbone.

Guo et al. [10] provide a family of van backbones, i.e., VAN-Tiny, VAN-Small, VAN-Base, and VAN-Large. Their network configurations are listed in Table 1. We also add model sizes and flops of the VAN variants when fed with a VGA resolution image. Since we are designing an efficient face detector in a low flop regime, even the tiny version of VAN consumes a large number of computational costs. A simple option to shrink the model is downscaling all output channel sizes by the same factor. We choose 4 as the scaling factor and 0.25 as the MLP ratio. Empirically, the first-stage output channel size is set to 16 to capture enough features. Combined with the above modifications, we term the new set of backbone network configurations as VAN-Reduce whose backbone has low flops.

3.1.2. FPN and FCOS Head

We use the same FPN [42] as that in the original FCOS to acquire high-level semantic feature maps at different levels. The feature map number and the output channel number are also network configurations. In original FCOS head, it includes four standard convolutions before the final prediction layer. Since the detector only needs to detect faces rather than multiple different objects, we manually reduce to only one convolution layer to keep the model compact.

FCOS head produces predictions in a per-pixel way. For the simplicity of illustration, we assume one output pixel map with stride S produced by FCOS head. As shown in Figure 4, the output pixel map has a spatial size of and H/S, where and H are the width and height of the input image.

A (4 + 1 + 1) − d vector at each pixel location contains the distances from pixel location to four boundaries of a bounding box d = (l, t, r, b) the face classification score , and one centerness score c. A pixel location is indexed by (xi, yi), a tuple of two integers. If a pixel at (xi, yi) falls into the bounding box of a face in the original image, we mark it as a positive sample and set a label . In Figure 4, the magenta pixel is a positive sample because its corresponding location in the input image is within the bounding box. Four magenta arrows are the four regression targets, . If a pixel locates within more than one bounding box, the minimum distance is used as the regression target. By contrast, the cyan pixel is a negative sample (). We can recover the ground truth bounding box at the positive sample location by the following formulas:where lbox, tbox, rbox, and bbox are the left, top, right, and bottom boundaries of the ground truth box; are the regression targets at the pixel location. xc, yc denote the coordinates of the pixel center. The centerness score is used to indicate a predicted high-quality bounding box and is defined as follows:

During inference, the overall score of a prediction is the product of the face classification score p and the centerness score c. A centerness score close to 1 means the pixel center xc, yc is near the bounding box center, and the prediction should be considered high-quality.

The loss function, which is the same as that in the original FCOS, is given below:where the classification loss Lcls, the regression loss Lreg, and the centerness loss Lcnt are focal loss [43], GIOU loss [44], and binary cross entropy loss, respectively. Since we only have one class to detect, the positive sample label is also the class label and is used in the classification loss.

FCOS uses multilevel pixel maps to detect bounding boxes for large-scale variances. In the original design of FCOS, there are five level pixel maps with strides 8, 16, 32, 64, and 128, respectively. We define the above five pixel maps as P8, P16, P32, P64, and P128, together with the corresponding neck outputs as N8, N16, N32, N64, and N128. In this scenario, marking positive samples has one more criterion. Each pixel map has a valid responsible range (Ri, Ri+1) where i is the map index. Two adjacent pixel maps share the same range bound. The range bound Ri+1 is the upper bound of pixel map i and is the lower bound of pixel map i + 1. When the maximum of four regression targets lies within this range, the corresponding pixel is a positive sample. Algorithm 1 shows the positive sample matching process. The range numbers R1, R2, R3, R4, R5, and R6 for the original FCOS (FPNH-Ori) are listed in Table 2. They are hyperparameters for the FCOS detection algorithm.

Input:
is a set of range numbers
is a set of bounding boxes in the input image
is a set of pixel maps’ strides
Output:
is the kth pixel map
(1)for each level do
(2) build a mesh grid Mi according to the stride at kth level
(3)for each pixel xi, yi ∈ Mido
(4)  calculate the pixel center coordinate xc, yc
(5)  for each bounding box  ∈ do
(6)   if xc, yc locates within then
(7)    compute distances d = (l, t, r, b) from the pixel center to the box’s boundary
(8)    if < max(l, t, r, b) < then
(9)     if the pixel xi, yi has not been marked positive then
(10)      mark the pixel xi, yi in a positive sample, assign regression target d, classification target t and centerness score c
(11)     else
(12)      compare 4 distances l, t, r, b with target d and replace with l, t, r, b if the corresponding value in d is larger.
(13)     end if
(14)    end if
(15)   end if
(16)  end for
(17)end for
(18)end for
3.2. Reallocating Computation Distribution

In determining the network configurations of the backbone, manually reducing the model size could be suboptimal, especially when the original configurations are based on an image classification dataset. SCRFD [6] points out that detecting small-scale faces requires more computation allocated in the shallow stage of the backbone. We transfer this design principle and apply it to VAN backbone design. In SCRFD, the backbone is based on ResNet, which has a hierarchical structure similar to VAN. A sequence of blocks is divided into four stages, and the deeper stage has a smaller spatial resolution. Design choices in the ResNet backbone are the number of blocks per stage and the output channel size per stage. We can find that these design choices have corresponding network configurations in the VAN backbone. Given the same face dataset, we believe that an optimal design choice in SCRFD can work well in other networks if they share similar structures. Therefore, we use the output channels and block numbers of SCRFD and define a new VAN backbone named VAN-Realloc. The configurations are shown in Table 1.

The computation reallocation happens not only within the backbone but also across network components. In the original full-sized FCOS, the FPN output channel is 256. When connecting FPN to the VAN-Reduce backbone, we use the same downscale factor 4, resulting in a 64-channel FPN output. For the VAN-Realloc backbone, the connected FPN has 24 output channels, which is consistent with the SCRFD design. Although the VAN-Realloc backbone has higher flops than the VAN-Reduce backbone, the gap is filled when complete structures are considered. FPN and FCOS head in VAN-Realloc detector induce fewer flops than in VAN-Reduce. Their computation distributions and performances are given in Section 4.2

3.3. Redesigning Positive Sample Matching

Although FCOS is anchor-free, the responsible range for each pixel map level plays a similar role as anchors in anchor-based detection algorithms, and they should be adjusted when facing a new dataset. During training, we resize the input image to 640 × 640. The bounding boxes are resized correspondingly, and most of them are below 64 × 64, as shown in Figure 5. If the matching criteria are not changed, the pixel map P8 is the most responsible for producing positive predictions. The other levels are less likely to get trained. This sample imbalance across different pixel maps downgrades the detection performance. Therefore, we modify the positive sample matching criteria and reduce pixel map numbers with corresponding feature pyramid levels. We name the modifications as FPNH-Rematch and show configurations in Table 2. Pixel maps P64 and P128 are removed. We put the maximum regressing targets below 32 in the pixel map P8. Around 72% of boxes can be assigned at this level. The targets between 32 and 64 are matched to the pixel map P16, and the remaining boxes are detected in the pixel map P32. The number reduction in the pixel map and feature pyramid is a better fit for the face dataset.

3.4. Quadruple Pixel Prediction

FCOS predicts only one bounding box at each pixel location, whereas the anchor-based SCRFD tiles two anchors per pixel, resulting in a doubled number of predictions. The WIDER FACE dataset is characterized by not only its small-scaled faces but also a considerable number of faces per image. Detectors that produce more predictions tend to perform better. Producing multiple predictions at the same pixel location for anchor-based detectors is easy due to their anchor-based nature. The positive sample matching rules are based on the Intersection over Union (IoU) between anchors and ground truth bounding boxes. For the anchor-free FCOS method, multiple predictions at the same location cause ambiguity in matching ground truths. To take the pros of multiple predictions and eliminate the ambiguity, we propose a quadruple pixel prediction method that defines a matching strategy. The idea is simple, and the implementation is straightforward. As shown in Figure 6, we quadruple the box predictions per location, which only requires the FCOS head to quadruple output channels. Then the four predictions at each pixel are reorganized and tiled as subpixel at the top-left, top-right, bottom-left, and bottom-right of the pixel, forming a new pixel map. If the original pixel map has a stride of S with four predictions per pixel, the new pixel map can be equivalently viewed as a normal pixel map with one prediction per location but at stride S/2. In other words, we squeeze the pixel map on the channel level and expand it to the spatial level. Therefore, quadruple pixel predictions can be trained with no ambiguity. We present the configuration with quadruple pixel prediction in Table 2 and name it FPNH-Quad. As shown in the table, the new model produces the three neck outputs N8, N16, and N32, but the FCOS head turns them into P4, P8, and P16. No offset values are added to differentiate four predictions at the same location during training and inferring. We encourage the network to learn the offsets by itself because every quadrupled pixel is squeezed and expanded consistently. It is worth mentioning that although quadruple pixel prediction expands the pixel map by a factor of 2, other expansion values can be used to make more predictions.

4. Experiments and Analyses

4.1. Experimental Setup

We train and validate models on the WIDER FACE dataset [5]. The dataset contains 12880 images for training, 3226 for validation, and 16097 for testing. During training, we randomly crop and resize images to 640 × 640 without preserving the aspect ratio. Other data augmentation methods are used, such as random flip and random color jittering.

Moreover, we utilize a sample redistribute technique similar to SCRFD. Images are expanded at the ratio of 2 with a 50% chance at the beginning of preprocessing steps. To be specific, an image is pasted to a double-sized canvas. The pasted location is random, and the rest of the canvas is filled with a mean value of the WIDER FACE dataset. Since the random crop and resize operation is based on the original image size, a double-sized image leads to smaller bounding boxes when resized to the same 640 × 640. Therefore, more small faces are fed into the networks and encourage the networks to learn from them.

We train the networks for 300 epochs with a batch size of 16, and the training process uses the Adam optimizer. The learning rate has a linear warmup, increasing from 1e − 6 to 1e − 3 in 3 epochs. At 120 and 240 epochs, we decay the learning rate by 10. All models are trained from scratch, and no pre-trained weights are used to initialize parameters. We evaluate models on the validation set. During validation, we resize the image to 640 × 480 and use no test-time augmentation. The evaluation metric is AP at 0.5 IoU threshold on the WIDER FACE hard subset.

4.2. Computation Reallocation

To test the effectiveness of applying the SCRFD design principle, we train and validate two model configurations, VAN-Reduce-FPNH-Rematch and VAN-Realloc-FPNH-Rematch. VAN-Reduce-FPNH-Rematch takes a quarter of VAN-Tiny channels as the backbone, and FPN is reduced correspondingly. VAN-Realloc-FPNH-Rematch uses the VAN-Realloc backbone, and the FPN output channel size is guided by SCRFD. Both networks produce three pixel maps and use no quadruple pixel predictions. We present computation distributions of two model configurations and the WIDER FACE validation results in Table 3. The comparison is obvious: while both have close values in total flops, the backbone in VAN-Realloc-FPNH-Rematch has more significant proportions than that in VAN-Reduce-FPNH-Rematch. The first two stages of VAN-Realloc-FPNH-Rematch take up more than half of the total computation costs. The superior performance of VAN-Realloc-FPNH-Rematch, which beats its counterpart by 4.7%, indicates the necessity of this reallocation for computation.

4.3. Modification in Detection Structure

We present the numbers of positive samples for each pixel map under different detection configurations in Table 4. The numbers are accumulated through one training epoch. Since the training sample generation includes randomness, the positive samples reported in Table 4 are average values across epochs. FPNH-Ori is the original design in FCOS. It can be seen that almost all positive samples lie in the P8 pixel map. This extreme imbalance limits the network performance. With positive sample rematching and FPN reduction, FPNH-Rematch has a more balanced distribution of positive samples. When quadruple pixel prediction is introduced, we observe more matched cases in FPNH-Quad. We evaluate different detection configurations using the same VAN-Realloc backbone, and the performance is given in Table 4. FPNH-Rematch outperforms FPNH-Ori by 0.7% percent due to the balance across pixel maps. FPNH-Quad achieves the best performance at 70.5% due to the most positive samples.

4.4. Comparison with State-of-the-Art Model

We compare our best model (VAN-Realloc-FPNH-Quad) with state-of-the-art efficient face detectors (SCRFD series [6] and BlazeFace [17]) in flops, the number of parameters, and the detection accuracy under VGA resolution. We also report the FPNH-Quad structure with a ResNet backbone named ResNet-Redesign-FPNH-Quad. Its backbone is redesigned to achieve the same amount of computation as VAN-Realloc-FPNH-Quad. Results are shown in Table 5. Since BlazeFace does not report its performance on WIDER FACE, we train BlazeFace by ourselves. In Table 5, although SCRFD-2.5 G has the best AP of 77.9%, the cost is the largest model size and the most flops. SCRFD-0.5 G needs the least computation with a lower AP of 68.5%. The trained BlazeFace has the lowest AP but with the least parameters. VAN-Realloc-FPNH-Quad outperforms its ResNet-based counterpart by 1.0%, thanks to its attention mechanisms. Our proposed model that ranks the second-best at AP of 70.5% needs only 1.05 Gflops, which is comparable to state-of-the-art models.

4.5. Engineering Applications

We show the detection results of our best model (VAN-Realloc-FPNH-Quad) in Figure 7 and suggest some potential engineering applications. Bounding box predictions are marked in blue. Figure 7(a) is an example of tiny faces with heavy occlusion. The detector is able to find the most faces, even with helmets. However, a few false predictions that detect a face twice are also produced by the model. In Figure 7(b), where partial illumination occurs, our model assigns a correct bounding box for almost every face. With the Internet of Things and big data [45, 46], face detection can find its application in V2X [47] or security surveillance systems [48].

5. Conclusion

We propose an efficient anchor-free detector that works at a low compute regime. The design absorbs the advancement in generic object detection and pays extra effort into tackling the tiny face problem. Using FCOS avoids the anchor-related hyperparameters. The visual attention backbone enhances the feature extraction by utilizing the LKA module. The design principle allocating more computation in shallow stages of the backbone improves detection performance, which is generalized from the ResNet-based networks to VAN-based networks. Sufficient and balanced positive samples in pixel maps facilitate detection performance, achieved by positive sample rematching and quadruple pixel prediction. With the techniques above, our efficient anchor-free detector arrives at 70.5% in AP with only 1.05 Gflops. While achieving competitive results with the state-of-the-art methods, we believe there is still room for improvement. Our detector takes the knowledge gained from SCRFD, which may limit the performance. A future direction is searching the detector’s design space and finding design principles and network configurations for a more optimal design.

Data Availability

The dataset used to support this study is introduced by 10.1109/CVPR.2016.596 and is available at http://shuoyang1213.me/WIDERFACE/.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.