Abstract

Scene text detection methods based on deep learning have recently shown remarkable improvement. Most text detection methods train deep convolutional neural networks with full masks requiring pixel accuracy for good quality training. Normally, a skilled engineer needs to drag tens of points to create a full mask for the curved text. Therefore, data labelling based on full masks is time consuming and laborious, particularly for curved texts. To reduce the labelling cost, a weakly supervised method is first proposed in this paper. Unlike the other detectors (e.g., PSENet or TextSnake) that use full masks, our method only needs coarse masks for training. More specifically, the coarse mask for one text instance is a line across the text region in our method. Compared with full mask labelling, data labelling using the proposed method could save labelling time while losing much annotation information. In this context, a network pretrained on synthetic data with full masks is used to enhance the coarse masks in a real image. Finally, the enhanced masks are fed back to train our network. Analysis of experiments performed using the model shows that the performance of our method is close to that of the fully supervised methods on ICDAR2015, CTW1500, Total-Text, and MSRA-TD5000.

1. Introduction

At present, natural scene text detection has attracted more attention due to its practical application requirements, such as scene understanding, visual question answering, autonomous driving, text detection [1], and recognition [2, 3]. Text is one of the most fundamental semantics appearing everywhere in daily life, for example, in traffic signs, commodity packages, and advertising posters. These text instances in the real world have varying sizes, random directions, and arbitrary shapes, making them extremely challenging to label and capture accurately. Unlike other general objectives, scene text usually cannot be described accurately by the axis-aligned rectangle, and most detectors using an axis-aligned rectangle only have an F-measure of below , such as TextSnake [4]. Recently, most scene text detectors based on deep learning have tended to detect texts in different shapes with many coordinates for better performance. However, the above detectors require accurate pixel-level labels with expensive costs. The labelling consumes a large amount of manpower and financial resources, especially for texts with arbitrary shapes in complex environments.

The precision of text detection has a close connection with the labelling methods of datasets. For example, several common datasets, ICDAR2013 [5], ICDAR2015 [6], ICDAR2017 [7], Total-Text [8], CTW1500 [9], and MSRA-TD500 [10], have different labelling methods for various texts. ICDAR2013, as one of the common datasets, was introduced during the ICDAR Robust Reading Competition in 2013 and mainly includes horizontal bounding boxes made by two points at the word level. Because of this labelling peculiarity, text detectors [11, 12] using box regression have a great performance on ICDAR2013. ICDAR2015 was released in the ICDAR2015 Robust Reading Competition for multioriented text detection, using quadrilateral boxes as the annotations, as shown in Figure 1(b). EAST [13] and SPCNet [14], as the representatives of detectors, achieved good results on ICDAR2015. ICDAR2017 was a dataset with texts in nine languages for multilingual scene text detection, using quadrilateral boxes as the annotations as well as ICDAR2015. MSRA-TD500 was released in 2012, and the annotation method is the same as that of ICDAR2015. Unlike the above datasets, Total-Text and CTW1500 contain many curved texts, which aim to solve the arbitrarily shaped text detection problem. CTW1500 has more than 10k text annotations and at least one curved text per image. Total-Text contains many curved and multioriented texts, which require tens of points for accurate labelling. Recently, segmentation-based text detectors [4, 15, 16] have shown promising performance in existing datasets with high-cost labelling. The annotation design becomes more complicated to fit the requirements of text detection in the real world, and the cost also increases.

The bounding box-based labelling method has low labelling costs but cannot fit text instances accurately in the wild, as shown in Figure 1(b). The pixel-based labelling method matches texts with arbitrary shapes in a complex environment but requires high labelling costs, as shown in Figure 1(c). To mitigate this conflict, we explore detecting texts at the pixel level but with a low labelling cost. Precise drawings of the text region are difficult, but using a cross-line to locate text is simple. Therefore, we seek to simplify the complex text labelling as a line named the text line in this work. Compared with the box or full masks, this annotation is extremely simple and contains less pixel information, as shown in Figure 1. Hence, the following two difficulties must be considered:(i)A weak text line label loses the text edge information and nearly all of the background information, which is rather problematic for supervised training(ii)The loss function focuses only on the labelled area and is not sensitive to the unlabelled ground truth

To solve the above difficulties, a scene text detector based on weakly supervised learning is proposed in our paper. The model is first pretrained on SynthText to make it sensitive to the text region. Subsequently, in the training process of real data, the pretrained model is used to enhance the text line label. In addition, to enhance the weak label better, a soft label [0, 1] containing pixel location (distance) information is used. The contributions of this work are summarized as follows:(i)We first propose a scene text detector based on weakly supervised learning that significantly simplifies the annotation process without losing much precision.(ii)A modified crossentropy loss function named degree crossentropy is proposed. The loss function can optimize the soft label containing distance information.

Scene text detection has received significant attention over the past few years, and numerous deep learning-based methods [1721] have achieved great progress. Increasing detectors tend to capture texts at the pixel level to detect texts more precisely.

2.1. Bounding Box-Level Text Detection Methods

Bounding box regression-based methods [19, 22] are inspired by general object detection methods such as SSD [23] and Faster R-CNN [24]. TextBoxes++ [25] further regresses to quadrangles instead of horizontal bounding boxes for multioriented text detection. RRD [26] uses rotation-invariant and sensitive features from two separate branches for better long text detection. DSRN [2] maps multiscale convolution features onto a scale invariant space and obtains uniform activation of multisize text instances for detecting texts. Although regression-based methods have achieved state-of-the-art performance, it is still difficult to capture all text information in a bounding box without involving a large proportion of background and even other text instances.

2.2. Pixel-Level Text Detection Methods

Pixel-level text detectors draw inspirations from FCN [23] and Mask R-CNN [27]. Using the mask as the annotation, PixelLink [28] performs text/nontext and links prediction at the pixel level. TextSnake [4] learns to predict local attributes, including the text centre line, text region, radius, and orientation, achieving improvements of up to 20% accuracy on curved benchmarks. CRAFT [15] trains a convolutional neural network producing the character region score and affinity score. PSENet [16] projects the feature map into several branches to produce multiple segmentation maps. TextField [29] detects scene text by predicting a direction field pointing away from the nearest text boundary to each text point. Text mountain [30] predicts text centre-border probability and text centre-direction to detect the scene text. Text detectors based on instance segmentation perform better with higher precision annotation.

2.3. Weak Supervision Semantic Segmentation

Sun et al. [31] leveraged the power of deep semantic segmentation CNNs while avoiding requiring expensive annotations for training. Rtfnet [32] took advantage of thermal images and fused both the RGB and thermal information in a novel deep neural network. Tang et al. [33] proposed a normalized cut loss for semisupervised learning; the loss combines partial crossentropy on labelled pixels and normalized cut for unlabelled pixels. Wang et al. [1] proposed a self-supervised approach and developed a pipeline to label drivable areas and road anomalies using RGB-D images automatically.

2.4. Weak Supervision Text Detection Methods

WeText [34] trains scene text detection models on a small number of character-level annotated text images, followed by boosting the performance with a much larger number of weakly annotated images at the word/text line level. WordSup [35] trains a character detector by exploiting word annotations in rich large-scale real scene text datasets.

Recently, all detectors have been trained with fully annotated masks, requiring pixel-level accuracy for good quality prediction. Motivated by weakly supervised semantic segmentation [34, 3638], we propose a weakly supervised scene text detector to alleviate the labelling consumption without losing high precision.

3. Method

In this section, we first introduce the overall pipeline of the proposed network. Second, the label and the procedure for enhancing the text line are described in detail. Furthermore, the designed loss function for weakly supervised learning is introduced. Finally, we list the simple postprocessing mechanism.

3.1. Overview

Figure 2 shows the overall pipeline of the proposed method, which is divided into three steps: (1) the model pretrained on a synthetic dataset [17], (2) label enhanced on a real dataset, and (3) training with the enhanced label. In the first step, the model is pretrained on a synthetic dataset with the full mask to make our model sensitive to the text region. In the second step, the pretrained model outputs an activation map of a real image as a supplement to the weakly annotated label (i.e., text line). In the final step, the enhanced label is fed back to optimize the network parameters. The output of the model in the final step forms the final prediction result through a contour search.

3.2. Labelling and Label Enhancement
3.2.1. Text Line

In this paper, we define the text line as a line across the text region, as shown in Figure 3. All characters within this text region should be connected with a continuous line (e.g., TL-1 to TL-5). There are no width and curvature requirements for these text lines. However, improper annotations such as TL-6 will result in an obvious decline in text detection accuracy. The BG in Figure 3 represents the background annotation, which has no requirements for the geometric parameters (e.g., shape, width, length, and curvature) of the line. As a result, the TL and the BG constitute the original annotation.

3.2.2. Soft Label

The soft label containing the distance (location) information is used in our method. The shortest distance between each text pixel and the background is calculated. Then, we map these distance values to [0, 1] as the soft label. For pixels concentrated in the centre of the text instance, a strong (high) value that tends to 1 should be given. However, for the estimated edge area, a weak (low) value that tends to 0 should be assigned. As shown in Figure 2 (activation map), the distance-mountain-like activation map is predicted from the model pretrained on SynthText. The shape of the soft label is the same as the distance-mountain shape. The value of the label is calculated using the following equation:where is the shortest distance between each text pixel () and the background pixels. is the maximum value for all s in the same text instance.

3.2.3. Label Enhancement

As shown in Figure 2, label enhancement is an important step in the overall pipeline. The detailed processing of the enhancement is as follows: the network is first pretrained on SynthText for one epoch with full masks, making it sensitive to text areas. The activation maps of real images are generated using the above pretrained model. Then, we extract the text skeleton for the given weakly supervised label. Finally, the intersection of the text activation region and the text skeleton is expanded to obtain more annotation information. The only purpose of label enhancement is to use the text line to locate the correct detection text region in the activation map of the real image and to obtain more supervision information. Enhanced labels only work on the positive part (i.e., text line), while background annotations are excluded.

Figure 2 (right) describes the combination of the text skeleton and activation maps. We first use the text skeleton to locate the corresponding text activation region in the activation map and then attempt to seek the corresponding text edge region through continuous dilation of the intersection of the text activation region and the text skeleton. Detailed seeking refers to considering a pixel as the edge pixel by estimating whether the pixel value approaches 0. Finally, the values of pixels deemed as edge pixels are used as the supplement to enhance the original annotation (i.e., text line). Note that the values in the activation map are not common binary probabilities (i.e., text/nontext prediction) but represent location (distance) values. Therefore, we can use the value of each pixel in the text region to confirm the relative distance from the background.

3.3. Network Design

We chose VGG16 [39] as our feature extractor for a fair comparison with other methods. The images are first downsampled to the multilevel features with five convolution blocks, and five feature maps (i.e.,) are generated in the step. Then, the features are gradually upsampled to the original size and mixed with the corresponding output of the previous convolution block:where refers to the feature concatenation and is the upsample function that is used to feed the feature map into the --- layers. The difference in for is obtained without the layer and reducing the channel number to 1 as the output. Finally, the output obtained through the sigmoid function is used to calculate the loss of the prediction. In addition to the VGG16, other backbones (i.e., ResNet) were also adopted in a comparative study in Section 4.6 Ablation Study.

3.4. Loss Function

The prediction is a two-dimensional feature map, and we map the value to [0, 1] using the sigmoid function. These values in a text instance are not the confidences of each pixel but represent the degrees of the shortest distance between each pixel and the background. The common binary crossentropy loss function iswhere is the ground truth and is the prediction. The common crossentropy is used to evaluate the confidence of a certain category but cannot calculate the loss value with specific meanings (e.g., our distance values).

In that case, we seek to optimize the loss containing distance values by or . However, we find that and are not sensitive to the distance distribution among [0, 1]. For instance, the L1 loss between the ground truth of 0.5 and the prediction result of 0.55 is too small and not conducive to backpropagation.

To solve the above difficulty, the degree crossentropy is proposed. The degree crossentropy can not only evaluate the confidence of category but also deal with the distance information. Losses for the positive and negative pixels are calculated according towhere is the traditional crossentropy loss of pixel and is the corresponding ground truth of pixel . Since the enhanced label may not be accurate, we treat the given label and the postenhanced supplements separately. is a discriminatory mechanism that calculates the losses of the original label and postenhanced part, respectively. is the degree crossentropy loss:where is the predicted result after the sigmoid function and is the ground truth. The loss of prediction and any goal [0, 1] is calculated to help us to deal with distance degree information of the text. The specific implementation of is described bywhere refers to pixel in the entire prediction map. and represent the annotated pixels and postenhanced pixels , respectively. is one set of pixels with a difference of more than between the ground truth and prediction. The postenhanced annotation from the pretrained model may not be quite accurate, and noise interference may exist. Several situations are present in label enhancing. For instance, background pixels are viewed as text pixels as positive annotations. The causes are the annotation differences in the datasets and the unreliability of the prediction. To make our network learn from noisy or wrong labels, we propose a discriminatory mechanism called , which calculates the losses of the original label and postenhanced part. In that case, the network performs strong-supervised learning on labelled pixels and distribution supervised learning on postenhanced pixels. More specifically, the predicted pixel values gradually decrease from the text centre to the edge without fitting the value of the label. The difference between the enhanced annotation and predicted results will be considered reasonable if it is smaller than . The value of is set to 0.1 in all the experiments. Therefore, the fault tolerance of can enhance the robustness of the model and avoid some mistakes from the postenhanced annotation.

3.5. Postprocessing

Most segmentation-based methods with segmentation have a common difficulty in which the separation of text instances that are close to each other is challenging. To solve this problem, we propose the apex-edge expansion algorithm that makes full use of the text-mountain shape. Given the prediction result, each text instance appears as a text mountain, as shown in Figure 4(a), where the text centre line region is the peak and the values of the pixels tend to 1. The text edge pixel areas are similar to the feet of the mountain, and their contents are mostly close to zero. Figure 4 presents a vivid example to illustrate the detailed procedure of the apex-edge expansion algorithm.

The detailed procedure of the apex-edge expansion algorithm is shown in Figures 4(b) and 4(c). The postprocessing mainly includes three parts. (1) The peak of each text mountain is selected to differentiate the different text instances. The pixel block for which the values of each inner pixel approach 1 is the peak. (2) The dilate in OpenCV is used to expand the peak region continuously until reaching the mountain foot or meeting other text areas. The expansion process is divided into many steps . () represents the entire expansion area in the step. () is called the extended area between two adjacent steps. The criterion of expansion ending is that the average score of the extended area approaches 0 or starts to increase. The average score approaching 0 means that the expansion area is close to the background. The increase in the score means that the expansion area begins to cover other text instances. (3) The contour of the whole text instance is represented by many coordinates as the final prediction result after the expansion. The entire postprocessing is shown in Algorithm 1, where represents the prediction result, and the output is the set of text instances. Dilation (.) is the dilate operation in OpenCV. The value and size of the expansion kernel in dilation (.) can be changed to realize different direction expansions and different scale expansions. Mean (.) is used to calculate the average value of a matrix. represents complementing the set. and refer to tending to a number and the value increasing, respectively.

Input: : Segmentation result
Output: : Text instances
for do
  if then
   
   while do
    Dilation ()//expansion operation
    Mean ()
    //get the average score of the extended region
    if or then
     Enqueue//push result into
     
    end if
    
   end while
end if
end for

4. Experiments

In this section, we evaluate our approach using ICDAR2015, Total-Text, MSRA-TD500, and CTW1500. The experimental results demonstrate that the performance of the proposed method is comparable to those of the other methods.

4.1. Datasets

The datasets used for testing our method are briefly introduced below:SynthText is a large-scale dataset that contains approximately 800 K synthetic images. These images were created by blending natural images with text rendered with random fonts, sizes, colours, and orientations. We used this dataset to pretrain our model.ICDAR2015 is a multioriented text detection dataset for English text that includes only 1,000 training images and 500 testing images. The text regions were annotated by four vertices of the quadrilateral.MSRA-TD500 contains 500 natural images. The indoor images are mainly signs, doorplates, and caution plates, while the outdoor images are mostly guided boards and billboards in complex backgrounds.Total-Text is a world-level English curved text dataset that is split into training and testing sets with 1,255 and 300 images, respectively. The text in these images includes more than 3 different text orientations: horizontal, multioriented, and curved.SCUT-CTW1500 contains 1,000 training images and 500 test images, which contain multioriented text, curved text, and irregularly shaped text. Text regions in this dataset are labelled with 14 scene text boundary points at the sentence level.Data labelling to test our method: we manually marked Total-Text, CTW1500, and TD500. As shown in Figure 5, the annotation method was brief and inexpensive. For ICDAR2015, the official label was used to fit the text line label for the further verification experiment. The detailed fitting method is simple. The text skeleton as a text line is extracted directly from the full label. All annotations will be released.

4.2. Implementation Details
4.2.1. Training

The network was pretrained on SynthText for one epoch and fine tuned on other datasets. We adopted the Adam optimizer as our learning rate scheme. During the pretraining phase, the learning rate was fixed to 0.001. During the fine-tuning stage, the learning rate was initially set to 0.0001 and decayed at a rate of 0.94 every 10,000 insertions. All of the experiments were conducted on a regular workstation (CPU: Intel (R) Core (TM) i7-7800X CPU @ 3.50 GHz; GPU: GTX 1080). The model was trained with a batch of 4 on one GPU.

VGG16 was adopted as the backbone network for the contrast experiment in our experiments. All of the experiments use the same training strategy: (1) enhancing the text annotation information with the model pretrained on SynthText and (2) training network on the target dataset. To validate the robustness of the proposed method and keep the same condition in the comparative experiments, all of the models used in label enhancement were the same model pretrained on SynthText for one epoch.

4.2.2. Data Augmentation

The images were randomly rotated, cropped, and mirrored at a probability of 0.4. Then, colour and lightness were randomly adjusted. Finally, the images were uniformly resized to 512 × 512.

4.2.3. Postprocessing

We obtained all of the text instances with the apex-edge expansion and then used findContours in OpenCV to obtain a set of edge coordinates for each text instance. Finally, the text instances of the regular text datasets (i.e., MSRA-TD500) were described by four coordinate points. Methods such as minAreaRect in OpenCV were applied to obtain the bounding boxes of text instances. For curved text datasets, we used a set of coordinate points to describe the text instance (Tables 1 and 2).

4.3. Detecting Curve Text

The CTW1500 and Total-Text datasets were used to test the ability of curve text detection. In the experiments, manual text line annotation is used for training. The model pretrained with one epoch on SynthText had two effects: one was to heighten the annotation information, and the other was fine-tuning the pretrained model on other datasets.

The training started with the pretrained model and achieved the best result between 20 and 40 epochs. The F-measure showed a fluctuation of approximately , while the threshold of the peak was in [0.5, 0.8]. For comparative experiments, the threshold of the peak in the apex-edge expansion algorithm was set to 0.6 for CTW1500 and Total-Text for comparative experiments. We continued to expand the peak region until the average score of the extended area approached 0 or met another text instance.

The F-measure of our method with text line was on Total-Text, while the F-measure of our method with full masks was , as shown in Table 3. The performance with full masks was close to that of the newest method. The difference () shows that using the text line can still achieve good results on the challenging poor annotation. The recall () was close to the values obtained for the other methods. On CTW1500, our method showed excellent results that were very close to the results obtained by the other strong-supervised methods with an F-measure of . The difference () between the F-measure of using the text line and that of using the full mask was also acceptable.

4.4. Detecting Long Text

TD500 contains many long text scenes and therefore is an excellent dataset for verifying the robustness of the network in long text cases. In the experiment, text line annotation was enhanced by the model pretrained on SynthText. The pretrained model was also used for fine-tuning on TD500. The threshold of the peak in the apex-edge expansion algorithm was set to 0.6, which is the same value as the experiments on CTW1500 and Total-Text. Table 4 compares the proposed method with state-of-the-art methods on TD500. The proposed method achieved an F-measure of , which is competitive with other state-of-the-art detectors trained in a strongly supervised way.

4.5. Detecting Oriented Text

All of the parameter settings and training details for ICDAR2015 were the same as those for the experiments on the curve text datasets. The official label was used to fit the text line label for the further verified experiment on ICDAR2015. Similar to the experiment on TD500, minAreaRect in OpenCV was used to obtain the bounding boxes of the text instance, in contrast to several detectors listed in Table 5 that used extra datasets. For instance, the F-measure of PSENet [16] was without an extra dataset. The F-measure () of our method was already comparatively close to those of the other methods.

4.6. Ablation Study

Three groups of comparative experiments were performed to verify the effectiveness of our method.

4.6.1. Baseline

The baseline was trained with the text line without label enhancement, and the F-measure of the baseline on Total-Text was , as shown in Table 3.

4.6.2. Label Enhancement

The results are shown in Table 3, which are further analysed for label enhancement of the model on Total-Text. Training with an unenhanced text line shows an unsatisfactory performance (), while training with a full mask obtained an F-measure of . The large difference () indicates that the text line loses important supervision information. After introducing the pretrained model on SynthText to enhance the text line, the performance of the model had an obvious improvement from to . In addition, using the synthetic text line from the full mask shows better performance (). The main reason for this is that the manual text line had a larger error in extracting the text skeleton compared to the synthetic text line. In addition, we also compared the performance of the model pretrained on different datasets: synthetic data (i.e., SynthText) and realistic data (i.e., SUCT-CTW1500). The F-measures using SynthText and CTW1500 were and , respectively. Obviously, the performance of our model pretrained with realistic data shows a few advantages. This also indicates an intrinsic limitation of this method and the dependence on the pretrained model.

4.6.3. Geometric Parameters of the Text Line

As shown in Table 6, the impact of the width and the offset of the text line was evaluated. For the width of the text line, we used different widths of synthetic or manually marked text lines to test our model. For the manually marked text line, we extracted its skeleton of one-pixel width and dilated the skeleton to different widths while the width was less than that of the original text line. For the synthetic text line, the skeleton of one pixel was extracted from the full mask and used to create different widths. While the width of the text line was the same, using the synthetic text line which usually achieved a better performance than using the manual text line, and the average difference was approximately . In addition, with increasing width, the F-measure showed a fluctuation of approximately . The offset of the text line was set to 0 in all experiments to evaluate the influence of the text line width.

Apart from the evaluation of the influence of width, the offset between the synthetic text line and centre line of the text instance was also set to test our detection method. The offset in Table 6 refers to the offset error ratio: . is the distance between the text line and text centre line, and is the width of the text region. In the experiment, we only performed the experiment on the synthetic text line, while the offset between the manual text line and text centre line was difficult to calculate. The text centre line was calculated from the original coordinate annotation, and then we created the text line by setting the corresponding offset ratio. The curvature and width of the created text line were the same as those of the text centre line. All widths of the text line or text centre line were one pixel in the experiment. While the offset ratio of the text line was below , the F-measure barely fluctuated. While the offset ratio of the text line exceeded , the performance of the model started to be affected slightly, but the fluctuation around was still acceptable.

4.6.4. Backbone

As shown in Table 7, a series of experiments comparing different backbones were performed to evaluate its influence on the proposed method. Similar to VGG16, five feature maps generated from VGG11 were gradually upsampled to the original size. For the ResNet series, four feature maps were used to merge. The F-measure using VGG11 was similar to that of using VGG16, but the latter had a slightly slower inference time. Due to the sophisticated design, the ResNet series had a longer convergence time, but the performance was comparatively accurate and stable.

4.6.5. Loss Function

As shown in Figure 6(a), due to the instability of the enhanced annotation, the F-measure decreased after dozens of epochs on four common datasets, particularly for curved text datasets. As shown in Figure 6(b), training with the text line was unstable relative to the method with full labelling, and the model with full labelling showed better convergence performance with an increasing number of training epochs. After incorporating into the loss function, the model with the text line showed improved convergence, with convergence fluctuation of approximately .

5. Conclusion and Future Work

In this paper, we first introduced a novel text detector based on weakly supervised learning. The most prominent feature of the method was proposing a novel labelling named the text line and the full use of the model pretrained on SynthText. The use of a text line can help the detector decrease the cost of labelling, and the pretrained model can improve the performance of the detector. The experiments showed that the text line with low-cost labelling can be used to train an effective text detector and further verify the feasibility of using a synthetic text dataset to enhance weak labels. Efficient low-cost text detectors have potential applications in the field of photo translation. Synthetic data will play an increasingly important role in the field of deep learning in the future. One reason for this is that the high cost of annotation hinders the application of actual scenes for arithmetic. Another reason is that synthetic data are increasingly similar to real-world images, and the development of auxiliary methods promotes the development of synthetic text. In future work, it will be important to train the methods with synthetic data but apply them to the real world.

Data Availability

The data are now made public at https://github.com/xingjici/Texts-as-Lines-Text-Detection-with-Weak-Supervision and the corresponding code is still cleaning up. Data description can be found in Abstraction sector.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Authors’ Contributions

Weijia Wu and Jici Xing contributed equally to this work.

Acknowledgments

This work was supported by the National Key Research and Development Project (Grant no. 2019YFC0118202), National Natural Science Foundation of China (Grant nos. 61803332 and 11574269), Natural Science Foundation of Zhejiang Provence (Grant no. LQ18E050001), Fundamental Research Funds for the Central Universities (Grant no. 2019FZJD005), and Scientific Research Fund of Zhejiang Provincial Education Department (Grant no. Y201941642).