Abstract

Deep learning-based object detection method has been applied in various fields, such as ITS (intelligent transportation systems) and ADS (autonomous driving systems). Meanwhile, text detection and recognition in different scenes have also attracted much attention and research effort. In this article, we propose a new object-text detection and recognition method termed “DetReco” to detect objects and texts and recognize the text contents. The proposed method is composed of object-text detection network and text recognition network. YOLOv3 is used as the algorithm for the object-text detection task and CRNN is employed to deal with the text recognition task. We combine the datasets of general objects and texts together to train the networks. At test time, the detection network detects various objects in an image. Then, the text images are passed to the text recognition network to derive the text contents. The experiments show that the proposed method achieves 78.3 mAP (mean Average Precision) for general objects and 72.8 AP (Average Precision) for texts in regard to detection performance. Furthermore, the proposed method is able to detect and recognize affine transformed or occluded texts with robustness. In addition, for the texts detected around general objects, the text contents can be used as the identifier to distinguish the object.

1. Introduction

Object detection [1, 2], as one of the most fundamental and challenging problems in computer vision, has received great attention in recent years. In the context of computer vision, object detection deals with the task of detecting instances of visual objects of specific classes such as humans, animals, and cars in digital images. It combines the cutting-edge technologies in many fields such as image processing, pattern recognition, automatic control, and artificial intelligence. Object detection is widely used in many fields including intelligent transportation systems [3, 4], advanced driver assistance systems (ADAS), and autonomous driving systems.

In intelligent traffic surveillance system [5], vehicle detection and recognition are a vital task. The automatic monitoring digital cameras take snapshots of passing vehicles and other moving objects to provide valuable clues including license plate number, the vehicle type, and the driver's facial image for authorities and other security departments. In recent years, autonomous cars and driverless vehicles have significantly changed the manner of transportation. Computer vision system is efficiently used in the development of ADAS. Sakhare et al. [6] have a detailed study of the vehicle detection in dynamic conditions. Yudin et al. [7] study vehicle detection in difficult areas with various architectures of deep neural networks [8].

In automated driving, detection and recognition of pedestrians, vehicles, traffic lights, and traffic signs [9] help avoid accidents and achieve safe driving. Collision avoidance systems are required for the driver to handle the emergence. Detecting pedestrians is essential for autonomous driving [10]. Zhang and Kim [11] propose a pedestrian detector which combines skip pooling from multiscale feature maps and recurrent convolutional layers to detect pedestrians of different scales. Reliable traffic light detection and classification in urban environments are also crucial for automated driving [12, 13]. Kim et al. [12] develop a two-step method to detect traffic lights with SSD architecture. Lu et al. [14] utilize a visual attention model to detect traffic signals which is effective for the detection of small objects.

Object detection, which is the core of various intelligent transportation systems, has been a research hotspot in recent years. Meanwhile, the rapid development of deep learning has accelerated the development of object detection. Many deep learning based object detection techniques have led to giant breakthroughs and remarkable performance. Object detection can be divided into one-stage methods and two-stage methods. Object detection algorithm of two-stage methods usually involves two steps. Firstly, region proposals are obtained from the original image. Secondly, the classification and regression networks such as the R-CNN [15] (Regional Convolutional Neural Network) series are used to detect the region proposals. Object detection algorithm of one-stage method just needs one step. One-stage methods can accomplish the classification and bounding box regression tasks directly without finding the region proposals separately. Typical one-stage algorithms include SSD [16] (Single Shot Multibox Detector) and YOLO [17] (You Only Look Once).

R-CNN proposed by Ross B. Girshick uses selective search [18] method to perform ROI (Region of Interest) scaling and feature extraction on target images. Because R-CNN requires forward calculation for a large number of region candidates which may overlap each other, the speed of training and detection is very slow. Fast R-CNN [19] uses a feature extractor to extract the features of the entire image instead of extracting each image multiple times for each region proposal. Because Fast R-CNN does not extract features repeatedly, the processing time is significantly reduced. Faster R-CNN [20] uses a design similar to Fast R-CNN. Faster R-CNN replaces the selective search method with RPN (region proposal network), which solves the problem of excessive time overhead in generating ROI. The Faster R-CNN achieves high accuracy and detection speed to some extent, but it still cannot meet the real-time requirement.

Compared with Faster R-CNN, SSD has a significant advantage of detection speed. The network generates multiple feature maps at different scales. Then the classification and bounding box regression tasks are simultaneously done on multiscale feature maps. SSD is able to detect large objects effectively. YOLO is another one-stage method. It predicts bounding boxes and class probabilities of multiple objects simultaneously. However, different from the SSD algorithm, YOLO does not use multiscale feature maps for detection. Its generalization capability is poor for object with large scale variations compared with that of SSD. It leads to missed detection and low recognition accuracy. YOLOv2 [21] algorithm uses anchor mechanism which utilizes convolutional layers instead of fully connected layers as in YOLO to predict the bounding boxes. The disadvantage of using fully connected layer to predict bounding boxes is that the spatial information of feature map is lost. However, the anchor mechanism directly predicts the bounding boxes on the feature map with convolutional layers. The spatial information of feature map is well preserved. Each feature point of the feature map corresponds to each grid of the original image. YOLOv2 improves the performance of the detection accuracy. YOLOv3 [22] algorithm adopts multiscale feature maps to predict bounding boxes. YOLOv3 uses FPN (Feature Pyramid Networks) concept which uses the output of the middle layers to merge with the output of the latter layers. The high-level features are passed to the low layers, so that small objects on low-level feature maps can be better detected. YOLOv3 has been greatly improved in regard to detection speed and accuracy.

The majority of the recent works related to deep neural networks has been devoted to detection or classification of object categories [23]. On top of that, another problem in computer vision that plays a vital role in intelligent transportation systems is the image-based text recognition. Text recognition aims to decode a sequence of labels from cropped text images.

The conventional methods recognize the text contents at character level. The characters of the text are segmented from the cropped text image. Then the segmented character regions are preprocessed and recognized. Different from the character-level recognition methods, recent text recognition methods do not require character segmentation in advance. One famous method is the multidigit number classification proposed by Goodfellow et al. [24], which is based on DCNN (deep convolutional neural network). The method requires selecting the maximum predictable sequence length in advance. This limits it to recognizing house number or license plate number whose length of texts is known beforehand. Another commonly used method is RNN (recurrent neural network) with CTC [25] (connectionist temporal classification). Shi et al. [26] and He et al. [27] propose RNN models to encode the features from the CNN and adopt CTC to decode the encoded sequence. The advantage of this method is that it can generate texts of any length. Furthermore, the nature of the Recurrent Neural Network determines that the model is able to learn the relationship between text and text temporal relations. Another type of method that does not require character segmentation of texts is attention mechanism. Lee and Osindero [28] use attention-based sequence-to-sequence structure to automatically focus on certain extracted CNN features and directly use text images to perform word string learning. This method implicitly learns character-level language models embodied in RNN. It is able to perform text recognition in unconstrained natural scenes.

Scene text recognition [29] in intelligent transportation systems has many applications, such as vehicle license plate recognition and road sign recognition. As an important part of intelligent transportation systems, vehicle license plate recognition is widely used in intelligent monitoring systems and parking systems. Automatic license plate recognition (ALPR) refers to the extraction of vehicle license plate information from an image or a sequence of images [30]. Chai and Zuo [31] propose an automatic vehicle license plate recognition method which adopts edge detection algorithm in extraction and character segmentation and recognition. Chang et al. [32] use license plate recognition technology to track vehicle on the road in complex traffic conditions.

Object detection in applications refers to the detection under specific application scenarios, such as pedestrian detection, vehicle detection, and scene text detection. Text recognition in specific application scenarios can get more information from the objects on which the applications focus. In this paper, we propose a model which combines object-text detection and text recognition. The model is able to detect both texts and general objects simultaneously. The model combines object detection task and text detection task and recognizes the detected text contents. In addition, for the texts detected around general objects, the contents can be used as the identifier to distinguish the object. The method we propose can be applied to a wide range of applications in regard to intelligent transportation systems and has comprehensive capabilities of detection and recognition.

The contributions are summarized as follows:(1)We propose an object-text detection model for multiple objects which can simultaneously detect texts and general objects(2)We propose a text recognition framework that effectively combines text detection and recognition(3)The method we propose can detect multiple types of objects and instantiate the identities of the detected objects based on the identified text labels. The recognized text label is used as a valid identity of the object

2. Materials and Methods

The network structure in this paper consists of two parts: object-text detection network and the text recognition network. We use the YOLOv3 architecture which adopts a fully convolutional neural network [33] to detect objects and texts in real-scene images. The convolutional network is used to extract the features in multiple scales feature maps from the image. The classification and bounding box regression networks directly output the objectness score, the class of the object, and the coordinate offsets of the object at multiple feature maps. We use NMS [34] (nonmaximum suppression) to remove the redundant bounding boxes which have large overlap with the same object. We adopt a successful scene text recognition algorithm, CRNN [26] (Convolutional Recurrent Neural Network), in conjunction with object-text detection. According to the coordinates of the text type which are output from the object-text detection network, the text regions are cropped from the original image. A convolutional neural network is used to extract features from the text regions. The extracted feature maps need to be scaled to a uniform height with a fixed aspect ratio. We use the recurrent model to encode the feature sequences from the feature maps and CTC to decode the encoded sequence. The network structure we propose is shown in Figure 1.

2.1. Architecture of the Object-Text Detection Network

The backbone network adopts Darknet-53, which uses the former 52 layers without fully connected layer. The feature extraction network is a fully convolutional network. It is mainly composed of 3 × 3 and 1 × 1 convolution kernels and a large number of shortcut links with residual units [3537]. The structure of the feature extraction network is shown in Figure 2. The network uses convolution kernel with stride instead of pooling layers to reduce the negative gradient effects brought by pooling. We also adopt a lot of data augmentation and batch normalization to avoid overfitting. In order to enhance the accuracy of the algorithm for small object detection, the network adopts upsampling and fusion methods which are similar to FPN [38] to implement the multiscale feature maps.

As shown in Figure 3, we assume the size of the input image to be 416 × 416. We extract three different scale feature maps from 26th, 43rd, and 52nd layers of the feature extraction networks in Figure 2. The scales of the extracted feature maps are 13 × 13, 26 × 26, and 52 × 52. The feature fusion network outputs three different scale feature maps with upsampling and fusion. The top layer with a size of 13 × 13 is concatenated with the 26 × 26 feature map via one-time upsampling. Then it is concatenated with the 52 × 52 feature map by upsampling twice. In this way, the high-level features from the top layer are passed to the low layers, which makes the model better at detecting small objects on low-level feature maps. Finally, the network generates three feature maps of different scales which are 1/8, 1/16, and 1/32 of the original image.

The output layers in 3 different scales are also convolutional. In our experiments with our dataset which has twenty-one classes including twenty general categories and one text category, we predict 3 bounding boxes with different sizes at feature maps of each scale. The shape of the output tensor is , where is the scale of the feature map, is the anchor boxes in features of different scales, is the coordinate offsets of the bounding box, is the objectness confidence prediction, and is the object classes.

The network adopts the anchor-based mechanism. Each gird of the feature maps predicts 3 bounding boxes according to the anchor boxes of 3 different scales. There are in total 9 different scale anchor boxes which are generated from k-means clustering. The 9 clusters on the COCO dataset [39] are (10 × 13), (16 × 30), (33 × 23), (30 × 61), (62 × 45), (59 × 119), (116 × 90), (156 × 198), and (373 × 326). The anchor boxes in different scale feature maps are shown in Figure 4.

The object-text detection network simultaneously predicts bounding boxes of texts and general objects conditioned on its input feature maps. At each grid of associated feature map, it outputs the objectness confidence, classification score, and coordinate offsets to its associated anchor boxes in a convolutional manner.

The object-text detection network adopts logistic regression to predict the bounding boxes and the objectness scoring on each anchor. Only the anchor with the highest objectness score is calculated. Each object can be detected by only one anchor. This step is performed before prediction, which can remove unnecessary anchors and reduce the amount of calculation. In bounding box regression, the network outputs the coordinate offsets. The formula that converts offsets to bounding box coordinates is defined aswhere are the coordinates of the bounding box, are the coordinates of the anchor box, and represents the sigmoid function.

2.2. Loss Function of the Object-Text Detection Network

Objectness confidence is the probability of predicting the existence of the object-text in anchor box. Objectness confidence loss adopts binary cross entropy. The objectness confidence loss function is defined aswhere represents the existence of the object-text in anchor box and represents the sigmoid probability of the existence of the bounding box.

Object-text classification score is the probability of the class which the object-text belongs to. The object-text class loss function is defined aswhere represents the existence of the object-texts’ class in anchor box and represents the sigmoid probability of the class of the bounding box .

Object-text detection model predicts the coordinate offsets between anchor boxes and bounding boxes. Equation (1) is used to convert the offsets to the coordinates of the bounding box. The object-text location loss adopts the GIoU [40] (Generalized Intersection over Union) method to calculate the error between the bounding box and ground truth. The GIoU Loss Algorithm is defined as in Algorithm 1.

Input: The region of the and (), where is the input image size.
Output:
Step 1. Calculate the smallest enclosing region ,;
Step 2. ;
Step 3. ;
Step 4. Calculate the bounding box scale: ;
Step 5. Calculate the location loss function: , where is the existence of the object-text in associated bounding box.

We use to form of the object-text location loss. The total loss function can be represented aswhere , , and are the weights of each loss. We empirically set .

2.3. NMS Module

The NMS module is applied to remove the redundant object-text bounding boxes detected from the same object. We adopt the NMS after the object-text detection on the object-text bounding boxes.

2.4. Text Recognition

After the object positions are detected from the object-text detection network, we pick out the text-type bounding boxes based on the text class. Firstly, the text extractor extracts the text regions corresponding to the coordinates of the text bounding boxes produced by the object-text detection module. Then the text recognition module preprocesses the extracted text regions by resizing them before they are fed into convolutional neural network. We scale the text regions to with a fixed aspect ratio, where is the fixed height, is the maximum length, and represents the number of the image channel. Finally, we use the scaled text region as the input of the convolutional layers.

We adopt the CRNN model as our text recognizer. Firstly, the convolutional layers extract the feature maps from the preprocessed text region. A sequence of feature vectors is extracted from left to right from the feature maps. Then each frame of the sequences which represents a vertical region corresponding to the original text image becomes the input of the recurrent layers. The recurrent layers adopt the deep bidirectional LSTM [41] (long short-term memory) to encode the sequence of the feature vectors. Finally, we adopt CTC to predict the text label corresponding to the sequences from the recurrent layers.

3. Results and Discussion

3.1. Experiment Setup

The object-text detection network is trained with training images using Adam (adaptive moment estimation) [42]. We initialize the model with pretrained weights on the COCO dataset. We divide the training process into two stages. In the first stage, we fix the backbone network and just train the classification and regression network. In the second stage, we train the whole network.

When we train the object-text detection network, in the first two epochs in training, we adopt the method of gradually increasing the learning rate from low to high which is called “warmup stage” method. The network converges quickly with the large learning rate. And then, we need to stabilize the network with a low learning rate to avoid gradient oscillation. We adopt a cosine annealing strategy proposed by Loshchilov et al. [43]. At the -th training step, the learning rate decays with a cosine annealing as follows:where is the initial value of the learning rate, which is set to , is the end value which we set to , and accounts for how many steps have been performed. is the total steps during the training. represents the warmup steps in the first two epochs. The learning rate curve is shown in Figure 5.

The training algorithm of the object-text detection model is summarized as in Algorithm 2.

Input: Parameter_1: The training set . is the number of the batches. is the -th batch of the training set;
Parameter_2: The labels corresponding to the training images of which format is defined as:
Output: Weights of the model
for do
  ifthen
   ;
  else
   
  fordo
   Predict the offsets, objectness and class:
   ;
   ;
   Calculate the loss:
   ;
   Calculate the gradients:
   ;
   Update the model parameters:
   ;
   end for
end for

We use a CRNN model proposed by Shi et al. [26] as the text recognition network. The experiment uses a pretrained model trained on the synth90k dataset [44] to initialize the parameters of the text recognition model. We use NEOCR [45] dataset and SCUT FORU dataset to fine-tune the pretrained model. We set the training parameters as follows: The model training runs for 2000000 epochs. The batch size is 32. The initial learning rate is 0.01 with exponential decay of 0.1 every 500000 epochs. The experiment adopts gradient descent with momentum [46] to train the text recognition network. We set the parameter of momentum to 0.9.

The training algorithm of the text recognition model is summarized as in Algorithm 3.

Input: Parameter_1: The training set of text . is the number of the batches. is the -th batch of the training set;
Parameter_2: The text labels corresponding to the text training images:
Output: Weights of the model
for do
;
for do
  Predict the recognition result:
  ;
  ;
  Calculate the CTC loss:
  ;
  Calculate the gradients:
  ;
  Update the model parameters:
  ;
  end for
end for
3.2. Dataset

We evaluate the proposed method on four datasets: VOC 2007 [47], VOC2012 [48], ICDAR 2013 [49], and SCUT FORU DB. VOC2007 and VOC2012 are the datasets about object detection. ICDAR 2013 and SCUT FORU DB are the datasets about text detection. We integrate them into a comprehensive dataset for detecting type-text object and general object simultaneously.

VOC2007 is the challenge to recognize objects from a number of visual object classes in realistic scenes. The database contains a total of 9963 annotated images. We use 5011 images as training set and 4952 images as testing set. There are twenty object classes in the dataset.

VOC2012 is the same challenge as VOC2007 which increases the size of the training set. There are 17125 training images in total. The testing set has not been released yet.

ICDAR 2013 is the Challenge 2 of ICDAR 2013 Robust Reading Competition, which contains horizontal texts. The dataset focuses on the reading of texts in real scenes. The images of the dataset refer to the text images focused around the text content of interest. The dataset consists of 229 training images and 233 testing images. Due to the fact that there are too few training images, we additionally use 1200 images from SCUT FORU training dataset.

SCUT FORU Database is released by the South China University of Technology. The dataset consists of Chinese2k and English2k.We only use the English2k dataset. The English2k dataset contains character annotations and word annotations. The characters of the dataset include 52 upper-lowercase letters and 10 Arabic numerals. The label format of the dataset is . are the top-left coordinates of the rectangular box. are the width and height of the rectangular box. is the word label of the text region. There are a total of 1715 images, of which 1200 are the training images and 515 are the testing images. The dataset has an average of 18.4 characters and 3.2 words per image.

COCO Dataset is a large-scale dataset for object detection, segmentation, and captioning. It contains more than 330K images and 200K labels. The COCO dataset has 80 object categories in total.

In the experiment, we integrate the datasets into a comprehensive dataset of 29265 images in total. There are 23565 training images and 5700 testing images. Since these datasets have different annotation formats, we need to convert them into a unified annotation format. The coordinates format of the annotation is defined as . We shuffle the combined dataset to feed into the model.

Text recognition network is performed with a CRNN model proposed by Shi et al. We use a pretrained model trained on the synth90k dataset and use NEOCR dataset and SCUT FORU dataset to fine-tune the pretrained model. The annotations in NEOCR dataset contain characters that are not in the English alphabet. We have modified the annotations by replacing the special characters to English letters that look similar. The text images in SCUT FORU dataset are cropped from the original images corresponding to the coordinates in annotations. The text images are resized to the size of before they are fed into the text recognition network.

3.3. Evaluation Metrics

We use mAP (mean Average Precision) as the measurement to evaluate the detection model performance. The mAP calculation is based on the following metrics [50]:True Positives (TP): examples detected correctly with False Positives (FP): negative examples detected incorrectly with False Negatives (FN): the ground truths not detected

The threshold is usually set to 0.5, 0.75, or 0.95. In our evaluation, we set it to 0.5.Precision. Precision is the percentage of correct positive predictions. The precision is defined asRecall. Recall is the percentage of true positive detected among all relevant ground truths. The recall is defined asPR (Precision-Recall) Curve. The PR curve is a good way to evaluate the performance of an object detector. The precision and recall values of detected objects are plotted to get a PR curve. The area under the PR curve is called AP (Average Precision). The AP calculation is defined aswhere is the measured precision against recall.mAP. The mAP is the average of all categories of AP.

3.4. Analysis of Experimental Results

In order to verify the choice of YOLOv3 as the detection network in the proposed method, we compare the detection performance of different detection frameworks, namely, Fast R-CNN, Faster R-CNN, SSD, YOLO, YOLOv2, and YOLOv3. We as well compare different input size setups of YOLOv3. The results are shown in Table 1. All the detection frameworks being compared are trained on the VOC2007 and VOC2012 training datasets and the mAP is tested on the VOC2007 testing dataset. As can be seen in Table 1, the YOLOv3 framework with network input size of 416 × 416 achieves the highest mAP among the frameworks being tested. Further, the YOLO series, be it YOLOv2 or YOLOv3, generally achieve higher mAP than other frameworks. It can therefore be concluded that the choice of the YOLOv3 framework in the proposed method is an optimized solution.

After we have confirmed the performance of the YOLOv3 in object detection, we further train it on the comprehensive dataset which is composed of the general object detection datasets of VOC2007 and VOC2012 and the text detection datasets of SCUT FORU and ICDAR2013. Then we test the performance of the frameworks on different testing datasets. The general objects detection testing dataset VOC2007 and the text detection datasets SCUT FORU and ICDAR2013 are used. We compare the performance of different detection frameworks with 3 categories out of the total 20 categories in the PASCAL VOC 2007 dataset. As shown in Table 2, the performance of YOLOv3 on the 3 categories is much better than other detection frameworks. We verify that the YOLOv3 has excellent performance on object detection. Our model achieves 70.0 mAP in the text detection task. We are not listing the text detection performance of the other methods because they do not feature text detection and recognition.

One may notice that the mAP of YOLOv3 in Table 2 is lower than that of Table 1. This is because we further train the YOLOv3 network on the text detection datasets. The detection of text objects reduces the mAP to a certain extent. In addition, the text object in datasets of VOC2007 and VOC2012 is not marked in the annotations. The detected texts in VOC2007 and VOC2012 will be seen as ‘False Positives’, thus the mAP would decrease.

Due to the imperfection of the comprehensive dataset which consists of general object datasets and text datasets, we improve the annotation information of the comprehensive dataset. We label the text objects in VOC2007 and VOC2012 and the general objects in ICDAR2013 and SCUT. This makes the comprehensive dataset of the object detection more accurate, reduces the false positive rate of the detection model in training and testing, and improves the detection accuracy as a whole. As can be seen from Table 2, the detection model used in the experiment has the highest detection accuracy on the YOLOv3 framework with the size of 544 × 544. Table 3 compares the detection effect of the comprehensive dataset before and after the modification on the YOLOv3 framework with the size of 544 × 544. As shown in Table 3, the detection network on modified comprehensive dataset has higher accuracy on person and text objects than original dataset. The detection accuracy of the text object is significantly improved. The mAP on the modified comprehensive dataset has also improved.

3.5. Performance on Object-Text Detection and Recognition

The model we propose performs two tasks: object-text detection and text recognition. The object-text detection network can detect general objects and text objects simultaneously. The text contents of the detected text regions from the detection network are recognized by the text recognition network. This section shows the detection and recognition results of test images in the experiment.

As shown in Figure 6, we mainly show some detection results of test images in transportation. The detection model can detect multiple objects in one image. It has good performance on both small objects and large objects. The text detection dataset contains billboards, signboards, road sign, etc. Some texts exist in complex environments and they might be occluded. As shown in Figure 7, the detection model can detect the text in complex scenes. However, some text bounding boxes in images are not accurate enough, which may cause wrong recognition in texts. The object-text detection model we propose can simultaneously detect the text and general objects. Some detection examples are demonstrated in Figure 8.

The text recognition model can recognize the text contents of the text regions detected from the detection model. As shown in Figure 9, we demonstrate some examples of text recognition model on road sign. As shown in Figure 10, the text recognition model can recognize not only the horizontal text, but also the affine distorted text. The affine distorted texts exist commonly due to the variations of the camera views. Yet the proposed model not only locates these texts, but also finds the contents of the texts.

Figure 11 gives a more application specific demonstration of the proposed object detection and text recognition model. In this scenario, information extracted by the text recognition module identifies the detected object. We use some cars images with plates as the proof of concept. The object-text detection model we have proposed can simultaneously detect the car and the plates on the car. Then the text recognition model recognizes the text contents on the plates.

4. Conclusions

We present an object-text detection and recognition model in this article. The model not only detects the texts and general objects simultaneously but also recognizes the text contents inside the detected text bounding boxes. The method we have proposed combines both object detection and text recognition. In the applications of some scenarios, the recognized text contexts around the general objects are able to be used as the identifier to distinguish the object. The proposed method has potential in extensive applications, such as intelligent transportation systems and autonomous driving.

Possible directions for future research include the following:(1)Improving the dataset: this refers to adding more samples which contain both text and general object to train the network(2)Improving the detection network on the text detection: for example, the anchor box which is suitable for the text size can be used. We can use k-means to cluster on the dataset containing text objects to make the size of the generated anchor boxes more suitable for text(3)Optimizing the connection between the detection network and the recognition network: in our proposed model, the connection between detection and recognition network is the text region which is cropped from the original image corresponding to the coordinates of the detected text boxes. In order to optimize the connection, we can extract the feature map from the detection network as the input of the recognition network. The affine transformation is applied to the feature map extracted from detection network to fit the input size of recognition network. Thus, during backpropagation, the gradients can flow from the recognition network back to the detection network. The detection and recognition model can be regarded as an end-to-end model.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This research was funded by the Foundation of the National Key Research and Development Program (grant number 2016YFC0801800), National Natural Science Foundation of China (grant number 51874300), National Natural Science Foundation of China and Shanxi Provincial People’s Government Jointly Funded Project of China for Coal Base and Low Carbon (grant number U1510115), and the Open Research Fund of Key Laboratory of Wireless Sensor Network & Communication, Shanghai Institute of Microsystem and Information Technology, Chinese Academy of Sciences (grant numbers 20190902 and 20190913).